Resume
- Do not write anything unrelated to CS.
- Do not write too long - 1 or 2 pages are fine. Senior engineer 3 pages.
- Do not write low GPA
- Never ever write “proficient in anything”
Big Data
Most classic question is “Frequent items” (refer to July’s blog).
Find top k hot queries in a daily access log of Google.
Variation:
- k = 1 vs k = 100000 - majority numbers
- low RAM vs sufficient RAM
- single machine vs multiple machines
- accurate vs inaccurate
Sufficient RAM
- HashTable + Heap (min-heap)
- Time O(n * logk), Space O(n)
Low RAM
- Split into 1000 (i.e. LOG/M) files by hash(query) % 1000
- Using HashTable + Heap to get top k for each files
- Collect 1000 top k queries and get global top k
- This method requires a lot of disk access and r/w, still slow.
Inaccurate (reduce memory from O(n) to O(k))
- Hash Count (only need to know this one) Limit the size of HashMap. The bigger the RAM, the more accurate is the result.
- Space Saving
- Lossy Counting
- Sticky Sampling
- Count Sketch
Bloom Filter
- Regular bloom filter - use 4 线性无关 formula
- Counting bloom filter - support delete
- Better DS than HashMap, but can loose some accuracy
Trie
Bitmap
Find all unique queries - use bigmap to store 3 types of states
System Design
Design a short url system
- Cache
to store hot urls
- Load Balance
Too many click in short time
- Storage balance
Hash value of an url and then store in individual machine
Expansibility?
- Consistent Hash
Node, can increase # of machines to store information
Migration process
- Router
check which machine response my query
light-weight calculations
what is router is down?
- Locale
url frequently access by China, then put the url storage in Beijing
Need-to-know Design patterns
- Singleton
- Factory
- Master-slave (esp. for relational DB)