[Design] Terminology: N-gram

Jan 9th, 2015 12:00 am

n-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

Example

frequency	word1	word2	word3
1419	much	the	same
461	much	more	likely
432	much	better	than
266	much	more	difficult
235	much	of	the
226	much	more	than

Downloadable n-grams sets for English

Google n-grams, based on the web as of 2006.
COCA n-grams, based on Corpus of Contemporary American English [COCA]. 450 million words from 1990 to 2012.

With n-grams data (2, 3, 4, 5-word sequences, with their frequency), we can carry out powerful queries offline.

[Question] Most Frequent Word From a Book

Jan 9th, 2015 12:00 am

Question

link

Let’s say I gave you a long String and I wanted you to tell me the most common word in that String. How would you do that?

Follow-up: how about if I gave you the entire works of Alexandre Dumas, one of the most prolific authors in history. How would your solution work? How could you change it to solve this more specific problem?

Solution

There are 2 solutions. Either HashMap or Trie. It’s easy to think of first, but remember that Trie is designed to do this kind of job.

A trie will be more useful in the second situation where you will have millions of words. You can have a counter at every node to see its frequency

Now the follow up. For big data problems, we can always do divide and conquer by hash value.

Alternatively, the comment by Prince mentioned how to solve with Map Reduce:

Instead of loading it as a string first I would stream data so that I avoid memory spike. Further our map size might increases as new words are added to it. Also, we can use map reduce type job were we create map<word, frequency> of each play in parallel and later join/collapse them this will reduce the time it will take to get result.

Common phrase should be no different then above algorithm. However we need to rebuild our index with <phase, frequency>

[Amazon] Match Triplet With Reverse Order

Jan 9th, 2015 12:00 am

Question

link

Find the substring of length 3 which is present in the reverse order from the string.

Ex: if the string is abcdcba (cba is the reverse of abc) so we should return cba.

Solution

HashMap (recommended). Hash all substrings of length 3. O(n). Look up all reverse substrings of length 3 in this hash set. O(n) time and O(n) space.
KMP Algo. Take every substring of length 3. Reverse it and find it in the input using KMP. O(n²) time and O(1) space.
Build suffix tree of height 3. Then in reverse order, check triplets.

The 3 solutions above all work well. Pick the one you love.

[Design] Big Data - Top K Frequency (Hands-on)

Jan 9th, 2015 12:00 am

Question

link

The input is an endless stream of English words or phrases (we refer them as tokens).

Output top N tokens we have seen so far (from all the tokens we have seen!)

Analysis

We will discuss the following details of implementation and optimization.

String into Integer
Data Storage
Process Incoming Streams
Save result

1. String into Integer

This is a nice trick that improves eficiency a lot.

Though there is almost infinite possible words on the Internet, but after accumulating a large set of words, the possibility of finding new words becomes lower and lower.

We have already found 4 million different words, and assigned a unique ID for each. This is important, because sorting and comparisons on integers is much much faster than on strings.

2. Data Storage

The system keeps archive data for every token. Basically it’s pairs of (Token, Frequency).

However, the table that stores the data would be so huge such that we have to partition the table physically. One partition scheme is based on ngrams of the token. If the token is a single word, it is 1gram. If the token is two-word phrase, it is 2gram.

Of course we can also divide the data by the hash value. For information on ngrams, read [Design] Terminology: n-gram.

3. Process Incoming Streams

The system will absorbs incoming sentences until memory becomes fully utilized (Ya, we need a MemoryManager). After taking N sentences and storing in memory, the system pauses, and starts tokenize each sentence into words and phrases. Each token (word or phrase) is counted.

This data processing logic runs as a process under Memory-Manager. The next part is another processing running concurrently.

4. Save result

Meanwhile, there will be another process that is activated once it finds any disk file generated by the system, then start merging it. Since the disk file is sorted, merging would take a similar process like merge sort.

There is some more steps afterwards, but they’re trivial. I have listed out the basic steps for processing large stream of incoming data (as string), and how to find out the Top K keywords.

I suggest you read previous [Design] Big Data - Top k Frequency posts before reading this.

[Design] Real Time Top K

Jan 9th, 2015 12:00 am

Question

link

Given a continuous twitter feed, design an algorithm to return the 100 most frequent words used at this minute, this hour and this day.

Analysis

This is a frequent and useful problem for companies like Google and Twitter.

The first solution below is an approximation method which select keywords that occur more than a certain threshold.

The second solution is more accurate but RAM-intensive.

Lossy Count (used to get an inaccurate trend)

Solution 1 is a modified version of Lossy Count. The detailed steps are explained here:

Start with an empty map (red-black tree). The keys will be search terms, and the values will be a counter for the term.

Look at each item in the stream.

If the term exists in the map, increment the associated counter.

Otherwise, if the map has fewer candidates than you’re looking for, add it to the map with a count of one.

However, if the map is “full”, decrement the counter in each entry. If any counter reaches zero during this process, remove it from the map.

This slide show explains Lossy Count, which is to divide input data into chunks. Then count elements and decrease counter by 1 after each chunk.

Note that the result is NOT the top frequency items. Instead, the final results are order-dependent, giving heavier weight to the counts processed last. It maybe helpful in some cases, cuz we want to check the latest trend. However, if we want more accurate top keywords for all data, we will do a second pass over the log data.

Now let’s discuss the threshold. Use “aaabcd” and map size = 2 as example. ‘a’ will be inserted into map with occurance = 3. Then ‘b’ is inserted, and removed. ‘c’ is inserted, and removed. ’d' is inserted. Since we always decrease 1 at each step, ‘a’ should only have occurance of 1 at the end. As explained here:

If we limit the map to 99 entries, we are guaranteed to find any term that occurs more than 1/(1 + 99) (1%) of the time.

We change the size of the map to change the threshold. The occurance of in the final result does not matter.

Solution 2

The lossy count does not actually produce the hourly, daily and monthly result accurately. Solution 2 will discuss how we deal with retiring old data in an accurate way.

Suggested by this answer, we keep a 30-day list for each keyword, that counts the daily occurance. This list is FIFO. When we remove and insert a new counter value, we update monthly total.

Alaternatively, this answer suggests keeping 1440 (24 * 60) HashMaps, each storing the information for one minute. And another 2 HashMap for the rolling total for the past hour, and past day.

You need an array of 1440 (24*60) word+count hash maps organized the way that you describe; these are your minute-by-minute counts. You need two additional hash maps - for the rolling total of the hour and the day.

Define two operations on hash maps - add and subtract, with the semantic of merging counts of identical words, and removing words when their count drops to zero.

Each minute you start a new hash map, and update counts from the feed. At the end of the minute, you place that hash map into the array for the current minute, add it to the rolling total for the hour and for the day, and then subtract the hash map of an hour ago from the hourly running total, and subtract the hash map of 24 hours ago from the daily running total.

This is a very good solution, which I would recommend as the standard solution to this “Real Time Top k” problem.

[Google] Number of Distinct Substrings

Jan 8th, 2015 12:00 am

Question

link

Given a string, find the number of distinct substrings of the string. Example:

input = “aaaa”,

output = 4 (the 4 substrings are “a”, “aa”, “aaa”, “aaaa”)

input = “abcd”,

output = 10 (“a”, “b”, “c”, “d”, “ab”, “bc”, “cd”, “abc”, “bcd”, “abcd”)

input = “banana”,

output = 15 (“a”, “an”, “ana”, “anan”, “anana”, “b”, “ba”, “ban”, “bana”, “banan”, “banana”, “n”, “na”, “nan”, “nana”)

This is also a question on SPOJ.

Solution

This is a very good question, which tests Suffix tree, LCP and string manipulation knowledges.

The solution is to build a suffix tree. This is because:

If you look through the prefixes of each suffix of a string, you have covered all substrings of that string.

There are 2 implementations. First one is slightly simpler.

Implementation 1

Suffix array + LCP (longest common prefix). Take “Banana” as input, then the suffixes:

0) BANANA
1) ANANA
2) NANA
3) ANA
4) NA
5) A

Sort it:

5) A
3) ANA
1) ANANA
0) BANANA
4) NA
2) NANA

Then we start calculate number of substring (that is prefixes of suffix). After removing duplicated prefix, the count is:

5) A - 1
3) ANA - 2
1) ANANA - 2
0) BANANA - 6
4) NA - 2
2) NANA - 2

Total number is:

1 + 2 + 2 + 6 + 2 + 2 = 15

But wait, realize something? “A” is simply duplicate substring in “ANA”, which appers in “ANANA”. Keep this in mind, cuz we need to observe this in the 2nd implementation, too.

Finally, the total number is calculated like this:

for each suffix
    ans += length(suffix) - LCP(suffix, previous suffix)

For more details, read here.

Implementation 2

Build a suffix tree, like this:

Number of substrings is simply the sum of levels of each leaf. For the 3 branches of the suffix tree, number of levels are: 6, 5 and 4 respectively. Total = 15.

[Google] Check All Numbers Given the Decimal Scale

Jan 8th, 2015 12:00 am

Question

link

检查一个字符串是否包含k位a进制数的所有表示形式。

保证原字符串的所有字串都是合法的k位a进制数。

“00110, a=2, k=2” => true （包括了00，01，10，11）

Solution

First find all substrings with length == k, then generate all numbers in a scale. This is not a difficult question.

We may want to score the substrings in a HashMap/HashSet. The hashing procedure is preferrably using Rolling hash.

Rolling Hash

A rolling hash is a hash function where the input is hashed in a window that moves through the input.

A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a moving average function can be computed much more quickly than other low-pass filters.

Rolling Hash is commonly used in Rabin-Karp Algorithm to speed up strStr() string pattern matching.

People in the origin post - they discuss about “slide window check” algorithm. I do not understand what’s the benefit of this. If you read this and would like to help me, please leave a comment. Thanks!

A similar question

This is simply the reverse of the question above:

给出最短的字符串, which is used to 表示k位a进制数的所有表示形式.

This question is solved using De Bruijn sequence.

[Facebook] Scheduling Jobs With Max Cost

Jan 7th, 2015 12:00 am

Question

link

Given a set of n jobs with [start time, end time, cost], find a subset so that no 2 jobs overlap and the cost is maximum.

Let’s assume the input is already sorted by start_time.

Solution

Somebody mentioned Interval Graph, so check it out if you interested!

I am going to discuss both DP and recursive solution.

This question reminds me of [Question] 0-1 Knapsack Problem and [Question] Coin Change Problem, cuz the basic idea is to compare 2 conditions:

include current element
or, not include current element

A very good DP solution is presented in here. The code below is written by me and it’s very intuitive to read.

Leave me a comment if you have questions. And one more thing~ Happy new year!

Code

private int dpSolution(Job[] jobs, int size) {
    int[] dp = new int[size];
    dp[size - 1] = jobs[size - 1].cost;
    // cost of last job equals to just itself
    for (int k = size - 2; k >= 0; k--) {
        int next = findNextJob(jobs, k);
        int includeK = jobs[k].cost;
        if (next < size) {
            includeK += dp[next];
        }
        int excludeK = dp[k + 1];
        dp[k] = Math.max(includeK, excludeK);
    }
    return dp[0];
}

private int findNextJob(Job[] jobs, int k) {
    int finishTime = jobs[k].finish;
    int next = k + 1;
    while (next < jobs.length) {
        if (jobs[next].start < finishTime) {
            next++;
        } else {
            break;
        }
    }
    return next;
}

Recursion

public int recursiveSolution(Job[] jobs, int size, int k) {
    // max cost from (k)th job and onwards
    if (k == size) {
        return 0;
    }
    // (k)th job must not conflict with any previous job
    int next = findNextJob(jobs, k);
    int includeK = jobs[k].cost + recursiveSolution(jobs, size, next);
    int excludeK = recursiveSolution(jobs, size, k + 1);
    return Math.max(includeK, excludeK);
}

private int findNextJob(Job[] jobs, int k) {
    int finishTime = jobs[k].finish;
    int next = k + 1;
    while (next < jobs.length) {
        if (jobs[next].start < finishTime) {
            next++;
        } else {
            break;
        }
    }
    return next;
}

[Question] Longest Common Substring

Jan 7th, 2015 12:00 am

Question

link

Given two strings ‘X’ and ‘Y’, find the length of the longest common substring.

For example, if the given strings are “GeeksforGeeks” and “GeeksQuiz”, the output should be 5 as longest common substring is “Geeks”.

Solution

This question is to be distinguished from [LintCode] Longest Common Subsequence.

The solution is DP, too. Even the code is similar. Read my code below.

Updated on Jan 26th, 2015: there is another approach using Suffix Array, suggested by this post - but wait! Do not try to read that article, because you won’t easily understand its explanations. I will summarize it in simple English:

Concatenate String X with a “#” sign, and String Y with a “$” sign.
Build a suffix tree using both strings
find out node with both “$” and “#” as child.

In the case of (X = xabxa, and Y = babxba), the branch “abx” would be the deepest node that qualifies. Thus the result. Code is not written.

Code

public String LCSubstr(String s, String t) {
    int longest = 0;
    int tPos = -1;

    // dp[i][j] represents the LCSubstr ending at position i and j
    int[][] dp = new int[t.length() + 1][s.length() + 1];
    for (int i = 1; i <= t.length(); i++) {
        for (int j = 1; j <= s.length(); j++) {
            if (t.charAt(i - 1) == s.charAt(j - 1)) {
                dp[i][j] = dp[i - 1][j - 1] + 1;
                if (dp[i][j] > longest) {
                    longest = dp[i][j];
                    tPos = i;
                }
            }
        }
    }
    return t.substring(tPos - longest, tPos);
}

[Design] P2P Technology

Jan 7th, 2015 12:00 am

Overview

Peer-to-peer (P2P) networking is a distributed application architecture that partitions tasks or work loads between peers.

Peers are both suppliers and consumers of resources, in contrast to the traditional client-server model where communication is usually to and from a central server. A typical example of a file transfer that uses the client-server model is the File Transfer Protocol (FTP) service in which the client and server programs are distinct: the clients initiate the transfer, and the servers satisfy these requests.

This architecture was popularized by the file sharing system Napster, originally released in 1999.

Precedure

Alice run P2P client software.
1. connect to Internet and get new IP address for each connection
2. register her files in P2P system
3. request “Harry Potter”
4. find other peers who have the copy
5. choose one and copy to her PC.
6. meanwhile, Alice is servig her files for other people
Act like a server
Act like a client
User keyword to search content (like google)

P2P Types

Unstructured P2P: no coupling between nodes and file location
1. Napster
2. Gnutella
3. KaZaA
Structured P2P: tight coupling between nodes and file location
1. DHT

Napster

Centralized search, and P2P distributing.

Gnutella

Decentralized design for searching:

No central directory server
Each node maintain a list of neighbors (overlay network)

Search by flooding:

BFS traversal.
Define maximum number of hops
Expanded-ring TTL search means to try 1 hop first, then try 2 hops, then 3…

Join nodes:

Use Bootstrap node to get some IP addresses
Join these IP, which becomes neighbors.

Shortcomings:

Flooding is NOT a scalable design.
Download may not complete.
Possibility of search failure, even then the resource presents.

KaZaA

Combines Napster and Gnutella.

Each peer is a supernode or assigned to a supernode. Each supernode connects to 30~50 other supernodes. The supernode acts like a mini-Napster hub.

At registration, a PC connects to a supernode. If a supernode goes down, obtains updated list and elect another one.

Search within supernode, then in other supernodes. If found many nodes holding the file, do parallel downloading.

Automatic recovery if 1 server peer goes down. Use ContentHash to search.

Structured P2P

For Distributed HashTable services, refer to [Design] Distributed hash table.

Conclusion

Unstructured P2P:
1. no coupling between nodes and file location
2. Centralized direcotry service (except Gnutella)
3. Search by flooding (overhead)
4. Hierarchical architecture (non-scalable)
Structured P2P:
1. tight coupling between nodes and file location
2. DHT using consistent hashing (eg. Chord, and many other types)
3. A node is assigned to hold particular content
4. Search with more efficiency

← Older Blog Archives Newer →