Woodstock Blog

a tech blog for general algorithmic interview questions

[Facebook] Query Search (HashMap, Suffix Array)

Question

link

有一个长度为n的字符串str,有非常多的关键字query(长度不超过10),需要判断每个关键字是否是str的子串。

注意:query是动态的输入进行查询的,预先并不知道所有的query。

Solution

Best idea of the solution is to use Suffix Tree and similar alternatives.

Solution 1 is an nice idea using HashMap.

我是把所有长度< =10的子串,哈希一下存放到10个哈希表中。

至于哈希函数的选取,随便选一个应该都不会超时。

Solution 2 is using ‘suffix array’. The most important point of this idea is to only make a substring instance for every 10 characters.

只用=10的子串。然后二分查找

用=10的字串排序即可,如包含更短的串会使得预处理变成O(10nlg(10n))。 查找的复杂度可能没什么变化,使用<=10会是O(lg(10n)),而只使用=10子串初始化的话,因为可能还要进行短query跟长子串间的前缀比较,复杂度会是O(10lgn),稍微有点提高。

Which is to say, using substring length == 10, we comsume less time for pre-processing, and a little more time when querying.

Code

Suffix tree solution from here, note written by me

private List<String> prefixList;

// pre-process the large string
public void initWithString(String str) {
    Set<String> strs = new HashSet<String>();

    for(int i = 0; i < str.length(); ++i) {
        strs.add(str.substring(i, Math.min(str.length(), i + 10)));
    }
    prefixList = new ArrayList<String>(strs);
    Collections.sort(prefixList);
}

// find the query substring
public boolean existSubString(String query) {
    int low = 0;
    int high = prefixList.size() - 1;
    while(low <= high) {
        int mid = (low + high) / 2;
        int comp = prefixList.get(mid).compareTo(query);
        if(comp == 0)  {
            return true;
        }
        if(prefixList.get(mid).startsWith(query)) {
            return true;
        }
        if(comp > 0) //mid > query
        {
            high = mid - 1;
        }else{
            low = mid + 1;
        }
    }
    return false;
}