Woodstock Blog

a tech blog for general algorithmic interview questions

[Design] Find Similar Library Books

Question

link

有很大一个电子图书馆,里面每本书的每一页都是OCR转换出来的text,所有每页大 约有5%的error(转换错误,错误分割单词,跳脱。。。)。设计一个方法判定图书馆 里是否有完全一样的书(duplicate),或者将来有书进来时判定同样的书是否已存在。

Solution

Very large string matching, we mush use hashing or similar technique. Since books have error, we need to do fuzzy search/matching. Bloom filter is designed to do this!

Refer to [Design] Big Data - Fuzzy Search Url (Bloom Filter).