[Design] Terminology: N-gram

n-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

Example

frequency	word1	word2	word3
1419	much	the	same
461	much	more	likely
432	much	better	than
266	much	more	difficult
235	much	of	the
226	much	more	than

Downloadable n-grams sets for English

Google n-grams, based on the web as of 2006.
COCA n-grams, based on Corpus of Contemporary American English [COCA]. 450 million words from 1990 to 2012.

With n-grams data (2, 3, 4, 5-word sequences, with their frequency), we can carry out powerful queries offline.

Woodstock Blog

a tech blog for general algorithmic interview questions

[Design] Terminology: N-gram

n-gram

Example

Downloadable n-grams sets for English