Woodstock Blog

a tech blog for general algorithmic interview questions

[Design] Terminology: N-gram

n-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

Example

frequency

word1

word2

word3

1419

much

the

same

461

much

more

likely

432

much

better

than

266

much

more

difficult

235

much

of

the

226

much

more

than

Downloadable n-grams sets for English

  1. Google n-grams, based on the web as of 2006.
  2. COCA n-grams, based on Corpus of Contemporary American English [COCA]. 450 million words from 1990 to 2012.

With n-grams data (2, 3, 4, 5-word sequences, with their frequency), we can carry out powerful queries offline.