n-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.
The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
Example
frequency |
word1 |
word2 |
word3 |
1419 |
much |
the |
same |
461 |
much |
more |
likely |
432 |
much |
better |
than |
266 |
much |
more |
difficult |
235 |
much |
of |
the |
226 |
much |
more |
than |
Downloadable n-grams sets for English
- Google n-grams, based on the web as of 2006.
- COCA n-grams, based on Corpus of Contemporary American English [COCA]. 450 million words from 1990 to 2012.
With n-grams data (2, 3, 4, 5-word sequences, with their frequency), we can carry out powerful queries offline.