HOUSE_OVERSIGHT_017018

← Prev Next →

Loading document…

II.3B. Generation of historical n-grams corpora To generate a particular historical n-grams corpus, a subset of book editions is chosen to serve as the base corpus. The chosen editions are divided by publication year. For each publication year, total counts for each n-gram are obtained by summing n-gram counts for each book edition that was published in that year. In particular, three counts are generated: (1) the total number of times the n-gram appears; (2) the number of pages on which the n-gram appears; and (3) the number of books in which the n-gram appears. We then generate tables showing all three counts for each n-gram, resolved by year. In order to ensure that n-grams could not be easily used to identify individual text sources, we did not report counts for any n-grams that appeared fewer than 40 times in the corpus. (As a point of reference, the total number of 1- grams that appear in the 3.2 million books written in English with highest date accuracy (‘eng-all’, see below) is 360 billion: a 1-gram that would appear fewer than 40 times occurs at a frequency of the order of 10° ') As a result, rare spelling and OCR errors were also omitted. Since most n-grams are infrequent, this also served to dramatically reduce the size of the n-gram tables. Of course, the most robust historical trends are associated with frequent n-grams, so our ability to discern these trends was not compromised by this approach. By dividing the reported counts by the corpus size (measured in either words, pages, or books), it is possible to determine the normalized frequency with which an n-gram appears in the base corpus. Note that the different counts can be used for different purposes. The usage frequency of an n-gram, normalized by the total number of words, reflects both the number of authors using an n-gram, and how frequently they use it. It can be driven upward markedly by a single author who uses an n-gram very frequently, for instance in a biography of 'Gottlieb Daimler’ whi

HOUSE_OVERSIGHT_017018 — Epstein Files

This document is part of the DOJ Epstein Files Transparency Act production (Public Law 119-38) — a corpus of 1,416,848 documents (2,915,593 pages) including prosecution files, FBI investigation records, court filings, and defense materials.

Enable JavaScript to view full page images, metadata, and cross-references.

Search the corpus: epstein-data.com/search

REST API: Get full document data as JSON

PDF: Download original PDF

Source Data Investigation Reports DOJ EFTA CC BY-NC-SA 4.0 Contact

You are leaving epstein-data.com

You are being redirected to an external website not operated by this project. We are not responsible for the content or privacy practices of external sites.

Research

Explore

Entities

Reports

Source

HOUSE_OVERSIGHT_017018

You are leaving epstein-data.com