HOUSE_OVERSIGHT_017013

← Prev Next →

Loading document…

II. Construction of Historical N-grams Corpora As noted in the paper text, we did not analyze the entire set of 15 million books digitized by Google. Instead, we 1. Performed further filtering steps to select only a subset of books with highly accurate metadata. 2. Subdivided the books into ‘base corpora’ using such metadata fields as language, country of publication, and subject. 3. For each base corpus, construct a massive numerical table that lists, for each n-gram (often a word or phrase), how often it appears in the given base corpus in every single year between 1550 and 2008. In this section, we will describe these three steps. These additional steps ensure high data quality, and also make it possible to examine historical trends without violating the ‘fair use’ principle of copyright law: our object of study is the frequency tables produced in step 3 (which are available as supplemental data), and not the full-text of the books. II.1. Additional filtering of books II.1A. Accuracy of Date-of-Publication metadata Accurate date-of-publication data is crucial component in the production of time-resolved n-grams data. Because our study focused most centrally on the English language corpus, we decided to apply more stringent inclusion criteria in order to make sure the accuracy of the date-of-publication data was as high as possible. We found that the lion's share of date-of-publication errors were due to so-called 'bound-withs' - single volumes that contain multiple works, such as anthologies or collected works of a given author. Among these bound-withs, the most inaccurately dated subclass were serial publications, such as journals and periodicals. For instance, many journals had publication dates which were erroneously attributed to the year in which the first issue of the journal had been published. These journals and serial publications also represented a different aspect of culture than the books did. For these reasons, we decided to filter out all serial pub

HOUSE_OVERSIGHT_017013 — Epstein Files

This document is part of the DOJ Epstein Files Transparency Act production (Public Law 119-38) — a corpus of 1,416,848 documents (2,915,593 pages) including prosecution files, FBI investigation records, court filings, and defense materials.

Enable JavaScript to view full page images, metadata, and cross-references.

Search the corpus: epstein-data.com/search

REST API: Get full document data as JSON

PDF: Download original PDF

Source Data Investigation Reports DOJ EFTA CC BY-NC-SA 4.0 Contact

You are leaving epstein-data.com

You are being redirected to an external website not operated by this project. We are not responsible for the content or privacy practices of external sites.

Research

Explore

Entities

Reports

Source

HOUSE_OVERSIGHT_017013

You are leaving epstein-data.com