home

epstein-data
Research ▼
🔍 SearchFull-text document search 🤖 Ask AIAI research assistant 🔎 Evidence MapFBI serial resolution 📷 Reverse Image SearchCLIP + face across 614K images 🧑 Find Face BETASearch 29K faces by photo 💻 Run Your OwnDownload & search locally
Explore ▼
📚 Full Text Corpus1.39M docs, 2.77M pages 🌎 Global Heatmap145 countries mentioned 📈 Coverage MapWhat's here 🌌 AtlasSemantic map · 1.29M docs ⚖ Cases53 federal & state cases · per-case briefings 🎤 DepositionsTranscribed audio & video 💬 Hear from the SurvivorsSurvivors in their own words 📖 Cover to Cover-Up24-hour public reading, synced to the video ✉ Wolff–Epstein Emails2,009 messages · 2009–2019
📷 Images92K analyzed photographs 🔍 Multi-DB SearchSearch all databases individually 🗃 All Databases14 searchable databases
Entities Reports
News ▼
📰 NewsCoverage & reporting ⚖ Justice MonitorArrests, charges, lawsuits, firings
Source ▼
🏛 DOJ ProductionOfficial EFTA disclosures 📜 EFTA Law TextPublic Law 119-38 📁 Source Data (GitHub)Open source databases
🌐 Community ResourcesCurated external projects ✉ ContactGeneral · privacy · DMCA · press
❤️ Donate 🎧 Podcast

Research

🔍 Search Documents 🤖 Ask AI 🔎 Evidence Map 📷 Reverse Image Search 🧑 Find Face BETA 💻 Run Your Own Investigator

Explore

📚 Full Text Corpus 🌎 Global Heatmap 📈 Coverage Map 🌌 Atlas ⚖ Cases 🎤 Depositions 💬 Hear from the Survivors 📖 Cover to Cover-Up ✉ Wolff–Epstein Emails 📷 Images 🔍 Multi-DB Search 🗃 All Databases

Entities

👥 Entity Directory

Reports

Browse All Reports 📰 News ⚖ Justice Monitor

Source

🏛 DOJ Production 📜 EFTA Law 📁 Source Data (GitHub) 🌐 Community Resources ✉ Contact
🎧 Podcast & Newsletter ❤️ Donate Privacy Policy

HOUSE_OVERSIGHT_017013

← Prev Next →
Loading document…

II. Construction of Historical N-grams Corpora As noted in the paper text, we did not analyze the entire set of 15 million books digitized by Google. Instead, we 1. Performed further filtering steps to select only a subset of books with highly accurate metadata. 2. Subdivided the books into ‘base corpora’ using such metadata fields as language, country of publication, and subject. 3. For each base corpus, construct a massive numerical table that lists, for each n-gram (often a word or phrase), how often it appears in the given base corpus in every single year between 1550 and 2008. In this section, we will describe these three steps. These additional steps ensure high data quality, and also make it possible to examine historical trends without violating the ‘fair use’ principle of copyright law: our object of study is the frequency tables produced in step 3 (which are available as supplemental data), and not the full-text of the books. II.1. Additional filtering of books II.1A. Accuracy of Date-of-Publication metadata Accurate date-of-publication data is crucial component in the production of time-resolved n-grams data. Because our study focused most centrally on the English language corpus, we decided to apply more stringent inclusion criteria in order to make sure the accuracy of the date-of-publication data was as high as possible. We found that the lion's share of date-of-publication errors were due to so-called 'bound-withs' - single volumes that contain multiple works, such as anthologies or collected works of a given author. Among these bound-withs, the most inaccurately dated subclass were serial publications, such as journals and periodicals. For instance, many journals had publication dates which were erroneously attributed to the year in which the first issue of the journal had been published. These journals and serial publications also represented a different aspect of culture than the books did. For these reasons, we decided to filter out all serial pub

Suggest a category
Misclassified? Pick a better fit.
Community Notes
▸ People Mentioned
▸ Interest Level
Routine Notable Significant
▸ Dates Mentioned
▸ Related Topics
▸ Places & Organizations
▸ Transcription Correction
▸ Research Notes 0
No notes yet.
Related documents
Source Data Investigation Reports DOJ EFTA CC BY-NC-SA 4.0 Contact
Independent research project. Not affiliated with the U.S. Department of Justice, FBI, any government agency, or Anthropic. All analytical text on this site is AI-generated (Claude, Anthropic) and iteratively fact-checked against source documents, but may contain errors. Verify all claims against linked EFTA sources before citing.
Powered by Datasette  ·  ❤️ Buy me a coffee

You are leaving epstein-data.com

You are being redirected to an external website not operated by this project. We are not responsible for the content or privacy practices of external sites.

Powered by Datasette