home

epstein-data
Research ▼
🔍 SearchFull-text document search 🤖 Ask AIAI research assistant 🔎 Evidence MapFBI serial resolution 📷 Reverse Image SearchCLIP + face across 614K images 🧑 Find Face BETASearch 29K faces by photo 💻 Run Your OwnDownload & search locally
Explore ▼
📚 Full Text Corpus1.39M docs, 2.77M pages 🌎 Global Heatmap145 countries mentioned 📈 Coverage MapWhat's here 🌌 AtlasSemantic map · 1.29M docs ⚖ Cases53 federal & state cases · per-case briefings 🎤 DepositionsTranscribed audio & video 💬 Hear from the SurvivorsSurvivors in their own words 📖 Cover to Cover-Up24-hour public reading, synced to the video ✉ Wolff–Epstein Emails2,009 messages · 2009–2019
📷 Images92K analyzed photographs 🔍 Multi-DB SearchSearch all databases individually 🗃 All Databases14 searchable databases
Entities Reports
News ▼
📰 NewsCoverage & reporting ⚖ Justice MonitorArrests, charges, lawsuits, firings
Source ▼
🏛 DOJ ProductionOfficial EFTA disclosures 📜 EFTA Law TextPublic Law 119-38 📁 Source Data (GitHub)Open source databases
🌐 Community ResourcesCurated external projects ✉ ContactGeneral · privacy · DMCA · press
❤️ Donate 🎧 Podcast

Research

🔍 Search Documents 🤖 Ask AI 🔎 Evidence Map 📷 Reverse Image Search 🧑 Find Face BETA 💻 Run Your Own Investigator

Explore

📚 Full Text Corpus 🌎 Global Heatmap 📈 Coverage Map 🌌 Atlas ⚖ Cases 🎤 Depositions 💬 Hear from the Survivors 📖 Cover to Cover-Up ✉ Wolff–Epstein Emails 📷 Images 🔍 Multi-DB Search 🗃 All Databases

Entities

👥 Entity Directory

Reports

Browse All Reports 📰 News ⚖ Justice Monitor

Source

🏛 DOJ Production 📜 EFTA Law 📁 Source Data (GitHub) 🌐 Community Resources ✉ Contact
🎧 Podcast & Newsletter ❤️ Donate Privacy Policy

HOUSE_OVERSIGHT_017015

← Prev Next →
Loading document…

(approximately 235,000) of the books were filtered out in this way. Table $1 lists the fraction removed at this stage for our other non-English corpora. 11.1D. Year Restriction In order to further ensure publication date accuracy and consistency of dates across all our corpora, we implemented a publication year restriction and only retained books with publication years starting from 1550 and ending in 2008. We found that a significant fraction of mis-dated books have a publication year of 0 or dates prior to the invention of printing. The number of books filtered due to this year range restriction is considerably small, usually under 2% of the original number of books. The fraction of the corpus removed by all stages of the filtering is summarized in Table $1. Note that because the filters are applied in a fixed order, the statistics presented below are influenced by the sequence in which the filters were applied. For example, books that trigger both the OCR quality filter and by the language correction filter are excluded by the OCR quality filter, which is performed first. Of course, the actual subset of books filtered is the same regardless of the order in which the filters are applied. I].2. Metadata based subdivision of the Google Books Collection II].2A. Determination of language To create accurate corpora in particular languages that minimize cross-language contamination, it is important to be able to accurately associate books with the language in which they were written. To determine the language in which a text is written, we rely on metadata derived from our 100 bibliographic sources, as well as statistical language determination using the Popat algorithm (Ref $3). The algorithm takes advantage of the fact that certain character sequences, such as ‘the’, 'of, and ‘ion", occur more frequently in English. In contrast, the sequences '‘la', 'aux', and 'de’ occur more frequently in French. These patterns can be used to distinguish between books written in Engl

Suggest a category
Misclassified? Pick a better fit.
Community Notes
▸ People Mentioned
▸ Interest Level
Routine Notable Significant
▸ Dates Mentioned
▸ Related Topics
▸ Places & Organizations
▸ Transcription Correction
▸ Research Notes 0
No notes yet.
Related documents
Source Data Investigation Reports DOJ EFTA CC BY-NC-SA 4.0 Contact
Independent research project. Not affiliated with the U.S. Department of Justice, FBI, any government agency, or Anthropic. All analytical text on this site is AI-generated (Claude, Anthropic) and iteratively fact-checked against source documents, but may contain errors. Verify all claims against linked EFTA sources before citing.
Powered by Datasette  ·  ❤️ Buy me a coffee

You are leaving epstein-data.com

You are being redirected to an external website not operated by this project. We are not responsible for the content or privacy practices of external sites.

Powered by Datasette