home

epstein-data
Research ▼
🔍 SearchFull-text document search 🤖 Ask AIAI research assistant 🔎 Evidence MapFBI serial resolution 📷 Reverse Image SearchCLIP + face across 614K images 🧑 Find Face BETASearch 29K faces by photo 💻 Run Your OwnDownload & search locally
Explore ▼
📚 Full Text Corpus1.39M docs, 2.77M pages 🌎 Global Heatmap145 countries mentioned 📈 Coverage MapWhat's here 🌌 AtlasSemantic map · 1.29M docs ⚖ Cases53 federal & state cases · per-case briefings 🎤 DepositionsTranscribed audio & video 💬 Hear from the SurvivorsSurvivors in their own words 📖 Cover to Cover-Up24-hour public reading, synced to the video ✉ Wolff–Epstein Emails2,009 messages · 2009–2019
📷 Images92K analyzed photographs 🔍 Multi-DB SearchSearch all databases individually 🗃 All Databases14 searchable databases
Entities Reports
News ▼
📰 NewsCoverage & reporting ⚖ Justice MonitorArrests, charges, lawsuits, firings
Source ▼
🏛 DOJ ProductionOfficial EFTA disclosures 📜 EFTA Law TextPublic Law 119-38 📁 Source Data (GitHub)Open source databases
🌐 Community ResourcesCurated external projects ✉ ContactGeneral · privacy · DMCA · press
❤️ Donate 🎧 Podcast

Research

🔍 Search Documents 🤖 Ask AI 🔎 Evidence Map 📷 Reverse Image Search 🧑 Find Face BETA 💻 Run Your Own Investigator

Explore

📚 Full Text Corpus 🌎 Global Heatmap 📈 Coverage Map 🌌 Atlas ⚖ Cases 🎤 Depositions 💬 Hear from the Survivors 📖 Cover to Cover-Up ✉ Wolff–Epstein Emails 📷 Images 🔍 Multi-DB Search 🗃 All Databases

Entities

👥 Entity Directory

Reports

Browse All Reports 📰 News ⚖ Justice Monitor

Source

🏛 DOJ Production 📜 EFTA Law 📁 Source Data (GitHub) 🌐 Community Resources ✉ Contact
🎧 Podcast & Newsletter ❤️ Donate Privacy Policy

HOUSE_OVERSIGHT_017014

← Prev Next →
Loading document…

metadata dates that were more than 5 years from the date determined by a human examining the book. Because errors are much more common among older books, and because the actual corpora are strongly biased toward recent works, the likelihood of error in a randomly sampled book from the final corpus is much lower than 6.2%. As a point of comparison, 27 of 100 books (27%) selected at random from an unfiltered corpus contained date-of-publication errors of greater than 5 years. The unfiltered corpus was created using a sampling strategy similar to that of Eng-1M. This selection mechanism favored recent books (which are more frequent) and pre-1800 books, which were excluded in the sampling strategy for filtered books; as such the two numbers (6.2% and 27%) give a sense of the improvement, but are not strictly comparable. Note that since the base corpora were generated (August 2009), many additional improvements have been made to the metadata dates used in Google Book Search itself. As such, these numbers do not reflect the accuracy of the Google Book Search online tool. II.1B. OCR quality The challenge of performing accurate OCR on the entire books dataset is compounded by variations in such factors as language, font, size, legibility, and physical condition of the book. OCR quality was assessed using an algorithm developed by Popat et al. (Ref S3). This algorithm yields a probability that expresses the confidence that a given sequence of text generated by OCR is correct. Incorrect or anomalous text can result from gross imperfections in the scanned images, or as a result of markings or drawings. This algorithm uses sophisticated statistics, a variant of the Partial by Partial Matching (PPM) model, to compute for each glyph (character) the probability that it is anomalous given other nearby glyphs. (‘Nearby' refers to 2-dimensional distance on the original scanned image, hence glyphs above, below, to the left, and to the right of the target glyph.) The model parameters a

Suggest a category
Misclassified? Pick a better fit.
Community Notes
▸ People Mentioned
▸ Interest Level
Routine Notable Significant
▸ Dates Mentioned
▸ Related Topics
▸ Places & Organizations
▸ Transcription Correction
▸ Research Notes 0
No notes yet.
Related documents
Source Data Investigation Reports DOJ EFTA CC BY-NC-SA 4.0 Contact
Independent research project. Not affiliated with the U.S. Department of Justice, FBI, any government agency, or Anthropic. All analytical text on this site is AI-generated (Claude, Anthropic) and iteratively fact-checked against source documents, but may contain errors. Verify all claims against linked EFTA sources before citing.
Powered by Datasette  ·  ❤️ Buy me a coffee

You are leaving epstein-data.com

You are being redirected to an external website not operated by this project. We are not responsible for the content or privacy practices of external sites.

Powered by Datasette