HOUSE_OVERSIGHT_017016

← Prev Next →

Loading document…

II.3. Construction of historical n-grams corpora IJ.3A. Creation of a digital sequence of 1-grams and extraction of n-gram counts All input source texts were first converted into UTF-8 encoding before tokenization. Next, the text of each book was tokenized into a sequence of 1-grams using Google’s internal tokenization libraries (more details on this approach can be found in Ref. $4). Tokenization is affected by two processes: (i) the reliability of the underlying OCR, especially vis-a-vis the position of blank spaces; (ii) the specific tokenizer rules used to convert the post-OCR text into a sequence of 1-grams. Ordinarily, the tokenizer separates the character stream into words at the white space characters (\n [newline]; \t [tab]; \r [carriage return]; ““ [space]). There are, however, several exceptional cases: (1) Column-formatting in books often forces the hyphenation of words across lines. Thus the word “digitized”, may appear on two lines in a book as "digi-<newline>ized". Prior to tokenization, we look for 1- grams that end with a hyphen ('-') followed by a newline whitespace character. We then concatenate the hyphen-ending 1-gram to the next 1-gram. In this manner, digi-<newline>tized became “digitized”. This step takes place prior to any other steps in the tokenization process. (2) Each of the following characters are always treated as separate words: ! (exclamation-mark) @ (at) % (percent) A (caret) * (star) ( (open-round-bracket) ) (close-round-bracket) [ (open-square-bracket) ] (close-square-bracket) - (hyphen) = (equals) { (open-curly-bracket) } (close-curly-bracket) | (pipe) \ (backslash) : (colon) : (semi-colon) < (less-than) 8 HOUSE_OVERSIGHT_017016

HOUSE_OVERSIGHT_017016 — Epstein Files

This document is part of the DOJ Epstein Files Transparency Act production (Public Law 119-38) — a corpus of 1,416,848 documents (2,915,593 pages) including prosecution files, FBI investigation records, court filings, and defense materials.

Enable JavaScript to view full page images, metadata, and cross-references.

Search the corpus: epstein-data.com/search

REST API: Get full document data as JSON

PDF: Download original PDF

Source Data Investigation Reports DOJ EFTA CC BY-NC-SA 4.0 Contact

You are leaving epstein-data.com

You are being redirected to an external website not operated by this project. We are not responsible for the content or privacy practices of external sites.

Research

Explore

Entities

Reports

Source

HOUSE_OVERSIGHT_017016

You are leaving epstein-data.com