HOUSE_OVERSIGHT_017017

← Prev Next →

Loading document…

, (comma) > (greater-than) ? (question-mark) / (torward-slash) ~ (tilde) * (back-tick) “(double quote) (3) The following characters are not tokenized as separate words: & (ampersand) _ (underscore) Examples of the resulting words include AT&T, R&D, and variable names such as HKEY_LOCAL_MACHINE. (4) . (period) is treated as a separate word, except when it is part of a number or price, such as 99.99 or $999.95. A specific pattern matcher looks for numbers or prices and tokenizes these special strings as separate words. (5) $ (dollar-sign) is treated as a separate word, except where it is the first character of a word consisting entirely of numbers, possibly containing a decimal point. Examples include $71 and $9.95 (6) # (hash) is treated as a separate word, except when it is preceded by a-g, j or x. This covers musical notes such as A# (A-sharp), and programming languages j#, and x#. (7) + (plus) is treated as a separate word, except it appears at the end of a sequence of alphanumeric characters or “+” s. Thus the strings C++ and Na2+ would be treated as single words. These cases include many programming language names and chemical compound names. (8) ' (apostrophe/single-quote) is treated as a separate word, except when it precedes the letter s, as in ALICE'S and Bob's The tokenization process for Chinese was. different. For Chinese, an_ internal CJK (Chinese/Japanese/Korean) segmenter was used to break characters into word units. The CJK segmenter inserts spaces along common semantic boundaries. Hence, 1-grams that appear in the Chinese simplified corpora will sometimes contain strings with 1 or more Chinese characters. Given a sequence of n 1-grams, we denote the corresponding n-gram by concatenating the 1-grams with a plain space character in between. A few examples of the tokenization and 1-gram construction method are provided in Table $2. Each book edition was broken down into a series of 1-grams on a page-by-page basis. For each page of each book, we counted

HOUSE_OVERSIGHT_017017 — Epstein Files

This document is part of the DOJ Epstein Files Transparency Act production (Public Law 119-38) — a corpus of 1,416,848 documents (2,915,593 pages) including prosecution files, FBI investigation records, court filings, and defense materials.

Enable JavaScript to view full page images, metadata, and cross-references.

Search the corpus: epstein-data.com/search

REST API: Get full document data as JSON

PDF: Download original PDF

Source Data Investigation Reports DOJ EFTA CC BY-NC-SA 4.0 Contact

You are leaving epstein-data.com

You are being redirected to an external website not operated by this project. We are not responsible for the content or privacy practices of external sites.

Research

Explore

Entities

Reports

Source

HOUSE_OVERSIGHT_017017

You are leaving epstein-data.com