II.3. Construction of historical n-grams corpora IJ.3A. Creation of a digital sequence of 1-grams and extraction of n-gram counts All input source texts were first converted into UTF-8 encoding before tokenization. Next, the text of each book was tokenized into a sequence of 1-grams using Google’s internal tokenization libraries (more details on this approach can be found in Ref. $4). Tokenization is affected by two processes: (i) the reliability of the underlying OCR, especially vis-a-vis the position of blank spaces; (ii) the specific tokenizer rules used to convert the post-OCR text into a sequence of 1-grams. Ordinarily, the tokenizer separates the character stream into words at the white space characters (\n [newline]; \t [tab]; \r [carriage return]; ““ [space]). There are, however, several exceptional cases: (1) Column-formatting in books often forces the hyphenation of words across lines. Thus the word “digitized”, may appear on two lines in a book as "digi-<newline>ized". Prior to tokenization, we look for 1- grams that end with a hyphen ('-') followed by a newline whitespace character. We then concatenate the hyphen-ending 1-gram to the next 1-gram. In this manner, digi-<newline>tized became “digitized”. This step takes place prior to any other steps in the tokenization process. (2) Each of the following characters are always treated as separate words: ! (exclamation-mark) @ (at) % (percent) A (caret) * (star) ( (open-round-bracket) ) (close-round-bracket) [ (open-square-bracket) ] (close-square-bracket) - (hyphen) = (equals) { (open-curly-bracket) } (close-curly-bracket) | (pipe) \ (backslash) : (colon) : (semi-colon) < (less-than) 8 HOUSE_OVERSIGHT_017016