I. Overview of Google Books Digitization In 2004, Google began scanning books to make their contents searchable and discoverable online. To date, Google has scanned over fifteen million books: over 11% of all the books ever published. The collection contains over five billion pages and two trillion words, with books dating back to as early as 1473 and with text in 478 languages. Over two million of these scanned books were given directly to Google by their publishers; the rest are borrowed from large libraries such as the University of Michigan and the New York Public Library. The scanning effort involves significant engineering challenges, some of which are highly relevant to the construction of the historical n-grams corpus. We survey those issues here. The result of the next three steps is a collection of digital texts associated with particular book editions, as well as composite metadata for each edition combining the information contained in all metadata sources. I.1. Metadata Over 100 sources of metadata information were used by Google to generate a comprehensive catalog of books. Some of these sources are library catalogs (e.g., the list of books in the collections of University of Michigan, or union catalogs such as the collective list of books in Bosnian libraries), some are from retailers (e.g., Decitre, a French bookseller), and some are from commercial aggregators (e.g., Ingram). In addition, Google also receives metadata from its 30,000 partner publishers. Each metadata source consists of a series of digital records, typically in either the MARC format favored by libraries, or the ONIX format used by the publishing industry. Each record refers to either a specific edition of a book or a physical copy of a book on a library shelf, and contains conventional bibliographic data such as title, author(s), publisher, date of publication, and language(s) of publication. Cataloguing practices vary widely among these sources, and even within a single source