capping strategy also minimizes bias towards modern books that might otherwise result because the number of books being published has soared in recent decades. Eng-Modern-1M This corpus was generated exactly as Eng-1M above, except that it contains no books from before 1800. Eng-US This is derived from a base corpus containing all English language books which pass the filters described in section 1 but having a quality filtering threshold of 60%, and having ‘United States' as its country of publication, reflected by the 2-letter country code "us", Eng-UK This is derived from a base corpus containing all English language books which pass the filters described in section 1 but having a quality filtering threshold of 60%, and having ‘United Kingdom’ as its country of publication, reflected by the 2-letter country code "gb", Fre-all This is derived from a base corpus containing all French language books which pass the series of filters described in section 1. Ger-all This is derived from a base corpus containing all German language books which pass the series of filters described in section 1. Spa-all This is derived from a base corpus containing all Spanish language books which pass the series of filters described in section 1. Rus-all This is derived from a base corpus containing all Russian language books which pass the series of filters described in section 1C-D. Chi-sim-all This is derived from a base corpus containing all books written using the simplified Chinese character set which pass the series of filters described in section 1C-D. Heb-all This is derived from a base corpus containing all Hebrew language books which pass the series of filter described in section 1. 11 HOUSE_OVERSIGHT_017019