Allen B. Riddell

Where are the novels?

Tue 14 February 2012

Between 1800 and 1836, 2,903 novels were published in England and the British Isles. This figure comes from the bibliographies of Garside, Raven, and Schöwerling (2000; 2006) (and the collective labor of countless others). We know that almost all of these novels survive into library collections somewhere. But how many have been scanned?

In order to answer that question, I took a random sample from the bibliography and tried to locate scans. Of the 82 novels in the sample, 58% had scans accessible by anyone (e.g. Internet Archive, Project Gutenberg, etc.), 8% had scans held by a for-profit company or by a library that had chosen not to allow the public to download the scan, 32% had no scans but copies were found in library holdings, and 1% (a solitary novel) had no scan as there are no (known) surviving copies. I counted scans of subsequent editions or printings of novels as scans of the original. For example, if I located a scan of a Philadelphia edition of a novel first published in London, I counted that novel as having a scan. The lion’s share of the private/for-profit scans are from the Corvey Collection which the publisher Gale appears to control. Needless to say, these novels (and scans, microfiche copies, etc.) are all in the public domain in the United States and many other countries.

Figure 1

Based on the sample, we may guess that about 58%—somewhere between 47% and 68%—of the 2,903 novels have publicly accessible scans.1 For any given novel, however, the chance of finding a scan seems to depend on two things: (1) the novel’s year of publication and (2) the novel having subsequent editions or printings (see Figure 2). For example, it is easier to find a scan of a novel published in 1830 than a scan of one published in 1800. Scans of novels with subsequent editions (like an American edition) are easier to find as well. This should come as no surprise. Printing runs were smaller as one goes back in time and older books are, in any event, more likely to end up in libraries’ special collections or other places spared from the initial wave of library digitization. And having subsequent editions is synonymous with “having more copies printed” given the way I’ve tallied the data.

Figure 2

Two results stand out. First, the 19th century British novel is a phenomenally well-preserved part of cultural history. Copies of nearly ever novel published during the period survive. Second, the proportion of novels scanned between 1800 and 1820 is low—likely around 33% based on this sample. This raises concerns about any claim of representativeness made on behalf of existing corpora covering those years. As libraries and private collections continue to be digitized and, I hope, be made publicly accessible, such concerns should diminish.

Data and Code

I’ve posted the data and the code for the graphs online. Figure 2 comes from a logistic regression. Good introductions to this kind of analysis include Peter Hoff’s A First Course in Bayesian Statistical Methods (2009) and Data Analysis Using Regression and Multilevel/Hierarchical Models (2006 ) by Andrew Gelman and Jennifer Hill.


  1. I calculated this range with a binomial sampling model (n = 83) and a uniform prior. The original sample was with replacement.