A few months ago I was asked to undertake a project for Tyndale House which involved searching through their catalogue for out-of-copyright books and try to link these with electronic versions already on-line. This naturally brought me to archive.org to search through its massive collection of on-line texts. On the basis of that experience I thought that it might be helpful to write down a few thoughts on the subject that may spark a discussion.
First of all, here are some of many positive features of archive.org texts:
- There is a huge amount of material available. Well over 90% of the 450 titles I searched for were already there.
- This material can be downloaded in a wide range of formats, including, PDF, DJVU, TEXT, HTML and Kindle compatible files.
- The site is supported by an enthusiastic user-base who are constantly adding new material.
- Some books that are still under copyright in the UK because they were printed there are listed as being in the Public Domain on archive.org because it is hosted in the United States. In order to prevent them being downloaded outside the US Google Books (linked from archive.org) has blocked non-US IP addresses from accessing them – which of course can always be circumvented using a US-based proxy.
- Some material that is in the Public Domain in the UK is being blocked by Google Books..
- The first two points serve as a reminder that users cannot rely on the accuracy of the copyright declaration on the site outside of the US – you need to double check everything.
- Some scans are incomplete and/or of poor quality.
- Scans to PDF are often very large files. By reprocessing the files it is possible to reduce the file size by 50% in one trial I conducted.
- The search facility is fine if you know the exact title of the work you are after. However, if you misspell it or get a word wrong then the book you are after will not appear in the results.
- Perhaps as a result of (6) the usage statistics listed next to certain titles showing the number of downloads are often surprisingly low.
- Important UK-published theological books in the Public Domain could be re-scanned and hosted so as to avoid the unnecessary blocks on accessing them.
- Poor quality scans can be replaced.
- When serving users on dial-up or slow access Internet connections there is scope for reprocessing selected works and hosting them elsewhere to reduce the file sizes.
- The site lends itself to being linked with specialist bibliographies (such as those provided by the TheologyOnTheWeb sites) linked directly to material hosted on archive.org. This gets round the problem of searches when the material is not being blocked.