Test sets

Main test sets (used on this site)

English Wikipedia as html files


This set consists of all articles in the English Wikipedia marked as ether featured or good articles, converted into standalone html files.

set

Files: 5 410
Size: 298 MB
View, Download

Enron files


With a total of 43 426 files, and a good mix of typical enterprise files like .pdf, .word, .xls, images etc, the Enron file set is a good resource to simulate a file server.

This data set contains files send as email attachments from about 150 users, mostly senior management of Enron. This data was originally made public, and posted to the web, by the US Federal Energy Regulatory Commission during its Enron investigation. The data set was created by extracting all email attachments from the original EDRM set.

set

Files: 43 401
Size: 7.9 GB
Top file types: doc: 21 102, xls: 9 589, pdf: 1 919, ppt: 1 823, jpg: 1 626,gif: 555, htm: 493, exe: 231, dat: 35
View, Download

 

Additional test sets

Full English Wikipedia as html files

This set consists of all articles in English Wikipedia converted into standalone html files.

Files: 7 926 727
Size: 83 GB
View, Download

Enron email

This data set contains 852 088 email from about 150 users, mostly senior management of Enron. This data was originally made public, and posted to the web, by the US Federal Energy Regulatory Commission during its Enron investigation. This data set was originally released by EDRM .

set

Files: 852 088
Size: 59 GB
View, Download

File samples

I am trying to collect as many different file formats as possible to use for testing text extraction.  It is currently a work in progress. If you have files to contribute, please contact me.

Files: < 20
Size: <100 MB
View

Lipsum

Example text in several different languages to test language compatibility. Thanks to lorem-ipsum.info for text.

Files: < 20
Size: <1 MB
View