Main test sets (used on this site)
English Wikipedia as html files
This set consists of all articles in the English Wikipedia marked as ether featured or good articles, converted into standalone html files.
set
| Files: | 5 410 |
| Size: | 298 MB |
| View, Download |
Enron files
With a total of 43 426 files, and a good mix of typical enterprise files like .pdf, .word, .xls, images etc, the Enron file set is a good resource to simulate a file server.
This data set contains files send as email attachments from about 150 users, mostly senior management of Enron. This data was originally made public, and posted to the web, by the US Federal Energy Regulatory Commission during its Enron investigation. The data set was created by extracting all email attachments from the original EDRM set.
set
| Files: | 43 401 |
| Size: | 7.9 GB |
| Top file types: | doc: 21 102, xls: 9 589, pdf: 1 919, ppt: 1 823, jpg: 1 626,gif: 555, htm: 493, exe: 231, dat: 35 |
| View, Download |
Additional test sets
Full English Wikipedia as html files
This set consists of all articles in English Wikipedia converted into standalone html files.
| Files: | 7 926 727 |
| Size: | 83 GB |
| View, Download |