Test sets

Main test sets (used on this site)

English Wikipedia as html files


This set consists of all articles in the English Wikipedia marked as ether featured or good articles, converted into standalone html files.

set

Files:  5 410
Size:  298 MB
View, Download

Enron files


With a total of 43 426 files, and a good mix of typical enterprise files like .pdf, .word, .xls, images etc, the Enron file set is a good resource to simulate a file server.

This data set contains files send as email attachments from about 150 users, mostly senior management of Enron. This data was originally made public, and posted to the web, by the US Federal Energy Regulatory Commission during its Enron investigation. The data set was created by extracting all email attachments from the original EDRM set.

set

Files: 43 401
Size: 7.9 GB
Top file types: doc: 21 102, xls: 9 589, pdf: 1 919, ppt: 1 823, jpg: 1 626,gif: 555, htm: 493, exe: 231, dat: 35
View, Download

 

Additional test sets

Full English Wikipedia as html files

This set consists of all articles in English Wikipedia converted into standalone html files.

Files:  7 926 727
Size:  83 GB
View, Download