Main test sets (used on this site)
English Wikipedia as html files
This data set contains files send as email attachments from about 150 users, mostly senior management of Enron. This data was originally made public, and posted to the web, by the US Federal Energy Regulatory Commission during its Enron investigation. The data set was created by extracting all email attachments from the original EDRM set.
|Top file types:||doc: 21 102, xls: 9 589, pdf: 1 919, ppt: 1 823, jpg: 1 626,gif: 555, htm: 493, exe: 231, dat: 35|
Additional test sets
Full English Wikipedia as html files
This set consists of all articles in English Wikipedia converted into standalone html files.
|Files:||7 926 727|