Sometimes you just need some data to test and stress things. But randomly generated data is awful — it doesn’t have realistic distributions, and it isn’t easy to understand whether your results are meaningful and correct. Real or quasi-real data is best. Whether you’re looking for a couple of megabytes or many terabytes, the following sources of data might help you benchmark and test under more realistic conditions.
Datasets for Benchmarking
- The venerable sakila test database: small, fake database of movies.
- The employees test database: small, fake database of employees.
- The Wikipedia page-view statistics database: large, real website traffic data.
- The IMDB database: moderately large, real database of movies.
- The FlightStats database: flight on-time arrival data, easy to import into MySQL.
- The Bureau of Transportation Statistics: airline on-time data, downloadable in customizable ways.
- The airline on-time performance and causes of delays data from data.gov: ditto.
- The statistical review of world energy from British Petroleum: real data about our energy usage.
- The Amazon AWS Public Data Sets: a large variety of data such as the mapping of the Human Genome and the US Census data.
- The Weather Underground weather data: customize and download as CSV files.
Post your favorites in the comments!