To test and compare new computer programs, researchers often use the same shared sets of data. A study looked into which datasets are being used and found that the field is becoming more focused on a very small number of popular ones. It turns out that most of these influential datasets were created by people at just a handful of elite universities. This concentration has an impact on how scientific progress is measured and raises important questions about fairness and equal opportunity for all researchers in the field.
The Few Datasets Powering Research
Machine learning research increasingly relies on a small number of popular datasets, mostly from elite institutions.