Tech Spotlight

Overview

Data science is increasingly used by governments and businesses to make decisions that impact billions of lives worldwide. However, the models and algorithms developed by data scientists are only as unbiased as the data they are fed. Consequently, it is critical for data practitioners to build and train systems with as little bias as possible to develop equitable algorithms.

There are millions of publicly available datasets and no existing standards to rate a set’s bias or determine its completeness. The Data Nutrition Project created the Dataset Nutrition Label to provide an at-a-glance evaluation of a dataset’s quality. The Label cultivates a culture of transparency, accountability, and understanding in the realm of big data.

The Data Nutrition Project was founded in 2018 through Harvard University’s Berkman Klein Center Assembly Fellowship.

The Data Nutrition Project aims to create a standard label for interrogating datasets.

The Challenge

Today, AI algorithms are used to inform decisions ranging from mortgage approvals to criminal sentencing to medical care, and more. The quality of AI algorithms is dependent on the quality of the data they are trained on. When the data is biased, the resulting algorithm is biased. For example, dataset bias has contributed to poorer medical attention and higher mortality rates for skin cancer patients with darker skin compared to those with lighter skin.¹

There is currently a missing step in the AI development pipeline: data practitioners cannot assess datasets against a standardized measure of quality that captures both the quantitative and qualitative attributes of a dataset. In short, there is no quick way of knowing the quality of a dataset. Thus the challenge in developing a tool that allows data practitioners to assess datasets quickly is twofold. The first challenge lies in developing the assessment criteria itself. The second challenge lies in expressing a label in a way that is easily understood by practitioners training AI models.

About the Intervention

The Dataset Nutrition Label increases the transparency and understandability of datasets. Using the analogy of the Nutrition Facts Label on food, the Dataset Nutrition Label highlights the “nutrients” of datasets to concisely summarize how healthy a data set is for a particular use case.

The Label has four sections: About, Alert Count, Use Cases, and Badges. The About section contextualizes the dataset. The Alert Count measures any issues with a dataset’s completeness, provenance, collection, description, and composition. Alerts are classified into three categories based on whether they can, cannot, or might be able to be mitigated. The Use Case section describes potential appropriate uses of the dataset. Lastly, the Badges communicate a variety of information like whether the dataset has undergone quality or ethical review. Alerts are also tallied and expressed graphically to show their categorical frequency, their mitigation potential, and their potential harm.

The Dataset Nutrition Label is helpful to dataset owners and data practitioners. For dataset owners, the Label provides scaffolding in the form of questions and processes to surface relevant information about a dataset. For data practitioners, the Label helps inform whether and how to use a dataset. The code and framework are entirely open-source.

Impact & Future Plans

After two years of developing a robust assessment framework, The Data Nutrition Project worked with a consortium of hospitals to create Dataset Nutrition Labels pertaining to melanoma and rare diseases long associated with biased medical datasets.

The Data Nutrition Project is looking to broaden its impact. In the short term, the organization is developing 10 to 20 new labels on several Harvard datasets popular for benchmarking in academia and industry. The Data Nutrition Project is also working with organizations like Humans in the Loop to mitigate problems experienced by refugees perpetuated by biased datasets. Additionally, they have partnered with AI Global and the World Economic Forum to develop a certification for healthy AI using a data rubric based on the Dataset Nutrition Label schema. Longer-term, The Data Nutrition Project is focused on scalability. The Project is investigating how to automate the production of the Label to make the Label as accessible and impactful as possible.

¹ Angela Lashbrook, “AI-Driven Dermatology Could Leave Dark-Skinned Patients Behind,” The Atlantic, August 16, 2018, https://www.theatlantic.com/health/archive/2018/08/machine-learning-dermatology-skin-color/567619/. ↩