This talk will cover the challenges of dealing with mistakes, inconsistencies, and subjectivity in human-labeled datasets. In this talk, we will discuss how to build, use, and secure representative datasets for AI problems, taking a special attention to crowdsourced data and data obtained from in-house annotation teams.
We will start with typical issues of crowdsourced and human-labeled datasets, such as annotator biases and differences in their backgrounds. Then, we will focus on the annotator disagreement problem and answer subjectivity problem. We will present business case studies of how these problem are addressed in practice, leading to the creation of useful training datasets. We will also discuss Web-scale dataset poisoning problems and the ways to ensure the sustainability of the once created dataset. Finally, we will tackle the problem of learning from such data, showing convenient open-source tools for improving machine learning model quality.