AI data sources reveal growing tech company dominance

A comprehensive study by the Data Provenance Initiative has uncovered concerning trends in AI training data sources, according to findings reported by Melissa Heikkilä in MIT Technology Review. The research, analyzing nearly 4,000 public datasets across 67 countries, shows that data collection for AI development is increasingly concentrated among major technology companies.

Since 2018, web scraping has become the dominant method for gathering AI training data, with platforms like YouTube providing over 70% of video and speech data. The study reveals that while early 2010s data came from diverse sources including parliamentary transcripts and weather reports, recent practices favor mass collection from internet sources. This shift particularly benefits large tech companies like Google, which owns YouTube.

The research also highlights significant geographical disparities, with over 90% of datasets originating from Europe and North America, while less than 4% come from Africa. Researchers warn that this concentration of data control, combined with exclusive data-sharing deals between major tech companies and content providers, could further strengthen the dominance of leading AI developers while limiting access for smaller organizations and researchers.

Related posts:

Stay up-to-date: