Wikipedia has announced a new initiative in collaboration with Kaggle, Google’s data science platform, aimed at supporting AI developers with structured, machine-readable datasets. The move is designed to reduce the heavy server load caused by automated scraping, which has become increasingly common in the age of artificial intelligence.
A Dataset Designed for AI
Currently in beta, the dataset includes content in English and French, featuring machine-friendly summaries, brief descriptions, infobox data, article sections, and image links. Elements like references and non-text media files are excluded. The data is provided in a clean, structured JSON format, making it ideal for machine learning workflows including training, benchmarking, and alignment.
The dataset is distributed under an open license, meaning it's freely accessible to both tech giants and independent developers. Wikimedia Foundation, which already collaborates with organizations like Google and the Internet Archive, hopes this partnership will make Wikipedia’s resources more accessible to smaller AI developers as well.
Combating the Impact of Web Scraping
This initiative comes in response to the growing problem of bots scraping Wikipedia pages at scale. According to Wikimedia, 65% of high-impact server traffic comes from bots, leading to bandwidth strain and increased operational costs. By offering a dedicated dataset, Wikipedia aims to provide a more sustainable and efficient alternative to scraping.
A Step Toward Broader Collaboration
Kaggle expressed strong support for the project, emphasizing the importance of keeping Wikipedia’s data accessible and useful for the machine learning community. This collaboration marks a significant step toward more responsible and cooperative use of open knowledge in the AI era.