Wikipedia on the edge of overload. AI bots can take away free access to us

The scale of automated data download by various types of bots, which aim to power language models (LLM), has increased in recent months by 50 percent. In terms of the use of the link – especially when collecting multimedia materials. According to the Wikimedia Foundation The phenomenon has intensified since January 2024 and concerns not only Wikipedia itselfbut also such websites as Wikimedia Commons, where we will find about 144 million files on open licenses.
Many years ago, Wikipedia resources and related projects have already been found to various commercial products and academic environments, but only from the beginning of 2024. AI companies began to download content rapidly. At the same time, they use various methods: from standard indexing of pages, to specialized API, to removing entire resources wholesale.
Wikimedia emphasizes that the huge demand for new data for AI models generates huge technical and financial costs. The organization emphasizes that it does not receive sufficient support in the form of financial resources, or even in the form of reliable assignment to the authorship of collected materials.
Check also: Chatgpt with a new function of generating and editing graphics. Surprising effects
Wikipedia massively downloaded by AI bots
From the perspective of volunteers and engineers of the Foundation, the nuisance of such a movement is best seen in the example of December 2024, when the former US president Jimmy Carter died. The article about him recorded a record number of views, which was predictable. But The biggest problem appeared when millions of people at the same time began to watch 1.5-day. Video report from the 1980 debate.shared at Wikimedia Commons.
The network traffic has ever twice increased and almost “clogged” several internet connections. Although the engineers quickly redirected the movement to reduce overload, they realized that part of the bandwidth has long been occupied by mass -downloading multimedia archive.
This phenomenon is also known in other free and open software projects. Fedora once blocked a movement from all over Brazil, when uncontrolled scripts caused similar problems, GNOME introduced a proof-of-work solution on its Gitlab platform, and Read The Docs reduced its capacity costs, cutting off excessive AI movement.
Wikimedia reports that even the problem is not solved by caching (temporary storage of data in cache to accelerate charging and reduce server load) content, because A typical user focuses mainly on popular pages. The bots, in turn, are massively downloaded by virtually the entire encyclopedia And they also look into the rarely visited materials. They just take everything that is.
Foundation's data indicate that The bots currently constitute 65 percent. The most demanding (and thus expensive) queries, although they are only responsible for 35 percent. total number of views. From the point of view of engineering, this means that such queries are many times heavier than people from people, and some crawlery additionally ignore the rules in the robots.txt file or try to lick up under ordinary users using false browser identifiers and IP address rotation.
For a team responsible for the reliability of the party This translates into a constant fight against the waves of unwanted movementwhich needs to be monitored and limited. This, in turn, draws attention from maintaining and developing a website for volunteers and community.
Read also: It is not man who is the king of the Internet. Something else generates more movement
It is difficult to deal with this problem
The difficulty in adapting to this phenomenon also applies to developer infrastructure of Wikimedia – tools to verify the code and report errors, which is also regularly burdened by bots.
There are stories similar to those described by Daniel Stenberg from the Curl project, where the number of false reports generated by AI, as well as by Drew Devault from Sourcehut, who noticed that the bots have been massively searched by Logi Git in a manner unpopularly in a way that is previously unpapped in motion typical of living users.
The introduction of more advanced technical solutions, such as proof-of-work (solutions consisting in forcing bots to make additional calculations) or so-called Tarty with a delayed response (deliberate slowing down calls) can partially discharge the problem, but this is not a panacea. They limit excessive movement, but they do not completely do the problem, because the bots still find new ways to circumvent the security.
In some places, shared bots blocking bots, such as AI.Robots.txt files, and commercial companies offer paid traffic analysis services – e.g. Cloudflare AI Labyrinth, about which we have already written on a business insider. They can more effectively recognize and block undesirable guests. Wikimedia reminds, however, that The content itself is free and is to remain so, while maintaining infrastructure requires real financial outlays – and these are constantly growing.
The Foundation is therefore starting with the We5: Responsible Use of Infrastructure initiative, in which it wants to develop the rules for the responsible use of its resources. It comes, among others about more effective ways of downloading data, developing transparent principles limiting aggressive bots and – perhaps – Determining the financing conditions between AI giants and content providers.
On the one hand, Wikimedia has long been promoting free access to knowledge, but on the other, they remind you that this access is not released from consequences. If the companies will massively use the content for their own commercial needs, and at the same time will not support the costs of infrastructure, This may threaten the stability of the entire ecosystemand thus make it difficult for millions of people free access to materials, which for years has been the essence of Wikipedia's mission.
A potential solution is better coordination with developers working on artificial intelligence, creating dedicated API interfaces or joint financing mechanisms. If these actions are not taken, Wikimedia and its similar organizations may not have funds to maintain the necessary infrastructure. This will translate into limiting access to free knowledge and slowing down the development of the entire environment based on free licenses. And then we will share only one step before the free content disappears from the network – companies will not stand to keep the servers, because these will be too overloaded by bots.
Author: Grzegorz Kubera, Business Insider Polska journalist