Artificial intelligence and cryptocurrency trading. Test of 6 language models

Ashley Davis04/11/2025

0 362 3 minutes read

6 artificial intelligence models received $10,000, the same data and instructions to trade on real cryptocurrency markets without any human intervention. Effect? After 17 days, 4 of them completed the first stage of the study with losses of up to 62%. The winner earned 22%.

ChataGPT crypto-failure. Artificial intelligence loses on cryptocurrency trading — photo: Mehaniq / / Shutterstock

On November 3, the first season of the Alpha Arena tournament ended, the aim of which was to test the capabilities of large language models in the field of quantitative trading on the cryptocurrency market. It was organized by Nof1 – presenting itself as the first research laboratory dealing with artificial intelligence in the context of financial markets.

6 Large Language Models (LLM) participated in Alpha Arena:

GPT-5 from OpenAi,
Gemini 2.5 Pro from Google,
Claude Sonnet 4.5 from Anthropic,
Grok 4 from Elon Musk's xAI,
Chinese DeepSeek v3.1,
Qwen3-Max from Alibaba.”

Artificial intelligence trades cryptocurrencies

The tournament started on October 18. Each model received identical prompts and inputs, $10,000 in initial capital, and a connection to the decentralized Hyperliquid exchange.

To keep things simple, Nof1 has limited the actions available to models to opening long and short positions and holding or closing them. The selection of instruments has been narrowed down to six popular cryptocurrencies on Hyperliquid: BTC, ETH, SOL, BNB, DOGE and XRP.

The authors of the study emphasized that they chose the cryptocurrency market and Hyperliquid for three practical reasons:

availability 24 hours a day, 7 days a week, which allowed us to observe the behavior of models continuously,
abundant and easily accessible data, conducive to analysis and transparent auditing
the speed and reliability of Hyperliquid and the ease of integrating the platform with LLM models.

Cryptocurrency failure of GPT Chat

Chat GPT-5 and Gemini started the tournament teetering around the starting point, but after a few days they started to suffer significant losses and did not recover until the end of the competition. Their final result did not differ much from the numbers that could be seen on the fifth day of the competition.

GPT-5 from OpenAi turned out to be the weakest language model in the Alpha Arena test. Of the initial $10,000, he had $3,733 left by November 3, which meant a loss of 62.7%.

The Google Gemini model took second place from the bottom with a capital decline of 56.7% to $4,329. Grok from the xAI lab lost 45.3% by November 3, ending the first stage with a deposit of $5,469.

Over 100% of DeepSeek's profit. Up to a point

Claude Sonnet was the best among the “Western” LLM models, losing 30.8% and ending the tournament with a result of $6,918. The first two places were taken by the Chinese models DeepSeek and Qwen3-Max, which also performed best throughout the entire test period.

After 10 days of competition, on October 27, DeepSeek dominated the rest of the field, netting over $13,000. Qwen3-Max followed hot on its heels by doubling its initial capital. However, subsequent declines in the cryptocurrency market undermined the final results.

Ultimately, the first edition of Alpha Arena was won by Qwen3-Max from the Chinese company Alibaba, finishing the competition with a result of $12,231, which translates into a 22.3% profit.. DeepSeek earned $10,489 or 4.9%.

Alpha Arena – results of a trading tournament on the cryptocurrency market (Nof1)

Alpha Arena 1.0 Ranking:

Qwen3 MAX – $12,231
DeepSeek – $10,489.
Claude Sonnet 4.5 – $6,918
Grok – $5,469
Gemini 2.5 Pro – $4,329
GPT-5 – $3,733

Test conclusions and announcement of Alpha Arena 1.5

Tournament organizers stressed that the first season of live competition in a narrow window of time has limited statistical power and early rankings may change in the future. Nof1 intends to continue the study and announced that the next round of Alpha Arena 1.5 will begin soon.

“We observed consistent deviations in the models' behavior that persisted over time and despite many iterations of the prompt (instruction). Something like an investing 'personality' was formed.”

We deliberately put the models in a difficult situation. LLM models are generally poor at dealing with numerical time series data, and this was the only context we provided them. They also received a limited universe of assets and a rather narrow space of operations.

In the next season, we will introduce many improvements and test many different prompts in parallel, as well as numerous instances of each model,” Jay A. Zhang, founder of Nof1, summed up the study.

Season 1 of Alpha Arena has officially ended. Qwen 3 MAX pulled ahead at the very end to secure the win, so congrats to the @Alibaba_Qwen team

Thanks to everyone who tuned in to our first experiment in understanding how LLMs handle the noisy, adversarial, non-stationary world of… pic.twitter.com/NMysYylped

— Jay A (@jay_azhang) November 3, 2025