DeepSeek-R1 hallucinates nearly 4x more than V3
Vectara’s HHEM 2.1 benchmark found DeepSeek-R1 hallucinates at 14.3% versus DeepSeek-V3’s 3.9%, a gap that may raise on-chain risk for crypto AI agent tokens.
Vectara ran its HHEM 2.1 hallucination evaluation on two models from DeepSeek and reported a 14.3% hallucination rate for DeepSeek-R1 compared with 3.9% for DeepSeek-V3. Vectara said it cross-checked results using a FACTS methodology and found R1 produced more false or unsupported statements in every test configuration. Vectara posted the figures publicly, noting R1 “overhelps” by adding information that does not appear in the source text.
Vectara’s analysis describes the R1 behavior as adding details that can be independently correct yet still count as hallucinations because they fabricate context. The company reported that the pattern held across test settings and that R1’s tendency to supply extra information raised the measured rate of unsupported statements.
The difference matters for crypto projects that wrap large language models in agent tooling. Several AI agent tokens use models to post on social media, route trades, mint tokens or execute contracts on-chain. Market data show the category of AI agent tokens grew about 39.4% over a recent 30-day window, and one leading project has a market capitalization above $576 million. One analysis of a specific agent token found the agent promoted 416 tokens with an average return of 19% on those promotions.
When a model fabricates a price level, a partnership or a contract address, the error can produce real financial effects on-chain. The risk increases with the level of autonomy: agents that only read and summarize data pose less direct risk than agents that hold treasury keys or carry out multi-step transactions. Vectara noted that a single hallucinated fact early in a planning chain can affect downstream actions for agents designed to execute multi-step plans.
The causes of hallucination are debated within the AI community. Meta chief AI scientist Yann LeCun has argued that hallucinations stem from the autoregressive architecture of current large language models and wrote that “Hallucinations in LLM are due to the Auto-Regressive prediction.” Other researchers point to improvements from techniques such as retrieval augmentation, post-training fine-tuning and verifier models as ways to reduce unsupported outputs.
A developer using the handle xlr8harder described a debugging session with R1 and reported an experience in which partial internal reasoning traces produced confident but unsupported outputs, saying the model “defaults to gaslighting me with hallucinations.”
Some development teams route every model claim through verification steps or assign financial actions to smaller, more conservative models to limit exposure. Vectara’s reported rates — 14.3% for R1 and 3.9% for V3 — are being used by builders and auditors to assess operational risk when selecting models for agent tooling.
Vectara and other labs continue to report benchmark results and mitigation techniques as they test reasoning-focused models. Future model updates and leaderboard cycles will provide additional data on whether hallucination rates change for reasoning-trained systems.








