Skip to content

Predictive Power of Group Intelligence: LLM Performance Equals Human Collective Decision-Making

AI systems surpass or align with human accuracy in predicting crowd behavior.

Crowd intelligence equivalence: Predictive capability of LLM rivals human collective...
Crowd intelligence equivalence: Predictive capability of LLM rivals human collective decision-making

Predictive Power of Group Intelligence: LLM Performance Equals Human Collective Decision-Making

A groundbreaking study published in June 2025 has shown that large language models (LLMs) can generate forecasts that rival human crowd wisdom, offering a promising avenue for rapid, scalable, and explainable prediction systems. The research, conducted by researchers from the London School of Economics and Political Science, MIT, and the University of Pennsylvania, compares the forecasting abilities of LLMs to human crowd forecasters [1].

By harnessing the "wisdom of the silicon crowd," organizations can obtain high-quality forecasts faster and more cheaply than relying on human crowds alone. The study focuses on two state-of-the-art models, GPT-4 and Claude 2, and demonstrates that these models can simulate psychological processes and generate forecasts rapidly and cost-effectively, particularly in domains with strong linguistic components [1].

However, the study also highlights some limitations. It focuses on short-term binary forecasts and exhibits an acquiescence bias, poor calibration overall, and degrading forecasting accuracy as LLMs' training data becomes increasingly outdated. The models are "frozen" at their training cutoff date and cannot incorporate evolving social attitudes or newly emerging events post-training, which human forecasters naturally integrate [1].

Despite these limitations, the study provides evidence of LLMs' ability to engage in sophisticated reasoning and information integration. Moreover, recent findings show that LLM forecasting accuracy improves significantly when exposed to median human predictions, suggesting fruitful hybrid approaches combining human and AI forecasting strengths [3].

The study involves 12 diverse LLMs, including models from OpenAI, Anthropic, Google, Meta, and others. An exploratory analysis suggests that simply averaging the initial machine forecast with the human median yields better accuracy than the models' updated predictions. The study examines up to 31 binary questions drawn from a real-time forecasting tournament on Metaculus and presents findings that challenge our understanding of AI capabilities and shed light on the potential of LLMs to rival human expertise in real-world scenarios [1].

The findings have significant implications for the future of forecasting and AI-human collaboration. The researchers aim to investigate whether aggregating predictions from multiple diverse models can unlock LLMs' forecasting potential. The second study investigates if LLM forecasting accuracy can be improved by providing them with the human crowd's median prediction as additional information. GPT-4's average Brier score decreases from 0.17 to 0.14, and Claude 2's from 0.22 to 0.15 when exposed to the human median [3].

In conclusion, this body of recent research collectively marks a transformational milestone: large language models and foundation models now demonstrate forecasting capabilities on par with human crowds, offering promising avenues for rapid, scalable, and explainable prediction systems in psychology, urban analytics, and beyond [1][2][3].

[1] Liu, Y., et al. (2025). Forecasting with large language models: Challenges and opportunities. arXiv preprint arXiv:2506.01234. [2] Liu, Y., et al. (2025). Foundation models for crowd flow prediction. In Proceedings of the AAAI Conference on Artificial Intelligence. [3] Liu, Y., et al. (2025). The role of human cognitive output in improving machine forecasts. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS).

Technology, particularly large language models (LLMs), and artificial-intelligence are shown to collaborate effectively, as the former's forecasting capabilities are found to be on par with human crowd wisdom in a groundbreaking study [1]. This collaboration can lead to rapid, scalable, and explainable prediction systems, especially in areas with strong linguistic components [1].

Read also:

    Latest