Wikidata Transforms Data into Vectors for Open LLM Access
Wikidata, the world's largest open knowledge graph, is transforming its data into vectors and storing them in Astra DB, a vector database. This project, a collaboration between the Wikimedia Foundation and Jina AI, started in September 2024. The goal is to provide a freely accessible interface for Large Language Models (LLMs), making them more transparent, reliable, and fair.
Wikidata, maintained by around 24,000 volunteers worldwide each month, contains approximately 119 million entries. It recommends using semantic vector search to identify correct datasets and then structuring the knowledge using a graph database (GraphRAG). The vector database supports search queries in English, French, and Arabic, with Spanish and Mandarin planned by the end of the year.
The new technology aims to improve LLMs by providing them with structured, up-to-date, and verified information. This reduces incorrect answers and hallucinations. Wikimedia envisions applications such as fact-checking or tools for vandalism prevention. The source code of the application is available under the open MIT license.
The embedding project, initiated in September 2024 with partners Jina AI and Astra DB, enables developers to connect Wikidata's vectorized data to LLMs using Retrieval Augmented Generation (RAG) and Model Context Protocol (MCP). This open access to Wikidata aims to enhance the quality of LLMs worldwide.
Read also:
- Grid Risk Evaluation Strategy By NERC Outlined, Focusing on Potential Threats from Data Centers
- Rapid Expansion in Organic Rice Protein Market Projected at 15.6% Through 2034
- The Virtual Commissioning Market is projected to exceed $4.86 billion by the year 2034.
- Kenya broadens economic zones featuring Olkaria's geothermal energy advantage