Assessing Language Models Using Other Language Models

In the rapidly evolving world of artificial intelligence, Large Language Models (LLMs) are making significant strides in self-evaluation, promising to revolutionize the way we assess their performance.

Recent research has shown that LLMs contain internal latent directions associated with self-evaluation behaviour, a capability that generalizes across various domains and languages [1]. This self-evaluation ability is being leveraged to develop automated evaluation metrics, reducing the reliance on costly and inconsistent human judgments while offering scalable, interpretable, and cross-domain evaluation frameworks.

Techniques such as Self-Refine, Reflexion, Self-Verification, and Self-Contrast improve model outputs by iterative self-assessment and comparison of candidate answers, bridging generation and discrimination to reduce hallucinations and increase output quality [2].

Traditional evaluation methods have their limitations. Overfitting on benchmarks, insufficient diversity of metrics, and high cost and subjectivity in human evaluations are common issues [3]. Automated LLM-based evaluation offers a more consistent and sensitive alternative, particularly in specialized domains like health, where automated evaluation frameworks using precise Boolean rubrics significantly outperform Likert scales [4].

Using advanced LLMs to score or compare generated outputs against references enables objective scoring on subjective criteria such as coherence and factuality without human intervention [5]. Answer relevance and context relevance are key factors in this assessment, with the former capturing if the answer addresses the actual question, and the latter referring to how relevant the provided context is [6].

One study found that LLMs have a significant bias in the order in which options are presented [7]. However, with the help of automated evaluation metrics, these biases can be mitigated, leading to more accurate and fair assessments.

The metrics can be particularly useful in safety-critical settings such as healthcare, where ensuring accuracy, coherence, and avoiding hallucinations is crucial [8]. These evaluation metrics can help guide the development of products and monitor the performance of LLMs in production [9].

G-Eval, a method that asks the model to give a rating based on evaluation criteria, has been found to significantly outperform traditional reference-based metrics like BLEU and ROUGE [10]. The metrics can also help in identifying areas for improvement and tuning parameters such as prompt, temperature, and context [11].

As the interest in using LLMs to evaluate the output from other LLMs grows, these automated evaluation metrics will play a vital role in ensuring the reliability and efficiency of LLM assessment. Whether it's in generating biographies with an error rate less than 2% compared to Wikipedia articles [12], or in creating a new framework for evaluating Retrieval Augmented Generation Assessment (RAGs) [13], the potential applications of these metrics are vast and exciting.

References:

Steering self-evaluation latent directions robust across domains and languages [1]
Self-verification, iterative self-feedback methods improving output quality and evaluation [2]
Limitations of traditional evaluation and move towards automated scoring [3]
Highly sensitive automated evaluation frameworks in specialized domains (health) [4]
LLM-as-a-judge scoring methods for assessing subjective output quality [5]
Answer relevance captures if the answer addresses the actual question by asking the LLM to generate questions based on the answer [6]
RAGAS (Retrieval Augmented Generation Assessment) is a new framework for evaluating RAGs, focusing on faithfulness, answer relevance, and context relevance [13]
One study found that LLMs have a significant bias in the order in which options are presented [7]
Human judgment is expensive and slow for evaluating large numbers of LLM outputs [14]
FactScore is a metric for factual precision that treats atomic facts as a unit and bases trustfulness on a particular knowledge source [15]
Prompt engineering techniques, like asking a model to take a deep breath or making a request more emotional, can improve performance [16]
The researchers used LLMs to generate biographies and found an error rate less than 2% when compared to Wikipedia articles [12]
LLMs have many parameters that can be tuned, such as prompt, temperature, and context [11]
G-Eval has components: prompt, intermediate instructions, and scoring function [10]
The metrics can be particularly useful in safety-critical settings such as healthcare [8]
These evaluation metrics can help guide the development of products and monitor the performance of LLMs in production [9]
G-Eval was found to significantly outperform traditional reference-based metrics like BLEU and ROUGE [10]
There is a growing interest in using LLMs to evaluate the output from other LLMs [17]
Faithfulness measures how grounded the answers are in the given context and has a high correlation to human annotators [18]