Guiding Language Model Development

In the realm of artificial intelligence, language models have become an essential tool for generating human-like text. However, these models are not perfect and can sometimes produce incoherent, repetitive, or even hallucinatory responses. To address these issues, a cheat sheet has been developed to help users choose the optimal parameters for their model's sampling.

This cheat sheet serves as a useful reference for tuning language models, offering insights into various parameters such as temperature, top-k, top-p, frequency penalty, and presence penalty.

Understanding the Parameters

Language models break down text into tokens, predict the next token in the sequence, and mix in some randomness. To control hallucinations, inject creativity, and optimize behavior in language models, these parameters can be adjusted.

Temperature: This parameter controls the randomness of token selection. Lower values (e.g., 0.2–0.5) make output more deterministic and focused, reducing hallucinations but limiting creativity. Higher temperatures (e.g., 0.7–1.0+) increase randomness, enhancing creativity but potentially increasing hallucinations or incoherence.
Top-k sampling: This method excludes the lowest likelihood tokens from being picked, only considering the "top k" best choices. Top-p sampling is a variant of top-k sampling that uses likelihood scores instead of token ranks to determine where it clips the tail.
Top-p (nucleus) sampling: This dynamically adapts the token pool size, balancing between variety and reliability. Lower p values focus sampling on high-probability tokens, reducing hallucinations; higher p values increase diversity and creativity.
Frequency penalty: This penalty adds a penalty to a token for each time it has occurred in the text, discouraging repeated use of the same tokens/words/phrases.
Presence penalty: This penalty applies a flat penalty if a token has already occurred in the text, promoting the introduction of new concepts or tokens.

Balancing Creativity and Reliability

Together, these parameters enable tuning of language model outputs. Setting lower temperature, smaller top-k, and lower top-p tightens the output distribution to reduce hallucinations and improve reliability. Increasing these parameters boosts creativity but risks more hallucinations or off-topic text. The frequency and presence penalties further refine generation by controlling repetition and novelty, helping balance coherence and innovation.

In practical use, you might:

Use temperature ~0.3-0.6, top-k around 40-200, and top-p around 0.8-0.95 for balanced yet creative outputs with controlled hallucinations.
Apply some frequency and presence penalties to avoid repeated phrasing or looping, especially in longer outputs.
Experiment with combinations of these parameters depending on your task’s tolerance for creativity versus factual accuracy.

This approach aligns with state-of-the-art token-level sampling methods that combine top-k and top-p with accept-reject mechanisms to improve quality and reduce hallucination.

Setting the Parameters

The rules for setting parameters to zero are as follows:

Temperature: For a single answer per prompt: Zero. For many answers per prompt: Non-zero.
Frequency and Presence Penalties: When there is one correct answer: Zero. When there are many correct answers: Optional.
Top-p/Top-k: With zero temperature: The output is not affected. With non-zero temperature: Non-zero.

Low temperatures in token sampling mean more quality, while high temperatures mean more diversity. When the temperature is set to zero, the model always samples the token with the highest likelihood score, resulting in zero diversity between queries, but ensuring that we always pick the highest quality continuation as assessed by the model.

Frequency and presence penalties add targeted penalties to inject diversity into the model's responses, unlike temperature which adds diversity with randomness. The cheat sheet provides rules for deciding which values to set to zero and tips for tuning the non-zero parameters.

The presence penalty causes the model to discuss more diverse subject matter and change topics more often without significantly discouraging the repetition of frequently used words. The frequency/presence penalties increase the diversity within a single response, while temperature increases diversity between responses.

By understanding and applying these parameters, users can significantly improve the quality and coherence of their language model's outputs.

Guiding Language Model Development