1. What is a large language model? How can I measure the LLM inconsistency?

3. Clustered errors formulas for sample dependence

Statistics in LLMs — Clustered Standard Errors

Report Highlights

Statistics in LLMs — Clustered Standard Errors is a report that will show you how to properly compute uncertainties in Large Language Models when we consider samples that are not independent. This dependence can be attributed to several factors in the prompt structure.

Along with this journey, we will show you how to characterize such a dependency, and the following topics:

What is a large language model? How can I measure the LLM inconsistency?
What are clustered standard errors?
Clustered error formulas for sample dependence.
LLMs evaluation and clustered errors.
Conclusions.
And much more.

So, let us show you how to properly evaluate your prompts with the clustered standard error approach.

💡 Beyond this report, you will see that we have a lot of content and tools for you. Expand your learning skills by registering for an Omni Calculator Account today. It is fast, easy, and it allows you to create, edit, and share calculators. You can also access the previous tools used in the blink of an eye.

1. What is a large language model? How can I measure the LLM inconsistency?

LLMs or large language models are deep learning models trained on a large amount of data. These trainings make LLMs capable of generating and comprehending natural language. So, the models can parse questions we naturally ask in daily life, such as “What are the best shoes for running a half-marathon?” These questions (or instructions) are known as prompts.

The neural network architecture behind LLMs, also known as transformers, can read various types of text and find patterns in them. The transformer basically divides the elements of a question into small units of text, named tokens. Then the transformer estimates the probability of all potential tokens given those in your question and outputs the most likely ones. This process is known as inference, and it is repeated until the model completes its output. Depending on the size of the neural network and the complexity level of your prompt, this process can take several minutes to produce an output.

Notebook screen accessing an AI platform

Besides the inference, there is another parameter related to LLM evaluation called eval. Eval methods use human-led assessments and automated benchmarks to test the limits of LLM inconsistency. It compares the model’s output with trustworthy data or human-generated responses to assess its accuracy and coherence.

The simplest analysis of LLMs assumes that all the samples from a token distribution are independent. In this case, if we have a dataset with $n$ questions, each one will receive an evaluation score defined as:

s_i = x_i + \epsilon_i

where:

$x_i$ — Expected performance of question $i$ ; and
$\epsilon_i$ — Random noise.

This approach can provide a good estimate of the model’s performance, as you can see in detail in our article “Statistics in LLMs: Introduction to basic concepts”.

However, if the questions are not independently sampled, we need to use clustered standard errors to properly evaluate the performance of LLMs.

2. What are clustered standard errors?

When we talk about independent samples, we can say that the central limit theorem is valid. This approach means that a normal distribution can characterize the average of the samples. However, when we have non-independent data, this theorem can underestimate the uncertainty of our measurements.

In such cases, we need to use clustered standard errors. The cluster will represent a space with sample dependence, and we can redefine the prompt score as $s_{i j}$ . Here $i$ is the cluster index and $j$ the sample within a cluster.

If the samples within a cluster are correlated, then we have the following constraint:

\begin{split} & \mathrm{Cov}(s_{i j},s_{i k}) > 0 \\[1em] & \mathrm{Cov}(X,Y) = E(XY) - E(X)\,E(Y) \end{split}

where:

$\mathrm{Cov}$ — Covariance; and
$E(X)$ — Expected value.

The covariance indicates how similar two samples from the same cluster are. If the $\mathrm{Cov}(s_{i j},s_{i k}) = 0$ , then the samples are independent, and there are no clustered errors.

🙋 You can learn more about covariance and expected values by checking out our covariance calculator and expected value calculator.

Example of four clusters of data with several samples

3. Clustered errors formulas for sample dependence

As you can imagine, we will need to adjust some of our statistical quantities to account for clusters. So, let us consider a set of $n$ samples. By working with non-independent data, we need to compute the so-called cluster means, which is defined as:

\bar{s}_i = \frac{1}{n_i}\sum_{j=1}^{n_i}s_{i j}

where $n_i$ is the number of samples inside the cluster $i$ . Therefore, the sample mean is written as:

\bar{s} = \frac{1}{C}\,\sum_{i=1}^{C}\,\bar{s}_i

and here $C$ is the number of clusters. The clustered standard error can be determined using the following equation:

\mathrm{SE}_{\mathrm{cluster}} = \sqrt{\frac{1}{C\,(C-1)}\,\sum_{i=1}^{C}(\bar{s}_i-\bar{s})^2}

So, the confidence interval taking into account the clustered standard error is:

\mathrm{CI} = \bar{s} \pm z \times \mathrm{SE}_{\mathrm{cluster}}

where $z = 1.96$ for a confidence interval of $95\%$ .

🙋 Learn more about confidence intervals and z-score with our confidence interval calculator and z-score calculator. Feel free to also access our articles on these subjects:

4. LLMs evaluation and clustered errors

Let us show you how to apply the concept of clustered standard errors in an LLM evaluation. Suppose that you have one dataset evaluated by two models, Aand B, shown in the table below:

Dataset	Model A	Model B
#1	$0.78\pm0.04$	$0.81\pm0.06$

We also present the average values and the confidence intervals computed from independent samples. Now, let us take four clusters for models A and B, with the following means:

\small \begin{split} & (0.70,\,0.82,\,0.79,\,0.81)\\[1em] & (0.72,\,0.88,\,0.81,\,0.83) \end{split}

By applying $C=4$ and the previous averages in the formula for $\mathrm{SE}_{\mathrm{cluster}}$ we determine that:

Dataset	Cluster A	Cluster B
#1	$0.78\pm0.054$	$0.81\pm0.065$

By comparing the confidence intervals of the two tables, we can see that model A is more sensitive to clustered standard errors, as it increases the uncertainty of this model by 35%. Moreover, the explicity intervals for models A and B after the computation of the clustered errors are, respectively: $[0.726, \,0.834]$ and $[0.745, \,0.875]$ . These intervals increased the overlap between the two models when evaluating the dataset.

5. Conclusions

This report showed how useful clustered errors and confidence intervals can be for characterizing LLMs when the samples are not independent.

The correlation between samples is quite common in benchmarks, where multiple items share the same context, prompt structure, or underlying task, leading to dependence between observations. We hope this journey helped you understand the influence of clustered errors on the evaluation of uncertainties and on the LLM inconsistency.

This article was written by João Rafael Lucio dos Santos and reviewed by Steven Wooding.

Authors of the report

João Rafael Lucio dos Santos, PhD
João Rafael Lucio dos SantosPhD, Federal University of Campina Grande
LinkedIn
Website
Research Gate
João Rafael Lucio dos Santos, PhD, is a physicist, a researcher, and a professor at the Federal University of Campina Grande, Brazil. He earned his Ph.D. in Physics at the University of Rochester, NY in 2013. He has written several papers in international journals, mainly in field theory and cosmology. João is also a member of the BINGO telescope collaboration, which is building a state-of-the-art radio telescope that will operate in the Northeast of Brazil. He believes in the importance of spreading knowledge and high-quality information among society and the public in general. In his free time, he enjoys running, playing tennis, cooking, traveling, and having pleasant moments with his relatives and friends. See full profile
Check our editorial policy

Ask the authors for a quote

Table of contents

Statistics in LLMs — Clustered Standard Errors

Report Highlights

1. What is a large language model? How can I measure the LLM inconsistency?

2. What are clustered standard errors?

3. Clustered errors formulas for sample dependence

4. LLMs evaluation and clustered errors

5. Conclusions

Authors of the report

Copyrights

Table of contents

Report Highlights

1. What is a large language model​? How can I measure the LLM inconsistency?

2. What are clustered standard errors​?

3. Clustered errors formulas for sample dependence

4. LLMs evaluation and clustered errors

5. Conclusions

Authors of the report

Copyrights

1. What is a large language model? How can I measure the LLM inconsistency?

2. What are clustered standard errors?