Omni Calculator logo

Table of contents

1. What is a large language model​? How can I measure the LLM inconsistency?

Statistics in LLMs — Clustered Standard Errors

Report Highlights

Statistics in LLMs — Clustered Standard Errors is a report that will show you how to properly compute uncertainties in Large Language Models when we consider samples that are not independent. This dependence can be attributed to several factors in the prompt structure.

Along with this journey, we will show you how to characterize such a dependency, and the following topics:

  • What is a large language model​? How can I measure the LLM inconsistency?
  • What are clustered standard errors​?
  • Clustered error formulas for sample dependence.
  • LLMs evaluation and clustered errors.
  • Conclusions.
  • And much more.

So, let us show you how to properly evaluate your prompts with the clustered standard error approach.

💡 Beyond this report, you will see that we have a lot of content and tools for you. Expand your learning skills by registering for an Omni Calculator Account today. It is fast, easy, and it allows you to create, edit, and share calculators. You can also access the previous tools used in the blink of an eye.

LLMs or large language models are deep learning models trained on a large amount of data. These trainings make LLMs capable of generating and comprehending natural language. So, the models can parse questions we naturally ask in daily life, such as “What are the best shoes for running a half-marathon?” These questions (or instructions) are known as prompts.

The neural network architecture behind LLMs, also known as transformers, can read various types of text and find patterns in them. The transformer basically divides the elements of a question into small units of text, named tokens. Then the transformer estimates the probability of all potential tokens given those in your question and outputs the most likely ones. This process is known as inference, and it is repeated until the model completes its output. Depending on the size of the neural network and the complexity level of your prompt, this process can take several minutes to produce an output.

Notebook screen accessing an AI platform

Besides the inference, there is another parameter related to LLM evaluation called eval. Eval methods use human-led assessments and automated benchmarks to test the limits of LLM inconsistency. It compares the model’s output with trustworthy data or human-generated responses to assess its accuracy and coherence.

The simplest analysis of LLMs assumes that all the samples from a token distribution are independent. In this case, if we have a dataset with nn questions, each one will receive an evaluation score defined as:

si=xi+ϵis_i = x_i + \epsilon_i

where:

  • xix_i — Expected performance of question ii; and
  • ϵi\epsilon_i — Random noise.

This approach can provide a good estimate of the model’s performance, as you can see in detail in our article “Statistics in LLMs: Introduction to basic concepts”.

However, if the questions are not independently sampled, we need to use clustered standard errors to properly evaluate the performance of LLMs.

When we talk about independent samples, we can say that the central limit theorem is valid. This approach means that a normal distribution can characterize the average of the samples. However, when we have non-independent data, this theorem can underestimate the uncertainty of our measurements.

In such cases, we need to use clustered standard errors. The cluster will represent a space with sample dependence, and we can redefine the prompt score as sijs_{i j}. Here ii is the cluster index and jj the sample within a cluster.

If the samples within a cluster are correlated, then we have the following constraint:

Cov(sij,sik)>0Cov(X,Y)=E(XY)E(X)E(Y)\begin{split} & \mathrm{Cov}(s_{i j},s_{i k}) > 0 \\[1em] & \mathrm{Cov}(X,Y) = E(XY) - E(X)\,E(Y) \end{split}

where:

  • Cov\mathrm{Cov} — Covariance; and
  • E(X)E(X) — Expected value.

The covariance indicates how similar two samples from the same cluster are. If the Cov(sij,sik)=0\mathrm{Cov}(s_{i j},s_{i k}) = 0, then the samples are independent, and there are no clustered errors.

🙋 You can learn more about covariance and expected values by checking out our covariance calculator and expected value calculator.

Example of four clusters of data with several samples

As you can imagine, we will need to adjust some of our statistical quantities to account for clusters. So, let us consider a set of nn samples. By working with non-independent data, we need to compute the so-called cluster means, which is defined as:

sˉi=1nij=1nisij\bar{s}_i = \frac{1}{n_i}\sum_{j=1}^{n_i}s_{i j}

where nin_i is the number of samples inside the cluster ii. Therefore, the sample mean is written as:

sˉ=1Ci=1Csˉi\bar{s} = \frac{1}{C}\,\sum_{i=1}^{C}\,\bar{s}_i

and here CC is the number of clusters. The clustered standard error can be determined using the following equation:

SEcluster=1C(C1)i=1C(sˉisˉ)2\mathrm{SE}_{\mathrm{cluster}} = \sqrt{\frac{1}{C\,(C-1)}\,\sum_{i=1}^{C}(\bar{s}_i-\bar{s})^2}

So, the confidence interval taking into account the clustered standard error is:

CI=sˉ±z×SEcluster\mathrm{CI} = \bar{s} \pm z \times \mathrm{SE}_{\mathrm{cluster}}

where z=1.96z = 1.96 for a confidence interval of 95%95\%.

Let us show you how to apply the concept of clustered standard errors in an LLM evaluation. Suppose that you have one dataset evaluated by two models, Aand B, shown in the table below:

Dataset

Model A

Model B

#1

0.78±0.040.78\pm0.04

0.81±0.060.81\pm0.06

We also present the average values and the confidence intervals computed from independent samples. Now, let us take four clusters for models A and B, with the following means:

(0.70,0.82,0.79,0.81)(0.72,0.88,0.81,0.83)\small \begin{split} & (0.70,\,0.82,\,0.79,\,0.81)\\[1em] & (0.72,\,0.88,\,0.81,\,0.83) \end{split}

By applying C=4C=4 and the previous averages in the formula for SEcluster\mathrm{SE}_{\mathrm{cluster}} we determine that:

Dataset

Cluster A

Cluster B

#1

0.78±0.0540.78\pm0.054

0.81±0.0650.81\pm0.065

By comparing the confidence intervals of the two tables, we can see that model A is more sensitive to clustered standard errors, as it increases the uncertainty of this model by 35%. Moreover, the explicity intervals for models A and B after the computation of the clustered errors are, respectively: [0.726,0.834][0.726, \,0.834] and [0.745,0.875][0.745, \,0.875]. These intervals increased the overlap between the two models when evaluating the dataset.

This report showed how useful clustered errors and confidence intervals can be for characterizing LLMs when the samples are not independent.

The correlation between samples is quite common in benchmarks, where multiple items share the same context, prompt structure, or underlying task, leading to dependence between observations. We hope this journey helped you understand the influence of clustered errors on the evaluation of uncertainties and on the LLM inconsistency.


This article was written by João Rafael Lucio dos Santos and reviewed by Steven Wooding.

Authors of the report

Ask the authors for a quote

Copyrights