Omni Calculator logo

Table of contents

1. What is a large language model​? Defining transformer, tokens and evals

Statistics in LLMs — Introduction to basic concepts

Report Highlights

In the era of AI, you have probably heard of large language models (LLMs), but what do the statistics of LLMs look like? In this report, we will approach this category of deep learning models in detail. We will also show you how basic statistics can be applied in this context.

Keep together with us on this journey, where you will see the following subjects:

  • What is a large language model​? Defining transformer, tokens, and evals;
  • Statistics in LLMs and LLM evaluation;
  • LLM performance and confidence intervals;
  • Conclusions;
  • And much more.

So, let’s explore this AI world, full of multiple datasets, using our well-known statistical quantities.

Beyond this report, you will see that we have a lot of content and tools for you. Expand your learning skills by registering for an Omni Account today. It is fast, easy, and it allows you to create, edit, and share calculators. You can also access the previous tools used in the blink of an eye.

First things first, let us start with the definition of a large language model (LLM). It is a category of deep learning models trained on a large amount of data. Such training makes them capable of generating and comprehending natural language. Therefore, they can perform various tasks.

The neural network architecture behind LLMs is called a transformer, which can read various types of text and find patterns in them. After training, LLMs can be used to answer prompts. The transformer divides these prompts into small units of text, called tokens. Then the transformer estimates the probability of all potential tokens connected to those in the prompt, and outputs the most likely ones. This process is known as inference, and it is repeated until the model completes its output.

This architecture is designed so that the model does not know the output in advance. Instead, it uses statistical quantities learned during training to determine the final answer.

Another relevant process related to LLMs is evaluation (or eval). Evaluation methods use human-led assessments and automated benchmarks to test the limits of LLMs. The eval procedure involves comparing the model’s output with trustworthy data or human-generated responses to assess the model’s accuracy and coherence. In this report, we will show you how statistical quantities can be applied to evaluate LLMs.

🙋 You can better understand several statistical quantities used in LLMs by using our confidence interval calculator, z-score calculator, and normal distribution calculator.

Notebook screen accessing an AI platform

The process of evaluating an LLM model can be viewed as a statistical problem. The reason is that each evaluation is an estimate subject to random variation. Therefore, we can use several statistical quantities to assess the potential reliability of an observed evaluation score.

Let’s start with the definition of an evaluation score for a prompt, which is given by the following equation:

si=xi+ϵis_i = x_i + \epsilon_i

where:

  • xix_i — True expected performance of question ii; and
  • ϵi\epsilon_i — Random noise.

One important point to highlight is that we have access to only a finite number of scores from the dataset evaluation. Then, to compute the performance of our LLM, we need to consider the sample mean sˉ\bar{s}, over a finite set of evaluation scores. The sample mean has the form:

sˉ=1ni=1nsi\bar{s} = \frac{1}{n}\,\sum_{i=1}^{n}\,s_i

where nn is the number of questions. By repeating the evaluation process across multiple datasets, we will obtain different values of sˉ\bar{s}. Such a procedure forms a sampling distribution, and we can measure the variation of sˉ\bar{s} in respect to multiple datasets by computing the standard error, which is given by:

SE=1n(n1)i=1n(sisˉ)2\mathrm{SE} = \sqrt{\frac{1}{n(n-1)}\,\sum_{i=1}^n\left(s_i-\bar{s}\right)^2}

Let’s see how it works with an example. Suppose that your model evaluated n=5n=5 questions where the answers were “True” or “False”. True is labeled as 11 and False as 00. The results are

s=[1,1,0,1,0]s = [1,1,0,1,0]

Therefore, sˉ=3/5=0.6\bar{s} = 3/5 = 0.6. By using these values in the formula for the standard error, we find that SE=0.24\mathrm{SE} = 0.24, indicating that performance is quite variable across these questions. This result is a preliminary analysis of the answers to a given prompt, which may lead us to the wrong conclusion about the statistical significance of the results.

Four red lamps and one green lamp, showing an example of "false" and "true" scenario

Another relevant statistical quantity for analysing LLM performance is the confidence interval. By considering a normal or Gaussian distribution for our samples, the confidence interval is computed using the following equation:

CI=sˉ±z×SE\mathrm{CI} = \bar{s} \pm z \times \mathrm{SE}

where zz is the z-score. For a confidence interval of 95%95\%, we have a z-score equal to 1.961.96.

Thus, one interesting way to apply statistics in LLMs is to compare models using confidence intervals. Depending on the results of such a comparison, we can evaluate the LLM’s performance.

Let’s see how to properly make the performance evaluation with an example. Suppose that you have 3 datasets which were evaluated by two different models, named A and B. The results of such an evaluation with the respective confidence intervals are shown in the table below:

Dataset

Model A

Model B

#1

0.78±0.040.78\pm0.04

0.81±0.060.81\pm0.06

#2

0.65±0.080.65\pm0.08

0.67±0.100.67\pm0.10

#3

0.92±0.020.92\pm0.02

0.90±0.040.90\pm0.04

From the previous table, we can observe that dataset 2 has the worst performance, since it has the largest uncertainty values — ±0.08\pm0.08 and ±0.10\pm0.10. Moreover, dataset 3 shows the lowest uncertainty values — ±0.02\pm0.02 and ±0.04\pm0.04. Furthermore, there is no big difference between the averages computed by models A and B for the three datasets.

The results presented in the last section led us to conclude that models A and B are indistinguishable. So, we need more data or different techniques to analyze these models, such as the clustered confidence intervals.

And that's the end of this learning journey, but also the beginning of a path that can make you deeply understand the concepts behind LLMs and artificial intelligence.


This article was written by João Rafael Lucio dos Santos and reviewed by Steven Wooding.

Authors of the report

Ask the authors for a quote

Copyrights