1. What is a large language model? Defining transformer, tokens and evals

2. Statistics in LLMs and LLM evaluation

3. LLM performance and confidence intervals

Statistics in LLMs — Introduction to basic concepts

Report Highlights

In the era of AI, you have probably heard of large language models (LLMs), but what do the statistics of LLMs look like? In this report, we will approach this category of deep learning models in detail. We will also show you how basic statistics can be applied in this context.

Keep together with us on this journey, where you will see the following subjects:

What is a large language model? Defining transformer, tokens, and evals;
Statistics in LLMs and LLM evaluation;
LLM performance and confidence intervals;
Conclusions;
And much more.

So, let’s explore this AI world, full of multiple datasets, using our well-known statistical quantities.

Beyond this report, you will see that we have a lot of content and tools for you. Expand your learning skills by registering for an Omni Account today. It is fast, easy, and it allows you to create, edit, and share calculators. You can also access the previous tools used in the blink of an eye.

1. What is a large language model? Defining transformer, tokens and evals

First things first, let us start with the definition of a large language model (LLM). It is a category of deep learning models trained on a large amount of data. Such training makes them capable of generating and comprehending natural language. Therefore, they can perform various tasks.

The neural network architecture behind LLMs is called a transformer, which can read various types of text and find patterns in them. After training, LLMs can be used to answer prompts. The transformer divides these prompts into small units of text, called tokens. Then the transformer estimates the probability of all potential tokens connected to those in the prompt, and outputs the most likely ones. This process is known as inference, and it is repeated until the model completes its output.

This architecture is designed so that the model does not know the output in advance. Instead, it uses statistical quantities learned during training to determine the final answer.

Another relevant process related to LLMs is evaluation (or eval). Evaluation methods use human-led assessments and automated benchmarks to test the limits of LLMs. The eval procedure involves comparing the model’s output with trustworthy data or human-generated responses to assess the model’s accuracy and coherence. In this report, we will show you how statistical quantities can be applied to evaluate LLMs.

🙋 You can better understand several statistical quantities used in LLMs by using our confidence interval calculator, z-score calculator, and normal distribution calculator.

Notebook screen accessing an AI platform

2. Statistics in LLMs and LLM evaluation

The process of evaluating an LLM model can be viewed as a statistical problem. The reason is that each evaluation is an estimate subject to random variation. Therefore, we can use several statistical quantities to assess the potential reliability of an observed evaluation score.

Let’s start with the definition of an evaluation score for a prompt, which is given by the following equation:

s_i = x_i + \epsilon_i

where:

$x_i$ — True expected performance of question $i$ ; and
$\epsilon_i$ — Random noise.

One important point to highlight is that we have access to only a finite number of scores from the dataset evaluation. Then, to compute the performance of our LLM, we need to consider the sample mean $\bar{s}$ , over a finite set of evaluation scores. The sample mean has the form:

\bar{s} = \frac{1}{n}\,\sum_{i=1}^{n}\,s_i

where $n$ is the number of questions. By repeating the evaluation process across multiple datasets, we will obtain different values of $\bar{s}$ . Such a procedure forms a sampling distribution, and we can measure the variation of $\bar{s}$ in respect to multiple datasets by computing the standard error, which is given by:

\mathrm{SE} = \sqrt{\frac{1}{n(n-1)}\,\sum_{i=1}^n\left(s_i-\bar{s}\right)^2}

Let’s see how it works with an example. Suppose that your model evaluated $n=5$ questions where the answers were “True” or “False”. True is labeled as $1$ and False as $0$ . The results are

s = [1,1,0,1,0]

Therefore, $\bar{s} = 3/5 = 0.6$ . By using these values in the formula for the standard error, we find that $\mathrm{SE} = 0.24$ , indicating that performance is quite variable across these questions. This result is a preliminary analysis of the answers to a given prompt, which may lead us to the wrong conclusion about the statistical significance of the results.

Four red lamps and one green lamp, showing an example of "false" and "true" scenario

3. LLM performance and confidence intervals

Another relevant statistical quantity for analysing LLM performance is the confidence interval. By considering a normal or Gaussian distribution for our samples, the confidence interval is computed using the following equation:

\mathrm{CI} = \bar{s} \pm z \times \mathrm{SE}

where $z$ is the z-score. For a confidence interval of $95\%$ , we have a z-score equal to $1.96$ .

🙋 You can learn more about confidence intervals, z-score, and their applications by accessing these articles:

Thus, one interesting way to apply statistics in LLMs is to compare models using confidence intervals. Depending on the results of such a comparison, we can evaluate the LLM’s performance.

Let’s see how to properly make the performance evaluation with an example. Suppose that you have 3 datasets which were evaluated by two different models, named A and B. The results of such an evaluation with the respective confidence intervals are shown in the table below:

Dataset	Model A	Model B
#1	$0.78\pm0.04$	$0.81\pm0.06$
#2	$0.65\pm0.08$	$0.67\pm0.10$
#3	$0.92\pm0.02$	$0.90\pm0.04$

From the previous table, we can observe that dataset 2 has the worst performance, since it has the largest uncertainty values — $\pm0.08$ and $\pm0.10$ . Moreover, dataset 3 shows the lowest uncertainty values — $\pm0.02$ and $\pm0.04$ . Furthermore, there is no big difference between the averages computed by models A and B for the three datasets.

4. Conclusions

The results presented in the last section led us to conclude that models A and B are indistinguishable. So, we need more data or different techniques to analyze these models, such as the clustered confidence intervals.

And that's the end of this learning journey, but also the beginning of a path that can make you deeply understand the concepts behind LLMs and artificial intelligence.

This article was written by João Rafael Lucio dos Santos and reviewed by Steven Wooding.

Authors of the report

João Rafael Lucio dos Santos, PhD
João Rafael Lucio dos SantosPhD, Federal University of Campina Grande
LinkedIn
Website
Research Gate
João Rafael Lucio dos Santos, PhD, is a physicist, a researcher, and a professor at the Federal University of Campina Grande, Brazil. He earned his Ph.D. in Physics at the University of Rochester, NY in 2013. He has written several papers in international journals, mainly in field theory and cosmology. João is also a member of the BINGO telescope collaboration, which is building a state-of-the-art radio telescope that will operate in the Northeast of Brazil. He believes in the importance of spreading knowledge and high-quality information among society and the public in general. In his free time, he enjoys running, playing tennis, cooking, traveling, and having pleasant moments with his relatives and friends. See full profile
Check our editorial policy

Ask the authors for a quote

Table of contents

Statistics in LLMs — Introduction to basic concepts

Report Highlights

1. What is a large language model? Defining transformer, tokens and evals

2. Statistics in LLMs and LLM evaluation

3. LLM performance and confidence intervals

4. Conclusions

Authors of the report

Copyrights

Table of contents

Report Highlights

1. What is a large language model​? Defining transformer, tokens and evals

2. Statistics in LLMs and LLM evaluation

3. LLM performance and confidence intervals

4. Conclusions

Authors of the report

Copyrights

1. What is a large language model? Defining transformer, tokens and evals