If you have two samples and you want to determine if they vary in a similar way, this covariance calculator is the tool you need. Here, you will learn how the covariance formula works, how to calculate covariance, and understand the covariance vs correlation relationship.
First, to answer the question: What is covariance?
What is covariance?
If X and Y are two random variables with expected values,
E[Y] respectively, their covariance is:
Cov(X, Y) = E[(X – E[X])*(Y – E[Y])].
This covariance formula can further be simplified to:
Cov(X, Y) = E[X*Y] – E[X]*E[Y].
The above formula is not practical in real-life situations, as it relies on us knowing the expected values for X, Y, and XY. The only way to discover these numbers is by knowing how the random variables X, Y, and X*Y are distributed. But, you usually don't know that. On the contrary, this is something you want to discover!
Instead, we use sample observations of x and y over a finite size, n. We will see that this is actually enough to perform one of the following two tasks:
Find the covariance of X and Y when we have access to the whole population data.
Find an estimate of the population covariance for X and Y when we only have access to a sample.
Covariance formula explained
So, each of the two samples, x and y, consist of n randomly observed values, X and Y respectively. The elements of the first sample are denoted by
x1, x2, ..., xn, and their average by
xmean. Similarly, the elements of the second sample are
y1, y2, ..., yn, with an average of
The following formula is the population covariance formula for two equally sized samples:
Covpop(x, y) = sum(xi - xmean)(yi - ymean) / n,
with summation over
i = 1, 2, ..., n.
The name here comes from the fact that we regard our two samples are all there is, i.e., they constitute our populations. We do not care here what happens outside of these samples.
The usefulness of the covariance formula is not immediately clear. Let's break it down to parts to understand it better.
Remember that n represents the sample size of each of the two samples. For each
i = 1, 2, ..., n, the terms
xi - xmean and
yi - ymean calculate the differences between the sample elements and the sample average.
In the covariance formula, these two terms are multiplied, then summed over all sample elements, and finally averaged by dividing by n, the size of each of the samples. Phew!
Don't be scared though! The idea behind the covariance formula is actually pretty simple: we want to measure how much the data from the two samples vary together.
Don't worry, we've prepared an example to make things much easier to understand.
Example on how to calculate covariance
We will now dig a bit deeper into the calculation's details by looking at how the covariance formula works in a real-life example.
John is an investor who just bought his first few stocks in "Cool Places", a company specialising in polar vacations. But, as every smart investor knows, John should diversify his portfolio, so he decided to buy some stocks in either "Star Dust" and "Time Travel Vacations", two more travel agencies. But, he can't decide which stocks to buy. What should he do?
Well, the covariance formula may have the answer! John randomly selects five daily closing prices, not necessarily in chronological order, for the stocks of "Cool Places" and "Star Dust", denoted by
Here is a step by step calculation:
Calculate the average of the "Cool Places" stock closing prices by adding up the values in the second column and by dividing the sum by the sample size of n = 5. Do the same for "Star Dust" stock prices in the third column:
xmean = 12.666and
ymean = 7.016.
Complete the fourth and fifth columns by subtracting the corresponding averages
ymeanfrom the stock prices listed in the rows:
xdiff = xi - xmeanand
ydiff = yi - ymean.
Fill out the sixth column by simply multiplying the corresponding numbers from the fourth and fifth columns.
Finally, when you sum up all the numbers in the sixth column and divide the sum by the sample size, in this case n = 5, you get the sample covariance
Covsample(x, y) = 0.0483!
Check this result by using the covariance calculator and read on to find out how to interpret this number.
How to interpret the covariance?
Note that the covariance value does not have a particular significance by itself, although there are some essential comments we can still make.
If the covariance is positive, the observation of the two samples tend to exhibit similar behavior with respect to their averages. Either they are both higher than their corresponding means, or they are both lower. So, the way they vary from their averages is similar.
In the case of negative covariance, the samples behave in the generally opposite manners. When one observation is lower than it's sample's average, the corresponding observation of the other sample is higher than its average, and vice versa. But, we can say the samples are still related, although in a different way than in the case of the positive covariance.
Also, the further the covariance is from zero, the more related the samples are, while a covariance value closer to zero suggests there isn't a strong relationship between sample variations:Covariance values - negative, zero, positive
What does this mean for John's diversification strategy? He will probably be better off if he buys stocks whose prices have a covariance close to zero when compared to the stocks he already has. Why? Because then he'll know that the second stock is less likely to vary simultaneously with the first one.
However, to make the right decision, he still needs to calculate the covariance of the observed closing prices for the stocks of "Cool Places" and "Time Travel Vacations".
Use the sample covariance calculator on the following data, where
xi still represents "Cool Places", but this time
yi is the "Time Travel Vacations" closing prices:
Now we get a sample covariance equal to
0.0018, which is a much lower covariance than the one for "Cool Places" and "Star Dust". John concludes that he should buy the "Time Travel Vacations" stocks in order to diversify his portfolio effectively.
How to estimate the population covariance from samples?
In practical situations, the sample to which we have access represents some larger population. Thankfully, even from limited samples, we can estimate the covariance in the whole population via the following formula:
Covsample(X, Y) = sum(xi - xmean)(yi - ymean) / (n-1).
The name here comes from the fact that we regard our two samples are parts of larger populations. Based on the samples we want to describe what happens in the bigger world.
You can see that the denominator here is n-1, which will give a somewhat higher result than the original covariance formula. We intuitively do expect this to happen, as limited samples generally don't reflect the total variability between entire populations and it can be shown rigorously that the denominator n-1 is the right corrective factor.
The relationship between sample and population covariances is given by formula:
Covsample(X, Y) = (n / n-1) * Covpop(x, y).
But, bear in mind that, as the size of the samples gets bigger, the difference between n and n-1 gets smaller. So the original sample covariance formula and population covariance estimate give similar results for large samples.
You can check out what would be the results for John's stocks if we have, instead of the original formula, used the estimation formula for sample covariance.
Variance and correlation vs covariance
Covariance is a measure of variability between two random variables X and Y, while variance measures how much a particular random variable varies by itself. Here is the relationship between these two notions:
Var(X) = Cov(X, X).
So, the variance of X is precisely the covariance between X and itself!
Another way to express the variability between two random variables is by the correlation between random variables X and Y. The relation between correlation and covariance is:
Corr(X, Y) = Cov(X, Y) / (σX * σY),
σY are the standard deviations of the random variables X and Y.
You can think of correlation as the normalized version of covariance: the above formula guarantees that correlation must be somewhere between the numbers -1 and 1. This property is the reason why correlation is more often and more readily used than covariance, although they do similar jobs.
Remember that in the example with John, we had the problem of interpreting how large or small the covariance is? Well, with correlation, you wouldn't have that issue.
Hey, did you know that there are various types of correlation? Check our dedicated calculators: