Get the Math and the Application in Analytics for both the terms..
Covariance and correlation are two terms used significantly in the field of statistics and probability theory. The majority of articles and literature on probability and statistics presuppose a basic understanding of terms such as means, standard deviation, correlations, sample size and covariance.Let us demystify a couple of these terms today so that we can proceed with the rest. The purpose of the article is to define the terms: correlation and covariance matrices, differentiate between the two and understand the application of the two in the field of analytics and datasets.
Simply put, both terms measure the relationship and dependency between two variables. “Covariance’ = the direction of the linear relationship between the variables. Correlation’, on the other hand, measures both the force and the direction of the linear relationship between two variables. The correlation is a function of covariance. What distinguishes them is the fact that correlation values are standardized, whereas covariance values are not. One can obtain the correlation coefficient of two variables by dividing the covariance of these variables by the product of the standard deviations of the same values. If the definition of Standard Deviation is revised, it essentially measures the absolute variability of the distribution of a data set. When dividing the covariance values by the standard deviation, it essentially scales the value down to a limited range from -1 to +1. This is precisely the range of the correlation values.
Mathematical definition of terms
Now let’s see the mathematical definitions of these terms.
The covariance of two variables (x and y) can be represented as cov(x,y). If E[x] is the expected value or the average of a sample ‘x’, then cov(x,y) can be represented as follows:
the expression can be written in the following way:
in the image above, ‘s²’ or sampled variance is basically the covariance of a variable with itself. This term can also be defined in the following manner:
In the formula above, the numerator of the equation(A) is called the sum of the deviations squared. In the equation(B) with two variables x and y, it is called the sum of the crossed products. In the above formula, n is the number of samples in the data set. The value (n-1) indicates the degrees of freedom.
In order to explain what degrees of freedom are, let’s take an example. In a set of 3 numbers with the average as 10 and two of the three variables as 5 and 15, there is only one possibility of the value that the third number can assume, namely 10. In a set of 3 numbers with the same average, for example: 12,8 and 10 or say 9,10 and 11, there is only one value for every 2 values given in the set. Essentially you can change the two values here and the third value fixes itself. The degree of freedom here is 2. Essentially, the degree of freedom is the number of independent data points that have gone to calculate the estimate. As we see in the previous example, it is not necessarily equal to the number of elements in the sample (n).
The correlation coefficient is also known as the product-person correlation coefficient of Pearson correlation coefficient. As mentioned above, it is obtained by dividing the covariance of the two variables by the product of their standard deviations. The mathematical representation of the same can be shown as follows:
The values of the correlation coefficient can vary from -1 to +1. The closer it is to +1 or -1, the more correlated the two variables are. A positive sign indicates the direction of the correlation, i.e. if one of the two variables increases, the other variable is supposed to increase as well.
Representation of the Covariance and Correlation Data Matrix
For a data matrix, X can be represented as follows:
a vector ‘xj’ would basically imply a (n × 1) vector extracted from column j-th of X where j belongs to the set (1,2,…,p). In the same way ‘xi` represents the vector (1 × p) from the i-th row of X. Here “i” can take a value from the set (1,2,…,n). You can also interpret X as an array of variables where ‘xij’ is the j-th variable (column) collected from the i-th entry (row). For ease of reference, we call rows as item/subjects and columns as variables. Let’s now see the average of a column in the data matrix above:
Using the above concept, let us now define the row-mean. It is basically the average of the elements present in the specified row.
Now, that we have the above metrics, it shall be easier to define the covariance matrix (S)
In the matrix above, we see that the size of the covariance matrix is p × p. This is essentially a symmetrical matrix, i.e. a quadrature matrix that is equal to its transposition (S`). The terms that construct the covariance matrix are called the variances of a given variable, forming the diagonal of the matrix or the covariance of 2 variables that fill the rest of the space. The variable j-th covariance with the variable k-th is equivalent to the covariance of the variable k-th with the variable j-th, i.e. ‘sjk’= ‘skj’.
The covariance matrix can be created from the data matrix in the following way: Here, ‘Xc’ is a centered matrix that has the respective column meaning subtracted from each element. By using this as the central component, the covariance matrix ‘S’ is the product of the transposition of ‘Xc’ and ‘Xc’ itself, which is then divided by the number of elements or rows (‘n’) in the data matrix.
Before going any further, let us review the concept of sample variance or s-squared (s²). From this value we can derive the standard deviation of a data set. Mathematics defines the value “s” as the standard deviation of the data set. It basically indicates the degree of dispersion or diffusion of the data around its mean.
Likewise, using the same data matrix and covariance matrix, we define the correlation matrix (R):
As we see here, the size of the correlation matrix is again p × p. Now, if we look at the individual elements of the correlation matrix, the main diagonal includes all 1. This indicates that the correlation of an element with itself is 1, or the highest possible value. This is logical and intuitive. The other elements ‘rjk’ are Pearson’s correlation coefficient between two values: ‘xj’ and ‘xk’. As we saw before, ‘xj’ denotes the j-th column of the data matrix, X. Moving on to how the correlation matrix can be obtained from the data matrix:
Xs’ in the above definition is called the scalar matrix or the standardized matrix. Here we see that the correlation matrix can be defined as the product of the transposition of the scalar matrix with itself, divided by ‘n’. By revisiting the definition of standard deviation from above, we see that each element (similar to the covariance matrix above) of the standardised matrix ‘Xs’ is divided by the corresponding column standard deviation. This strengthens our understanding that the correlation matrix is a standardised or scaled derivative of the covariance matrix.
Covariance versus Correlation
The formula of covariance takes the units from the product of the units of the two variables. On the other hand, the correlation is adimensional. It is a measure without units of the relation between the variables. This is because you divide the value of the covariance by the product of the standard deviations that have the same units. The value of covariance is influenced by the change of scale of the variables. If all the values of the given variable are multiplied by a constant and all the values of another variable are multiplied by a similar or different constant, then the covariance value also changes. However, doing the same, the value of the correlation is not affected by the change of scale of the values. Another difference between covariance and correlation is the range of values they can assume. The correlation coefficients are between -1 and +1, but the covariance can assume any value between -∞ and +∞.
Application in Analytics
So now that we are done with mathematical theory, let’s explore how and where it can be applied in the field of data analysis. Correlation analysis, as many analysts would know, is a vital tool for characteristic selection and multivariate analysis in data pre-processing and exploration. Correlation helps us to investigate and establish relationships between variables. It is used in the selection of characteristics before any kind of statistical modeling or data analysis.
PCA or Principal component analysis is a significant application of the same. So how do we decide what to use? Correlation matrix or covariance matrix? In simple terms, we recommend using the covariance matrix when the variables are on similar scales and the correlation matrix when the scales of the variables are different.
Now let’s try to understand this with the help of examples. To help you with the implementation, if necessary, I will take care of the examples in both R and Python. Let’s first see the first example where we see how PCA results differ when they are calculated with the correlation matrix and covariance matrix respectively. For the first example here, we will consider the ‘mtcars’ data set in R.
# Loading dataset in local R environment
# Print the first 10 rows of the dataset
From the above image, we see that all columns are numerical and therefore, we can proceed with the analysis. We will use the prcomp() function of the ‘stats’ package for the same.