Applications of a correlation matrix
There are three broad reasons for computing a correlation matrix:
To summarize a large amount of data where the goal is to see patterns. In our example above, the observable pattern is that all the variables highly correlate with each other.
To input into other analyses. For example, people commonly use correlation matrixes as inputs for exploratory factor analysis, confirmatory factor analysis, structural equation models, and linear regression when excluding missing values pairwise.
As a diagnostic when checking other analyses. For example, with linear regression a high amount of correlations suggests that the linear regression’s estimates will be unreliable.
Most correlation matrixes use Pearson’s Product-Moment Correlation (r). It is also common to use Spearman’s Correlation and Kendall’s Tau-b. Both of these are non-parametric correlations and less susceptible to outliers than r.
Coding of the variables
If you also have data from a survey, you’ll need to decide how to code the data before computing the correlations. For example, if respondents were given choices of Strongly Disagree, Somewhat Disagree, Neither Agree nor Disagree, Somewhat Agree, and Strongly Agree, you could assign codes of 1, 2, 3, 4, and 5, respectively (or, mathematically equivalent from the perspective of correlation, scores of -2, -1, 0, 1, and 2). However, other codings are possible, such as -4, -1, 0, 1, 4. Changes in codings tend to have little effect, except when extreme.
Treatment of missing values
The data that we use to compute correlations often contain missing values. This can either be because we did not collect this data or don’t know the responses. Various strategies exist for dealing with missing values when computing correlation matrixes. A best practice is usually to use multiple imputations. However, people more commonly use pairwise missing values (sometimes known as partial correlations). This involves computing correlation using all the non-missing data for the two variables. Alternatively, some use listwise deletion, also known as case-wise deletion, which only uses observations with no missing data. Both pairwise and case-wise deletion assumes that data is missing completely at random. This is why multiple imputations are generally the preferable option.
When presenting a correlation matrix, you’ll need to consider various options including:
Whether to show the whole matrix, as above or just the non-redundant bits, as below (arguably the 1.00 values in the main diagonal should also be removed).
How to format the numbers (for example, best practice is to remove the 0s prior to the decimal places and decimal-align the numbers, as above, but this can be difficult to do in most software).
Whether to show statistical significance (e.g., by color-coding cells red).
Whether to color-code the values according the correlation statistics (as shown below).
Rearranging the rows and columns to make patterns clearer.