Principal component analysis and singular value decomposition are among the two common concepts of linear algebra in machine learning. After collecting raw data, is it possible to discover the structure? For instance, when we consider the interest rates of the previous week, is there any way to figure out trends in the market?
These questions grow more complicated when we have large amounts of data. Finding the answer to these questions is similar to searching for a needle in the haystack. Thus, we utilize the singular value decomposition to untangle and extract the complicated information. This article will help you understand the concept of singular value decomposition in detail.
Singular Value Decomposition
This multivariate statistical technique helps solve complex problems in atmospheric sciences. Empirical orthogonal function analysis and principal component analysis are similar sets of procedures for the same technique introduced in 1956 by Edward Lorenz.
The singular value decomposition helps reduce datasets containing a large number of values. Furthermore, this method is also helpful to generate significant solutions for fewer values. However, these fewer values also comprise immense variability available in the original data.
Data reveals large spatial correlations in the geophysical and atmospheric sciences. A Singular Value Decomposition analysis supports and yields results for a more compact demonstration of these correlations. By using multivariate datasets, you can produce insights into temporal and spatial variations. These variations exhibit data after the analysis.
Even though there are fewer limitations to the technique, you should understand these before computing the Singular Value Decomposition of the datasets. First, there should be anomalies in the data that the first structure will capture. If you are analyzing the data to find spatial correlations independent of trends, you should de-trend the data before applying it to the analysis.
Singular vectors & Singular Values
The matrix AAᵀ and AᵀA in linear algebra are very special. By multiplying the Aᵀ with the matrix after considering them × n matrix A, we can form AAᵀ and AᵀA individually. The matrices include:
- Square
- Symmetrical
- Same matrices with both positive eigenvalues
- Positive semidefinite, and
- Same r as A with both rank
A major property of symmetric matrices is that they are symmetric, and we choose eigenvectors to be orthonormal. We use these covariance matrices in machine learning a lot.
Example of Singular Value Decomposition
To understand the concept, let’s suppose the matrix m × n, A, collects the training data set. These sets of data will take the row for each training vector. Here, N indicates that the dimension of each vector will be very large.
By feeding the A in a clustering algorithm, you will generate a fixed number of cluster centers as the output. Since ‘n’ is quite large, the algorithm will be unstable or take too long. So, we will utilize singular value decomposition to reduce the number of variables. We will use a transparent method for computation, considering that we are still solving the problem with un-transformed coordinates.
Step 1: Reading in the data
We can start to read the data by filling up A. So let’s begin the tutorial in C language:
//subroutine header for performing cluster analysis:
#include “cluster.h”//maximum number of clusters:
#define MAX_CLUSTER 10int main(intargc, char **argv) {
char *infile; //input file
char *outfile; //output file
FILE *fs; //file pointer
double **a; //matrix of training data/U
int m; //number of rows in matrix
int n; //number of columns in matrix
intnsv; //number of singular values if (argc!=4) {
printf(“syntax: cluster_svdnsv train centers\n”);
printf(” where:\n”);
printf(“nsv = number of singular values (0 = use untransformed data)\n”);
printf(“infile = ASCII input file containing training data\n”);
printf(“output = ASCII output file containing cluster centers\n”);
printf(“\n”);
printf(“file format:\n”);
printf(“- one line header containing number of rows and number of columns\n”);
printf(“- row major list of each matrix element\n”);
exit(1);
} if (sscanf(argv[1], “%d”, &nsv)!=1) {
fprintf(stderr, “Error parsing first command line argument\n”);
exit(1);
}
infile=argv[2];
outfile=argv[3]; fs=fopen(infile, “r”);
if (fs==NULL) {
fprintf(stderr, “Error opening input file, %s\n”, infile);
exit(1);
} if (fscanf(fs, “%d %d”, &m, &n)!=2) {
fprintf(stderr, “Format error in input file: %s line 0”, infile);
exit(1);
}
if (nsv>n) {
fprintf(stderr, “Command line parameter nsv=%d out of range\n”, nsv);
exit(1);
} a=new double *[m];
a[0]=new double[m*n];
for (inti=1; i<m; i++) a[i]=a[0]+i*n;
for (inti=0; i<m; i++) {
for (int j=0; j<n; j++) {
if (fscanf(fs, “%lg”, a[i]+j)!=1) {
fprintf(stderr, “Format error in input file, %s, line %d\n”, infile, i);
exit(1);
}
}
} fclose(fs);
Step 2: Performing SVD
Now we will use the artificial singular value decomposition routine that the header file, svd.h contains:
#ifndef SVD_H
#define SVD_H//subroutine for singular value decomposition:
int //returns an error code (0 for success)
svd (double **a, //input matrix–replaced by U on output
int m, //number of rows
int n, //number of columns
double *s, //singular values
double **vt); //V–right singular vectors#endif
Singular value decomposition routines are complex in regards to the type of matrix and vector used. However, you can easily summarize the complete coding through wrapper function. Routine will be equally straightforward when you rely on a singular value decomposition routine. We will add the following codes after of the previous section:
double *ave;
double *s; //singular values
double **vt; //right singular vectors //first we calculate and remove the arithmetic means:
ave=new double[n];
for (inti=0; i<n; i++) ave[i]=0;
for (inti=0; i<m; i++) {
for (int j=0; j<n; j++) {
ave[j]+=a[i][j];
}
}
for (inti=0; i<n; i++) ave[i]/=m;
for (inti=0; i<m; i++) {
for (int j=0; j<n; j++) {
a[i][j]-=ave[j];
}
} if (nsv>0) {
//make space for singular values:
s=new double[n]; //make space for right singular vectors:
vt=new double *[n];
vt[0]=new double[n*n];
for (inti=1; i<n; i++) vt[i]=vt[0]+i*n; //perform the decomposition:
int err=svd(a, m, n, s, vt);
if (err!=0) {
fprintf(stderr, “Error in svd subroutine\n”);
exit(err);
}
}
Step 3: Executing the Cluster Analysis
The clustering algorithm process will generate a set of c cluster centers using {ᵢ ; i ∈ [1, c]}:
#ifndef CLUSTER_H
#define CLUSTER_Hint //returns number of cluster centers
cluster (double ** x, //training vectors
int m, //number of training vectors
int n, //dimension of each vector
intmax_nc, //maximum number of cluster centers
double **mu); //returned cluster centers#endif
We will continue from the above section and generate the cluster centers:
double **mu_p; //matrix of cluster centers
intnc; //number of cluster centers //make space for cluster centers:
mu_p=new double *[MAX_CLUSTER];
mu_p[0]=new double[MAX_CLUSTER*n];
for (inti=1; i<MAX_CLUSTER; i++) mu_p[i]=mu_p[0]+i*n; if (nsv>0) {
//make space for cluster centers:
nc=cluster(a, m, nsv, MAX_CLUSTER, mu_p);
} else {
//make space for cluster centers:
nc=cluster(a, m, n, MAX_CLUSTER, mu_p);
} if (nc<= 0) {
fprintf(stderr, “Cluster algorithm failed”);
exit(-1);
}
As we will use the transformed training data for clustering algorithm, the transformed system will include the cluster centers.
Step 4: Storing the Results
Now you can store the cluster centers by using the following equation in the un-transformed coordinate system.
P in this equation signifies the number of coordinates.
double **mu; //cluster centers in un-transformed coords //allocate space for the un-transformed cluster centers:
mu=new double *[nc];
mu[0]=new double[nc*n];
for (inti=1; i<nc; i++) mu[i]=mu[0]+i*n; //perform the coordinate transformation:
for (inti=0; i<nc; i++) {
for (int j=0; j<n; j++) {
mu[i][j]=ave[j];
if (nsv>0) {
for (int k=0; k<nsv; k++) mu[i][j]+=vt[k][j]*s[k]*mu_p[i][k];
} else {
mu[i][j]+=mu_p[i][j];
}
}
} //write the results to a file:
fs=fopen(outfile, “w”);
if (fs==NULL) {
fprintf(stderr, “Error opening output file, %s\n”, outfile);
exit(1);
} fprintf(fs, “%d %d\n”, nc, n);
for (inti=0; i<nc; i++) {
for (int j=0; j<n; j++) fprintf(fs, “%16.8lg”, mu[i][j]);
fprintf(fs, “\n”);
} fclose(fs); //clean up:
delete [] mu[0];
delete [] mu; delete [] mu_p[0];
delete [] mu_p; delete [] ave;
delete [] a[0];
delete [] a;
if (nsv>0) {
delete [] s;
delete [] vt[0];
delete [] vt;
} return 0;
}
Conclusion
In this article, we explained the definition of the singular value decomposition and helped you understand the construction of the model in C-language. You can utilize this method for recovering atmospheric variables according to satellite measurements. You can also use this technique for interpolating sparse measurements or for a machine learning algorithm. This technique helps with regression and classification of the dataset.