Where to Slice the Cake: PCA from Scratch

By Will Lienert

When you are dealing with lots of data with lots of dimensions, it can sometimes be quite difficult to understand your data in an intuitive way while staying sane. One solution to this you might think, is to look at a single 2D slice of the vector space. But what slice do you choose? Surely there is a slice that retains the most information about the data. This is where Principal Component Analysis (PCA) comes in.

What is PCA?

PCA is a dimensionality reduction technique that is used to reduce the number of features/dimensions in a dataset. It is a linear transformation that projects the data from a high-dimensional space to a lower-dimensional space. The goal of PCA is to find the directions of the data that contain the most variance. These directions are called the principal components.

The principal components are orthogonal to each other and are found by calculating the eigenvectors of the covariance matrix. The eigenvectors are sorted by the amount of variance they contain, with the first eigenvector pointing in the direction of the greatest variance within the original vector space. From there, you can imagine the second vector defining a plane perpendicular to the first eigenvector, this plane contains the most variance within the original vector space. This can continue for as many dimentions as you want, but for most cases, we only need 2, so that the data can be visualised in a 2D plot.

I am using this example dataset with 5 dimensions.


            X = np.array([
                [  1.,   2.,  -1.,   4.,  10.],
                [  3.,  -3.,  -3.,  12., -15.],
                [  2.,   1.,  -2.,   4.,   5.],
                [  5.,   1.,  -5.,  10.,   5.],
                [  2.,   3.,  -3.,   5.,  12.],
                [  4.,   0.,  -3.,  16.,   2.]
              ])

Here is the code to standardise the data and calculate the covariance matrix:


            std_X = np.std(X, axis=0)
            mean_X = np.mean(X, axis=0)
            std_X[std_X == 0] = 1
            standarised_X = (X - mean_X) / std_X
            covariance_matrix = np.cov(standarised_X, ddof=1, rowvar=False)

The covariance matrix is:


            array([
                [ 1.2       , -0.42098785, -1.0835838 ,  0.90219291, -0.37000528],
                [-0.42098785,  1.2       ,  0.20397003, -0.77149364,  1.18751836],
                [-1.0835838 ,  0.20397003,  1.2       , -0.59947269,  0.22208218],
                [ 0.90219291, -0.77149364, -0.59947269,  1.2       , -0.70017993],
                [-0.37000528,  1.18751836,  0.22208218, -0.70017993,  1.2       ]
            ])

Here is the code to calculate the eigenvalues and eigenvectors of the covariance matrix, aka the principal components. The data can then be projected onto the principal components by matrix multiplying the standarised data by the eigenvectors:


            eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
            sorted_order = np.argsort(eigenvalues)[::-1]
            sorted_eigenvalues = eigenvalues[sorted_order]
            sorted_eigenvectors = eigenvectors[:, sorted_order]

            reduced_data = np.matmul(standarised_X, sorted_eigenvectors[:, :k])

Here is the reduced data plotted in a 2D plot: