Multivariate Exploratory Data Analysis

November 15, 2020

In the “Age of Data”, understanding the data retrieved in a research and/or development project is a fundamental step in any rigorous investigation. Do you have thousands of variables or observations? Maybe millions or more? With missing data? Time series? Multiple sources? Looking at data is something you need to do in order to understand the behavior of the phenomenon under investigation.

In this second entry of the blog, I will illustrate how Multivariate Exploratory Data Analysis (MEDA for short) can be a powerful approach for data exploration. My goal is that readers understand what MEDA is and how it can be used.

Do you want to know more about MEDA?

So, what is MEDA anyway? What I call MEDA is the use of linear and simple multivariate models to visualize and interact with data, with the final goal of finding patterns that we can interpret. Basically, we ask the data “show me what you’ve got”. Thus, if you are working on a project where you can gather data, maybe this is an interesting approach for you to understand your problem better. We mainly use Principal Component Analysis (PCA) as the model in MEDA, but sometimes we use more complex approaches.

For simplicity, let’s think of a data matrix with N rows and M columns. The rows represent the individuals, items, objects or observations: how you call them depends on what your data contains and your own jargon. They are generally the elements you would like to compare, to understand their differences and commonalities. For instance, an astronomer may have a data set where observations correspond to different stars, and a geneticist may study a set of individuals (e.g. people) in order to understand genetic differences. The columns of the data matrix represent the variables that are measured per observation. Coming back to the previous examples, an astronomer may measure the light coming from each star at different wavelengths. Thus, the astronomer’s data matrix contains N stars (in the rows) which light is measured at different wavelengths (in the columns). A geneticist may encode the genome of each individual in the columns. Therefore, the data contains N individuals (in the rows) with the counts of M genes (in the columns).

The motivation of a researcher to gather data can also vary. For instance, the astronomer may want to find clusters of stars that behave similarly, or single stars that behave oddly: her interest lies in the rows of the matrix: the stars. The geneticist may aim to find genes that are related to a disease or an evolutionary advantage, and he gathers data from individuals with and without a disease in order to compare their genes. Therefore, the interest lies in the columns of the data set.

When you have a matrix with many observations and many variables, and you want to take a look at the data, how do you do this? That’s actually very tricky. Let’s say that the astronomer has 100 stars, measured at 200 wavelengths (these are, btw, ridiculously low numbers for an astronomy data set). Then she can plot all stars as a function of the wavelengths in order to visualize the data:

Astronomer rows

Or she can do the other way round: plot wavelengths as a function of the stars:

Astronomer columns

For anyone with a little experience on data analysis, it’s obvious that I used fake, random data, not true stars reflections. Still, this data servers to make my point: we cannot see, let alone understand, anything from these plots. Therefore, we need a better strategy to ‘Look at data’. For that we use MEDA and, in the simplest case, PCA.

PCA is a decomposition (or data factorization) of the data matrix. Let’s call X the matrix of data. We decompose X into three new matrices, and this helps us visualize and understand the patterns hidden in the data, in particular in the rows (among stars), the columns (among wavelengths), and among rows and columns (e.g. which wavelengths make a star sine in a particular way).

Let’s see how PCA works. The PCA equation follows: X = T · P’ + E. The input of PCA is the data matrix X, and the output are matrices T, P and E. T is called the score matrix, and its elements are called the scores. It is useful to find patterns in the observations (rows) of X. P is called the matrix of loadings, and is equivalently useful to find patterns in the variables (columns). E contains the residuals and, depending on our goal, it can be of little or of much interest. The following figure illustrates how the data factorization looks like:

PCA matrix decomposition

PCA has the property to perform the best approximation X for a specific number of columns in T and P. These columns are called the Principal Components (PCs), and they are automatically computed from X, using algorithms like the singular value decomposition (SVD). In the case of the previous figure we have two PCs: two columns in T and P (note P is transposed). It is often the case that we can accurately approximate the data in X with only a few of these PCs. When this is the case, we can use this advantage to visualize the data.

Click on the title above to unfold the related text.

Let us star with very simple data set: The Wine data set in the PLS-Toolbox (Newsweek, 127(4), 52, 1/22/1996). I always use this example in my courses, because it shows that even a simple and small-size data set can be complex to understand if we don’t use the adequate tools. This example is also used in the PLS-Toolbox tutorial. The data is presented below: the observations correspond to countries, and the variables include alcohol consumption and health information. Clearly, this data set was retrieved to look for patterns of connection between drinking habits and health variables across the countries:

Country Liquor WineBeerLifeExHeartD'
France 2.5 63.5 40.1 78 61.1
Italy 0.9 58 25.1 78 94.1
Switz 1.7 46 65 78 106.4
Austra 1.2 15.7 102.1 78 173
Brit 1.5 12.2 100 77 199.7
U.S.A. 2 8.9 87.8 76 176
Russia 3.8 2.7 17.1 69 373.6
Czech 1 1.7 140 73 283.7
Japan 2.1 1 55 79 34.7
Mexico 0.8 0.2 50.4 73 36.4

It is often hopeless to look for patterns in the numbers of a matrix. A better idea is to try and issue some visualizations, for instance to visualize the countries in terms of the variables, or the other way round:

Wine rows
Wine columns

For convenience, I normalized data so that variables have zero mean and unit variance. The first plot shows a country, Russia, with an extreme behavior (very high or low values). Apart from that, very little can be inferred from the previous plots.

Now let’s use MEDA with a PCA model. We decompose the matrix of data into scores and loadings with two PCs, which represent 78% of the variance of the data. The variance is a measure of the patterns of change, and therefore can be associated to the amount of information. This means that the first 2 PCs contain almost 4/5 of the information of my data, and we can display all this information in a pair of plots.

The first plot below is a score plot, which shows the distribution of the observations: the countries. In this plot we observe two patterns. Russia is far from the rest of the countries, manifesting a disparate content in its variables in comparison to the rest of countries. The latter from a quasi-linear pattern from France to the Czech Republic, which shows a gradual change of content. I included arrows to clarify the patterns.

Wine rows

The loading plot below shows how variables distribute in the first 2 PCs. PC 1, in the abscises, is mainly modelling the negative relationship between life expectancy and heart diseases. The types of alcohol are distributed in the plot, forming an almost perfect equilateral triangle, with wine closer to life expectancy, liqueur to heart disease and beer in a neutral point. This shows that the data reflects that a preference for one type of alcohol leads to a reduced consumption of the other two.

Wine columns

Combining both scores and loadings in a biplot, we can infer inter-connections between countries and variables. The trend from France to the Czeck Republic shows where countries lie in their preference between wine and beer. Given also that this trend is slightly leaning towards the right, the pattern suggest that the wine preference correlates in the data with a higher life expectancy. Finally, Russia is separated from the rest due to its preference for liqueur and higher incidence of heart attacks.

Wine columns

Previous conclusions amount to 78% of the patterns of change in the data. This means that there is still 22% of change/information we have not observed yet. Most often, when subsequent PCs do not contain relevant patterns (in terms of variance), we look at the residuals as squared aggregates. For instance, the residuals can be observed as sum-of-squares in the observations (or variables). If we find any observation (variables) that stands out the rest, we should probably investigate the reason for that. Take the example below. Clearly, there is more to understand about Mexico and Japan than what we observed in the first 2 PCs. Looking at the third PC (not shown), we can infer that mainly Mexico but also Japan have lower values in all variables than what would be expected from the first 2 PCs.

Wine columns

This example illustrates how MEDA can be an interesting approach to data understanding. In the following entries of this blog, I will use increasingly complex examples but the same, general, approach to data investigation.