I’ve mentioned the word ‘correlation’ a few times already in a few articles. In this one, I’ll do a deep-dive towards correlation. What it is exactly, what it says and where you can use it for.
Correlation basically is used to test whether one factor (or variable) is related to another in a data set. For now I’ll keep talking about Covid-19, because the dataset you get from there are huge and because the data is actually real (I’m not saying anything about thrust-worthiness) and therefore it appeals to me more to actually get results from a data set like that.
So, Covid-19. There’s a lot of different opinions about it, and in this article I will exactly talk about NONE of them. For me it’s about the data. No, I won’t say anything about its validity, because, again, that’s not what this article is about.
An example for a correlation in Covid-19 data could be whether amount of new cases is related to amount of tests being done. Or, the amount of deaths related to amount of cases in total. Doing correlations like that you don’t just get a yes or no, but it’s stated in a number that can be transformed into a percentage. It says something about the level of strength of the correlation from 0 to 1. The higher the better and usually above 0.7 is seen as a strong correlation between the two factors (variables).
Especially in a bigger data set, or in a dataset you’re working on that’s and you’re in the exploring and preparation phase, a correlation is used to actually find (unexpected) relationship. Get a feeling for the data, variables and find a starting point what variables could be interesting for research (the ones with a hight correlation). It helps you in the preparation phase to see what’s going on and based on the results, you can plan your next steps.
Correlation and covariance are often two to confuse, since they both do something with variables. Important to know that you will mostly use the correlation if you’re looking for a relationship between two variables and the strength of it. Covariance is only useful to find the direction of the relationship between the two variables (positive, negative), but that’s about it.
Correlation is only saying something about the relation between them. The closer to 1, the more likely it is that if one variable is moving in a certain direction, the other one is moving as well. It can still move two ways, same directen (end up close to 1) opposite direction (end up close to -1).
For now we use an example about Covid-19. Don’t get scared now, but take a look at this correlation overview below. Yes it’s a huge figure and it might back off instantly, but this is a very good first step to actually see all the correlations in one overview. To be clear: I would personally never show something like this to anyone who makes the final decision, because that person will go nuts, not knowing where to look at all, but for an analyst/scientist, and for me personally, it’s a fascinating overview of the data. I can spend hours on looking at it and do deep dives between variables.
Highlight
For now, I just want to highlight a few potentially interesting results that come from this correlation matrix.
The big white mark is quite an obvious one, which I will not use. Hand washing facilities are of course everywhere in hospitals and therefore it’s strongly negatively related with those variables.
Interesting to see is that the life expectancy is strongly positive related with hand washing facilities which indicates that those two have a relation with each other. Which suggests less hand washing facilities affects life expectancy of people.
Correlation results
Life expectancy vs. hand washing facilities 0.825218
There is a strong correlation between life expectancy and hand washing facility. Which indicates that is people are able to was their hands very regularly, it affects their life expectancy positive (think about this as in an covid related environment, it doesn’t make much sense in a normal day life).
Female smokers vs. aged 70 older 0.778382
There is a strong relation between aged 70 and older (with covid) and female smokers. It seems to be the case that people who are over the age of 70 are more likely to be a female smoker.
Cases
new cases |
total tests |
total deaths |
|
total tests |
0.77 |
– |
|
total deaths |
0.97 |
0.84 |
– |
total cases |
0.96 |
0.94 |
0.98 |
In this overview above there is a strong correlation between new cases, deaths, tests and total cases, which indicates (for all of them) that the more new cases, the more new total deaths. General information that we get on the news is that testing increased compared to the first peak. The data above is data from the whole period, march to november. It could be interesting to see whether there is a different correlation in April and October.
Correlation April vs October
Correlation |
April |
October |
New cases vs total tests |
0.82 |
0.89 |
Total tests vs total deaths |
0.80 |
0.90 |
Total deaths vs new cases |
0.88 |
0.97 |
Total cases vs total tests |
0.90 |
0.95 |
The table above shows for every variable the correlation in April en the correlation in October. It shows that in October there is a higher correlation between cases and tests, which indicates the more tests, the more cases come up. Also, the more tests, the more deaths, the more new cases, the more total deaths and the more total cases, the more tests are being done.
All indicated that the country has the whole Covid-19 operation more on point. It’s tricky and hard to draw a hard conclusion about this results, but knowing that testing has gone up in general, you could say that because of the testing has gone up, more cover cases come from the testing, and therefore more registered deaths come from the that. The correlation between total tests and total cases is very strong and even has gone up, which indicates there is only a small percentage of cases that will be registered comes in from another direction.
All of those conclusions might seem very logic and straightforward for maybe a lot of people. From my perspective, I prefer to have data to confirm a certain logic or thought. Because data is knowledge, and knowledge is key.
Obviously this will not be your end station for research. There is so much cover data that the possibilities are endless. This dataset is also a dataset that’s definitely not complete. All sort of data could be added to this set to expand and make it more interesting.
Cheers,
Laura