In the upcoming portfolio articles, I’ll go more in-depth about different types of data analysis. Of course just to show off my knowledge (I’m kidding), and to share knowledge. This information can be useful for a broad range of people, from future employers to managers, to people who like data and wanna learn more. Because no matter what office job you end up in, data is always around and at some point you will come across it and how good is it if you understand a little bit more of it? It can only benefit you.
Time series analysis is logically speaking analysis over data where there is a certain time frame involved. This could be done by all sorts of dat that might be available for the business/organization/company you’re working for. For example:
- Purchases (in every way you can think of),
- Website visits,
- Results over time (basically every result business related)
- Analytics data (Google Analytics or anything like that)
- Trends (from year-on-year to hour-to-hour)
- (Strategic) analysis for advertising
- Profits and losses (over time)
Not to forget the one time series that almost everyone comes across every day: Covid-19 analysis. This basicallt is one huge big time series analysis. From new cases to recovered to amount of deaths to amount of tests done and or available. As soon as you turn on the news there is something about it.
Time to explain a bit more about time series. Therefore I’ll use a dataset for amount of Covid test every day. After having it cleaned and prepped (see more about that in data analysis), we’re ready to use the data set.
Time series can give different insights, depending what you’re looking at. Are you looking for a trend in data, for example an increase or decrease over time or do you want to know more about how seasonality effects your business over years, you might find a pattern that keeps coming back every summer or winter over the years.
Figure 1 shows an overview of the top 10 countries with registered Covid-19 cases, from February onwards. Over time, you see a trend where in all of those countries cases increase.
Seasonal data results are something different. It can be something you’re looking for, and if you’ve found it, you work from there to find whatever you’re trying to find, but it can also be something that if you’ve found it, you’re ruling it out so it makes you able to do more analysis on the same data, but where seasonal behavior is excluded. For example to predict or forecast.
Forecasting and predictive analysis
Forecasting and predictive analysis are great opportunities in time series. Based on historical data predicting what might happen in the future. Sounds amazing, right? Before you’re able to do that, you need to take a few steps to check whether your data is eligible for it and if you, you have to make your data ready for it. The first thing is that we need to know whether data is stationary or not. Stationary means there is no seasonality in the data. To test for seasonality you can use Augmented Dickey Fuller test (ADH Test) this test basically starts with the null hypothesis that your time series is seasonal. If the result of the ADH Test is significant it means it’s not seasonal. And therefore stationary. which means you can continue to the next step.
The Kwiatkowski-Phillips-Schmidt-Shin – KPSS test (trend stationary) can be used to check for trends in your time series. The null hypothesis and the P-Value interpretation is just the opposite of ADH test. So, the null hypothesis states that there is no trend, and if your test result if significant, it states that there is a trend. So if the result of your test is non-significant, there is no trend and you can continue with your analysis.
Once you know whether there is a trend/seasonal result you can continue with your analysis. If your data is seasonal or if there’s a trend, data needs to be manipulated because predicting works best when the numbers that needs to be predicted are not related with (in this case) time. You’re basically excluding everything that can clarify a certain difference to make the prediction legit and more trustworthy.
You can detrend a trend with the library called ‘signal’ in python, or you can deseason with ‘seasonal_decompose’ in python.
As I’ve mentioned before, a lot of datasets are not perfect and ready to go with all cells filled with the exact data you need. Missing values in common and needs to be addressed before moving on. It’s up to you to make a decision what to do with those values. It’s not likely to fill his with a mean, or with a ‘0’ if it’s missing. A quick fill could be the forward-fill of the previous value. Other options are the mean of the nearest neighbors or seasonal mean. Again, it does depend hugely on the data, your data set, and your knowledge about the data to make the right decision.
There is plenty of other things you could consider doing in the time series but one of the tests I would like to point out at last is the Granger Causality because that can indicate whether your time series could actually be a good time series to forecast. A significant result of the granger causality rejects the null hypothesis that states that the time series is not useful, which means that the time series is useful to forecast.
Predictive analysis / forecaster will be discussed in one of my next articles.