Introduction to time series analysis:
Nowadays, one of the most needed skills in data science is time series. Any data, based on time stamps, is considered as and analyzed as time series. Time series analysis is a deep part of sales, offers and launches of products in industrial levels; while also it is deeply used to detect different events in physical worlds and different systems and therefore used as a general analysis tool in many parts of physics and analyzing different types of experiments and natural phenomenon. Now, that's all in air, let's dive in the basic theory and then we will discuss details of technical analysis as how to do
time series analysis with python
time series analysis with R
Basic theory of time series:
According to Wikipedia,
" A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average."
First of all, you need to know what should you look for and why will you look at time series. Time series has 4 main components, which each provide different aspects of a time series.
Time series components are:
(1) trend
(2) seasonality
(3)Cyclical variation
(4) white noise or random noise
Here above, I have provided examples of time series. Yes, that's right, google trends provide search histories for times and thus provide real life time series. Now, here, the average slope of the value or the data points values, is the trend. To formally define, it is the component which depends on the time and sets the upward or downward slope of the time series sequence, when the other variations are taken out. To clarify this, let us concentrate on the blue time series from the above graph. Clearly see, the time series fluctuates a lot around the 75 line, but it does not change much at the total over the time. Now, that is the trend of the time series. The trend is the slope of somewhat "overall" of the time series. Just let's say, you take the time series of the temperatures over time of 200 years. Then you will see a trend slowly upward. That is because the temperature has increased over the time, whatever slow it is. So, trend is the most basic element of a time series and it determines the overall increase or decrease of the time series element over a time.
next element is seasonality. As the name suggests, it depicts the element of the time series which is seasonal. That means the element which changes with the season or a period to mean. To tell formally, after removing the trend, the element of time series with a periodical behavior, i.e. a periodical element of the time series, is found. This element is called seasonality of the time series. For example, when you take temperature as time series, it gives the seasonal change of temperature par every season.
Next element is cyclical variation of time series. Cyclical variation is like seasonal component but with long time periods. i.e. a cyclical variation is a variation is a component which has a period with really long time. These components vary also periodically, but unlike seasonal, these periods are larger, and also not constant in value. In case of temperature, this refers to changes in temperature due to distance change of earth from sun cyclically over some years. Or a change in number of earthquake in japan due to some periodic change of tectonic plates is sort of cyclical variation.
The last element is random error or what you call white noise. While white noise occur in a lot of scenarios, here, random error/white noise means that it is a probabilistic process with no significant trend or patterns at all. This part of the time series is unexplained and several tests are done to test whether the error part is truly without any pattern or not. Achieving this random error, refers to the successful decomposition of the time series into its 4 components.
Evaluating different parts of the time series:
Calculating trend:
This is the first step in a time series analysis. In case of a time series, the first step is to find out the trend. For finding out trend, basically there are no automatic processes. But the main idea is that to smooth out the time series by averaging, weighted averaging, estimating etc. I will briefly talk about each of the styles.
This is the first step in a time series analysis. In case of a time series, the first step is to find out the trend. For finding out trend, basically there are no automatic processes. But the main idea is that to smooth out the time series by averaging, weighted averaging, estimating etc. I will briefly talk about each of the styles.
(1) averaging: the first basic idea is averaging the data points with a specified window and then observing that changed sequence. The concept is that, after getting some idea of the periodicity of seasonality, one will replace each data point with average of the values within a window of specific value, before and after that data point. The heuristic behind that idea is that with averaging like that, cancels the effect of the seasonal component, and by some amount of the cyclical component. Therefore, this averaged sequence depicts the trend in some way.
(2) weighted averaging: the second idea is putting a weight on the lesser relevant data points. In this case, we apply weights during the averaging process. In this averaging process, more weights are given to data point which are closer, and lesser weights are assigned to points which are far. The concept of window is still same in this case. The additional heuristic behind this weighted averaging is that for a data point, the points in near time are more relevant than the points away from that data point.
In this weighted averaging; now comes one concern. how to assign these weights? the answer is again that you can use multiple types, like exponentially decreasing weights, weights under specific probability distributions and others. But behind this the heuristics is most important.
(3) estimation by regressions: when the idea of the trend is clear, i.e. there is a good understanding about how the trend behaves, as such like a linear relation or a polynomial relation is there; then we can implement regression models to estimate the parameters of the trend as such a model while using the data points of the time series as input for that model.
The idea about estimating the trends is also important for the different models like ARIMA etc. We will talk about it in more details when we discuss about the models in a different post.
Now we will talk about how we can estimate the seasonality.
Calculating Seasonality:
Seasonality is the second most important thing to find out. There are multiple ways to find out the seasonality; i.e. correlation methods, fast Fourier transforms and other processes like fitting a sin cosine function as seasonality model. I will discuss each of the approach briefly below:
(1) Correlation related method, Lagged Autocorrelation: First here are two terms new to you, which are lagged and autocorrelation. Autocorrelation is the correlation of the data with itself, but with a shifted version of it. i.e. let's say you have a data series X1,X2,... etc. Now, autocorrelation is correlation of this series with X3,X4,... or X6,X7,... So this means that autocorrelation is the correlation you find with the same dataset with a lag. This is also called lagged autocorrelation and the shift by which we change the data point is called the lag in this case. Now, I will give you the heuristic behind using the autocorrelation function to find the seasonality.
Let's say you have a temperature data for 3 years. Now, as the temperature has a seasonality of 12 months, therefore, it should have higher correlation with 12 month lag; as it repeats its values after 12 month. Therefore the lagged correlation will attain high peaks with the period of the seasonality and the smaller peaks with multiples of the period. This is the heuristic of the correlation related method.
Many of the times, in application, we calculate the lagged correlations with a number of lags, plot them and find out the periodicity and therefore get the idea of the seasonality.
(2) Fast Fourier transform: First you will want to know what is Fourier transform. Fourier series is the decomposition of a function into periodic functions, i.e. sine and cosine functions or complex exponential function. For knowing more about Fourier series, start reading from Fourier_series wikipedia.
Now, in this case, we take the time series and detrend it, i.e. estimate the trend and deduct it from the time series. After detrending it, it becomes a time series which does not change in average over time. Now, this modified series is actually a ensemble of periodic functions and therefore we can decompose it into periodic functions.
Now, once we apply fft (fast fourier transform) and get the decomposition; you can observe clearly that you will find components of this time series as periodic functions. These pure periodic functions which come out as fourier transform component, are the seasonality components basically.
Obviously there is vast details regarding this method as well as computational aspects to apply this; but this above is the basic heuristic of applying fast fourier transform in case of the time series.
This fft process is again used during forecasting time series output by fourier transformation extrapolation. But that is beyond the scope of the current post.
(3) estimating via sine cosine function: This is again a pretty old school trick. here we detrend the time series first. Theoretically again, it is a combination of periodic functions now. Now, we try to estimate the periodicity by assuming the time series to be a linear combination of sine and cosine functions. In this case, optimizing functions can be applied to optimize the fit of the sine cosine combination and therefore find the approximate seasonality. This is a way to use when the function for seasonality is known up to unknown coefficients. This is also many times the case for physical experiments and variables, where we know the forms already by theoretical process, and then only have to fit the data around that theoretical formulas.
More on this can be achieved by looking at spectral analysis, periodogram and other tools.
First of all, I will describe what is spectral analysis. Spectral analysis is the above discussed process to use sines and cosine functions to regress the stationary time series and express it as a sine cosine linear combinations. You can read the same staff discussed here again from this pdf.
Calculating Cycles:
There is not many detailed process for detecting cycles. First one is that as there are detailed process for finding trends, seasonality. Then you can just deduct the trend and seasonality part from the time series and then you end up with the cyclical variation and the random error.
Now in the literature, sometimes the cyclical variation is considered with the random error only. In other cases, one can find the cyclical variation while trying to fit for the seasonality. While trying to find the seasonality, often coefficients of the cycle comes up. One can thus find and eliminate cycles. The same often happens when fitting the sine cosine combination coefficients too.
Other than that, some other complicated process for finding out cycles are there.
We will discuss about it more in our next update of the blog.
Hope that you have understood the basic concepts of a time series. In the next update of this blog. I will discuss about how to find time series datasets; and work with it in python and R and perform the above decomposition and some other staffs like that.
See you soon!!
Update with programming:
Many of the time series models are already present in R. Therefore, one more used to R, may want to use R using python. The library rpy2 allows you to do it. More details to this package can be found here.
Facebook also has a library named "prophet " for time series modelling. You can read more about prophet here .
Also, I have found this amazing collection by Max benchrest on models for time series in python; check https://github.com/MaxBenChrist/awesome_time_series_in_python.
First, tbats has a number of options in modelling. The options are:
(1) using box-cox transformation for transforming non-linear models into linear models; by converting non-gaussian variables into gaussian variables.
(2) using a trend or not using it
(3) using a damped trend or not using it
(4) using number of seasonalities or no seasonality
(5) using arma modelling on the error
The basic process which the TBATS model uses here is, it takes each of this options, then creates optimized models with one having the option and other not having the option. Then, it determines using AIC that which model is useful, i.e. which one have lesser AIC or AICc(in case of smaller sample data). This way, it takes up the best ones of the options and finally completes fitting.
Here, we have used a number of terms which are a bit advanced. To know about AIC, see here in wiki. To know about ARMA more, follow here in wikipedia. To download the original paper for more details, visit here.
I will soon download some code snippets regarding tbats.
Consider this link to a dataset for daily minimum temparature. This csv is created from a dataset compiled by another webpage on machine learning datasets for time series. I will use this dataset to explore this package now.
Seasonality is the second most important thing to find out. There are multiple ways to find out the seasonality; i.e. correlation methods, fast Fourier transforms and other processes like fitting a sin cosine function as seasonality model. I will discuss each of the approach briefly below:
(1) Correlation related method, Lagged Autocorrelation: First here are two terms new to you, which are lagged and autocorrelation. Autocorrelation is the correlation of the data with itself, but with a shifted version of it. i.e. let's say you have a data series X1,X2,... etc. Now, autocorrelation is correlation of this series with X3,X4,... or X6,X7,... So this means that autocorrelation is the correlation you find with the same dataset with a lag. This is also called lagged autocorrelation and the shift by which we change the data point is called the lag in this case. Now, I will give you the heuristic behind using the autocorrelation function to find the seasonality.
Let's say you have a temperature data for 3 years. Now, as the temperature has a seasonality of 12 months, therefore, it should have higher correlation with 12 month lag; as it repeats its values after 12 month. Therefore the lagged correlation will attain high peaks with the period of the seasonality and the smaller peaks with multiples of the period. This is the heuristic of the correlation related method.
Many of the times, in application, we calculate the lagged correlations with a number of lags, plot them and find out the periodicity and therefore get the idea of the seasonality.
(2) Fast Fourier transform: First you will want to know what is Fourier transform. Fourier series is the decomposition of a function into periodic functions, i.e. sine and cosine functions or complex exponential function. For knowing more about Fourier series, start reading from Fourier_series wikipedia.
Now, in this case, we take the time series and detrend it, i.e. estimate the trend and deduct it from the time series. After detrending it, it becomes a time series which does not change in average over time. Now, this modified series is actually a ensemble of periodic functions and therefore we can decompose it into periodic functions.
Now, once we apply fft (fast fourier transform) and get the decomposition; you can observe clearly that you will find components of this time series as periodic functions. These pure periodic functions which come out as fourier transform component, are the seasonality components basically.
Obviously there is vast details regarding this method as well as computational aspects to apply this; but this above is the basic heuristic of applying fast fourier transform in case of the time series.
This fft process is again used during forecasting time series output by fourier transformation extrapolation. But that is beyond the scope of the current post.
(3) estimating via sine cosine function: This is again a pretty old school trick. here we detrend the time series first. Theoretically again, it is a combination of periodic functions now. Now, we try to estimate the periodicity by assuming the time series to be a linear combination of sine and cosine functions. In this case, optimizing functions can be applied to optimize the fit of the sine cosine combination and therefore find the approximate seasonality. This is a way to use when the function for seasonality is known up to unknown coefficients. This is also many times the case for physical experiments and variables, where we know the forms already by theoretical process, and then only have to fit the data around that theoretical formulas.
More on this can be achieved by looking at spectral analysis, periodogram and other tools.
First of all, I will describe what is spectral analysis. Spectral analysis is the above discussed process to use sines and cosine functions to regress the stationary time series and express it as a sine cosine linear combinations. You can read the same staff discussed here again from this pdf.
Calculating Cycles:
There is not many detailed process for detecting cycles. First one is that as there are detailed process for finding trends, seasonality. Then you can just deduct the trend and seasonality part from the time series and then you end up with the cyclical variation and the random error.
Now in the literature, sometimes the cyclical variation is considered with the random error only. In other cases, one can find the cyclical variation while trying to fit for the seasonality. While trying to find the seasonality, often coefficients of the cycle comes up. One can thus find and eliminate cycles. The same often happens when fitting the sine cosine combination coefficients too.
Other than that, some other complicated process for finding out cycles are there.
We will discuss about it more in our next update of the blog.
Hope that you have understood the basic concepts of a time series. In the next update of this blog. I will discuss about how to find time series datasets; and work with it in python and R and perform the above decomposition and some other staffs like that.
See you soon!!
Update with programming:
python:
The first basic most thing one will use in time series is a statsmodel package called time series analysis. A documentation of the same can be found here in the official page. This let's you do some basic staffs.Many of the time series models are already present in R. Therefore, one more used to R, may want to use R using python. The library rpy2 allows you to do it. More details to this package can be found here.
Facebook also has a library named "prophet " for time series modelling. You can read more about prophet here .
Also, I have found this amazing collection by Max benchrest on models for time series in python; check https://github.com/MaxBenChrist/awesome_time_series_in_python.
Package discussions:
We will first discuss a time series package called tbats. tbats is both available in python and R. For python, the module name is tbats which comes with both TBATS and BATS model. For R, the tbats process is available with the library named 'forecast'. We will discuss the python version of tbats.
First let's discuss how tbats package theoretically work.
Theoretical background:
TBATS package is introduced mainly to solve 2 problems. There are many standard models in statistical computing for time series analysis. But most of them analyze time series with one seasonality. Some papers go beyond that up to 2 seasonality. But very less number of them go beyond that and optimize such models.
The second problem being that of having a non-integer seasonal periodicity. Periods in time series models are often modeled as integers. But that kind of overlooks the problem of non integral periods. TBATS model tries to solve that version of the problem.
Also, TBATS focuses on the problem of the big optimization included to choose the best model. It also solves that problem using some advanced mathematical manipulation. We will discuss the details model selection process below.First, tbats has a number of options in modelling. The options are:
(1) using box-cox transformation for transforming non-linear models into linear models; by converting non-gaussian variables into gaussian variables.
(2) using a trend or not using it
(3) using a damped trend or not using it
(4) using number of seasonalities or no seasonality
(5) using arma modelling on the error
The basic process which the TBATS model uses here is, it takes each of this options, then creates optimized models with one having the option and other not having the option. Then, it determines using AIC that which model is useful, i.e. which one have lesser AIC or AICc(in case of smaller sample data). This way, it takes up the best ones of the options and finally completes fitting.
Here, we have used a number of terms which are a bit advanced. To know about AIC, see here in wiki. To know about ARMA more, follow here in wikipedia. To download the original paper for more details, visit here.
I will soon download some code snippets regarding tbats.
Consider this link to a dataset for daily minimum temparature. This csv is created from a dataset compiled by another webpage on machine learning datasets for time series. I will use this dataset to explore this package now.
Comments
Post a Comment