1. 1.
    ​Random walk - what is?
  1. 1.
    ​Time series decomposition book - stl x11 seats


  1. 1.
    SKtime - is a sk-based api, medium, integrates algos from tsfresh and tslearn
  2. 2.
    (really good) A LightGBM Autoregressor — Using Sktime, explains about the basics in time series prediction, splitting, next step, delayed step, multi step, deseason.
  3. 4.
    ​TSFresh - extracts 1200 features, filters them using FDR for time series classification etc
  4. 5.
    ​TSlearn - DTW, shapes, shapelets (keras layer), time series kmeans/clustering/svm/svr/KNN/bary centers/PAA/SAX
  5. 6.
    ​DTAIDistance - Library for time series distances (e.g. Dynamic Time Warping) used in the DTAI Research Group. The library offers a pure Python implementation and a faster implementation in C. The C implementation has only Cython as a dependency. It is compatible with Numpy and Pandas and implemented to avoid unnecessary data copy operations dtaidistance.clustering.hierarchical​
* Identify anomalies, outliers or abnormal behaviour (see for example the anomatools package).
  1. 1.
    Semi supervised with DTAIDistance - Active semi-supervised clustering
The recommended method for perform active semi-supervised clustering using DTAIDistance is to use the COBRAS for time series clustering: https://github.com/ML-KULeuven/cobras. COBRAS is a library for semi-supervised time series clustering using pairwise constraints, which natively supports both dtaidistance.dtw and kshape.
  1. 1.
    ​Affine warp, a neural net with time warping - as part of the following manuscript, which focuses on analysis of large-scale neural recordings (though this code can be also be applied to many other data types)
  2. 2.
    ​Neural warp - NeuralWarp: Time-Series Similarity with Warping Networks
  3. 3.
​A great introduction into time series - “The approach is to come up with a list of features that captures the temporal aspects so that the auto correlation information is not lost.” basically tells us to take sequence features and create (auto)-correlated new variables using a time window, i.e., “Time series forecasts as regression that factor in autocorrelation as well.”. we can transform raw features into other type of features that explain the relationship in time between features. we measure success using loss functions, MAE RMSE MAPE RMSEP AC-ERROR-RATE
​Interesting idea on how to define ‘time series’ dummy variables that utilize beginning\end of certain holiday events, including important information on what NOT to filter even if it seems insignificant, such as zero sales that may indicate some relationship to many sales the following day.
  • A trend (a,b,c) exists when there is a long-term increase or decrease in the data.
  • A seasonal (a - big waves) pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. The monthly sales induced by the change in cost at the end of the calendar year.
  • A cycle (a) occurs when the data exhibit rises and falls that are not of a fixed period - sometimes years.
​Some statistical measures (mean, median, percentiles, iqr, std dev, bivariate statistics - correlation between variables)
Bivariate Formula: this correlation measures the extent of a linear relationship between two variables. high number = high correlation between two variable. The value of r always lies between -1 and 1 with negative values indicating a negative relationship and positive values indicating a positive relationship. Negative = decreasing, positive = increasing.
But correlation can LIE, the following has 0.8 correlation for all of the graphs:
Autocorrelation measures the linear relationship between lagged values of a time series.
L8 is correlated, and has a high measure of 0.83
  • White-noise has autocorrelation of 0.
  • Average: Forecasts of all future values are equal to the mean of the historical data.
  • Naive: Forecasts are simply set to be the value of the last observation.
  • Seasonal Naive: forecast to be equal to the last observed value from the same season of the year
  • Drift: A variation on the naïve method is to allow the forecasts to increase or decrease over time, the drift is set to be the average change seen in the historical data.
  • Log
  • Box cox
  • Back transform
  • Calendrical adjustments
  • Inflation adjustment


  • Dummy variables: sunday, monday, tues,wed,thurs, friday. NO SATURDAY!
  • notice that only six dummy variables are needed to code seven categories. That is because the seventh category (in this case Sunday) is specified when the dummy variables are all set to zero. Many beginners will try to add a seventh dummy variable for the seventh category. This is known as the "dummy variable trap" because it will cause the regression to fail.
  • Outliers: If there is an outlier in the data, rather than omit it, you can use a dummy variable to remove its effect. In this case, the dummy variable takes value one for that observation and zero everywhere else.
  • Public holidays: For daily data, the effect of public holidays can be accounted for by including a dummy variable predictor taking value one on public holidays and zero elsewhere.
  • Easter: is different from most holidays because it is not held on the same date each year and the effect can last for several days. In this case, a dummy variable can be used with value one where any part of the holiday falls in the particular time period and zero otherwise.
  • Trading days: The number of trading days in a month can vary considerably and can have a substantial effect on sales data. To allow for this, the number of trading days in each month can be included as a predictor. An alternative that allows for the effects of different days of the week has the following predictors. # Mondays in month;# Tuesdays in month;# Sundays in month.
  • Advertising: $advertising for previous month;$advertising for two months previously
“compute parameter estimates over a rolling window of a fixed size through the sample. If the parameters are truly constant over the entire sample, then the estimates over the rolling windows should not be too different. If the parameters change at some point during the sample, then the rolling estimates should capture this instability”
estimate the trend cycle
  • 3-5-7-9? If its too large its going to flatten the curve, too low its going to be similar to the actual curve.
  • two tier moving average, first 4 then 2 on the resulted moving average.
​Visual example of ARIMA algorithm - captures the time series trend or forecast.


  1. 1.
    ​Creating curves to explain a complex seasonal fit.
  2. 2.
  3. 3.

Weighted “window”

  1. 1.
    Level. The baseline value for the series if it were a straight line.
  2. 2.
    Trend. The optional and often linear increasing or decreasing behavior of the series over time.
  3. 3.
    Seasonality. The optional repeating patterns or cycles of behavior over time.
  4. 4.
    Noise. The optional variability in the observations that cannot be explained by the model.
All time series have a level, most have noise, and the trend and seasonality are optional.
One step forecast using a window of “1” and a typical sample “time, measure1, measure2”:
  • linear/nonlinear classifiers: predict a single output value - using the t-1 previous line, i.e., “measure1 t, measure 2 t, measure 1 t+1, measure 2 t+1 (as the class)”
  • Neural networks: predict multiple output values, i.e., “measure1 t, measure 2 t, measure 1 t+1(class1), measure 2 t+1(class2)”
One-Step Forecast: This is where the next time step (t+1) is predicted.
Multi-Step Forecast: This is where two or more future time steps are to be predicted.
Multi-step forecast using a window of “1” and a typical sample “time, measure1”, i.e., using the current value input we label it as the two future input labels:
  • “measure1 t, measure1 t+1(class) , measure1 t+2(class1)”
​This article explains about ML Methods for Sequential Supervised Learning - Six methods that have been applied to solve sequential supervised learning problems:
  1. 1.
    sliding-window methods - converts a sequential supervised problem into a classical supervised problem
  2. 2.
    recurrent sliding windows
  3. 3.
    hidden Markov models
  4. 4.
    maximum entropy Markov models
  5. 5.
    input-output Markov models
  6. 6.
    conditional random fields
  7. 7.
    graph transformer networks


​What is? A time series without a trend or seasonality, in other words non-stationary has a trend or seasonality
There are ways to remove the trend and seasonality, i.e., take the difference between time points.
  1. 1.
    T+1 - T
  2. 2.
    Bigger lag to support seasonal changes
  3. 3.
  4. 4.
    Plot a histogram, plot a log(X) as well.
  5. 5.
    Test for the unit root null hypothesis - i.e., use the Augmented dickey fuller test to determine if two samples originate in a stationary or a non-stationary (seasonal/trend) time series
(amazing) STL and more.


  1. 1.
    ​Short time series​
  2. 2.
    ​PDarima - Pmdarima‘s auto_arima function is extremely useful when building an ARIMA model as it helps us identify the most optimal p,d,q parameters and return a fitted ARIMA model.
  3. 4.
    1. 1.
      Autoregression (AR)
    2. 2.
      Moving Average (MA)
    3. 3.
      Autoregressive Moving Average (ARMA)
    4. 4.
      Autoregressive Integrated Moving Average (ARIMA)
    5. 5.
      Seasonal Autoregressive Integrated Moving-Average (SARIMA)
    6. 6.
      Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX)
    7. 7.
      Vector Autoregression (VAR)
    8. 8.
      Vector Autoregression Moving-Average (VARMA)
    9. 9.
      Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)
    10. 10.
      Simple Exponential Smoothing (SES)
    11. 11.
      Holt Winter’s Exponential Smoothing (HWES)
Predicting actual Values of time series using observations
  1. 1.
    ​Using kalman filters - explains the concept etc, 1 out of 55 videos.
There are three types of gates within a unit:
  • Forget Gate: conditionally decides what information to throw away from the block.
  • Input Gate: conditionally decides which values from the input to update the memory state.
  • Output Gate: conditionally decides what to output based on input and the memory of the block.
Using lstm to predict sun spots, has some autocorrelation usage


  1. 1.
    ​Stackexchange - Yes, you can use DTW approach for classification and clustering of time series. I've compiled the following resources, which are focused on this very topic (I've recently answered a similar question, but not on this site, so I'm copying the contents here for everybody's convenience):



  1. 2.
    ​mastery on arimas​
  2. 4.
    ​AD techniques, part 2, part 3​
  3. 6.
    ​Adtk a sklearn-like toolkit with an amazing intro, various algorithms for non seasonal and seasonal, transformers, ensembles.
  4. 9.
    ​Ransac is a good baseline - random sample consensus for outlier detection
    1. 1.
      ​Ransac, 2, 3, 4, 5, 6
    2. 2.
      You can feed ransac with tsfresh/tslearn features.
  5. 11.
    AD for TS, recommended by DTAIDistance, anomatools​
  6. 13.
    Sliding windows
  7. 14.
    Forecasting using Arima 1, 2​
  8. 15.
    Auto arima 1, 2, 3​
  9. 16.
    ​Twitters ESD test for outliers, using z-score and t test
    1. 1.
      Another esd test inside here​
  10. 18.
    ​Golden signals, youtube​
  11. 20.
    ​Time2vec, paper (for deep learning, as a layer)


Dynamic Time Warping (DTW)

DTW, ie., how to compute a better distance for two time series.
Myth 1: The ability of DTW to handle sequences of different lengths is a great advantage, and therefore the simple lower bound that requires different-length sequences to be reinterpolated to equal length is of limited utility [10][19][21]. In fact, as we will show, there is no evidence in the literature to suggest this, and extensive empirical evidence presented here suggests that comparing sequences of different lengths and reinterpolating them to equal length produce no statistically significant difference in accuracy or precision/recall. Myth 2: Constraining the warping paths is a necessary evil that we inherited from the speech processing community to make DTW tractable, and that we should find ways to speed up DTW with no (or larger) constraints[19]. In fact, the opposite is true. As we will show, the 10% constraint on warping inherited blindly from the speech processing community is actually too large for real world data mining. Myth 3: There is a need (and room) for improvements in the speed of DTW for data mining applications. In fact, as we will show here, if we use a simple lower bounding technique, DTW is essentially O(n) for data mining applications. At least for CPU time, we are almost certainly at the asymptotic limit for speeding up DTW.
  1. 2.
    ​Python code with a good tutorial.​
  2. 3.
    Another function for dtw distance in python
  3. 4.
    ​Medium, mentions prunedDTW, sparseDTW and fastDTW
  4. 5.
    ​DTW in TSLEARN​
  5. 6.
  1. 1.
    (duplicate above in classification) Stackexchange - Yes, you can use DTW approach for classification and clustering of time series. I've compiled the following resources, which are focused on this very topic (I've recently answered a similar question, but not on this site, so I'm copying the contents here for everybody's convenience):