Predicting Hate Crime Arima Model – Data Meets Humanity

The next models that I tried for predicting hate crime was the ARIMA and SARIMA models. It would make sense that there would be autoregressive features in hate crimes, where points would be similar to recent data points in the time series, generally. I had definitely seen while there were specific incidents that resulted in short term spikes in hate crime, most of the time, the triggers for change, such as an economic downturn were more long term.

I tried several different models and transformations. I will show here some of the different models I looked at, the accuracy of the models and then the goodness of fit for the models and then a comparison to the Holt Winters model.

Exploratory Data Analysis

See below the lag plot and you will see that in many cases the value at a given point in a time series is very close to the next point or previous point. However, in many cases, there can be a substantial percent change.

Seasonal Decomposition

Non-Constant Variance: Log Transformation

See above in the seasonal decomposition that there is a clear seasonality and since 2016 there has been a definite upward trend in hate crimes. The ARIMA Model, subject to the many of the same model assumptions as a linear regression, requires residuals must be normally distributed, independent and with constant variance. The residuals have some heteroskedasticity, non-constant variance, which is a problem for this model. I will apply a log transformation to handle the non-constant variance. At the end of modeling, I will test the model assumptions.

Differencing

Weak stationarity, a fixed mean and variance, is an additional model assumption that is also required for ARIMA. For strict stationarity, an unusual occurrence, time series would have the same likely range of values for any point in time versus another point in time. (Stationarity in Time Series — A Comprehensive Guide | by Leonie Monigatti | Towards Data Science). On the encyclopedia.com, the make the point that the likelihood of a one data point versus another data point, a shifted value or another independent random variable, have the same likelihood distribution (PDF) Stationary Process | Encyclopedia.com. Weak stationarity is acceptable for our purposes. Working in Python, I used multiple methods to check for stationarity, visual inspection and the Augmented Dickey Fuller test (statsmodels.tsa.statstools and pmdarima.arima.stationarity). Note that a technique to make the data stationary is to difference the data, meaning at a point in time, we subtract the previous data point from the current data point and repeat this operation through the series. Differencing the data results in stationary data. Sometimes the data might need to be differenced more than once. In this case, I was only anticipating it being differenced once. I will utilize functions in Python to confirm this. The pmdarima.arima.utils library provides a method to indicate the number of times the time series needs to be differenced.

Interestingly, there appeared to be a difference of opinion as to whether differencing was required, based on the test or method used. The statstools version of the Augmented Dickey Fuller test indicated the differencing was required for stationarity. The pmdarima version showed different decisions on differencing, depending on which method was applied. Also, the ndiffs with a kpss test indicated that differencing was required, but the auto_arima model, which uses this method by default, did not apply a difference step by default. I decided to try models with and without differencing. When using the auto_arima, I specifically passed d=1 as a parameter, forcing the method to search models with differencing applied.

Training and Test Data

For data training, I set split the data into train and test data. Most of the data is allocated to training data. The last 12 months are used as the test data. I want to forecast 1 year into the future, which means I need 12 months of test data. I tested accuracy of the predictions based on out of sample predictions. I plan to implement a walk forward validation approach to training upon the next updates to the model, or the subsequent model.

ACF and PACF

Initially I reviewed the ACF to get an idea as to the Moving Average component. Then I looked at the PACF to assess the Autoregressive component. I didn’t get a clear answer from the ACF and PACF below. After these plots, I move onto modeling using the PDMARIMA’s auto_arima, a grid search approach and then also created my grid search approach, leveraging the statsmodels.tsa.arima.model library ARIMA method.

See below that the plot could indicate an AR(5)

Choosing a Model

GRID SEARCH USING statsmodels.tsa.arima.model

Here I created a function to calculate the root mean squared error, mean absolute percentage error, r^2, aic, bic value. The AIC and BIC are goodness of fit calculations, which both apply a penalty for the number of parameters. This is done to avoid overfitting to the training data, resulting in poor outcomes with out of sample data. The difference between the two is that the BIC penalizes models trained on larger data sets to a greater extent. This is because rather than adding 2*the number of parameters/columns, as the AIC does, BIC multiples the number of parameters by log(n), or k, a potentially larger number depending on the size of the dataset. Since k is multiplied by the number of parameters, and the number of rows is constant when calculating the two metrics, BIC results in a higher penalty for the said of columns or parameters. As a result, BIC will tend toward preferring fewer parameters, compensating for the larger k multiplier. I had the function calculate both metrics.

In the function above I return the rmse, mape, r squared, aic and bic for the selected model. Then I created a data frame with a column for each calculation and run a grid search for models. The function returns a variety of accuracy and goodness of fit metrics for each model. I then selected the best model based on these metrics. The end result was a AR(3), I(1), MA(5), determined to be the best based on both AIC and BIC. The mean absolute percentage error is 7.85% and the r squared indicates that 39% of the variability is explained by this model. This model does not include seasonality. Note that I performed a log transformation of the data when modeling and then exponentiated the forecasts so that the data is in its original form, raw counts.

The forecasts are out of sample using a separate test set. Below is a plot of the actual test data, the fitted values and the out of sample forecast. We did get close to the shape, but if you viewed my Holt Winters model, you saw that the Holt Winters model explained a much higher percentage of the variation, 69.7% with a lower error rate, 69.2 RMSE compared with 98.16 RMSE for this ARIMA model.

I also used the pmdarima library, auto_arima method, to do a grid search including seasonality. I created a model with and without differencing.

Seasonal Model

PMDARIMA Auto Arima including seasonality

See below the seasonal model with differencing.

Next look at the accuracy and goodness of fit calculations above. Note that the r squared has improved over the non-seasonal model, but is still less than the 69.7% r squared, the amount of variation accounted for by the model, of the Holt Winters model and the RMSE, root mean squared error, of 72.62 is greater than the 69.3% RMSE of the Holt Winters model.

Model Goodness of Fit

As I mentioned earlier, the ARIMA model is a similar to linear regression and has some of the same model assumptions such as normally and independently distributed. See the results of these tests below.

To test these assumptions, I used the normaltest for residuals in the scipy.stats mstats. I used the Durbin Watson test for serial autocorrelation. Even with the log transformation, the residuals were not normally distributed and display evidence of serial correlation of residuals. The model assumptions have been violated. Given there is conditional heteroskedasticity, the ARCH/ GARCH model might perform better. That will be the next model.

Note: I completed this model based on the lessons I learned in the UDEMY Course: Time Series Analysis, Forecasting and Machine Learning by “The Lazy Programmer.” It’s a great class I highly recommend.