Machine Learning Model Validation

I just came across an excellent and highly relevant piece of research "A comparison of machine learning model validation schemes for non-stationary time series data" by Matthias Schnaubelt. Features like non-stationarity, concept drift, and structural breaks present serious modelling challenges, and properly validating ML time series models requires knowing proper validation strategies.

Dr Schnaubelt writes:
Using  cross-validation  for  time-series  applications  comes  at  a great risk.  While theoretically applicable, we find that random cross-validation often is associated with the largest bias and variance when compared to all other validation schemes.  In most cases,blocked variants of cross-validation have a similar or better performance, and should therefore be preferred if cross-validation is to be used. If global stationarity is perturbed by non-periodic changes in autoregression coefficients, we find that forward-validation may be preferred over cross-validation.Within forward-validation schemes, we find that rolling-origin and growing-window schemes often achieve  the  best  performance.   A  closer  look  on  the  effect  of  the  perturbation  strength  reveals that there exist three performance regimes:  For small perturbations, cross- and forward-validation methods perform similarly.  For intermediate perturbation strengths, forward-validation performs better.  For still higher perturbation strengths, last-block validation performs best.

While some of this I intuited and used in practice, it is great that someone has thoroughly looked and analyzed the topic. If you are using machine learning in finance and want to weigh in with your experience with these or other validation strategies, please leave a comment.

Weekly market report

Wall st delivered a mixed bag of news with VIX, VNKY, and VSTOXX and their underlying markets almost unchanged. VXD - volatility index based...