## Tuesday, 23 February 2016

### Intro to Statistical Learning

Overview

Statistical learning addresses how a fund creates a trading model by utilising its collection of fundamental data to generate index forecasts. This is achieved through modelling the behaviour of an event (the market index value) with predictors that have a relationship with it (fundamental data). In mathematical terms the relationship is Y = f(x) + error. f represents an unknown function of independent variables. Error is independent of predictors with mean 0. We aim to estimate the f form based on our observations and subsequently evaluate the accuracy of such estimates.

Prediction

Predictive modelling is the central focus in our blog posts. It concerns itself with predicting Y based on an added observed independent variable X. If the optimum model equation has been calculated then we are able to predict the response in Y based on adding this new predictor. Different predictors will result in differing accuracies in the Y estimate. Reducible error refers to the error of having poor predictors. We aim to make the reducible error as minimal as possible. Note that irreducible error is always present through the error term in the original equation which consists of unmeasured influences.

Inference

Inference refers to identifying the relationship between the predictor variables and the independent variable Y. For instance, determining important predictors and what type of relationship exists between each predictor and the Y outcome. Whilst the result is a model that can be understood relatively easily, it comes at a cost of worse prediction power. Determining whether the relationship is linear or non-linear means the model has greater prediction power though is harder to interpret. Because the actual form is not as important to know compared to the ability to create accurate predictions, predictive modelling has greater emphasis in the quant finance community.

Parametric Models

These methods need the user to assume the linearity of the model. From this, we know which variables we need to estimate. In the example of a linear model, the line of best fit does not necessarily pass through the origin so a coefficient specifies the intercept of the y axis (α). With multiple predictors, β represents the intercept which we find the estimate of using Ordinary Least Squares ftor instance. This is easier than to fit a potentially non-linear function, though comes at the expense that the estimate of f is unlikely to represent the true form and reduces flexibility in the model. Adding further parameters is wise though avoid over fitting so to ensure the model is following trading signals rather than noise.

Non Parametric Models

Alternatively, a non parametric model is more flexible though due to this feature, many more observations are required. nonparametric test is a hypothesis test that does not require the population's distribution to be characterised by certain parameters. In terms of trading, as extensive historical data is already available, non parametric models seem to have an advantage. Nevertheless, financial time series often embody a poor noise vs. signal ratio so over fitting bias can still be an issue. It is evident a balance should be struck between parametric and non-parametric models.

Supervised Machine Learning

In a supervised model, each independent variable/predictor there is a related response in the Y outcome. The model is "trained" onto the dataset. That is, the Ordinary Least Squares method trains the dataset to have a linear regression model fit onto it, resulting in an estimate of β to the vector of regression coefficients.

Unsupervised Machine Learning

Despite the lack of training dataset to evaluate accuracy, this technique is still useful with regards to clustering.

Parametised Clustering Model

This model is often used to determine unexpected relationships evident in the dataset that would otherwise not be easily found. In finance, this is usually useful in analysing volatility.

Linear & Logistic Regression

Regression uses supervised machine learning to model the relationship between x and y variables. The end equation identifies the change in response of y when x changes ceteris paribus. For instance, Linear regression uses Ordinary Least Squares to produce parameter estimates depending on a linear relationship to the x predictor. A model can predict the value of the ASX through historical data, dynamically updated through new market data to predict the next day's price. With inference, the relationship strength between the price and market data predictors can be analysed to determine the reasons behind the outcome changing. The underlying relationship however is not a priority compared to prediction quality in developing algorithms for trading. Another very common and easy to learn regression is known as logistic regression, which results in a response that suits a categorical type (e.g. "positive", "negative", "up", "down" etc) as opposed to continuous (e.g. stock prices). We recommend taking QBUS3830 (Usyd Business Analytics subject) to learn about the statistical procedure of Maximum Likelihood Estimation which is used in logistic regression to estimate parameters. MLE is the procedure that finds the value of a parameter(s) for a given statistic which makes the known likelihood distribution a maximum.

Classification Technique

This technique refers to supervised machine learning which tries to classify an observation into a user-specified category based on its features. Such categories may be ordinal or unordered.  Classifiers are the algorithms behind this technique, commonly used in the quant finance field particularly with regards to predicting market direction. They are able to predict whether a certain time series in the future will have positive or negative returns (note: not the actual value, like in a regression). The predictors themselves can be continuous. Our classifiers include linear, non-linear, logistic regression, discriminant analysis, artificial neutral networks and support vector machines.

Time Series Technique

This technique is often deemed as a combination of regression and classification. Time series use chronological ordering in the series, therefore predictors are often derived from past/present values. The main types of time series models relevant to algorithmic trading include ARIMA and ARCH models. These concepts are covered in depth in Usyd's Predictive Analytics class (QBUS 2820). ARIMA refers to linear autoregressive integrated moving average models.  They are used to model changes in the absolute value of a time series. ARCH refers to autoregressive conditional heteroskedasticity models. They are used to model variance/volatility of time series i.e. using the previous volatilities of a time series to predict future volatility. Stochastic volatility models differ by using numerous stochastic time series to model volatility. In a time series, asset prices are discrete i.e. finite values. However it is usual in quant finance to examine continuous time series using models such as Heston Stochastic Volatility, Geometric Brownian Motion and Ornstein-Uhlenbeck. These models are further explained in the next blog post with the aim of taking advantage of their features to form trading strategies.