The methods we focus on utilise binary supervised classification to predicts whether the % return for a particular future day is positive or negative. The parameters that need to be evaluated for their accuracy in this case includes deviation from actual outcome and magnitude.
Hit Rate
As the name suggests, the hit rate determines how many times the correct direction was predicted using the training dataset as a % of overall predictions. The Scikit-Learn library has a function that can calculate this as part of training.
Confusion Matrix
This is also known as a contingency table, commonly used after determining the hit rate. The matrix identifies how many times the model forecasted positive returns accurately versus how many times the model forecasted negative returns accurately. The point of this is to work out if the algorithm is more useful towards predicting a certain direction e.g. better at predicting falls in the index. An example strategy based off this involves being long/short depending on its bias.
Behind the scenes the matrix is calculated how many Type I errors and Type II errors for a classifier algorithm. Type I error is where the model incorrectly rejects a true null ("false positive") whilst a Type II error incorrectly fails to reject a false null ("false negative"). These concepts are emphasised in QBUS2810 Statistical Modelling for Business (Usyd Business Analytics subject). Again, Scikit-Learn also has a function to calculate the confusion matrix.
Factor Choice
A logical approach to select predictors involves identifying the fundamental drivers of asset movement. It is sometimes simpler than what most people thing - in the S&P500, the 500 companies listed will be the drivers as it is composed of their value. The question is can the past predict the future i.e. do prior returns hold any predictive power. We could add fundamental macro data e.g. employment, inflation, interest rates, earnings, etc. A more creative approach is examining exchange rates with countries that trade most with the US as drivers.
When considering historical asset prices in a time series, indicators include lags e.g. k-1, k-2, k-3,...k-p for daily time series with p lags. Traded volume is also a common indicator, we form a p + one dimensional feature vector daily which includes the p time lags and volume.
The above information is considered micro data, related to and found within the time series itself. External macro time series can be overlaid in the forecasts. Commodities prices can be correlated to weather, or forex related with offshore interest rates. Firstly we work out whether the correlations are statistically significant to include in a trading strategy.
Classification Models
Below are the more commonly used models in supervised classification. We describe the techniques behind each model and in what situations they would be useful.
Logistic Regression
Logistic regression measures the relationship between continuous independent variables (lagged % returns) and binary categorical dependent variables ("positive", "negative"). The regression outputs a probability between 0 and 1 the next time period will be classified as positive or negative based on the past % returns. We use logistic rather than linear as linear can incorrectly result in negative probabilities for continuous variables. This parameter is set at 50% but this can be modified.
In mathematical notation the formula that works out the probability of having a positive return day assuming we have L1 and L2 as previous returns:
P(Y= U | L1, L2) = e^(B0 + B1L1 + B2L2) / 1 + e^(B0 + B1L1 + B2L2)
Maximum likelihood method is used to estimate the Beta coefficients, as part of the Scikit-Learn library.
Logistic regression has the advantage over other models such as Naive Bayes as there are less restrictions on correlation between features. It is suited to where thresholds are used due to the probabilistic nature of results rather than automatically selecting the highest probability category to set stronger prediction power. For instance, indicating 75% as a threshold rather than picking a "positive" prediction that is only 51%.
Discriminant Analysis
Discriminant analysis is more strict with its assumptions compared to logistic regression, though if such strict assumptions hold then prediction power is stronger.
Linear Discriminant Analysis models the distribution of the L variables independently, using Bayes' Theorem to obtain the probability. This theorem describes how conditional probability can be calculated from knowing the probability of each cause and the conditional probability of each cause. Predictors are based on a multivariate Gaussian (normal) distribution, and those parameters are used in Bayes' Theorem to predict what class an observation should be classified under. Assumptions include that all outcomes ("positive" "negative" "up" "down") have a common covariance matrix, which is handled by Pyton's scikit-learn library. Covariance measures the correlation between two trends.
Quadratic Discriminant Analysis differs as it assumes each outcome has a separate covariance matrix. We use this analysis if there are non-linear decision boundaries, when there are more training observations (reducing variance is not a priority). It is evident the choice between linear and quadratic discriminant analysis comes down to bias versus variance.
Linear Discriminant Analysis models the distribution of the L variables independently, using Bayes' Theorem to obtain the probability. This theorem describes how conditional probability can be calculated from knowing the probability of each cause and the conditional probability of each cause. Predictors are based on a multivariate Gaussian (normal) distribution, and those parameters are used in Bayes' Theorem to predict what class an observation should be classified under. Assumptions include that all outcomes ("positive" "negative" "up" "down") have a common covariance matrix, which is handled by Pyton's scikit-learn library. Covariance measures the correlation between two trends.
Quadratic Discriminant Analysis differs as it assumes each outcome has a separate covariance matrix. We use this analysis if there are non-linear decision boundaries, when there are more training observations (reducing variance is not a priority). It is evident the choice between linear and quadratic discriminant analysis comes down to bias versus variance.
Support Vector Machines
Support Vector Classifiers try to find a linear separation boundary that can accurately classify most of the observations into multiple distinct groups. Sometimes this can work if the class separation is mainly linear, though occasionally this requires further techniques to enable non-linear decision boundaries. Support Vector Machines have an advantage in enabling non-linear expansion whilst still allowing efficiency in computations. How? Instead of using a fully linear separating boundary, we can use quadratic polynomials or higher order polynomials to modify the kernel used and therefore define non-linearity in boundaries. This means they make relatively flexible models, though the right boundary must be selected for optimum results. In real life applications, Support Vector Machines are useful in the field of text classification where there is high dimensionality, though drawdowns include complex computations, difficulty in fine tuning and model interpretation.
Decision trees use a tree structure to allocate into recursive subsets via a decision at each tree node. This can be visually illustrated using this sample scenario. If one asked if yesterday's price was above or below a certain level, it creates two subsets. It could then be asked if volume was above or below a certain level, forming four separate subsets. This continues until predictive power reaches a peak through partitioning. The advantage of a decision tree is that it is naturally interpretable relative to the "behind the scenes" approach that Discriminant Analysis or Support Vector Machines use.
The advantages of using Decision Trees are extensive, such as ability to handle interactions between features and being non-parametric. They are also useful when it is difficult to linearly separate data into classes, an assumption required in support vector machines.
The disadvantage of using individual decision trees is that they are prone to over-fitting (high variance). A newer field in classification involves ensemble learning, where a large amount of classifiers is created using the same model and trained with differing parameters. The results are then combined and averaged out with the goal of achieving a prediction accuracy greater than just one classifier.
One of the most popular ensemble learning techniques is the Random Forest (constantly a popular topic in Quantopian Forums, and arguably the best classifier to use in machine learning competitions). Scikit-Learn has a RandomForestClassifier class in its module that enables predictions from thousands of decision tree learners to be combined. The main parameters associated with the RandomForestClassifier includes n_jobs, which refers to how many processing cores to spread calculations over, and n_estimators, which refers to how many decision trees to form. These features will be discussed in the future blog posts.
Principal Components Analysis
This is an example of an unsupervised classification technique, where the algorithm identifies features by itself. We would use this technique if we wanted to filter the problem down to only important dimensions, or finding topics in large amounts of textual data, or finding features that unexpectedly hold predictive power in time series analysis.
Principal Components Analysis uses autocorrelation in time series data to transforms a set of potentially correlated variables into a set of linearly uncorrelated variables. These variables are the Principal Components, which are then ranked depending on the amount of variability they describe. So if we have many dimensions, we can reduce the feature space through principal components analysis to just 2 or 3 that provides almost all the variance in the data, resulting in a stronger parameters fed through to a supervised classifier model.
Principal Components Analysis uses autocorrelation in time series data to transforms a set of potentially correlated variables into a set of linearly uncorrelated variables. These variables are the Principal Components, which are then ranked depending on the amount of variability they describe. So if we have many dimensions, we can reduce the feature space through principal components analysis to just 2 or 3 that provides almost all the variance in the data, resulting in a stronger parameters fed through to a supervised classifier model.
Multinomial Naive Bayes Classifier
Naive Bayes is useful when we have a limited dataset as it is a high-bias classifier which assumes conditional independence. This means it is unable to identify interactions between individual features, unless they are added as extra features. For instance, in sentiment analysis, document classification is common due to the qualitative nature of the data. Naive Bayes learns that individual words are referring to texts relating to those words, but phrases or slang that have underlying meaning with those individual words could not be considered under the same topic as the individual words. Instead, we treat the phrase/slang as an additional feature rather than associate it with the category the individual words are classified under.
Conclusion
If there is tick level data the type of classifier applied does not matter, rather technical efficiency and algo scalability become priorities to address. As data size is substantial, the marginal performance power increase reduces. Note classifiers often have differing assumptions so if a specific classifier exhibits poor performance, it is usually due to a violation of the assumptions in the classifier.