机器学习的十种基本算法

0 / 675

英国数学家,计算机科学家,逻辑学家和密码分析员艾伦·图灵(Alan Turing)对机器有如下推断:
Alan Turing, an English mathematician, computer scientist, logician, and cryptanalyst, surmised about machines that:
“这就像一个从大师那里学到很多东西,但通过自己的工作又增加了很多的学生。发生这种情况时,我感到有人有义务将机器视为具有智能。”
“It would be like a pupil who had learnt much from his master but had added much more by his own work. When this happens I feel that one is obliged to regard the machine as showing intelligence."

To give you an example of the impact of machine learning, Man group’s AHL Dimension programme is a $5.1 billion dollar hedge fund which is partially managed by AI. After it started off, by the year 2015, its machine learning algorithms were contributing more than half of the profits of the fund even though the assets under its management were far less.

为了举例说明机器学习的影响,Man group的AHL Dimension计划是一个51亿美元的对冲基金,部分由AI管理。该基金开始运作后,到2015年,尽管其管理的资产远远少于该基金,但其机器学习算法却贡献了该基金一半以上的利润。

Machine Learning in Trading

After reading this blog, you would be able to understand the basic logic behind some popular and incredibly resourceful machine learning algorithms which have been used by the trading community as well as serve as the foundation stone on which you step on to create the best machine learning algorithm. They are:

阅读此博客后,您将能够了解一些流行且难以置信的机器学习算法背后的基本逻辑,这些算法已被交易社区使用,并成为您进一步创建最佳机器学习的基础算法。他们是:

Linear Regression线性回归

Initially developed in statistics to study the relationship between input and output numerical variables, it was adopted by the machine learning community to make predictions based on the linear regression equation.

The mathematical representation of linear regression is a linear equation that combines a specific set of input data (x) to predict the output value (y) for that set of input values. The linear equation assigns a factor to each set of input values, which are called the coefficients represented by the Greek letter Beta (β).

The equation mentioned below represents a linear regression model with two sets of input values, x1 and x2. y represents the output of the model, β0, β1 and β2 are the coefficients of the linear equation.

最初是在统计学中发展起来的,用于研究输入和输出数值变量之间的关系,后来机器学习社区将其用于基于线性回归方程式进行预测。

线性回归的数学表示法是一个线性方程,它结合了一组特定的输入数据(x),以预测该组输入值的输出值(y)。线性方程式为每组输入值分配一个因子,这些输入值称为希腊字母Beta(β)表示的系数。

下面提到的方程式表示具有两组输入值x 1和x 2的线性回归模型。y表示模型的输出,β 0,β 1和β 2是线性方程的系数。

y = β0 + β1x1 + β2x2

When there is only one input variable, the linear equation represents a straight line. For simplicity, consider β2 to be equal to zero, which would imply that the variable x2 will not influence the output of the linear regression model. In this case, the linear regression will represent a straight line and its equation is shown below.
当只有一个输入变量时,线性方程式表示一条直线。为了简单起见,考虑β 2是等于零,这将意味着,变量x 2不会影响线性回归模型的输出。在这种情况下,线性回归将代表一条直线,其等式如下所示。

y = β0 + β1x1

A graph of the linear regression equation model is as shown below
线性回归方程模型的图形如下所示

Linear Regression Graph

Linear regression can be used to find the general price trend of a stock over a period of time. This helps us understand if the price movement is positive or negative.
线性回归可用于查找一段时间内股票的总体价格趋势。这有助于我们了解价格走势是正面还是负面。

Logistic regression 逻辑回归

In logistic regression, our aim is to produce a discrete value, either 1 or 0. This helps us in finding a definite answer to our scenario.
在逻辑回归中,我们的目标是产生一个离散值,即1或0。这有助于我们找到一种确定的方案答案。

Logistic regression can be mathematically represented as,
Logistic回归可以用数学表示为:

Logistic regression mathematical represesntation

The logistic regression model computes a weighted sum of the input variables similar to the linear regression, but it runs the result through a special non-linear function, the logistic function or sigmoid function to produce the output y.

The sigmoid/logistic function is given by the following equation.
逻辑回归模型类似于线性回归来计算输入变量的加权和,但是它通过特殊的非线性函数,逻辑函数或S形函数运行结果,以产生输出y。

S形/逻辑函数由以下方程式给出。

y = 1 / (1+ e-x)

logistic regression sigmoid function

In simple terms, logistic regression can be used to predict the direction of the market.
简而言之,逻辑回归可以用来预测市场的方向。

KNN Classification KNN分类

The purpose of the K nearest neighbours (KNN) classification is to separate the data points into different classes so that we can classify them based on similarity measures (e.g. distance function).

KNN learns as it goes, in the sense, it does not need an explicit training phase and starts classifying the data points decided by a majority vote of its neighbours.

The object is assigned to the class which is most common among its k nearest neighbours.

Let’s consider the task of classifying a green circle into class 1 and class 2. Consider the case of KNN based on 1-nearest neighbour. In this case, KNN will classify the green circle into class 1. Now let’s increase the number of nearest neighbours to 3 i.e., 3-nearest neighbour. As you can see in the figure there are ‘two’ class 2 objects and ‘one’ class 1 object inside the circle. KNN will classify a green circle into class 2 object as it forms the majority.
K最近邻(KNN)分类的目的是将数据点分为不同的类别,以便我们可以基于相似性度量(例如,距离函数)对它们进行分类。

从某种意义上说,KNN可以不断进行学习,不需要明确的训练阶段,并且可以开始对由邻居的多数投票决定的数据点进行分类。

将对象分配给在其k个最近的邻居中最常见的类。

让我们考虑将绿色圆圈分为1类和2类的任务。考虑基于1个最近邻居的KNN的情况。在这种情况下,KNN将绿色圆圈分类为1类。现在,让我们将最近邻居的数量增加到3,即3最近邻居。正如您在图中看到的那样,圆圈内有“两个” 2类对象和“一个” 1类对象。KNN将绿色圆圈归为2类对象,因为它形成了大多数对象。

KNN

Support Vector Machine (SVM) 支持向量机(SVM)

Support Vector Machine was initially used for data analysis. Initially, a set of training examples is fed into the SVM algorithm, belonging to one or the other category. The algorithm then builds a model that starts assigning new data to one of the categories that it has learned in the training phase.

In the SVM algorithm, a hyperplane is created which serves as a demarcation between the categories. When the SVM algorithm processes a new data point and depending on the side on which it appears it will be classified into one of the classes.
支持向量机最初用于数据分析。最初,一组训练示例被输入到SVM算法中,属于一个或另一个类别。然后,该算法构建一个模型,该模型开始将新数据分配给它在训练阶段学到的类别之一。

在SVM算法中,创建了一个超平面,该超平面用作类别之间的分界。当SVM算法处理一个新的数据点时,根据它出现的一侧,它将被分类为一种类别。

Support Vector Machine

When related to trading, an SVM algorithm can be built which categorises the equity data as a favourable buy, sell or neutral classes and then classifies the test data according to the rules.
当与交易相关时,可以构建SVM算法,该算法将股票数据分类为有利的买入,卖出或中性类别,然后根据规则对测试数据进行分类。

Decision Trees 决策树

Decision trees are basically a tree-like support tool which can be used to represent a cause and its effect. Since one cause can have multiple effects, we list them down (quite like a tree with its branches).

决策树基本上是一种树状支持工具,可用于表示原因和结果。由于一个原因可能有多种影响,因此我们将其列出(非常像带有分支的树)。

Decision Trees

We can build the decision tree by organising the input data and predictor variables, and according to some criteria that we will specify.

The main steps to build a decision tree are:

  1. Retrieve market data for a financial instrument.
  2. Introduce the Predictor variables (i.e. Technical indicators, Sentiment indicators, Breadth indicators, etc.)
  3. Setup the Target variable or the desired output.
  4. Split data between training and test data.
  5. Generate the decision tree training the model.
  6. Testing and analyzing the model.

The disadvantage of decision trees is that they are prone to overfitting due to their inherent design structure.
我们可以通过组织输入数据和预测变量并根据我们将指定的一些标准来构建决策树。

建立决策树的主要步骤是:

  1. 检索金融工具的市场数据。
  2. 介绍预测变量(即技术指标,情绪指标,广度指标等)
  3. 设置目标变量或所需的输出。
  4. 在训练和测试数据之间拆分数据。
  5. 生成训练模型的决策树。
  6. 测试和分析模型。

决策树的缺点是由于其固有的设计结构而易于过度拟合。

Random Forest 随机森林

A random forest algorithm was designed to address some of the limitations of decision trees.

Random Forest comprises of decision trees which are graphs of decisions representing their course of action or statistical probability. These multiple trees are mapped to a single tree which is called Classification and Regression (CART) Model.

To classify an object based on its attributes, each tree gives a classification which is said to “vote” for that class. The forest then chooses the classification with the greatest number of votes. For regression, it considers the average of the outputs of different trees.
一个随机森林算法旨在解决一些决策树的局限性。

随机森林由决策树组成,决策树是代表决策过程或统计概率的决策图。这些多棵树被映射到称为分类和回归(CART)模型的单个树。

为了基于对象的属性对对象进行分类,每棵树都给出了一个分类,该分类被称为对该类“投票”。然后,森林选择投票数最多的类别。为了进行回归,它考虑了不同树的输出的平均值。

Random Forest

Random Forest works in the following way:

  1. Assume the number of cases as N. A sample of these N cases is taken as the training set.
  2. Consider M to be the number of input variables, a number m is selected such that m < M. The best split between m and M is used to split the node. The value of m is held constant as the trees are grown.
  3. Each tree is grown as large as possible.
  4. By aggregating the predictions of n trees (i.e., majority votes for classification, average for regression), predict the new data.
    随机森林的工作方式如下:
  5. 假设案例数为N。将这N个案例中的一个样本作为训练集。
  6. 假设M为输入变量的数量,则选择m使得m <M。m和M之间的最佳分割用于分割节点。随着树木的生长,m的值保持恒定。
  7. 每棵树都生长得尽可能大。
  8. 通过汇总n棵树的预测(即,多数票用于分类,平均值用于回归),预测新数据。

Artificial Neural Network 人工神经网络

In our quest to play God, an artificial neural network is one of our crowning achievements. We have created multiple nodes which are interconnected to each other, as shown in the image, which mimics the neurons in our brain. In simple terms, each neuron takes in information through another neuron, performs work on it, and transfers it to another neuron as output.
在我们追求上帝的过程中,人工神经网络是我们的最高成就之一。如图所示,我们已经创建了多个相互连接的节点,它们模仿了我们大脑中的神经元。简而言之,每个神经元通过另一个神经元接收信息,对其进行工作,然后将其作为输出传递给另一个神经元。

Artificial Neural network

Each circular node represents an artificial neuron and an arrow represents a connection from the output of one neuron to the input of another.

Neural networks can be more useful if we use it to find interdependencies between various asset classes, rather than trying to predict a buy or sell choice.
每个圆形节点代表一个人工神经元,而箭头则代表从一个神经元的输出到另一个神经元的输入的连接。

如果我们使用神经网络来发现各种资产类别之间的相互依赖关系,而不是尝试预测买入或卖出选择,则神经网络会更有用。

K-means Clustering K均值聚类

In this machine learning algorithm, the goal is to label the data points according to their similarity. Thus, we do not define the clusters prior to the algorithm but instead, the algorithm finds these clusters as it goes forward.

A simple example would be that given the data of football players, we will use K-means clustering and label them according to their similarity. Thus, these clusters could be based on the strikers preference to score on free kicks or successful tackles, even when the algorithm is not given pre-defined labels to start with.

K-means clustering would be beneficial to traders who feel that there might be similarities between different assets which cannot be seen on the surface.
在这种机器学习算法中,目标是根据数据点的相似性对其进行标记。因此,我们没有在算法之前定义聚类,而是算法在前进时找到了这些聚类。

一个简单的例子是,根据足球运动员的数据,我们将使用K-means聚类并根据他们的相似性对其进行标记。因此,即使没有为算法提供预定义的标签,这些聚类也可以基于前锋的偏爱来为任意球或成功的铲球得分。

K均值聚类将对认为在表面上看不到的不同资产之间可能存在相似之处的交易者有利。

Naive Bayes theorem 朴素贝叶斯定理

Now, if you remember basic probability, you would know that Bayes theorem was formulated in a way where we assume we have prior knowledge of any event that related to the former event.

For example, to check the probability that you will be late to the office, one would like to know if you face any traffic on the way.

However, Naive Bayes classifier algorithm assumes that two events are independent of each other and thus, this simplifies the calculations to a large extent. Initially thought of nothing more than an academic exercise, Naive Bayes has shown that it works remarkably well in the real world as well.

Naive Bayes algorithm can be used to find simple relationships between different parameters without having complete data.
现在,如果您记得基本概率,您就会知道贝叶斯定理是用一种假定我们对与前一事件相关的任何事件具有先验知识的方式来表述的。

例如,要检查您迟到办公室的可能性,您可能想知道您在途中是否遇到任何交通拥堵。

但是,朴素贝叶斯分类器算法假设两个事件彼此独立,因此,这在很大程度上简化了计算。最初,Naive Bayes最初只是想做学术练习而已,它表明它在现实世界中也表现出色。

朴素贝叶斯算法可用于查找不同参数之间的简单关系而无需完整的数据。

Recurrent Neural Networks (RNN) 递归神经网络(RNN)

Did you know Siri and Google Assistant use RNN in their programming? RNNs are essentially a type of neural network which have a memory attached to each node which makes it easy to process sequential data i.e. one data unit is dependent on the previous one.

A way to explain the advantage of RNN over a normal neural network is that we are supposed to process a word character by character. If the word is “trading”, a normal neural network node would forget the character “t” by the time it moves to “d” whereas a recurrent neural network will remember the character as it has its own memory.
您知道Siri和Google Assistant在其编程中使用RNN吗?RNN本质上是一种神经网络,具有连接到每个节点的内存,这使得处理顺序数据变得容易,即一个数据单元依赖于前一个数据单元。

一种解释RNN优于普通神经网络的优势的方法是,我们应该逐个字符地处理一个单词。如果单词是“ trading”,则正常的神经网络节点会在移动到“ d”时忘记字符“ t”,而循环神经网络会记住该字符,因为它具有自己的记忆。

Recurrent Neural Network

Conclusion

According to a study by Preqin, 1,360 quantitative funds are known to use computer models in their trading process, representing 9% of all funds. Firms organise cash prizes for an individual's machine learning strategy if it makes money in the test phase and in fact, invest their own money and take it in the live trading phase. Thus, in the race to be one step ahead of the competition, everyone, be it billion dollar hedge funds or the individual trade, all are trying to understand and implement machine learning in their trading strategies.

You can go through the AI in Trading course on Quantra to learn these algorithms in detail as well as apply them in live markets successfully and efficiently.

You can enroll for the online machine learning course on Quantra which covers classification algorithms, performance measures in machine learning, hyper-parameters, and building of supervised classifiers.
根据Preqin的一项研究,已知1,360种量化基金在其交易过程中使用计算机模型,占所有基金的9%。如果公司在测试阶段赚钱,并且实际上是将自己的钱投入到实时交易阶段,则公司会为个人的机器学习策略组织现金奖励。因此,为了在竞争中领先一步,每个人,无论是数十亿美元的对冲基金还是个人交易,都在试图在其交易策略中理解和实施机器学习。

您可以通过Quantra上的AI交易课程来学习这些算法的详细信息,并将其成功有效地应用到实时市场中。

您可以在Quantra上注册在线机器学习课程,该课程涵盖分类算法,机器学习中的性能度量,超参数以及监督分类器的构建。

Suggested Reads:

Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.