机器学习的十种基本算法

0 / 805

英国数学家,计算机科学家,逻辑学家和密码分析员艾伦·图灵(Alan Turing)对机器有如下推断:
Alan Turing, an English mathematician, computer scientist, logician, and cryptanalyst, surmised about machines that:
“这就像一个从大师那里学到很多东西,但通过自己的工作又增加了很多的学生。发生这种情况时,我感到有人有义务将机器视为具有智能。”
“It would be like a pupil who had learnt much from his master but had added much more by his own work. When this happens I feel that one is obliged to regard the machine as showing intelligence."

To give you an example of the impact of machine learning, Man group’s AHL Dimension programme is a $5.1 billion dollar hedge fund which is partially managed by AI. After it started off, by the year 2015, its machine learning algorithms were contributing more than half of the profits of the fund even though the assets under its management were far less.

为了举例说明机器学习的影响,Man group的AHL Dimension计划是一个51亿美元的对冲基金,部分由AI管理。该基金开始运作后,到2015年,尽管其管理的资产远远少于该基金,但其机器学习算法却贡献了该基金一半以上的利润。

Machine Learning in Trading

After reading this blog, you would be able to understand the basic logic behind some popular and incredibly resourceful machine learning algorithms which have been used by the trading community as well as serve as the foundation stone on which you step on to create the best machine learning algorithm. They are:

阅读此博客后,您将能够了解一些流行且难以置信的机器学习算法背后的基本逻辑,这些算法已被交易社区使用,并成为您进一步创建最佳机器学习的基础算法。他们是:

Linear Regression线性回归

Initially developed in statistics to study the relationship between input and output numerical variables, it was adopted by the machine learning community to make predictions based on the linear regression equation.

The mathematical representation of linear regression is a linear equation that combines a specific set of input data (x) to predict the output value (y) for that set of input values. The linear equation assigns a factor to each set of input values, which are called the coefficients represented by the Greek letter Beta (β).

The equation mentioned below represents a linear regression model with two sets of input values, x1 and x2. y represents the output of the model, β0, β1 and β2 are the coefficients of the linear equation.

最初是在统计学中发展起来的,用于研究输入和输出数值变量之间的关系,后来机器学习社区将其用于基于线性回归方程式进行预测。

线性回归的数学表示法是一个线性方程,它结合了一组特定的输入数据(x),以预测该组输入值的输出值(y)。线性方程式为每组输入值分配一个因子,这些输入值称为希腊字母Beta(β)表示的系数。

下面提到的方程式表示具有两组输入值x 1和x 2的线性回归模型。y表示模型的输出,β 0,β 1和β 2是线性方程的系数。

y = β0 + β1x1 + β2x2

When there is only one input variable, the linear equation represents a straight line. For simplicity, consider β2 to be equal to zero, which would imply that the variable x2 will not influence the output of the linear regression model. In this case, the linear regression will represent a straight line and its equation is shown below.
当只有一个输入变量时,线性方程式表示一条直线。为了简单起见,考虑β 2是等于零,这将意味着,变量x 2不会影响线性回归模型的输出。在这种情况下,线性回归将代表一条直线,其等式如下所示。

y = β0 + β1x1

A graph of the linear regression equation model is as shown below
线性回归方程模型的图形如下所示

Linear Regression Graph

Linear regression can be used to find the general price trend of a stock over a period of time. This helps us understand if the price movement is positive or negative.
线性回归可用于查找一段时间内股票的总体价格趋势。这有助于我们了解价格走势是正面还是负面。

Logistic regression 逻辑回归

In logistic regression, our aim is to produce a discrete value, either 1 or 0. This helps us in finding a definite answer to our scenario.
在逻辑回归中,我们的目标是产生一个离散值,即1或0。这有助于我们找到一种确定的方案答案。

Logistic regression can be mathematically represented as,
Logistic回归可以用数学表示为:

Logistic regression mathematical represesntation

The logistic regression model computes a weighted sum of the input variables similar to the linear regression, but it runs the result through a special non-linear function, the logistic function or sigmoid function to produce the output y.

The sigmoid/logistic function is given by the following equation.
逻辑回归模型类似于线性回归来计算输入变量的加权和,但是它通过特殊的非线性函数,逻辑函数或S形函数运行结果,以产生输出y。

S形/逻辑函数由以下方程式给出。

y = 1 / (1+ e-x)

logistic regression sigmoid function

In simple terms, logistic regression can be used to predict the direction of the market.
简而言之,逻辑回归可以用来预测市场的方向。

KNN Classification KNN分类

The purpose of the K nearest neighbours (KNN) classification is to separate the data points into different classes so that we can classify them based on similarity measures (e.g. distance function).

KNN learns as it goes, in the sense, it does not need an explicit training phase and starts classifying the data points decided by a majority vote of its neighbours.

The object is assigned to the class which is most common among its k nearest neighbours.

Let’s consider the task of classifying a green circle into class 1 and class 2. Consider the case of KNN based on 1-nearest neighbour. In this case, KNN will classify the green circle into class 1. Now let’s increase the number of nearest neighbours to 3 i.e., 3-nearest neighbour. As you can see in the figure there are ‘two’ class 2 objects and ‘one’ class 1 object inside the circle. KNN will classify a green circle into class 2 object as it forms the majority.
K最近邻(KNN)分类的目的是将数据点分为不同的类别,以便我们可以基于相似性度量(例如,距离函数)对它们进行分类。

从某种意义上说,KNN可以不断进行学习,不需要明确的训练阶段,并且可以开始对由邻居的多数投票决定的数据点进行分类。

将对象分配给在其k个最近的邻居中最常见的类。

让我们考虑将绿色圆圈分为1类和2类的任务。考虑基于1个最近邻居的KNN的情况。在这种情况下,KNN将绿色圆圈分类为1类。现在,让我们将最近邻居的数量增加到3,即3最近邻居。正如您在图中看到的那样,圆圈内有“两个” 2类对象和“一个” 1类对象。KNN将绿色圆圈归为2类对象,因为它形成了大多数对象。

KNN

Support Vector Machine (SVM) 支持向量机(SVM)

Support Vector Machine was initially used for data analysis. Initially, a set of training examples is fed into the SVM algorithm, belonging to one or the other category. The algorithm then builds a model that starts assigning new data to one of the categories that it has learned in the training phase.

In the SVM algorithm, a hyperplane is created which serves as a demarcation between the categories. When the SVM algorithm processes a new data point and depending on the side on which it appears it will be classified into one of the classes.
支持向量机最初用于数据分析。最初,一组训练示例被输入到SVM算法中,属于一个或另一个类别。然后,该算法构建一个模型,该模型开始将新数据分配给它在训练阶段学到的类别之一。

在SVM算法中,创建了一个超平面,该超平面用作类别之间的分界。当SVM算法处理一个新的数据点时,根据它出现的一侧,它将被分类为一种类别。

Support Vector Machine

When related to trading, an SVM algorithm can be built which categorises the equity data as a favourable buy, sell or neutral classes and then classifies the test data according to the rules.
当与交易相关时,可以构建SVM算法,该算法将股票数据分类为有利的买入,卖出或中性类别,然后根据规则对测试数据进行分类。

Decision Trees 决策树

Decision trees are basically a tree-like support tool which can be used to represent a cause and its effect. Since one cause can have multiple effects, we list them down (quite like a tree with its branches).

决策树基本上是一种树状支持工具,可用于表示原因和结果。由于一个原因可能有多种影响,因此我们将其列出(非常像带有分支的树)。

Decision Trees

We can build the decision tree by organising the input data and predictor variables, and according to some criteria that we will specify.

The main steps to build a decision tree are:

  1. Retrieve market data for a financial instrument.
  2. Introduce the Predictor variables (i.e. Technical indicators, Sentiment indicators, Breadth indicators, etc.)
  3. Setup the Target variable or the desired output.
  4. Split data between training and test data.
  5. Generate the decision tree training the model.
  6. Testing and analyzing the model.

The disadvantage of decision trees is that they are prone to overfitting due to their inherent design structure.
我们可以通过组织输入数据和预测变量并根据我们将指定的一些标准来构建决策树。

建立决策树的主要步骤是:

  1. 检索金融工具的市场数据。
  2. 介绍预测变量(即技术指标,情绪指标,广度指标等)
  3. 设置目标变量或所需的输出。
  4. 在训练和测试数据之间拆分数据。
  5. 生成训练模型的决策树。
  6. 测试和分析模型。

决策树的缺点是由于其固有的设计结构而易于过度拟合。

Random Forest 随机森林

A random forest algorithm was designed to address some of the limitations of decision trees.

Random Forest comprises of decision trees which are graphs of decisions representing their course of action or statistical probability. These multiple trees are mapped to a single tree which is called Classification and Regression (CART) Model.

To classify an object based on its attributes, each tree gives a