[论文翻译]利用强化学习进行量化投资的尝试

创新部分

根据股票投资制度，本文创建了简易的投资模型以及相应的环境，设计并实现了相应的算法库。

在传统的经济量化领域，或者说股票投资和机器学习的交叉领域，程序常常被用来实现对股票趋势的预测，而得知股票趋势概率后，投资的具体行为却是由人自己控制的，这便意味着量化过程并非全部由程序实现，在面对概率等因素下，人所固有的情感不仅会影响对股票趋势的判断，也会影响在判断完趋势和概率后，对股票持有情况的选择。因此，本文不同于其它希望预测股票趋势，然后简单地涨买跌卖的论文，而是着重于将整个过程作为一种“游戏”全部交给程序执行，可能在最开始，程序将不能学会如何购买股票，但经过长时间大量的训练后，它的表现可能优于人类。

比如说，假如程序预测一支股票涨的概率是0.6，跌的概率是0.4，那么最后对股票的决策仍然是由人来完成的。人需要结合程序预测的准确度，历史信息，还有涨跌预测的幅度来判断信息，并尽力让自己最后赚钱。那何不让机器来完成这项工作呢——即让机器作为行动方，独立完成股票的预测和买卖，并最终通过训练，让自己有着赚钱的能力。

程序思想

机器学习部分（程序主体）

程序采用ActorCritic的思路，将整个股票交易视为一场游戏，程序在“买入”，“不动”，和“卖出”三种动作中选择，并得到系统给予的reward，根据reward，训练C网络（评价网络），然后用TD-error训练A网络（动作网络）。

选择AC的原因在于，该程序在预测股票趋势的基础上增加了选择的过程，但这样的选择不一定马上能改变当前的reward，因此使用动作-评委网络，可以改善程序歪打正着的情况，以及平衡概率和选择的矛盾（冒险与否）。

相关程序被集成为ACSDK.py下的Actor类和Critic类，并由主程序main.py调用。

相关代码如下：

class Actor(object):
    def __init__(self, sess, n_features, n_actions, lr=0.001):
        self.sess = sess

        self.s = tf.compat.v1.placeholder(tf.float32, [1, n_features], "state")
        self.a = tf.compat.v1.placeholder(tf.int32, None, "act")
        self.td_error = tf.compat.v1.placeholder(tf.float32, None, "td_error")  # TD_error

        with tf.compat.v1.variable_scope('Actor'):
            l1 = tf.compat.v1.layers.dense(
                inputs=self.s,
                units=20,  # number of hidden units
                activation=tf.nn.relu,
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='l1'
            )

            self.acts_prob = tf.compat.v1.layers.dense(
                inputs=l1,
                units=n_actions,  # output units
                activation=tf.nn.softmax,  # get action probabilities
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='acts_prob'
            )

        with tf.compat.v1.variable_scope('exp_v'):
            log_prob = tf.compat.v1.log(self.acts_prob[0, self.a])
            self.exp_v = tf.reduce_mean(log_prob * self.td_error)  # advantage (TD_error) guided loss

        with tf.compat.v1.variable_scope('train'):
            self.train_op = tf.compat.v1.train.AdamOptimizer(lr).minimize(-self.exp_v)  # minimize(-exp_v) = maximize(exp_v)

    def learn(self, s, a, td):
        s = s[np.newaxis, :]
        feed_dict = {self.s: s, self.a: a, self.td_error: td}
        _, exp_v = self.sess.run([self.train_op, self.exp_v], feed_dict)
        return exp_v

    def choose_action(self, s):
        s = s[np.newaxis, :]
        probs = self.sess.run(self.acts_prob, {self.s: s}) 
        return np.random.choice(np.arange(probs.shape[1]), p=probs.ravel())  # return a int

class Critic(object):
    def __init__(self, sess, n_features, lr=0.01):
        self.sess = sess

        self.s = tf.compat.v1.placeholder(tf.float32, [1, n_features], "state")
        self.v_ = tf.compat.v1.placeholder(tf.float32, [1, 1], "v_next")
        self.r = tf.compat.v1.placeholder(tf.float32, None, 'r')

        with tf.compat.v1.variable_scope('Critic'):
            l1 = tf.compat.v1.layers.dense(
                inputs=self.s,
                units=20,  
                activation=tf.nn.relu,  
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='l1'
            )

            self.v = tf.compat.v1.layers.dense(
                inputs=l1,
                units=1,  # output units
                activation=None,
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='V'
            )

        with tf.compat.v1.variable_scope('squared_TD_error'):
            self.td_error = self.r + GAMMA * self.v_ - self.v
            self.loss = tf.compat.v1.square(self.td_error)  # TD_error = (r+gamma*V_next) - V_eval
        with tf.compat.v1.variable_scope('train'):
            self.train_op = tf.compat.v1.train.AdamOptimizer(lr).minimize(self.loss)

    def learn(self, s, r, s_):
        s, s_ = s[np.newaxis, :], s_[np.newaxis, :]

        v_ = self.sess.run(self.v, {self.s: s_})
        td_error, _ = self.sess.run([self.td_error, self.train_op],
                                    {self.s: s, self.v_: v_, self.r: r})
        return td_error,v_

以及主函数：

from turtle import color
from ACSDK import Actor,Critic 
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

import Environment
import getShock

GAMMA = 0.1  # 衰变值
LR_A = 0.0001  # Actor学习率
LR_C = 0.5  # Critic学习率

N_F = 4  # 状态空间
N_A = 3  # 动作空间

def main():
    # 数据设定
    stock_code = '000037'
    trade_cost = 15/10000

    #getShock.get(stock_code)

    sess = tf.compat.v1.Session()
    actor = Actor(sess, n_features=N_F, n_actions=N_A, lr=LR_A)  # 初始化Actor
    critic = Critic(sess, n_features=N_F, lr=LR_C)  # 初始化Critic
    sess.run(tf.compat.v1.global_variables_initializer())  # 初始化参数

    env = Environment.Env(stock_code, trade_cost)
    obv, reward, done = env.init_module()
    rewards = []
    td_errors = []
    vs = []
    print(obv)
    while(True):
        action = actor.choose_action(obv)
        obv_, reward, done = env.step(action)
        rewards.append(reward)
        td_error,v_1 = critic.learn(obv, reward, obv_)
        td_errors.append(td_error[0])
        vs.append(v_1[0])
        actor.learn(obv, action, td_error)
        obv = obv_
        #print(reward)
        if (done):break
    
    x = np.linspace(0,env.readLines()-1,env.readLines())
    plt.plot(x,np.array(rewards),color = 'blue')
    plt.plot(x,np.array(vs),color = 'black')
    plt.plot(x,100 * np.array(env.readData())[1:],color = 'red')
    plt.plot(x,100 * np.array(env.readStocks())[0:],color = 'yellow')
    plt.show()

if __name__ == "__main__":
    main()

环境编写

为该人工智能配置一个具有真实股票环境的反应系统，主要负责接收动作，读取当前股价，根据规则返回对应的 reward ，一次迭代时间为一天。

关于激励reward的设计，采用赚取的总金额作为激励，同时为避免程序采取消极措施，reward在刚开始加上了负偏置。即从负值开始，每天可以选择“买进”一单位股票，“卖出”一单位股票或者不动，然后以当天的收盘价扣除现金或获得现金，并以第二天的开盘价格算出现金加上股票的总价作为激励，即：$reward = stock * price + cash$

在状态的设计上，本实验参考了论文：ML-TEA 一套基于机器学习和

[论文翻译]利用强化学习进行量化投资的尝试

代码地址：https://www.aiqianji.com/openoker/stock_critic.git

创新部分

程序思想

机器学习部分（程序主体）

环境编写