[论文翻译]利用强化学习进行量化投资的尝试



代码地址:https://www.aiqianji.com/openoker/stock_critic.git

创新部分

根据股票投资制度,本文创建了简易的投资模型以及相应的环境,设计并实现了相应的算法库。

在传统的经济量化领域,或者说股票投资和机器学习的交叉领域,程序常常被用来实现对股票趋势的预测,而得知股票趋势概率后,投资的具体行为却是由人自己控制的,这便意味着量化过程并非全部由程序实现,在面对概率等因素下,人所固有的情感不仅会影响对股票趋势的判断,也会影响在判断完趋势和概率后,对股票持有情况的选择。因此,本文不同于其它希望预测股票趋势,然后简单地涨买跌卖的论文,而是着重于将整个过程作为一种“游戏”全部交给程序执行,可能在最开始,程序将不能学会如何购买股票,但经过长时间大量的训练后,它的表现可能优于人类。

比如说,假如程序预测一支股票涨的概率是0.6,跌的概率是0.4,那么最后对股票的决策仍然是由人来完成的。人需要结合程序预测的准确度,历史信息,还有涨跌预测的幅度来判断信息,并尽力让自己最后赚钱。那何不让机器来完成这项工作呢——即让机器作为行动方,独立完成股票的预测和买卖,并最终通过训练,让自己有着赚钱的能力。

程序思想

机器学习部分(程序主体)

程序采用ActorCritic的思路,将整个股票交易视为一场游戏,程序在“买入”,“不动”,和“卖出”三种动作中选择,并得到系统给予的reward,根据reward,训练C网络(评价网络),然后用TD-error训练A网络(动作网络)。

选择AC的原因在于,该程序在预测股票趋势的基础上增加了选择的过程,但这样的选择不一定马上能改变当前的reward,因此使用动作-评委网络,可以改善程序歪打正着的情况,以及平衡概率和选择的矛盾(冒险与否)。

相关程序被集成为ACSDK.py下的Actor类和Critic类,并由主程序main.py调用。

相关代码如下:

class Actor(object):
    def __init__(self, sess, n_features, n_actions, lr=0.001):
        self.sess = sess

        self.s = tf.compat.v1.placeholder(tf.float32, [1, n_features], "state")
        self.a = tf.compat.v1.placeholder(tf.int32, None, "act")
        self.td_error = tf.compat.v1.placeholder(tf.float32, None, "td_error")  # TD_error

        with tf.compat.v1.variable_scope('Actor'):
            l1 = tf.compat.v1.layers.dense(
                inputs=self.s,
                units=20,  # number of hidden units
                activation=tf.nn.relu,
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='l1'
            )

            self.acts_prob = tf.compat.v1.layers.dense(
                inputs=l1,
                units=n_actions,  # output units
                activation=tf.nn.softmax,  # get action probabilities
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='acts_prob'
            )

        with tf.compat.v1.variable_scope('exp_v'):
            log_prob = tf.compat.v1.log(self.acts_prob[0, self.a])
            self.exp_v = tf.reduce_mean(log_prob * self.td_error)  # advantage (TD_error) guided loss

        with tf.compat.v1.variable_scope('train'):
            self.train_op = tf.compat.v1.train.AdamOptimizer(lr).minimize(-self.exp_v)  # minimize(-exp_v) = maximize(exp_v)

    def learn(self, s, a, td):
        s = s[np.newaxis, :]
        feed_dict = {self.s: s, self.a: a, self.td_error: td}
        _, exp_v = self.sess.run([self.train_op, self.exp_v], feed_dict)
        return exp_v

    def choose_action(self, s):
        s = s[np.newaxis, :]
        probs = self.sess.run(self.acts_prob, {self.s: s}) 
        return np.random.choice(np.arange(probs.shape[1]), p=probs.ravel())  # return a int

class Critic(object):
    def __init__(self, sess, n_features, lr=0.01):
        self.sess = sess

        self.s = tf.compat.v1.placeholder(tf.float32, [1, n_features], "state")
        self.v_ = tf.compat.v1.placeholder(tf.float32, [1, 1], "v_next")
        self.r = tf.compat.v1.placeholder(tf.float32, None, 'r')

        with tf.compat.v1.variable_scope('Critic'):
            l1 = tf.compat.v1.layers.dense(
                inputs=self.s,
                units=20,  
                activation=tf.nn.relu,  
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='l1'
            )

            self.v = tf.compat.v1.layers.dense(
                inputs=l1,
                units=1,  # output units
                activation=None,
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='V'
            )

        with tf.compat.v1.variable_scope('squared_TD_error'):
            self.td_error = self.r + GAMMA * self.v_ - self.v
            self.loss = tf.compat.v1.square(self.td_error)  # TD_error = (r+gamma*V_next) - V_eval
        with tf.compat.v1.variable_scope('train'):
            self.train_op = tf.compat.v1.train.AdamOptimizer(lr).minimize(self.loss)

    def learn(self, s, r, s_):
        s, s_ = s[np.newaxis, :], s_[np.newaxis, :]

        v_ = self.sess.run(self.v, {self.s: s_})
        td_error, _ = self.sess.run([self.td_error, self.train_op],
                                    {self.s: s, self.v_: v_, self.r: r})
        return td_error,v_

以及主函数:

from turtle import color
from ACSDK import Actor,Critic 
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

import Environment
import getShock

GAMMA = 0.1  # 衰变值
LR_A = 0.0001  # Actor学习率
LR_C = 0.5  # Critic学习率

N_F = 4  # 状态空间
N_A = 3  # 动作空间

def main():
    # 数据设定
    stock_code = '000037'
    trade_cost = 15/10000

    #getShock.get(stock_code)

    sess = tf.compat.v1.Session()
    actor = Actor(sess, n_features=N_F, n_actions=N_A, lr=LR_A)  # 初始化Actor
    critic = Critic(sess, n_features=N_F, lr=LR_C)  # 初始化Critic
    sess.run(tf.compat.v1.global_variables_initializer())  # 初始化参数

    env = Environment.Env(stock_code, trade_cost)
    obv, reward, done = env.init_module()
    rewards = []
    td_errors = []
    vs = []
    print(obv)
    while(True):
        action = actor.choose_action(obv)
        obv_, reward, done = env.step(action)
        rewards.append(reward)
        td_error,v_1 = critic.learn(obv, reward, obv_)
        td_errors.append(td_error[0])
        vs.append(v_1[0])
        actor.learn(obv, action, td_error)
        obv = obv_
        #print(reward)
        if (done):break
    
    x = np.linspace(0,env.readLines()-1,env.readLines())
    plt.plot(x,np.array(rewards),color = 'blue')
    plt.plot(x,np.array(vs),color = 'black')
    plt.plot(x,100 * np.array(env.readData())[1:],color = 'red')
    plt.plot(x,100 * np.array(env.readStocks())[0:],color = 'yellow')
    plt.show()

if __name__ == "__main__":
    main()

环境编写

为该人工智能配置一个具有真实股票环境的反应系统,主要负责接收动作,读取当前股价,根据规则返回对应的 reward ,一次迭代时间为一天。

关于激励reward的设计,采用赚取的总金额作为激励,同时为避免程序采取消极措施,reward在刚开始加上了负偏置。即从负值开始,每天可以选择“买进”一单位股票,“卖出”一单位股票或者不动,然后以当天的收盘价扣除现金或获得现金,并以第二天的开盘价格算出现金加上股票的总价作为激励,即:$reward = stock * price + cash$

在状态的设计上,本实验参考了论文:ML-TEA 一套基于机器学习和