## Using reinforcement learning to trade multiple stocks through Python and OpenAI Gym | Presented at ICAIF 2020

*Note from Towards Data Science’s editors:** While we allow independent authors to publish articles in accordance with our *rules and guidelines*, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our *Reader Terms* for details.*

This blog is based on our paper: ** Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy**, presented at

**ICAIF 2020**: ACM International Conference on AI in Finance.

Our codes are available on Github.

Our paper is available on SSRN.

If you want to cite our paper, the reference format is as follows:

Hongyang Yang, Xiao-Yang Liu, Shan Zhong, and Anwar Walid. 2020. Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy. In ICAIF ’20: ACM International Conference on AI in Finance, Oct. 15–16, 2020, Manhattan, NY. ACM, New York, NY, USA.

(Video) Deep Reinforcement Learning Applied to Crypto and Stock Trading - Beginner Insights

A most recent DRL library for Automated Trading-**FinRL **can be found here:

## GitHub - AI4Finance-Foundation/FinRL: A Deep Reinforcement Learning Framework for Automated Trading…

### News: we plan to share our codes for both paper trading and live trading. Please actively share your interests with our…

github.com

FinRL for Quantitative Finance: Tutorial for Single Stock Trading

FinRL for Quantitative Finance: Tutorial for Multiple Stock Trading

FinRL for Quantitative Finance: Tutorial for Portfolio Allocation

**ElegantRL** supports state-of-the-art DRL algorithms and provides user-friendly tutorials in Jupyter notebooks. The core codes <1,000 lines, using PyTorch, OpenAI Gym, and NumPy.

One can hardly overestimate the crucial role stock trading strategies play in investment.

Profitable automated stock trading strategy is vital to investment companies and hedge funds. It is applied to optimize capital allocation and maximize investment performance, such as expected return. Return maximization can be based on the estimates of potential return and risk. However, it is challenging to design a profitable strategy in a complex and dynamic stock market.

Every player wants a winning strategy. Needless to say, a profitable strategy in such a complex and dynamic stock market is not easy to design.

Yet, we are to reveal a deep reinforcement learning scheme that automatically learns a stock trading strategy by maximizing investment return.

**Our Solution**: Ensemble Deep Reinforcement Learning Trading Strategy**This strategy** includes three actor-critic based algorithms: Proximal Policy Optimization (PPO), Advantage Actor Critic (A2C), and Deep Deterministic Policy Gradient (DDPG).**It combines** the best features of the three algorithms, thereby robustly adjusting to different market conditions.

The performance of the trading agent with different reinforcement learning algorithms is evaluated using Sharpe ratio and compared with both the Dow Jones Industrial Average index and the traditional min-variance portfolio allocation strategy.

Existing works are not satisfactory. Deep Reinforcement Learning approach has many advantages.

**1.1 DRL and Modern Portfolio Theory (MPT)**

- MPT performs not so well in
**out-of-sample data**. - MPT is very
**sensitive to outliers**. - MPT is calculated
**only based on stock returns**, if we want to**take other relevant factors**into account, for example some of the technical indicators like**Moving Average Convergence Divergence (MACD)**, and**Relative Strength Index (RSI)**, MPT may not be able to combine these information together well.

## 1.2 **DRL and supervised machine learning prediction models**

- DRL doesn’t need
**large labeled training datasets**. This is a significant advantage since the amount of data grows exponentially today, it becomes very time-and-labor-consuming to label a large dataset. - DRL uses a
**reward function**to optimize future rewards, in contrast to an ML regression/classification model that predicts the probability of future outcomes.

## 1.3 **The r**ationale** of using DRL for stock trading**

- The goal of stock trading is to
**maximize returns**, while avoiding risks. DRL solves this optimization problem by**maximizing the expected total reward**from future actions over a time period. - Stock trading is a
**continuous process**of testing new ideas, getting feedback from the market, and trying to optimize the trading strategies over time. We can model stock trading process as**Markov decision process**which is the very foundation of Reinforcement Learning.

## 1.4 **The advantages of deep reinforcement learning**

- Deep reinforcement learning algorithms can
**outperform human players in many challenging games**. For example, on March 2016, DeepMind’s AlphaGo program, a deep reinforcement learning algorithm, beat the world champion Lee Sedol at the game of Go. **Return maximization as trading goal**: by defining the reward function as the change of the portfolio value, Deep Reinforcement Learning maximizes the portfolio value over time.- The stock market provides
**sequential feedback**.**DRL**can**sequentially**increase the model performance during the training process. **The exploration-exploitation technique**balances trying out different new things and taking advantage of what’s figured out. This is difference from other learning algorithms. Also, there is no requirement for a skilled human to provide training examples or labeled samples. Furthermore, during the exploration process, the agent is encouraged to explore the uncharted by human experts.**Experience replay**: is able to overcome the correlated samples issue, since learning from a batch of consecutive samples may experience high variances, hence is inefficient. Experience replay efficiently addresses this issue by randomly sampling mini-batches of transitions from a pre-saved replay memory.**Multi-dimensional data**: by using a continuous action space, DRL can handle large dimensional data.**Computational power**: Q-learning is a very important RL algorithm, however, it fails to handle large space. DRL, empowered by neural networks as efficient function approximator, is powerful to handle extremely large state space and action space.

## 2.1 Concepts

**Reinforcement Learning** is one of three approaches of machine learning techniques, and it trains an agent to interact with the environment by sequentially receiving states and rewards from the environment and taking actions to reach better rewards.

**Deep Reinforcement Learning** approximates the Q value with a neural network. Using a neural network as a function approximator would allow reinforcement learning to be applied to large data.

**Bellman Equation **is the guiding principle to design reinforcement learning algorithms.

**Markov Decision Process (MDP) **is used to model the environment.

**2.2 ****Related works**

**Recent applications** of deep reinforcement learning in financial markets consider discrete or continuous state and action spaces, and employ one of these learning approaches: **critic-only approach, actor-only approach, or and actor-critic approach.**

**1. Critic-only approach: **the critic-only learning approach, which is the most common, solves a discrete action space problem using, for example, Q-learning, Deep Q-learning (DQN) and its improvements, and trains an agent on a single stock or asset.** The idea of the critic-only approach** is to use a **Q-value function** to learn the optimal action-selection policy that maximizes the expected future reward given the current state. Instead of calculating a state-action value table, **DQN minimizes** the mean squared error between the target Q-values, and uses a neural network to perform function approximation. The major limitation of the critic-only approach is that it only works with discrete and finite state and action spaces, which is not practical for a large portfolio of stocks, since the prices are of course continuous.

**Q-learning:**is a value-based Reinforcement Learning algorithm that is used to find the optimal action-selection policy using a Q function.**DQN:**In deep Q-learning, we use a neural network to approximate the Q-value function. The state is given as the input and the Q-value of allowed actions is the predicted output.

**2. Actor-only approach: **The idea here is that the agent directly learns the optimal policy itself. Instead of having a neural network to learn the Q-value, the neural network learns the policy. The policy is a probability distribution that is essentially a strategy for a given state, namely the likelihood to take an allowed action. The actor-only approach can handle the continuous action space environments.

**Policy Gradient:**aims to maximize the expected total rewards by directly learns the optimal policy itself.

**3. Actor-Critic approach: **The actor-critic approach has been recently applied in finance. The idea is to **simultaneously update the actor network that represents the policy,** and** the critic network that represents the value function**. The critic estimates the value function, while the actor updates the policy probability distribution guided by the critic with policy gradients. Over time, the actor learns to take better actions and the critic gets better at evaluating those actions. The actor-critic approach has proven to be able to learn and adapt to large and complex environments, and has been used to play popular video games, such as Doom. Thus, the actor-critic approach fits well in trading with a large stock portfolio.

**A2C:**A2C is a typical actor-critic algorithm.**PPO:**PPO is introduced to control the policy gradient update and ensure that the new policy will not be too different from the previous one.**DDPG:**DDPG combines the frameworks of both Q-learning and policy gradient, and uses neural networks as function approximators.

**3.1 Data**

We track and select the **Dow Jones 30 stocks (at 2016/01/01) **and use historical daily data from **01/01/2009 to 05/08/2020** to train the agent and test the performance. The dataset is downloaded from Compustat database accessed through **Wharton Research Data Services (WRDS)**.

The whole dataset is split in the following figure. Data from 01/01/2009 to 12/31/2014 is used for **training**, and the data from 10/01/2015 to 12/31/2015 is used for **validation** and tuning of parameters. Finally, we test our agent’s performance on **trading** data, which is the unseen out-of-sample data from 01/01/2016 to 05/08/2020. To better exploit the trading data, we continue training our agent while in the trading stage, since this will help the agent to better adapt to the market dynamics.

## 3.2 MDP model for stock trading:

• **State** 𝒔 = [𝒑, 𝒉, 𝑏]: a vector that includes stock prices 𝒑 ∈ R+^D, the stock shares 𝒉 ∈ Z+^D, and the remaining balance 𝑏 ∈ R+, where 𝐷 denotes the number of stocks and Z+ denotes non-negative integers.

• **Action** 𝒂: a vector of actions over 𝐷 stocks. The allowed actions on each stock include selling, buying, or holding, which result in decreasing, increasing, and no change of the stock shares 𝒉, respectively.

• **Reward** 𝑟(𝑠,𝑎,𝑠′):the direct reward of taking action 𝑎 at state 𝑠 and arriving at the new state 𝑠′.

• **Policy** 𝜋 (𝑠): the trading strategy at state 𝑠, which is the probability distribution of actions at state 𝑠.

• **Q-value** 𝑄𝜋 (𝑠, 𝑎): the expected reward of taking action 𝑎 at state 𝑠 following policy 𝜋 .

The state transition of our stock trading process is shown in the following figure. At each state, one of three possible actions is taken on stock 𝑑 (𝑑 = 1, …, 𝐷) in the portfolio.

**Selling**𝒌[𝑑] ∈ [1,𝒉[𝑑]] shares results in 𝒉𝒕+1[𝑑] = 𝒉𝒕 [𝑑] − 𝒌[𝑑],where𝒌[𝑑] ∈Z+ and𝑑 =1,…,𝐷.**Holding**, 𝒉𝒕+1[𝑑]=𝒉𝒕[𝑑].**Buying**𝒌[𝑑] shares results in 𝒉𝒕+1[𝑑] = 𝒉𝒕 [𝑑] + 𝒌[𝑑].

At time 𝑡 an action is taken and the stock prices update at 𝑡+1, accordingly the portfolio values may change from “portfolio value 0” to “portfolio value 1”, “portfolio value 2”, or “portfolio value 3”, respectively, as illustrated in Figure 2. Note that the portfolio value is 𝒑𝑻 𝒉 + 𝑏.

**3.3 Constraints:**

**Market liquidity**: The orders can be rapidly executed at the close price. We assume that stock market will not be affected by our reinforcement trading agent.**Nonnegative balance**: the allowed actions should not result in a negative balance.**Transaction cost**: transaction costs are incurred for each trade. There are many types of transaction costs such as exchange fees, execution fees, and SEC fees. Different brokers have different commission fees. Despite these variations in fees, we assume that our transaction costs to be 1/1000 of the value of each trade (either buy or sell).**Risk-aversion for market crash**: there are sudden events that may cause stock market crash, such as wars, collapse of stock market bubbles, sovereign debt default, and financial crisis. To control the risk in a worst-case scenario like 2008 global financial crisis, we employ the financial**turbulence index**that measures extreme asset price movements.

## 3.4 Return maximization as trading goal

We define our reward function as **the change of the portfolio value** when action 𝑎 is taken at state 𝑠 and arriving at new state 𝑠 + 1.

The goal is to design a trading strategy that maximizes the **change of the portfolio value 𝑟(𝑠𝑡,𝑎𝑡,𝑠𝑡+1)** in the dynamic environment, and we employ the deep reinforcement learning method to solve this problem.

**3.5 Environment for multiple stocks:**

**State Space: We use a 181-dimensional vector (30 stocks * 6 + 1) consists of seven parts of information to represent the state space of multiple stocks trading environment**

**Balance**: available amount of money left in the account at current time step**Price**: current adjusted close price of each stock.**Shares**: shares owned of each stock.**MACD**: Moving Average Convergence Divergence (MACD) is calculated using close price.**RSI**: Relative Strength Index (RSI) is calculated using close price.**CCI**: Commodity Channel Index (CCI) is calculated using high, low and close price.**ADX**: Average Directional Index (ADX) is calculated using high, low and close price.

**Action Space**:

- For a single stock, the action space is defined as
**{-k,…,-1, 0, 1, …, k}**, where k and -k presents the number of shares we can buy and sell, and**k ≤h_max**while**h_max**is a predefined parameter that sets as the maximum amount of shares for each buying action. - For multiple stocks, therefore the size of the entire action space is
**(2k+1)^30**. - The action space is then normalized to
**[-1, 1]**, since the RL algorithms A2C and PPO define the policy directly on a Gaussian distribution, which needs to be normalized and symmetric.

class StockEnvTrain(gym.Env):“””A stock trading environment for OpenAI gym”””metadata = {‘render.modes’: [‘human’]}def __init__(self, df, day = 0):

self.day = day

self.df = df # Action Space

# action_space normalization and shape is STOCK_DIMself.action_space = spaces.Box(low = -1, high = 1,shape = (STOCK_DIM,))

# State Space

# Shape = 181: [Current Balance]+[prices 1–30]+[owned shares 1–30]

# +[macd 1–30]+ [rsi 1–30] + [cci 1–30] + [adx 1–30]self.observation_space = spaces.Box(low=0, high=np.inf, shape = (181,))# load data from a pandas dataframe

self.data = self.df.loc[self.day,:]

self.terminal = False # initalize stateself.state = [INITIAL_ACCOUNT_BALANCE] + \# initialize reward

self.data.adjcp.values.tolist() + \

[0]*STOCK_DIM + \

self.data.macd.values.tolist() + \

self.data.rsi.values.tolist() + \

self.data.cci.values.tolist() + \

self.data.adx.values.tolist()

self.reward = 0

self.cost = 0 # memorize all the total balance change

self.asset_memory = [INITIAL_ACCOUNT_BALANCE]

self.rewards_memory = []

self.trades = 0

#self.reset()

self._seed()

## 3.6 Trading agent based on deep reinforcement learning

## A2C

**A2C **is a typical **actor-critic algorithm **which we use as a component in the ensemble method. A2C is introduced to improve the policy gradient updates. A2C utilizes an **advantage function** to reduce the variance of the policy gradient. Instead of only estimates the value function, the **critic network estimates the advantage function**. Thus, the evaluation of an action not only depends on how good the action is, but also considers how much better it can be. So that it reduces the high variance of the policy networks and makes the model more robust.

A2C uses **copies of the same agent** working in parallel to update gradients with different data samples. Each agent works independently to interact with the same environment. After all of the parallel agents finish calculating their gradients, A2C uses** a coordinator** to pass the average gradients over all the agents to a **global network**. So that the global network can update the actor and the critic network. The presence of a global network increases the diversity of training data. The synchronized gradient update is more cost-effective, faster and works better with large batch sizes. A2C is a great model for stock trading because of its stability.

## DDPG

**DDPG** is an actor-critic based algorithm which we use as a component in the ensemble strategy to maximize the investment return. DDPG **combines** the frameworks of both **Q-learning and policy gradien**t, and uses neural networks as function approximators. In contrast with DQN that learns indirectly through Q-values tables and suffers the curse of dimensionality problem, DDPG learns directly from the observations through policy gradient. It is proposed to deterministically map states to actions to better fit the continuous action space environment.

## PPO

We explore and use PPO as a component in the ensemble method. PPO is introduced to control the policy gradient update and ensure that the new policy will not be too different from the older one. PPO tries to simplify the objective of **Trust Region Policy Optimization (TRPO)** by introducing a clipping term to the objective function.

The objective function of PPO takes the** minimum of the clipped and normal objective**. PPO discourages large policy change move outside of the clipped interval. Therefore, PPO improves the stability of the policy networks training by restricting the policy update at each training step. We select PPO for stock trading because it is stable, fast, and simpler to implement and tune.

**Ensemble strategy**

Our purpose is to create a highly robust trading strategy. So we use **an ensemble method** to automatically select the best performing agent among **PPO**, **A2C**, and **DDPG** to trade based on the** ****Sharpe ratio**. The ensemble process is described as follows:

**Step 1.** We use a growing window of **𝑛** **months** to retrain our three agents concurrently. In this paper, we retrain our three agents at every **three months**.

**Step 2. **We validate all three agents by using a **3-month validation** rolling window followed by training to pick the best performing agent which has the **highest Sharpe ratio**. We also adjust risk-aversion by using **turbulence index **in our validation stage.

**Step 3. **After validation, we only **use the best model with the highest Sharpe** ratio to predict and trade for the next quarter.

fromstable_baselinesimportSACfromstable_baselinesimportPPO2fromstable_baselinesimportA2Cfromstable_baselinesimportDDPGfromstable_baselinesimportTD3fromstable_baselines.ddpg.policiesimportDDPGPolicyfromstable_baselines.common.policiesimportMlpPolicyfromstable_baselines.common.vec_envimportDummyVecEnvdeftrain_A2C(env_train, model_name, timesteps=10000):“””A2C model”””start = time.time()

model = A2C(‘MlpPolicy’, env_train, verbose=0)

model.learn(total_timesteps=timesteps)

end = time.time() model.save(f”{config.TRAINED_MODEL_DIR}/{model_name}”)

print(‘Training time (A2C): ‘, (end-start)/60,’ minutes’)

return modeldeftrain_DDPG(env_train, model_name, timesteps=10000):“””DDPG model”””start = time.time()

model = DDPG(‘MlpPolicy’, env_train)

model.learn(total_timesteps=timesteps)

end = time.time() model.save(f”{config.TRAINED_MODEL_DIR}/{model_name}”)

print(‘Training time (DDPG): ‘, (end-start)/60,’ minutes’)

return modeldeftrain_PPO(env_train, model_name, timesteps=50000):“””PPO model”””start = time.time()

model = PPO2(‘MlpPolicy’, env_train)

model.learn(total_timesteps=timesteps)

end = time.time() model.save(f”{config.TRAINED_MODEL_DIR}/{model_name}”)

print(‘Training time (PPO): ‘, (end-start)/60,’ minutes’)

return modeldefDRL_prediction(model, test_data, test_env, test_obs):“””make a prediction”””start = time.time()

fori in range(len(test_data.index.unique())):

action, _states = model.predict(test_obs)

test_obs, rewards, dones, info = test_env.step(action)

# env_test.render()

end = time.time()

## 3.7 Performance evaluations

We use Quantopian’s pyfolio to do the backtesting. The charts look pretty good, and it takes literally one line of code to implement it. You just need to convert everything into daily returns.

importpyfoliowithpyfolio.plotting.plotting_context(font_scale=1.1):

pyfolio.create_full_tear_sheet(returns=ensemble_strat,

benchmark_rets=dow_strat, set_context=False)

References:

A2C:

Volodymyr Mnih, Adrià Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. The 33rd International Conference on Machine Learning (02 2016). https://arxiv.org/abs/1602.01783

DDPG:

Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR) 2016 (09 2015). https://arxiv.org/abs/1509.02971

PPO:

John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. 2015. Trust region policy optimization. In The 31st International Conference on Machine Learning. https://arxiv.org/abs/1502.05477John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv:1707.06347 (07 2017). https://arxiv.org/abs/1707.06347

## FAQs

### Can reinforcement learning be used for stock trading? ›

1. Trading bots with Reinforcement Learning. **Bots powered with reinforcement learning can learn from the trading and stock market environment by interacting with it**. They use trial and error to optimize their learning strategy based on the characteristics of each and every stock listed in the stock market.

**Is machine learning good for stocks? ›**

Stock Price Prediction using machine learning **helps you discover the future value of company stock and other financial assets traded on an exchange**. The entire idea of predicting stock prices is to gain significant profits. Predicting how the stock market will perform is a hard task to do.

**What is FinRL? ›**

FinRL is **an open-source framework to help practitioners establish the development pipeline of trading strategies based on deep reinforcement learning (DRL)**. The codes for FinRL are available on our Github. "We are committed to creating the "AI4Finance Open Source Foundation". non-profit academic institution.

**What is TensorTrade? ›**

TensorTrade is **an open source Python framework for building, training, evaluating, and deploying robust trading algorithms using reinforcement learning**.

**What is PPO in reinforcement learning? ›**

Proximal Policy Optimization, or PPO, is **a policy gradient method for reinforcement learning**. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization.

**Can reinforcement learning be used for prediction? ›**

Due to the frequency and duration of stock price fluctuations, **reinforcement learning (RL) is a more suitable predictive tool for statistical analysis of data than supervised learning**.

**Can deep learning predict stock price? ›**

Deep learning technology is a new emerging technology that can effectively process time-series data and multi-period data. At the same time, **the combination of multiple models usually has better performance than a single model and is becoming the main direction in stock price prediction**.

**Which algorithm is best for stock prediction? ›**

**LSTM, short for Long Short-term Memory**, is an extremely powerful algorithm for time series. It can capture historical trend patterns, and predict future values with high accuracy.

**What is the best ML model for stock prediction? ›**

Which machine learning algorithm is best for stock price prediction? Based on experiments conducted in this article, **LSTMs** seem to be the best initial approach in solving the stock price prediction problem. Other methods can combine features extracted from LSTM or Bi-LSTM models and fed into a classical ANN regressor.

**How do you automate the stock market using FinRL? ›**

**How To Automate The Stock Market Using FinRL (Deep Reinforcement Learning Library)?**

- Install and import packages.
- Download Apple Stocks data using Yahoo finance API.
- Preprocessing.
- Trading Environment building. Initiate environment.
- Implement DRL Algorithms.
- Training on 5 different models. ...
- Trading.
- Backtesting Performance.

### What is DDPG algorithm? ›

Deep Deterministic Policy Gradient (DDPG) is **an algorithm which concurrently learns a Q-function and a policy**. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.

**Is reinforcement learning used in finance? ›**

**Reinforcement learning has been used in various applications in finance and trading, including portfolio optimization and optimal trade execution**.

**Is trading bot a machine learning? ›**

The project is aimed at developing an intelligent trading bot for automated trading cryptocurrencies using **state-of-the-art machine learning (ML) algorithms** and feature engineering.

**What do algorithmic traders do? ›**

Algorithmic trading makes use of complex formulas, combined with mathematical models and human oversight, to **make decisions to buy or sell financial securities on an exchange**. Algorithmic traders often make use of high-frequency trading technology, which can enable a firm to make tens of thousands of trades per second.

**What is A2C reinforcement learning? ›**

In the field of Reinforcement Learning, the Advantage Actor Critic (A2C) algorithm **combines two types of Reinforcement Learning algorithms (Policy Based and Value Based) together**. Policy Based agents directly learn a policy (a probability distribution of actions) mapping input states to output actions.