Predicting the price of Bitcoin with multivariate Pytorch LSTMs

Using multivariate, multi-output forecasting models for financial data

14 min readJan 25, 2022

In a previous post, I went into detail about constructing an LSTM for univariate time-series data. This itself is not a trivial task; you need to understand the form of the data, the shape of the inputs that we feed to the LSTM, and how to recurse over training inputs to produce an appropriate output. This knowledge is fantastic for analysing curves where the only information we have is the past values of the curve itself. For instance, we can use univariate LSTMs to learn trigonometric curves, and plot them well outside the domain of training values.

Whilst it is all well and good to take one array of data, representing a sequence of values of a function over time, and attempt to predict future values, this often doesn’t match the quantity of data available in the real world. More likely, the data we receive will be in tabular form, with each row representing a single time step. Each row probably consists of several different columns; these are the features, or predictors, that we have information about at each time step. One of these columns is not a feature, however. It is the target we want to predict. Consequently, we need a way to feed in these multiple values at each time step to our LSTM, and to produce a singular output representing the prediction at the next time step in return. In this way, we construct a multivariate LSTM.

Introduction: predicting the price of Bitcoin

For this problem, we’re going to focus on financial data. Before we begin, I would like to point out that LSTMs will not make you rich, even if they are excellent forecasters for time-series data. No model will make you rich; there’s a whole field of thinking on epistemic humility and how impossible it is for anything you do to detect any edge in the market; I won’t go into it here. But know this: if you’ve found an inconsistency in the price of a stock (it’s too low, or too high and you want to capitalise on that), and you believe that no-one else has spotted this inconsistency, then you might want to rethink your surety. Recall that “no-one else” in this scenario includes the team of PhDs sitting in Goldman Sachs with a supercomputing cluster, the lowest latency in information and order processing in the market, and multi-billion dollar financial incentives to find exactly the same inconsistencies as you. If you really believe you’ve beaten them, you haven’t. For more detail on this line of thinking, see Eliezer Yudkowsky’s Inadequate Equilibria.

We’ll frame our problem as follows. We have historical price data for Bitcoin, which includes the following predictors for each day (where we have daily time steps):

Opening price
High price
Low price
Volume traded

Our goal is to take some sequence of the above four values (say, for 100 previous days), and predict the target variable (Bitcoin’s price) for the next 50 days into the future.

Preprocessing the data

We begin by importing the data and quickly cleaning it. Fortunately, financial data is readily available online. We will use Yahoo historical prices for Bitcoin, available back to September 17, 2014. This data is available here. Import the data using Pandas and have a look.

Note: you can download historical data from Yahoo for almost any financial instrument. Interested in Tesla shares? Recreate this LSTM to see if you should buy or sell in the coming month.

The data structure for our Bitcoin prices dataframe. (Image by author)

At the bare minimum, your exploratory data analysis should consist of plotting the target variable of interest. (Some people will argue that you should do a lot more than this, such as regressing the target variable on the predictors, and looking for linear relationships between the variables.) Let’s plot the Bitcoin price over time to see what we’re actually trying to predict.

Interesting. We want a realistic emulation of what would happen in the real-world. That is, we want a few years of historical data, and train an LSTM to predict what will happen to the price of Bitcoin in the next few months.

Setting inputs and outputs

Recall that our predictors will consist of all the columns except our target, closing price. Note that we want to use an sklearn preprocessor below, which requires reshaping the array if it consists of a single feature, as our target does. Hence, for the target y, we have to call values, which removes the axes labels and will allow us to reshape the array.

We now have the task of standardising our features. We’ll use standardisation for our training features X by removing the mean and scaling to unit variance. Standardisation helps the deep learning model to learn by ensuring that parameters can exist in the same multi-dimensional space; it wouldn’t make much sense to have the weights have to change their size simply because all the variables have different scales. For our target y, we will scale and translate each feature individually to between 0 and 1. This transformation is often used as an alternative to zero mean, unit variance scaling.

Now comes the slightly fiddly part. We want to feed in 100 samples, up to the current day, and predict the next 50 time step values. To do this, we need a special function to ensure that the corresponding indices of X and y represent this structure. Examine this function carefully, but essentially it just boils down to getting 100 samples from X, then looking at the 50 next indices in y, and patching these together. Note that because of this we'll throw out the first 50 values of y.

Let’s check that the first sample in y_mm indeed starts at the 100th sample in the original target y vector.

Above, we mentioned that we wanted to predict the data a several months into the future. Thus, we’ll use a training data size of 95%, with 5% left for the remaining data that we’re going to predict. This gives us a training set size of 2763 days, or about seven and a half years. We will predict 145 days into the future, which is almost 5 months.

We need to now, as usual, convert our data into tensors. This is fairly easy — we do so by calling torch.tensor() on our object, and setting the property requires_grad=True. Some old Pytorch tutorials might have you believe that we need to apply the wrapper Variable here. However, this is deprecated, and now the input tensor to be forward propagated has to be can facilitate automatic back propagation (through backward()) without being wrapped in a variable.

If we look at the documentation for the multi-layer torch.nn.LSTM, we see that the input shape depends on whether the parameter batch_first is true. Since we are accustomed to having the first dimension of our data be the batch, we will set batch_first to true. The size of the input is then (𝑁,𝐿,𝐻_𝑖𝑛), where 𝑁 is the batch size, 𝐿 is the sequence length, and 𝐻_𝑖𝑛 is the input size (i.e. the number of features). In other words, we want the dimensions to be the rows of the dataframe in the first dimension, followed by the length of the dataframe in the next dimension (representing the length of the input sequence), and finally the features (of which we have four) in the final dimension.

To reshape the tensors into our required shape, we use torch.reshape. This takes as arguments the tensors we are reshaping, and then a tuple of the shape we need to reshape to. For the rows of the dataframe, we can simply look at the shape of the first dimension of X_train_tensors.shape[0]. Since we are applying an LSTM, we recall that the sequence length we feed in is simply 1. (Remember that the whole point of the LSTM is not feeding in tonnes of data at each point, because the memory is handled by the inner workings of the LSTM cell. For more on this, see my previous article.) Finally, we want the last dimension to be the number of features, which is stored in X_train_tensors.shape[1].

One more thing we want to check: the data logic of the test set. Sequential data is hard to get your head around, especially when it comes to generating a test-set for multi-step output models. Here, we want to take the 100 previous predictors up to the current time-step, and predict 50 time-steps into the future. In the test set, we have 150 batch feature samples, each consisting of 100 time-steps and four feature predictors. In the targets for the test set, we again have 150 batch samples, each consisting of an array of length 50 of scalar outputs.

Since we want a way to validate our results, we need to predict the Bitcoin price for 50 time steps in the test set for which we have the data (i.e. the test targets). Because of the way we wrote split_sequence() above, we simply need the last sample of 100 days in X_test, run the model on it, and compare these predictions with the last sample of 50 days of y_test. These correspond to a period of 100 days in X_test's last sample, proceeded immediately by the next 50 days in the last sample of y_test.

We want to check that the 50 values we will be predicting match the last 50 values of y in the test set.

Perfect, the first four rows of our data are as they should be. Note that X_check[-1] should be identical to X.iloc[-149:-49], ending 50 days before the end of our dataset. So we are taking the 100 time-steps of information, up to the 26th of November 2021, and attempting to predict the 50 days after that, up to the 14th January 2022. For one final check, we make sure that the final batch sample in our test targets matches these dates for prediction.

You can remove the .values from the code above to check for yourself that the dates match.

That was a lot of indexing and checking. To summarise, the main performance test for our model will be on the last batch sample in the test set. This will consist of predictors from the 100 time-steps up to the 26th November 2021, and this information will be used by our model to predict the next 50 days of Bitcoin prices, up to the 14th January 2022. In this way, we will validate model performance by comparing predictions to the actual prices in that 50 day window.

LSTM model

Now we need to construct the LSTM class, inheriting from nn.Module. In contrast to our previous univariate LSTM, we're going to build the model with the nn.LSTM rather than nn.LSTMCell. This is for two reasons: firstly, it's nice to be exposed to both so that we have the option. Secondly, we don't need the flexibility that nn.LSTMCell provides. We know that nn.LSTM is essentially just a recurrent application of nn.LSTMCell. Thus, we would only use nn.LSTMCell if we wanted to apply other transformation in between different LSTM layers, such as batch-normalisation and dropout. Here however, we can implement dropout automatically using the dropout parameter in nn.LSTM. We've already standardised our data. Thus, there's not a whole lot of reasons to use the more fiddly nn.LSTMCell.

As per usual, we’ll present the entire model class first, and then break it down line by line.

In our initialisation, as usual we initialisation our parent class, nn.Module. Most initialisations in a Pytorch model are separated into two distinct chunks:

Any variables that the class will need to reference, for things such as hidden layer size, input size, and number of layers.
Defining the layers of the model (without connecting them) using the variables instantiated above.

This is exactly what we do here. If you carefully read over the parameters for the LSTM layers, you know that we need to shape the LSTM with input size, hidden size, and number of recurrent layers. For instance, setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Thus, we initialise these three variables in the first part of the __init__. We also need to force our model to output only one predicted value, so we also initialise a number of classes variable.

In the second part of __init__, we set out the layers of our network. Our first layer is obviously a recurrent application of LSTM cells, with all the parameters specified above.

self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size, 
                    num_layers=num_layers, batch_first=True)

Next, we pass this to a fully connected layer, which has an input of hidden_size (the size of the output from the last LSTM layer) and outputs 128 activations. Then, we pass these 128 activations to another hidden layer, which evidently accepts 128 inputs, and which we want to output our num_classes (which in our case will be 1, see below). Finally, we pass this activation through a non-linear function, in our case ReLU.

self.fc_1 =  nn.Linear(hidden_size, 128) # fully connected
self.fc_2 = nn.Linear(128, num_classes) # fully connected last layer
self.relu = nn.ReLU()

Training

In our univariate example, we had a complicated loss function, because we said that the loss topology in sequential data is usually characterised by lots of long, flat valleys. Here, we’re going to keep things simple, and see what performance we can get with a typical regression loss function like mean-squared error. This also has the added benefit that we don’t have to restructure the training loop so as to include the callable closure in our parameter update. (If you don't know what I'm talking about, don't worry.)

Let’s very quickly recap what’s going on here. We do the forward pass through our network by passing in our training tensors, which we shaped appropriately before. We then zero out the current gradients stored in the Pytorch computational graph. We compare these outputs from the forward pass with the actual train targets using our loss function, and backpropagate using loss.backward() to calculate the gradients of the loss with respect to the parameters (our weights and biases). We then use this loss to update the parameters.

Setting the appropriate variables below, we can instantiate an instance of our LSTM model.

We use MSE as our loss function, and the well-known Adam optimiser.

loss_fn = torch.nn.MSELoss()    # mean-squared error for regression
optimiser = torch.optim.Adam(lstm.parameters(), lr=learning_rate)

Let’s train for 1000 epochs and see what happens. Recall in the previous article that a key part of LSTM debugging is visual cues. Here, our training is fast enough that we can just plot the result at the end, and if it’s off, we can change our parameters and run it again.

Prediction

A key part of prediction, if the variables have been standardised, is knowing what we need to pass to the model. That is: what do we need to standardise before passing it to the model? Well, we can answer this question easily: we just apply the same transforms to the predictors and targets so that whatever we feed to the model is exactly what the model is used to seeing. It wouldn’t make sense to not standardise our inputs; the model would be confused as to why they no longer have zero mean and unit variance.

Let’s plot our results for the whole dataset.

Although this may seem fantastic, it’s not quite as good as it seems. This plot above, where the test data is any time step beyond the red-dashed line, seems to suggest that our model is amazingly accurate at predicting the price of Bitcoin, quite a few months into the future. However, what’s happening here is a form of data leakage: information about the test targets has leaked into our test features. That is, whenever we run the model, it has access to the Open price for that day, which is obviously going to be extremely close to the price Bitcoin ends up at for that particular time-step. When we loop over our predictions on the model, we append the first predicted value for that time-step. This means that every prediction we are plotting here has the benefit of being the next predicted value, rather than a sequence of 50 values into the future, like we set up in our problem definition.

If we really want to see how our model is performing, we have to feed it the 100 time-steps of features before the final value in the test set, and then use one model forward pass to calculate the 50 time-step prices in the test set. Recall from above that we double-checked our test features in the final batch sample in the test set were the 100 time steps up to the time-step we’re attempting to predict. That is, we feed to model 100 days of information up to 26th November 2021, and get it to predict 50 days of Bitcoin prices, from this day to the 14th January 2022.

This is good. If we feed in the last 100 days of information, our model successfully predicts a steady decline in the price of Bitcoin over the next 50 days. For one last plot, let’s put this in perspective of the scale of the data.

So, if we’d run our model on 26th November 2021, we would have been correct in selling off our Bitcoin. The model correctly predicts a price drop, as well as the rate at which it drops.

Here’s the amazing part, which took me a while to process. This orange whole curve is generated without looking at any target data. Yes, to be fair, the model is very familiar with all the actual targets before the dashed red line. However, once we move past the training set into the test set, the model has no idea what the test targets are. It has to generate its predictions based on the input features alone. It’s quite remarkable, then, that our orange predictions curve so closely matches that of the actual price.

Conclusion

Interestingly, there’s essentially no information on the internet on how to construct multi-step output LSTM models for multivariate time-series data. Hopefully, this article gave you both the intuition and technical understanding for building your own forecasting models. Just remember to think through your input and output shapes very carefully, and construct tensors that represent past data predicting future data. Pytorch’s LSTM class will take care of the rest, so long as you know the shape of your data.

In terms of next steps, I would recommend running this model on the most recent Bitcoin data from today, extending back to 100 days previously. See what the model thinks will happen to the price of Bitcoin over the next 50 days. You could also play with the time being fed to the model and the time being forecast; try for longer periods and see if the model can pick up on longer-term dependencies. Finally, you should note that these types of LSTMs are not the only solution to these multivariate, multi-output forecasting problems. There are many other deep learning solutions, including encoder-decoder networks for variable-length sequences, that you should look into. Saying that, LSTMs are a great place to start, and can give incredible performance if you know how to utilise them.