Adding artificial intelligence to your investing strategy; part 3

Machine Learning for stock selection; an application of Stochastic Gradient Descent on time-series data

Patrick Collins
8 min readMar 9, 2020
Photo by Franck V. on Unsplash

Today we look at a supervised binary classifier and how it can be used on stock selection. A binary classifier is a function that groups a list of elements into two groups depending on their features.

Code and samples to do this are below.

Why would we want to do this?

Continuing the series from the Alpha Vantage blog on AI, we are just slamming data into ML models and seeing what happens. Or that’s what we would be doing if we didn’t have a systemic reason for going about it.

When people look at stocks, often the simplest and most intrinsic question they ask is “Is this a buy or a sell?”. Which conveniently can be viewed as a binary question, “Is this a yes or a no?”.

If we teach a computer to classify stocks as “buys” (yes) or “sells” (no) based off historical data, how strong would it be at finding these buys?

Now obviously the method we choose to teach the computer is important, for this exploratory article, we are going to look at Stochastic Gradient Descent. Which in my opinion should be called “randomized multi-variable best fit lines”, and if you know why I think that or can understand my logic, please agree with me in the comments section (or even better, disagree with me!)

This video did an AMAZING job helping me understand this concept (and please watch the prerequisites as well!) So take a look if you want to learn more about the actual ML mechanism we are using.

Let’s get started.

1. Data collection

The first step for every ML process is to collect your data. To keep it simple, we are going to look at all available NASDAQ tickers from the past 20 years. So first, we need a list of tickers to pull from. (All code is in python). We can get that list from http://ftp.nasdaqtrader.com/dynamic/SymDir/nasdaqlisted.txt

import requests
url = "http://ftp.nasdaqtrader.com/dynamic/SymDir/nasdaqlisted.txt"
response = requests.get(url)
tickers = [line.split("|")[0] for line in response.text.split("\n")][1:-2]

Great, now we have a list of tickers now we want to get our dataset. It’s time to pull historical data. Luckily we can tap into Alpha Vantage to grab a dataset. In this code block we are grabbing all 3,500 tickers using multithreading. If you don’t have a key that can make that many API calls, you can also get a free key from Alpha Vantage and get 5 API calls/minute, and just run with a subset of the tickers.

Boom! Now we have a dataset object that has all 3,500 tickers with 20 years of historical data to pick from.

2. Clean and prepare the data

We have this massive dataset which we can gather insights from, but the model we are going to use (scikit-learn SDG)can only take the data in ndarrys, and we haven’t given it any information about how to learn. We need to tell it what would be considered a buy, and what would be considered a sell.

To keep it simple and explicit, we are going to go with the question “Will this ticker go up or down in exactly one year”. If yes, then it’s a buy, if no, then it’s a sell.

BUT WAIT!!!!

We ALWAYS want to split our data at least into a test set, and a training set, luckily scikit-learn has a way for us to easily do that at random.

from sklearn.model_selection import train_test_split
random_state = 43
train_set, test_set = train_test_split(dataset, test_size=0.25, random_state = random_state)

A good rule of thumb is something between 80 training/20 testing or 70/30, we are going to go with 75/25, more information on that here. We don’t want to use ALL of our data to train the model, as then we may overfit the data, and get a model that is good for ONLY our specific set of data. We want to be able to test it on the data we have.

We have to treat data like a limited resource in this sense.

In order to convert our list of training data to be plugged into the SGD model, we need to convert it to numpy arrays. Below is a crude (but straightforward) way for us to accomplish this. We are going to look only at tickers that have been alive for over a year, and we will label all tickers a “buy” (or as a boolean True), that have strictly gone up since Feb. 7th, 2019 to Feb. 7th 2020. All those that have gone down are “False”.

AAPL would be considered a “buy”

This labeling is our y_set, and will be a 1D single numpy array, that has n_samples in size. We have 2,392 valid tickers out of our original 3,500 (we did a little work ahead of time to find that out), so it will be an array that is 1 x 2392. Whereas our x_set will be a 2D array that is n_samples by n_features in size (number of tickers and number of features).

Our features in this case, are the open, high, low, close, volume, adjusted close, dividend, and split for each date. So each ticker has a maximum of almost 40,000 features (20 years of weekdays times 8 features = about 40,000) A feature in a machine learning model is an observation that can be measured, and used to help teach the model.

This list of features is what the model is going to be using to help classify whether or not a future arbitrary stock is a buy or a sell.

Whew! Simple as that! But now we are ready to plug the data into our classifier of choice. Scikit has several models you can plug into, and as stated, we are going to use the Stochastic Gradient Descent model.

3. Choose and train the model

Often, cleaning data and coming up with an idea is the hardest part, so you can high five yourself as we go onto seeing how this model does!

# pick a classifier and train it
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=5,tol=-np.infty,random_state = random_state)
sgd_clf.fit(x_train_set, y_train_set)

Simple as that.

If only everything in life had a command like artificial_intelligence_model.fit(data)

4. Evaluate the model

Great, now let’s see if it would have done a good job picking stocks….

We can do a quick test just to see how much our model would get right, just on the training data that we gave it:

for x in range(len(x_train_set)):
if sgd_clf.predict([x_train_set[x]]) == y_train_set[x]:
correct = correct + 1
print(correct / len(x_train_set))

We get: 0.5731605351170569

Ok not bad! Generally, stock picking is a 50/50 shot anyways, so looking at the training data, we see we may have a model that can generate some signals for us!…. Or that would be the case but we remembered to set aside test data, and there are a few more sophisticated methods to test the data here.

Using our ticker_map, we can do a little more surface-level exploration to see how specific tickers were evaluated:

It gave TSLA, AAPL, and GOOGL a True result. Since it said we should buy TSLA, that means this is a golden algorithm and should be deployed to production immediately…. Did I mention I like Tesla?

Photo by Taun Stewart on Unsplash

To get a little more into the details, we want to split the test data up into different sections, and see how each section does and get the average of that total. That way we can find outliers quicker. One way to do this split is built into scikit:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(sgd_clf, x_test_set, y_test_set, cv=3, scoring="accuracy")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
>>> Accuracy: 0.54 (+/- 0.02)

This little chunk of code split up the test data 3 ways and got the mean of that accuracy. Generally, when looking at the performance of an AI model, we want to look at more than just “how many it got right vs how many it got wrong”, otherwise known as the Accuracy. Maybe we want to know how many sells we labeled buys, and how many sells we labeled buys. To view this we can check out a confusion matrix.

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
y_test_predict = cross_val_predict(sgd_clf, x_test_set, y_test_set, cv=3)
confusion_matrix(y_test_set, y_test_predict)

This will give us an array of (in order from right to left top to bottom) correctly predicted sells (true negative), incorrectly predicted buys (false positive), incorrectly predicted sells (false negative), and correctly predicted buys.

We can now gain some more insight into this matrix.

The other 3 important performance measures are recall, F1, and precision, which you can read more about here, and they all have scikit integrations as well!

5. Reflect, reevaluate, and change for optimizations

This step could go back to square one and change everything.

A few thoughts on what we could do to improve this example:

  1. Change the input data so that every position in the arrays mapped to a date as opposed to just throwing all the days in sequence.
  2. Create a multi-class classifier by adding multiple binary classifiers together of different time ranges other than one year.
  3. Change the date that we started to run the year at.
  4. Run the year for every day over the past 20 years.
  5. Figure out how to include tickers that haven’t been listed for a year.

The list goes on and on…..

Do you think this could be a way to generate alpha? 54% accuracy isn’t something to write home about, but casinos have as little as a 1% advantage in some games… Maybe worth some more investigation.

In any case, as always, be sure to check out the ageron/handson-ml python GitHub for more info, examples, and stats, as well as the book.

Want to learn more?

Follow Alpha Vantage on Medium and see the tutorials that are coming out, with content like blockchain applications, machine learning with python, hackathons, alpha generation, platform synergies, and a ton of other helpful content.

Get a Free API Key and start downloading historical data, pricing quotes, and more on FX, stock market, and cryptocurrency data. Want more data? We have premium API keys that have even more bandwidth.

You can reach us also on slack, twitter, or discord.

#investing #machinelearning #AI #stockapi #fintech

--

--

Patrick Collins
Patrick Collins

Written by Patrick Collins

Lover of smart contract engineering and security

No responses yet