Predictive text is something we use every day. We use it when we send text messages, emails and anywhere else you might write. It has become so commonplace that it’s an expected feature across most platforms. Given its ubiquity, I was curious to see how hard it was to create my own predictive text model. I wasn’t expecting it to be perfect, as not even the top predictive text models are, but I thought it would be a fun project to learn more about machine learning and natural language processing.
The tutorial will show you how to make a simple text-based prediction model based off of Jane Austen’s “Pride and Prejudice.” Due to the training data being based on a romantic novel, I was expecting that the resulting predictive text would be very skewed towards the themes of love and marriage. We’ll test this theory at the end.
The interface for this will be an API. It will take prediction queries and return the three most likely words to come afterward. We’ll deploy the API to the cloud so that it can be used as a third-party API in other applications, like a chat app or anywhere else you might want to embed predictive text. If you want to skip to the end and deploy it yourself, you can find the full source code in our examples repo.
Python will be the base language because the machine learning (ML) ecosystem around Python is fantastic, enabling us to use tools like TensorFlow, Keras and NumPy. TensorFlow acts as a kind of ML toolbox. There are plenty of options, but we should use the right tool for the job.
Using a bidirectional long short-term memory recurrent neural network (Bi-LSTM) is ideal for our predictive text problem. This type of neural network, apart from having a very long name, also has the unique capability to store both long- and short-term context. We’ll need short-term memory to store the previous words in the sentence and long-term memory to store the context of how these words have been used in previous sentences.
Step 1 – Set up the Project
We’ll use the following tools:
- Pipenv for simplified dependency management
- The Nitric CLI for simple cloud backend infrastructure
- (optional) Your choice of an AWS, Google Cloud Platform (GCP) or Microsoft Azure account.
Start by creating a new project for our API.
nitricnewprediction-apipython-starter
Then open the project in your editor of choice and resolve dependencies using Pipenv.
pipenvinstall--dev
Step 2 – Prepare the Data Set
Project Gutenberg provides the “Pride and Prejudice” text, so you can download the file from there, or you can use the precleaned data from our example repo, which is the recommended approach. This text file will form the basis of our training data and will give our predictions a Jane Austen spin.
Before we begin training our model, we want to make sure we explore and preprocess the training data to clean it up for quality training. Looking through the “Pride and Prejudice” text, we find that Project Gutenberg adds a header and a footer to the data. There are also volume headers, chapter headings, punctuation and contractions that we’ll remove. We’ll also convert all the numbers to words — “8” to “eight.” This cleanup allows our training data to be as versatile as possible, so predictions will be more cohesive.
Now that we have our cleaned data, we’ll tokenize the data so it can be processed by the model. To tokenize the data, we’ll use Keras’ preprocessing module, so we’ll need to install the Keras module.
pipenvinstallkeras==2.15.0
We can then create and fit the tokenizer to the text. We’ll initialize the Out of Vocabulary (OOV) token as <oov>
. After it’s fit to the text, we’ll save it so we can use it later.
Now we’re ready to start training our model.
Step 3 – Train the Model
To train the model, we’ll use a Bi-LSTM. This type of recurrent neural network is ideal for this problem since it enables the neural network to store the context of the previous words in the sentence.
Start by loading the tokenizer we created in the preprocessing stage.
We’ll then create the input sequences to train our model. This works by getting every six-word combination in the text. First, add NumPy as a dependency.
pipenvinstallnumpy
Then we’ll write the function to create the input sequences from the data.
We’ll then split the input sequences into labels, training and testing data.
The next part is fitting, compiling and training the model. We’ll pass in the training data, which we have split into training and testing data. We can use the model checkpoint callback, which will save the best iteration of our model at each epoch. To optimize our training speed, we’ll also add an adaptive moment estimation (ADAM) optimizer and a reduce learning rate on plateau callback.
Then we’ll add layers to the sequential model.
Finally, we can put it all together and then compile the model using the training data.
With all the services defined, we can train our model with the cleaned data.
The model checkpoint save callback will save the model as model.keras
. We’ll then be able to load the model when we create our API.
Step 4 – Write the Text Prediction Function
We’re ready to start predicting text. Starting with the hello.py
file, we’ll first write functions to load the model and tokenizer.
We will then write a function to predict the next three most likely words. This uses the tokenizer to create the same token list that was used to train the model. We can then get a prediction of all the most likely words, which we’ll reduce down to three. We’ll then get the actual word from the map of tokens by finding the word in the dictionary. The tokenizer word index is in the form { "word": token_num }
, such as { "the": 1, "and": 2 }
. The predictions we receive will be an array of the token numbers.
Step 5 – Create the API
Using the predictive text function, we can create our API. I will be using the Nitric framework for this, as it makes deploying our API very straightforward and gives us the choice of which cloud we want to use at the end.
First, we will import the necessary modules for the Nitric SDK.
We’ll then define the API and our first route.
Within this function block, we want to define the code that will be run on a request. We’ll accept the prompt to predict from via the query parameters. This will mean that requests are in the form: /predictions?prompt=
.
Now that we have extracted the prompt from the user, we can pass this into the model for prediction. This will produce the three most likely next words and return them to the user.
That’s all there is to it. To test the function locally, we’ll start the Nitric server.
nitricstart
You can then make a request to the API using any HTTP client. Given the prompt “What should I”, it returns the most likely responses: “have”, “think” and “say”.
You will find that the predictions have a lot of focus on family and weddings. This shows that the training data, courtesy of Jane Austen, has a big effect on the type of predictions that are produced. You can see these themes in the below examples where we start with a two-word sentence and see what the predictive text produces.
Step 6 – Deploy to the Cloud
You can deploy to your cloud to enable use of the API by other projects. First, set up your credentials and any other cloud-specific configuration:
Create your stack. This is an environment configuration file for the cloud provider where your project will be deployed. For this project, I used Google Cloud; however, it will work perfectly well if you prefer AWS or Azure.
nitricstacknew
This project will run as expected with a default memory configuration of 512MB. However, to get instant predictions, we’ll amend the memory to be 1GB. This just means adding some config to the newly created stack file.
You can then deploy using the following command:
nitricup
When deployment is finished, you’ll get an endpoint so you can test your API in the cloud.
If you’re just testing for now, you can tear down the stack with the following command:
nitricdown
If you want to learn more about using Nitric to quickly deploy Python and other language applications to your cloud, check out the session that Anmol Krishan Sachdeva and Kingsley Madikaegbu of Google are leading at Open Source Summit North America: “Thinking Beyond IaC: an OSS Approach to Cloud Agnostic Infra. Management Using Infra. from Code (IfC).”
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don’t miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.