Module 3: Adding Data Science to a Web Application
Module Overview
You have your application, you have your data - now it's time for science! Let's use what we've learned throughout the program to add some useful intelligent functionality to our web application.
Learning Objectives
- Add a machine learning model to our web server that generates predictions when passed the appropriate inputs
- Add routes to our app that will listen for POST HTTP requests (form submissions) and respond accordingly
- Display appropriate messages on the screen after user actions including error messages when invalid actions are taken
Objective 01 - Run and report simple online analysis of data from the user or an API
Overview
Online analysis refers to running whatever data science model/algorithm in real-time, in response to a user request. This has the advantage of processing and possibly training directly on user data, but the disadvantage of added compute cost (and thus possibly latency) at request time. It's still a useful paradigm, especially for descriptive statistics and other simple/inexpensive techniques.
At long last, we have a web application with some data in it - let's do some Data Science!
“Online analysis” just refers to running the analytical code (e.g. dataframe wrangling, model fitting, etc.) in the same application/methods as the web app itself. So this just means writing code like you're using to writing in a notebook (importing and using pandas and sklearn), but in functions that get called by the web application.
This makes it easy to reason about, and ensures that it's running with the latest data. But if your model is computationally intensive, online analysis is not ideal, as you may block the response and page loading as it waits for your model.
For our immediate purposes, we'll stick with a simple and efficient model (logistic regression), and online analysis will yield useful and reasonably efficient results.
Follow Along
We're going to run the classic logistic regression example data - the iris dataset. But now, on the web!
pipenv install scikit-learn
(in your project directory) - you probably already have scikit-learn locally, but it's important for your project to have the dependency specified for later deployment.- Wherever your app and routes are specified, add the following route:
@app.route('/iris')
def iris():
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, solver='lbfgs',
multi_class='multinomial').fit(X, y)
return str(clf.predict(X[:2, :]))
Start your server, and visit http://127.0.0.1:5000/iris
. You should see [0 0]
as the response, reflecting the class predictions for the given Iris data based on logistic regression.
Challenge
- Add a route that reports the overall goodness of fit of the logistic regression (e.g.
clf.score(X, y)
). - Add a route that takes as a parameter the input data to be predicted, and returns class labels (or even full probabilities).
- Refactor! You should probably centralize and move the actual scikit logic to another file, and have your routing file just import and use what it needs.
Objective 02 - Run a more complicated offline model and serialize the results for online use
Required Resources
Overview
Offline analysis refers to running whatever data science model/algorithm ahead of time, before the application is built/deployed. The trained model is then loaded into the application, or otherwise used to inform the predictions or behavior at runtime (online, in response to user requests). More expensive ML approaches essentially must be trained offline, with only inference (running the model for a prediction) happening in real-time.
As you know, in Data Science we have lots of ~toys~ tools - different models, tests, feature engineering, data processing, and so forth. Doing all of these “online” - live in realtime as the user makes a request - is not always realistic.
Instead, the approach is to run your pipeline and train your model “offline” - which just means not as part of the web application itself (in terms of routes and responses). Then, serialize the results - save them in some portable format - and make them available to the online application for real-time use (generally inference, making predictions rather than training).
Follow Along
There are many ways to save and reuse a model - the simplest (for a regression) would be to extract its
coefficients, and program them into the live application. One step up is to use pickle
.
pickle
is a module in the Python standard library (so it's always available wherever Python
is), and it allows you to “dump” and “load” arbitrary (in-memory) Python objects to string or binary
representations. For our purposes we'll use string, as that is most convenient to then add to a database
or otherwise save and load in the web application.
To use pickle
, you need only two functions - dumps
(“dump to a string”) and
loads
(“load from a string”):
>>> import pickle
>>> d = {'An arbitrary': 'Python Object'}
>>> d_pickled = pickle.dumps(d)
>>> d_pickled
b'\x80\x03}q\x00X\x0c\x00\x00\x00An arbitraryq\x01X\r\x00\x00\x00Python Objectq\x02s.'
>>> d_unpickled = pickle.loads(d_pickled)
>>> d_unpickled
{'An arbitrary': 'Python Object'}
Challenge
- The above example shows how to save and load an arbitrary Python object with
pickle
. - Reproduce the example, but with your own Python object (and specifically one representing a statistical model of some sort).
- Persist the string into a file or database, and write code to automatically load it in a fresh session.
- See if you can integrate the above into your web application!
Guided Project
Open guided-project.md in the GitHub repository to follow along with the guided project.
Module Assignment
Reproduce the lecture tasks (logistic regression fitting, predicting, returning) in a REPL/notebook with different real data, incorporate predictive code in the application, and add forms for user interaction with the predictive model.
Assignment Solution Video
Additional Resources
Data Science Tools
- Scikit-Learn Linear Models
- Scikit-Learn StandardScaler
- Scikit-Learn Model Persistence
- Python Pickle Module