Module 3: Adding Data Science to a Web Application

Module Overview

You have your application, you have your data - now it's time for science! Let's use what we've learned throughout the program to add some useful intelligent functionality to our web application.

Learning Objectives

Objective 01 - Run and report simple online analysis of data from the user or an API

Overview

Online analysis refers to running whatever data science model/algorithm in real-time, in response to a user request. This has the advantage of processing and possibly training directly on user data, but the disadvantage of added compute cost (and thus possibly latency) at request time. It's still a useful paradigm, especially for descriptive statistics and other simple/inexpensive techniques.

At long last, we have a web application with some data in it - let's do some Data Science!

“Online analysis” just refers to running the analytical code (e.g. dataframe wrangling, model fitting, etc.) in the same application/methods as the web app itself. So this just means writing code like you're using to writing in a notebook (importing and using pandas and sklearn), but in functions that get called by the web application.

This makes it easy to reason about, and ensures that it's running with the latest data. But if your model is computationally intensive, online analysis is not ideal, as you may block the response and page loading as it waits for your model.

For our immediate purposes, we'll stick with a simple and efficient model (logistic regression), and online analysis will yield useful and reasonably efficient results.

Follow Along

We're going to run the classic logistic regression example data - the iris dataset. But now, on the web!

@app.route('/iris')
def iris():    
    from sklearn.datasets import load_iris
    from sklearn.linear_model import LogisticRegression
    X, y = load_iris(return_X_y=True)
    clf = LogisticRegression(random_state=0, solver='lbfgs',
                          multi_class='multinomial').fit(X, y)

    return str(clf.predict(X[:2, :]))

Start your server, and visit http://127.0.0.1:5000/iris. You should see [0 0] as the response, reflecting the class predictions for the given Iris data based on logistic regression.

Challenge

Objective 02 - Run a more complicated offline model and serialize the results for online use

Required Resources

Overview

Offline analysis refers to running whatever data science model/algorithm ahead of time, before the application is built/deployed. The trained model is then loaded into the application, or otherwise used to inform the predictions or behavior at runtime (online, in response to user requests). More expensive ML approaches essentially must be trained offline, with only inference (running the model for a prediction) happening in real-time.

As you know, in Data Science we have lots of ~toys~ tools - different models, tests, feature engineering, data processing, and so forth. Doing all of these “online” - live in realtime as the user makes a request - is not always realistic.

Instead, the approach is to run your pipeline and train your model “offline” - which just means not as part of the web application itself (in terms of routes and responses). Then, serialize the results - save them in some portable format - and make them available to the online application for real-time use (generally inference, making predictions rather than training).

Follow Along

There are many ways to save and reuse a model - the simplest (for a regression) would be to extract its coefficients, and program them into the live application. One step up is to use pickle.

pickle is a module in the Python standard library (so it's always available wherever Python is), and it allows you to “dump” and “load” arbitrary (in-memory) Python objects to string or binary representations. For our purposes we'll use string, as that is most convenient to then add to a database or otherwise save and load in the web application.

To use pickle, you need only two functions - dumps (“dump to a string”) and loads (“load from a string”):

>>> import pickle
>>> d = {'An arbitrary': 'Python Object'}
>>> d_pickled = pickle.dumps(d)
>>> d_pickled
b'\x80\x03}q\x00X\x0c\x00\x00\x00An arbitraryq\x01X\r\x00\x00\x00Python Objectq\x02s.'
>>> d_unpickled = pickle.loads(d_pickled)
>>> d_unpickled
{'An arbitrary': 'Python Object'}

Challenge

Guided Project

Open guided-project.md in the GitHub repository to follow along with the guided project.

Module Assignment

Reproduce the lecture tasks (logistic regression fitting, predicting, returning) in a REPL/notebook with different real data, incorporate predictive code in the application, and add forms for user interaction with the predictive model.

Assignment Solution Video

Additional Resources

Data Science Tools

Flask Integration

API Development

User Interface Design