MyPy IPython Example

This is a more comprehensive example, showing how type checking can help with data analysis.

Let’s suppose we have a web API for sentiment analysis. When we post to http://sentiment-analyzer.example.com/analyze, we get a result of {"sentiment": "positive"} or {"sentiment": "negative"}. In our example, we will imagine that this is a usage-based payment API: some amount of money for 1000 requests.

[4]:
# from requests import Session
# SESSION = Session()

from unittest import mock
SESSION = mock.MagicMock()

We also have a database with people’s reviews that we want to analyze for sentiment. We will abstract the database away and assume we already have the data we need in memory.at i

Since we pay money for API usage, we mostly debug on a sample of the data. Once we see that it works, we run it on the full sample. In real life, the sample data might be a 1,000 elements, and the full data 1,000,000.

In our little example, for pedagogical reasons, the sample has two items and the full data has three – only one more.

[2]:
SAMPLE_DATA = [
    {"name": "Jane Doe", "review": "I liked it", "product_id": 5},
    {"name": "Huan Liu", "review": "it sucked", "product_id": 7},
]
FULL_DATA = SAMPLE_DATA + [
    {"name": "Denzel Brown", "review": "ok I guess", "product_id": 2},
]

Here is the wrapper code to call the sentiment analyzer:

[3]:
def is_positive(text):
    results = SESSION.post("http://sentiment-analyzer.example.com/analyze", json=dict(text=text))
    return results.json()["sentiment"] == "positive"

Unfortunately, even on our small sample, this was sometimes hanging for a long time. But, easy enough to fix: we added a little retry loop that tries three times, and added a 3 second timeout.

[4]:
def sentiment(text):
    for i in range(3):
        try:
            results = SESSION.post("http://sentiment-analyzer.example.com/analyze",
                                   json=dict(text=text), timeout=3)
        except OSError:
            continue
        else:
            return 1 if results.json()["sentiment"] == "positive" else -1
[5]:
SESSION.post.side_effect = [
    mock.MagicMock(**{"json.return_value": dict(sentiment=sentiment)})
    for sentiment in ["positive", "negative"]
]
[6]:
average_sentiment = sum(sentiment(datum["review"]) for datum in SAMPLE_DATA)
print(average_sentiment)
0

Looks good! It even handles errors:

[7]:
import random
side_effect = [
    mock.MagicMock(**{"json.return_value": dict(sentiment=sentiment)})
    for sentiment in ["positive", "negative"]
] + [OSError("woops too long")] * 2
random.shuffle(side_effect)
SESSION.post.side_effect = side_effect
[8]:
average_sentiment = sum(sentiment(datum["review"]) for datum in SAMPLE_DATA)
print(average_sentiment)
0

Looks good! Let’s wrap it in a function:

[9]:
def get_average_sentiment(data):
    return sum(sentiment(datum["review"]) for datum in data)
[10]:
SESSION.post.side_effect = [
    mock.MagicMock(**{"json.return_value": dict(sentiment=sentiment)})
    for sentiment in ["positive", "negative"]
]
get_average_sentiment(SAMPLE_DATA)
[10]:
0

But, on the full sample, sometimes requests fail three times. What happens then?

[11]:
SESSION.post.side_effect = [
    mock.MagicMock(**{"json.return_value": dict(sentiment=sentiment)})
    for sentiment in ["positive", "negative"]
] + [OSError("woops too long")] * 4
[12]:
get_average_sentiment(FULL_DATA)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-a5751932427c> in <module>
----> 1 get_average_sentiment(FULL_DATA)

<ipython-input-9-bfb1266b4732> in get_average_sentiment(data)
      1 def get_average_sentiment(data):
----> 2     return sum(sentiment(datum["review"]) for datum in data)

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

Woops! Too bad. The fix is simple: it is rare for requests to fail three times, so we can just return 0: it is not going to change the average too much.

[13]:
def sentiment(text):
    for i in range(3):
        try:
            results = SESSION.post("http://sentiment-analyzer.example.com/analyze",
                                   json=dict(text=text), timeout=3)
        except OSError:
            continue
        else:
            return 1 if results.json()["sentiment"] == "positive" else -1
    return 0

We are done.

Too bad that to grab the new sentiments, we have to use the API again…for all elements. Oh, well. Too bad about the usage-based cost.

[14]:
SESSION.post.side_effect = [
    mock.MagicMock(**{"json.return_value": dict(sentiment=sentiment)})
    for sentiment in ["positive", "negative"]
] + [OSError("woops too long")] * 4
get_average_sentiment(FULL_DATA)
[14]:
0

What if this could all have been avoided?

[1]:
%load_ext mypy_ipython
[5]:
def sentiment(text: str) -> int:
    for i in range(3):
        try:
            results = SESSION.post("http://sentiment-analyzer.example.com/analyze",
                                   json=dict(text=text), timeout=3)
        except OSError:
            continue
        else:
            return 1 if results.json()["sentiment"] == "positive" else -1
[6]:
%mypy
note: In function "sentiment":
    def sentiment(text: str) -> int:
error: Missing return statement
Found 1 error in 1 file (checked 1 source file)
Type checking failed