Module 4: Software Testing, Documentation, and Licensing

Module Overview

You've written your code using OOP; it runs and works, it's in a package, and you can even run it in a reproducible container. You're done, right? Nope!

To effectively share your code in a way that will be lastingly useful, you also need to test and document. These are not just "overhead" tasks - they are a core part of software engineering, and code that lacks these properties is essentially impossible to maintain or build on in the long term.

Last but not least, you need to choose an appropriate license for your code and make sure you understand the licenses of your dependencies and the ecosystem in general. You don't need a law degree, but there are significant differences between licenses you should understand, even ones that all count as "open source."

Learning Objectives

Objective 01 - Understand purpose of and approaches towards software testing and write basic unit tests

Required Resources

Overview

When you say “the code works”, what do you really mean? Generally you mean that, when the system/function is run with given input, the output/behavior is as expected (and hopefully documented).

Software testing formally specifies this, and provides a framework for automatically verifying that code really passes the tests. This helps you avoid regressions - no, not the statistical models, but rather the literal meaning of “going backwards” with your code (introducing bugs/errors). You may have already seen this, and already ran software tests - in your coding challenges!

The simplest possible test requires the simplest possible piece of code - a unit. What is a unit?

In many cases it is just a function (or, if that function is a member of a class, remember it is called a method). For larger/more complicated code, it may be you write different unit tests for different cases of a function call, each passing different arguments in and testing for expected output.

Unit tests are the most basic, well, unit of testing. There are more sophisticated tests - integration tests combine modules into a group and test their joint behavior, and end-to-end tests simulate an entire user flow/interaction. These larger tests are arguably more effective at catching tricky bugs - you can have 100% unit test coverage and still miss things if you don't test things combined.

But there is one important advantage to unit tests, and a reason to not neglect them entirely - they may not catch every bug, but they do force you to think of your code in units. A good unit test requires good code to test, and so you may find yourself refactoring your code in order to make it more testable. Embrace this! It's one of the biggest advantages of proper software testing.

Follow Along

To write unit tests, we will use unittest, a package included in the Python standard library (i.e. no special installation needed).

Consider the following example:

import unittest
class TestStringMethods(unittest.TestCase):

    def test_upper(self):
        self.assertEqual('foo'.upper(), 'FOO')

    def test_isupper(self):
        self.assertTrue('FOO'.isupper())
        self.assertFalse('Foo'.isupper())

    def test_split(self):
        s = 'hello world'
        self.assertEqual(s.split(), ['hello', 'world'])
        # check that s.split fails when the separator is not a string
        with self.assertRaises(TypeError):
            s.split(2)

if __name__ == '__main__':
    unittest.main()

This tests basic string methods (i.e. the “units” being tested are already written and built-in to Python). The overall test case is a class, inheriting from unittest.TestCase, and the methods in the class are specific tests.

Save the code to a file string_tests.py, and execute with python string_tests.py. You should see:

...
----------------------------------------------------------------------
Ran 3 tests in 0.000s

OK

What happened? The default response when tests pass is to be silent - this just means all the tests passed. Try running again as verbose with python string_tests.py -v:

test_isupper (__main__.TestStringMethods) ... ok
test_split (__main__.TestStringMethods) ... ok
test_upper (__main__.TestStringMethods) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.000s

OK

Now you see it checking off each individual test and saying that it passed.

What does it look like when a test fails? It's noisier!

Python string methods are pretty well-written, so to get a failing test let's introduce a bug into the test itself:

    def test_upper(self):
        self.assertEqual('foo'.upper(), 'FO')  # bug in test!

Now run python string_tests.py again:

..F
======================================================================
FAIL: test_upper (__main__.TestStringMethods)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "string_tests.py", line 6, in test_upper
    self.assertEqual('foo'.upper(), 'FO')  # bug in test!
AssertionError: 'FOO' != 'FO'
- FOO
?   -
+ FO


----------------------------------------------------------------------
Ran 3 tests in 0.001s

FAILED (failures=1)

Failure! Specifically, the assertion self.assertEqual('foo'.upper(), 'FO') failed, as 'foo'.upper() is equal to FOO.

The above is the minimal viable approach to testing basic expected code behavior - you can make unit tests more complicated, with setUp() and tearDown() methods in the class to do setup/cleanup behavior before/after tests. You can also use unittest.mock to make “fake” objects, e.g. a fake database connection, so that a unit test can run without hitting real systems.

Advanced testing is beyond the scope for today, and in general is unlikely to be your core responsibility as a data scientist (there exist specific software test engineers). But it's good to be aware of its existence, and that a codebase is only as healthy as its test suite.

Challenge

Write a unit test for one of the functions or methods (pop quiz - what's the difference?) in your bloomdata package.

Additional Resources

Objective 02 - Read and write quality comments-documentation and READMEs

Overview

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” - Brian Kernighan in “The Elements of Programming Style”

Every topic we've covered this week helps us address the above dilemma, but documentation plays a particularly key role. It's where you write those things that the code itself doesn't quite say, but you had to think through to be able to come up with it. Your coworkers (and future you) will thank you!

You read more code than you write - and the more experience you get and the higher your position, the more that is true. So when you write code, you should always remember that it's not good enough for it to just run. Obviously that matters, but it's also important that your code can be read - that is, understood by another human, be it a coworker, or just you but in the future.

Code that runs assigns and moves the bytes that the computer wants and understands - one of the biggest challenges of coding is “thinking like a computer” and writing code that does what it should. But your human “mental state” when writing that code is extremely temporary - it requires great focus to write complicated code, and you'll quickly forget the details within weeks if not days.

How do we send information to the future? We write it down! Comments, Pydoc, and READMEs are how we (as Python developers) save our human mental state, and share it with whoever works with our code in the future.

Follow Along

There are two sorts of comments in Python:

Inline comment

# This is an inline Comment
"""
Multi
Line
Comment
"""

Both are useful, but serve different purposes - an inline comment is a brief annotation, to indicate the meaning (or “trick”) of a line or maybe a few lines of code.

Multi-line comments, aka docstrings, (which are technically strings, but are ignored by the interpreter as they aren't being assigned to anything) are documentation, and can describe details such as the argument and return types of a function, links to resources, etc. They're what the help() function grabs to explain things in the repl, and there's a tool named Pydoc that extracts docstrings and generates HTML (this is how a lot of code documentation sites are built!).

Challenge

Go back to your bloomdata project, and make sure you've added comments as appropriate. In particular, have docstrings wherever PEP8 suggests - then build/import your module and try the help() function on your code. For an extra goal - try building HTML documentation with pydoc - you can even put it in a docs/ directory and push to GitHub to deploy with GitHub Pages!

Additional Resources

Objective 03 - Recognize major open-source licenses and their significance for personal and professional use

Learn to recognize major open-source licenses and their significance for personal and professional use

Overview

Code rules the world - binaries (compiled code) are just a by-product. So the rules and laws around who gets to see and use what code in what situations matter quite a bit. We don't all have to be lawyers, but we should all be informed coding citizens.

You may think that writing code is a relatively new thing - but lawyers have been doing it for centuries. Legal code, though not quite as deterministic as computer code, has been an important part of human history for many centuries now.

One area it has been applied is to the idea of “intellectual property” - the ownership of ideas themselves. Ideas are powerful, and can have real economic and other forms of value. In many areas, intellectual property is straightforward (if occasionally controversial) - the creator of an idea (or whoever pays them) “owns” the idea, as represented by things like copyright and patents.

But in software, a remarkably different situation has emerged - the open source movement, as embodied by idealists such as Richard Stallman, has had tremendous impact on software development, technology, and the world. The idea that code should be available, while pragmatic from a development perspective, has created an ecosystem where arguably the majority of high quality software in the world is available not just in binary but source code form.

Follow Along

There are two major “schools” of open source licenses - GPL and BSD/MIT. The GPL (General Public License, part of the GNU Project with Richard Stallman) is the more “aggressive” of the two - it takes the stance that source code shouldn't just be available, but that people who use open source code should also make their source code available. This is referred to as “copyleft”, and some consider it “infectious”:

"Linux is a cancer that attaches itself in an intellectual property sense to everything it touches."
--Steve Ballmer, Microsoft

BSD/MIT-flavored licenses take a different approach - they put code out there, and the main presumption they make on anyone who uses it is that the original code writer isn't liable for bad things (i.e. you can't sue them), and that you should include the license with your code and acknowledge the original author (but not necessarily release your own code the same way).

Which is the right approach? That's the realm of philosophy - but both have major proponents and users. Linux (as Ballmer bemoaned) is the single most influential GPL licensed software, though there is an entire GPL toolchain (most of what you run on a command line). BSD (the operating system family) is, unsurprisingly, BSD licensed, and the MIT license is also widely used (for instance by BloomTech for our project code we share with you).

Both approaches have an important commonality - rather than foresake copyright (as with Public Domain), they use copyright to claim certain, but limited, rights. The main practical difference - BSD/MIT code is more “business-friendly” than GPL (indeed MacOS is built on BSD). There exist finer differences between licenses, and if you're interested you should read more at the Open Source Initiative.

Challenge

Read the MIT License, Line by Line, and write a summary of it with a target audience of a fellow code-writing coworker.

Additional Resources

Guided Project

In this guided project, we'll learn how to write effective tests, documentation, and choose appropriate licenses for our code. Open guided-project.md in the GitHub repository below to follow along with the guided project.

Module Assignment

For this assignment, you'll implement software testing, documentation, and licensing practices to create professional-quality code that follows industry standards.

Solution Video

Additional Resources

Testing

Documentation

Licensing

Additional Tools