How do you test more complicated functions?

How do you test more complicated functions? - python

I'm a total amateur/hobbyist developer trying to learn more about testing the software I write. While I understand the core concept of testing, as the functions get more complicated, I feel as though it's a rabbit hole of varations, outcomes, conditions etc. For example...
The function below reads files from a directory into a Pandas DataFrame. A few columns adjustments are made before the data is passed to a different function that ultimately imports the data to our database.
I've already coded a test for the convert_date_string function. But what about this entire function as as whole - how do I write a test for it? In my mind, much of the Pandas library is already tested - thus making sure core functionality there works with my setup seems like a waste. But, maybe it isn't. Or, maybe this is a refactoring question to break this down into smaller parts?
Anyway, here is the code... any insight would be appreciated!
def process_file(import_id=None):
all_files = glob.glob(config.IMPORT_DIRECTORY + "*.txt")
if len(all_files) == 0:
return []
import_data = (pd.read_csv(f, sep='~', encoding='latin-1',
warn_bad_lines=True, error_bad_lines=False,
low_memory=False) for f in all_files)
data = pd.concat(import_data, ignore_index=True, sort=False)
data.columns = [col.lower() for col in data.columns]
data = data.where((pd.notnull(data)), None)
data['import_id'] = import_id
data['date'] = data['date'].apply(lambda x: convert_date_string(x))
insert_data_into_database(data=data, table='sales')
return all_files

There are mainly two kind of tests - proper unit tests, and integration tests.
Unit tests, as the name implies, test "units" of your program (functions, classes...) in isolation (without considering how they interact with other units). This of course require those units can be tested in isolation. For example, a pure function (a function that compute a results from it's inputs, where the result depends only on the inputs and will always be the same for the same inputs, and which doesn't have any side effect) is very easy to test, while a function that reads data from a hardcoded path on your filesystem, makes http requests to a hardcoded url and updates a database (whose connection data are also hardcoded) is almost impossible to test in isolation (and actually almost impossible to test).
So the first point is to write your code with testability in mind: favour small, focused units with a single clear responsability and as few dependencies as possible (and preferably taking their dependencies as arguments so you can pass a mock instead). This is of course a bit of a platonic ideal, but it's a worthy goal still. As a last resort, when you cannot get rid of dependencies or parameterize them, you can use a package like mock that will replace your dependencies with bogus objects having a similar interface.
Integration testing is about testing whole subsystems from a much higher level - for example for a website project, you may want to test that if you submit the "contact" form an email is sent to a given address and that the data are also stored in the database. You obviously want to do so with a disposable test database and a disposable test mailbox.
The function you posted is possibly doing a bit too much - it reads files, builds a panda dataframe, applies some processing, and stores thing in a database. You may want to try and factor it into more functions - one to get the files list, one to collect data from the files, one to process the data etc, you already have the one storing the data in the database - and rewrite your "process_files" (which is actually doing more than processing) to call those functions. This will make it easier to test each part in isolation. Once done with this, you can use mock to test the "process_file" functions and check that it calls the other functions with the expected arguments, or run it against a test directory and a test database and check the results in the database.

In general, I wouldn't go down the road of testing pandas or any other dependencies. The way I see it, it is important to make sure that a package that i use is well developed and well supported, then making tests for it will be redundant. Pandas is a very well supported package.
As to your question about the specific function and interest in testing in general, I will highly recommend checking out the Hypothesis python package (you'r in luck - its currently only for python). It provides mock data and generates edge cases for testing purposes.
an example from their docs:
from hypothesis import given
from hypothesis.strategies import text
#given(text())
def test_decode_inverts_encode(s):
assert decode(encode(s)) == s
here you tell it that the function needs to receive text as input, and the package will run it multiple times with different variables that answer the criteria. It will also try all kind of of edge cases.
It can do much more once implemented.

Related

Unit Testing Functions without Return Value in Python

My final project in CS50 is a salary slip generator in pdf format. I got these functions with me but I don't know to test them.
create_pdf() - function that opens my data file (.xlsx), iterates over its data, puts them into variables which will then be called by fpdf to put them into the pdf file. This function will generate as much pdf's depending on the number of data inside the data file.
merge_pdf() - function that merges all the previously generated pdf's into one pdf. This function I might try to check if it outputs the merged pdf or not but still not quite clear to me how to implement it.
get_print_date() - this function only I created just for the sake of adding extra functions to my project hoping that I can test it. It takes datetime.now() and returns the string value of the current date and time. But how can I assert also the return value if the return value changes over time?

Mine is a generic answer regardless of the language used.
Generally, when I have to test some method or function that has side effects or does not return any data, I check for some basic functions called within this function, and I mock them.
These core features are features that I assume are working and do not need to be further tested, such as:
Files management;
Access to the database;
etc..
I therefore suggest you find some libraries to allow you to make mocks of the services used within your functions and change the architecture of your software accordingly.
I hope I was clear.

Creating data for Python tests

I have written a module in Python that reads a couple of tables from a database using pd.read_sql method, performs some operations on the data, and writes the results back to the same database using pd.to_sql method.
Now, I need to write unit tests for operations involved in the above mentioned module. As an example, one of the tests would check if the dataframe obtained from the database is empty, another one would check if the data types are correct etc. For such tests, how do I create sample data that reflects these errors (such as empty data frame, incorrect data type)? For other modules that do not read/write from a database, I created a single sample data file (in CSV), read the data, make necessary manipulations and test different functions. For the module related to database operations, how do I (and more importantly where do I) create sample data?
I was hoping to make a local data file (as I did for testing other modules), and then read using read_sql method, but that does not seem possible. Creating a local database using postegresql etc might be possible, but such tests cannot be deployed to clients without requiring them to create the same local databases.
Am I thinking of the problem correctly or missing something?
Thank you

You're thinking about the problem in the right way. Unit-tests should not rely on the existence of a database, as it makes them slower, more difficult to setup, and more fragile.
There are (at least) three approaches to the challenge you're describing:
The first, and probably the best one in your case, is to leave read_sql and write_sql out of the tested code. Your code should consist of a 'core' function that accepts a data frame and produces another data frame. You can unit-test this core function using local CSV files, or whatever other data you prefer. In production, you'll have another, very simple, function that just creates data using read_sql, pass it to the 'core' function, get the result, and write it using write_sql. You won't be unit-testing this wrapper function - but it's a really simple function and you should be fine.
Use sqlite. The tested function gets a database connection string. In prod, that would be a 'real' database. During your tests, it'll be a lightweight sqlite database that you can keep in your source control or create it as part of the test.
The last option, and the most sophisticated one, is to monkey-patch read_sql and write_sql in your test. I think it's an overkill in this case. Here's how one can do it.
def my_func(sql, con):
print("I'm here!")
return "some dummy dataframe"
pd.read_sql = my_func
pd.read_sql("select something ...", "dummy_con")

Please help understand the use case of unit testing

I have 2 scripts.
It runs a load test using locust, collects the output in dict format, then pass it to the 2nd script.
2nd script will accept a dict as input from script-1, parses it, creates a json payload and sends the data to an api endpoint where it stores in some db.
The application starts running from the 1st script and all the functionalities are working well. I have never worked on a unit testing. My question here is:
What can be tested here using unit testing in order to keep proper standard of building an application.
script-1. (suppose locust is running in already)
def on_slave_report(data):
send_to_db(data)
events.slave_report += on_slave_report
script-2.
def send_to_db(data):
send_it(take_only_needed(data))
def take_only_needed(data):
needed=data[stats]
payload = json.dumps({'stats' : needed, 'bla': bla})
return payload
def send_it(payload):
requests.request("POST", url, data=payload, headers=headers)

For the two functions send_to_db and send_it, unit-testing does not make much sense: Both functions consist only of interactions with other components/functions. Since unit-testing aims at finding those bugs which can be found in the isolated units, there are no bugs which unit-testing could find. Bugs in such interaction dominated code lie more in the following area: Are you calling the right functions in the right order with the right values for the parameters, are the parameters in the right order, are the results/return values delivered in the expected way and format? And, answers for all these questions can not be found in the isolated code, but only in code where the respective components truly interact - and this is integration-testing rather than unit-testing.
The only function in your script-2 that makes sense to unit-test is take_only_needed: This function performs actual computations. It also has the nice property that it only has dependencies which (probably) do not cause testing problems and thus probably don't need mocking.
Conclusion: Perform unit-testing for take_only_needed, for the others skip unit-testing and test them during interaction testing.

Python - Why we should use mock to do test?

I am very new to Python and I saw many projects on Github using Mock to do their test but I don't understand why.
When we use mock, we construct a Mock object with a specific return_value, I don't truely understand why we do this. I know sometimes it is difficult to build our needed resources but what is the point if we construct a object/function with a certain return value ?

Mock can help to write unit tests.
In unit tests, you want to test a small portion of your implementation. For example, as small as one function or one class.
In a moderately large software, these small parts depend on each other. Or sometimes there are external dependencies. You open files, do syscalls, or get external data in some other way.
While writing a directed unit test for a small portion of your code, you do not want to spend time to set-up everything else around it. (The files, syscalls, external data). Mock comes to your help there. With mock, you can make the other dependencies of your code behave exactly as you like. This way you can focus on testing your intended implementation.
Coming to the mock with the return value: Say you want to test func_a. And func_a calls func_b. func_b does a lot of funky processing to calculate its return value. For example, talking to an external service, doing bunch of syscalls, or some other expensive operation. Since you are testing func_a, you only care about possible return values of func_b. (So that func_a can use them) In this scenario you would mock func_b and set the return values explicitly. This can really simplify your test complexity.

How do I compare two nested data structures for unittesting?

For those who know perl, I'm looking for something similar to Test::Deep::is_deeply() in Python.
In Python's unittest I can conveniently compare nested data structures already, if I expect them to be equal:
self.assertEqual(os.walk('some_path'),
my.walk('some_path'),
"compare os.walk with my own implementation")
However, in the wanted test, the order of files in the respective sublist of the os.walk tuple shall be of no concern.
If it was just this one test it would be ok to code an easy solution. But I envision several tests on differently structured nested data. And I am hoping for a general solution.
I checked Python's own unittest documentation, looked at pyUnit, and at nose and it's plugins. Active maintenance would also be an important aspect for usage.
The ultimate goal for me would be to have a set of descriptive types like UnorderedIterable, SubsetOf, SupersetOf, etc which can be called to describe a nested data structure, and then use that description to compare two actual sets of data.
In the os.walk example I'd like something like:
comparison = OrderedIterable(
OrderedIterable(
str,
UnorderedIterable(),
UnorderedIterable()
)
)
The above describes the kind of data structure that list(os.walk()) would return. For comparison of data A and data B in a unit test, the current path names would be cast into a str(), and the dir and file lists would be compared ignoring the order with:
self.assertDeep(A, B, comparison, msg)
Is there anything out there? Or is it such a trivial task that people write their own? I feel comfortable doing it, but I don't want to reinvent, and especially would not want to code the full orthogonal set of types, tests for those, etc. In short, I wouldn't publish it and thus the next one has to rewrite again...

Python Deep seems to be a project to reimplement perl's Test::Deep. It is written by the author of Test::Deep himself. Last development happened in early 2016.
Update (2018/Aug): Latest release (2016/Feb) is located on PyPi/Deep
I have done some P3k porting work on github

Not a solution, but the currently implemented workaround to solve the particular example listed in the question:
os_walk = list(os.walk('some_path'))
dt_walk = list(my.walk('some_path'))
self.assertEqual(len(dt_walk), len(os_walk), "walk() same length")
for ((osw, osw_dirs, osw_files), (dt, dt_dirs, dt_files)) in zip(os_walk, dt_walk):
self.assertEqual(dt, osw, "walk() currentdir")
self.assertSameElements(dt_dirs, osw_dirs, "walk() dirlist")
self.assertSameElements(dt_files, osw_files, "walk() fileList")
As we can see from this example implementation that's quite a bit of code. As we can also see, Python's unittest has most of the ingredients required.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.