Creating data for Python tests

Creating data for Python tests - python

I have written a module in Python that reads a couple of tables from a database using pd.read_sql method, performs some operations on the data, and writes the results back to the same database using pd.to_sql method.
Now, I need to write unit tests for operations involved in the above mentioned module. As an example, one of the tests would check if the dataframe obtained from the database is empty, another one would check if the data types are correct etc. For such tests, how do I create sample data that reflects these errors (such as empty data frame, incorrect data type)? For other modules that do not read/write from a database, I created a single sample data file (in CSV), read the data, make necessary manipulations and test different functions. For the module related to database operations, how do I (and more importantly where do I) create sample data?
I was hoping to make a local data file (as I did for testing other modules), and then read using read_sql method, but that does not seem possible. Creating a local database using postegresql etc might be possible, but such tests cannot be deployed to clients without requiring them to create the same local databases.
Am I thinking of the problem correctly or missing something?
Thank you

You're thinking about the problem in the right way. Unit-tests should not rely on the existence of a database, as it makes them slower, more difficult to setup, and more fragile.
There are (at least) three approaches to the challenge you're describing:
The first, and probably the best one in your case, is to leave read_sql and write_sql out of the tested code. Your code should consist of a 'core' function that accepts a data frame and produces another data frame. You can unit-test this core function using local CSV files, or whatever other data you prefer. In production, you'll have another, very simple, function that just creates data using read_sql, pass it to the 'core' function, get the result, and write it using write_sql. You won't be unit-testing this wrapper function - but it's a really simple function and you should be fine.
Use sqlite. The tested function gets a database connection string. In prod, that would be a 'real' database. During your tests, it'll be a lightweight sqlite database that you can keep in your source control or create it as part of the test.
The last option, and the most sophisticated one, is to monkey-patch read_sql and write_sql in your test. I think it's an overkill in this case. Here's how one can do it.
def my_func(sql, con):
print("I'm here!")
return "some dummy dataframe"
pd.read_sql = my_func
pd.read_sql("select something ...", "dummy_con")

Related

How can I prefetch_related() everything related to an object?

I'm trying to export all data connected to an User instance to CSV file. In order to do so, I need to get it from the DB first. Using something like
data = SomeModel.objects.filter(owner=user)
on every model possible seems to be very inefficient, so I want to use prefetch_related(). My question is, is there any way to prefetch all different model's instances with FK pointing at my User, at once?

Actually, you don't need to "prefetch everything" in order to create a CSV file – or, anything else – and you really don't want to. Python's CSV support is of course designed to work "row by row," and that's what you want to do here: in a loop, read one row at a time from the database and write it one row at a time to the file.
Remember that Django is lazy. Functions like filter() specify what the filtration is going to be, but things really don't start happening until you start to iterate over the actual collection. That's when Django will build the query, submit it to the SQL engine, and start retrieving the data that's returned ... one row at a time.
Let the SQL engine, Python and the operating system take care of "efficiency." They're really good at that sort of thing.

How do you test more complicated functions?

I'm a total amateur/hobbyist developer trying to learn more about testing the software I write. While I understand the core concept of testing, as the functions get more complicated, I feel as though it's a rabbit hole of varations, outcomes, conditions etc. For example...
The function below reads files from a directory into a Pandas DataFrame. A few columns adjustments are made before the data is passed to a different function that ultimately imports the data to our database.
I've already coded a test for the convert_date_string function. But what about this entire function as as whole - how do I write a test for it? In my mind, much of the Pandas library is already tested - thus making sure core functionality there works with my setup seems like a waste. But, maybe it isn't. Or, maybe this is a refactoring question to break this down into smaller parts?
Anyway, here is the code... any insight would be appreciated!
def process_file(import_id=None):
all_files = glob.glob(config.IMPORT_DIRECTORY + "*.txt")
if len(all_files) == 0:
return []
import_data = (pd.read_csv(f, sep='~', encoding='latin-1',
warn_bad_lines=True, error_bad_lines=False,
low_memory=False) for f in all_files)
data = pd.concat(import_data, ignore_index=True, sort=False)
data.columns = [col.lower() for col in data.columns]
data = data.where((pd.notnull(data)), None)
data['import_id'] = import_id
data['date'] = data['date'].apply(lambda x: convert_date_string(x))
insert_data_into_database(data=data, table='sales')
return all_files

There are mainly two kind of tests - proper unit tests, and integration tests.
Unit tests, as the name implies, test "units" of your program (functions, classes...) in isolation (without considering how they interact with other units). This of course require those units can be tested in isolation. For example, a pure function (a function that compute a results from it's inputs, where the result depends only on the inputs and will always be the same for the same inputs, and which doesn't have any side effect) is very easy to test, while a function that reads data from a hardcoded path on your filesystem, makes http requests to a hardcoded url and updates a database (whose connection data are also hardcoded) is almost impossible to test in isolation (and actually almost impossible to test).
So the first point is to write your code with testability in mind: favour small, focused units with a single clear responsability and as few dependencies as possible (and preferably taking their dependencies as arguments so you can pass a mock instead). This is of course a bit of a platonic ideal, but it's a worthy goal still. As a last resort, when you cannot get rid of dependencies or parameterize them, you can use a package like mock that will replace your dependencies with bogus objects having a similar interface.
Integration testing is about testing whole subsystems from a much higher level - for example for a website project, you may want to test that if you submit the "contact" form an email is sent to a given address and that the data are also stored in the database. You obviously want to do so with a disposable test database and a disposable test mailbox.
The function you posted is possibly doing a bit too much - it reads files, builds a panda dataframe, applies some processing, and stores thing in a database. You may want to try and factor it into more functions - one to get the files list, one to collect data from the files, one to process the data etc, you already have the one storing the data in the database - and rewrite your "process_files" (which is actually doing more than processing) to call those functions. This will make it easier to test each part in isolation. Once done with this, you can use mock to test the "process_file" functions and check that it calls the other functions with the expected arguments, or run it against a test directory and a test database and check the results in the database.

In general, I wouldn't go down the road of testing pandas or any other dependencies. The way I see it, it is important to make sure that a package that i use is well developed and well supported, then making tests for it will be redundant. Pandas is a very well supported package.
As to your question about the specific function and interest in testing in general, I will highly recommend checking out the Hypothesis python package (you'r in luck - its currently only for python). It provides mock data and generates edge cases for testing purposes.
an example from their docs:
from hypothesis import given
from hypothesis.strategies import text
#given(text())
def test_decode_inverts_encode(s):
assert decode(encode(s)) == s
here you tell it that the function needs to receive text as input, and the package will run it multiple times with different variables that answer the criteria. It will also try all kind of of edge cases.
It can do much more once implemented.

Storing the results of a SQLAlchemy query to merge into another session

I have a SQLAlchemy-based tool for selectively copying data between two different databases for testing purposes. I use the merge() function to take model objects from one session and store them in another session. I'd like to be able to store the source objects in some intermediate form and then merge() them at some later point in time.
It seems like there are a few options to accomplish this:
Exporting DELETE/INSERT SQL statements. Seems pretty straightforward, I think I can get SQLAlchemy to give me the INSERT statements, and maybe even the DELETEs.
Exproting the data to a SQLite database file with the same (or similar) schema, that could then be read in as a source at a later point in time.
Serializing the data in some manner and then reading them back into memory for the merge. I don't know if SQLAlchemy has something like this built-in or not. I'm not sure what the challenges would be in rolling this myself.
Has anyone tackled this problem before? If so, what was your solution?
EDIT: I found a tool built on top of SQLAlchemy called dataset that provides the freeze functionality I'm looking for, but there seems to be no corresponding thaw functionality for restoring the data.

I haven't used it before, but the dogpile caching techniques described in the documentation might be what you want. This allows you to query to and from a cache using the SQLAlchemy API:
http://docs.sqlalchemy.org/en/rel_0_9/orm/examples.html#module-examples.dogpile_caching

Communicating with the outside world from within an atomic database transaction

I am implementing an import tool (Django 1.6) that takes a potentially very large CSV file, validates it and depending on user confirmation imports it or not. Given the potential large filesize, the processing of the file is done via flowy (a python wrapper over Amazon's SWF). Each import job is saved in a table in the DB and the workflow, which is quite simple and consists of only one activity, basically calls a method that runs the import and saves all necessary information about the processing of the file in the job's record in the database.
The tricky thing is: We now have to make this import atomic. Either all records are saved or none. But one of the things saved in the import table is the progress of the import, which is calculated based on the position of the file reader:
progress = (raw_data.tell() * 100.0) / filesize
And this progress is used by an AJAX progressbar widget in the client side. So simply adding #transaction.atomic to the method that loops through the file and imports the rows is not a solution, because the progress will only be saved on commit.
The CSV files only contain one type of record and affect a single table. If I could somehow do a transaction only on this table, leaving the job table free for me to update the progress column, it would be ideal. But from what I've found so far it seems impossible. The only solution I could think of so far is opening a new thread and a new database connection inside it every time I need to update the progress. But I keep wondering… will this even work? Isn't there a simpler solution?

One simple approach would be to use the READ UNCOMMITTED transaction isolation level. That could allow dirty reads, which would allow your other processes to see the progress even though the transaction hasn't been committed. However, whether this works or not will be database-dependent. (I'm not familiar with MySQL, but this wouldn't work in PostgreSQL because READ UNCOMMITTED works the same way as READ COMMITTED.)
Regarding your proposed solution, you don't necessarily need a new thread, you really just need a fresh connection to the database. One way to do that in Django might be to take advantage of the multiple database support. I'm imagining something like this:
As described in the documentation, add a new entry to DATABASES with a different name, but the same setup as default. From Django's perspective we are using multiple databases, even though we in fact just want to get multiple connections to the same database.
When it's time to update the progress, do something like:
JobData.objects.using('second_db').filter(id=5).update(progress=0.5)
That should take place in its own autocommitted transaction, allowing the progress to be seen by your web server.
Now, does this work? I honestly don't know, I've never tried anything like it!

Can serialized objects be accessed simultaneously by different processes, and how do they behave if so?

I have data that is best represented by a tree. Serializing the structure makes the most sense, because I don't want to sort it every time, and it would allow me to make persistent modifications to the data.
On the other hand, this tree is going to be accessed from different processes on different machines, so I'm worried about the details of reading and writing. Basic searches didn't yield very much on the topic.
If two users simultaneously attempt to revive the tree and read from it, can they both be served at once, or does one arbitrarily happen first?
If two users have the tree open (assuming they can) and one makes an edit, does the other see the change implemented? (I assume they don't because they each received what amounts to a copy of the original data.)
If two users alter the object and close it at the same time, again, does one come first, or is an attempt made to make both changes simultaneously?
I was thinking of making a queue of changes to be applied to the tree, and then having the tree execute them in the order of submission. I thought I would ask what my problems are before trying to solve any of them.

Without trying it out I'm fairly sure the answer is:
They can both be served at once, however, if one user is reading while the other is writing the reading user may get strange results.
Probably not. Once the tree has been read from the file into memory the other user will not see edits of the first user. If the tree hasn't been read from the file then the change will still be detected.
Both changes will be made simultaneously and the file will likely be corrupted.
Also, you mentioned shelve. From the shelve documentation:
The shelve module does not support concurrent read/write access to
shelved objects. (Multiple simultaneous read accesses are safe.) When
a program has a shelf open for writing, no other program should have
it open for reading or writing. Unix file locking can be used to solve
this, but this differs across Unix versions and requires knowledge
about the database implementation used.
Personally, at this point, you may want to look into using a simple key-value store like Redis with some kind of optimistic locking.

You might try klepto, which provides a dictionary interface to a sql database (using sqlalchemy under the covers). If you choose to persist your data to a mysql, postgresql, or other available database (aside from sqlite), then you can have two or more people access the data simultaneously or have two threads/processes access the database tables -- and have the database manage the concurrent read-writes. Using klepto with a database backend will perform under concurrent access as well as if you were accessing the database directly. If you don't want to use a database backend, klepto can write to disk as well -- however there is some potential for conflict when writing to disk -- even though klepto uses a "copy-on-write, then replace" strategy that minimizes concurrency conflicts when working with files on disk. When working with a file (or directory) backend, your issues 1-2-3 are still handled due to the strategy klepto employs for saving writes to disk. Additionally, klepto can use a in-memory caching layer that enables fast access, where loads/dumps from the on-disk (or database) backend are done either on-demand or when the in-memory cache reaches a user-determined size.
To be specific: (1) both are served at the same time. (2) if one user makes an edit, the other user sees the change -- however that change may be 'delayed' if the second user is using an in-memory caching layer. (3) multiple simultaneous writes are not a problem, due to klepto letting NFS or the sql database handle the "copy-on-write, then replace" changes.
The dictionary interface for klepto.archvives is also available in a decorator form that provided LRU caching (and LFU and others), so if you have a function that is generating/accessing the data, hooking up the archive is really easy -- you get memorization with an on-disk or database backend.
With klepto, you can pick from several different serialization methods to encrypt your data. You can have klepto cast data to a string, or use a hashing algorithm (like md5), or use a pickler (like json, pickle, or dill).
You can get klepto here: https://github.com/uqfoundation/klepto

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.