Communicating with the outside world from within an atomic database transaction - python

I am implementing an import tool (Django 1.6) that takes a potentially very large CSV file, validates it and depending on user confirmation imports it or not. Given the potential large filesize, the processing of the file is done via flowy (a python wrapper over Amazon's SWF). Each import job is saved in a table in the DB and the workflow, which is quite simple and consists of only one activity, basically calls a method that runs the import and saves all necessary information about the processing of the file in the job's record in the database.
The tricky thing is: We now have to make this import atomic. Either all records are saved or none. But one of the things saved in the import table is the progress of the import, which is calculated based on the position of the file reader:
progress = (raw_data.tell() * 100.0) / filesize
And this progress is used by an AJAX progressbar widget in the client side. So simply adding #transaction.atomic to the method that loops through the file and imports the rows is not a solution, because the progress will only be saved on commit.
The CSV files only contain one type of record and affect a single table. If I could somehow do a transaction only on this table, leaving the job table free for me to update the progress column, it would be ideal. But from what I've found so far it seems impossible. The only solution I could think of so far is opening a new thread and a new database connection inside it every time I need to update the progress. But I keep wondering… will this even work? Isn't there a simpler solution?

One simple approach would be to use the READ UNCOMMITTED transaction isolation level. That could allow dirty reads, which would allow your other processes to see the progress even though the transaction hasn't been committed. However, whether this works or not will be database-dependent. (I'm not familiar with MySQL, but this wouldn't work in PostgreSQL because READ UNCOMMITTED works the same way as READ COMMITTED.)
Regarding your proposed solution, you don't necessarily need a new thread, you really just need a fresh connection to the database. One way to do that in Django might be to take advantage of the multiple database support. I'm imagining something like this:
As described in the documentation, add a new entry to DATABASES with a different name, but the same setup as default. From Django's perspective we are using multiple databases, even though we in fact just want to get multiple connections to the same database.
When it's time to update the progress, do something like:
JobData.objects.using('second_db').filter(id=5).update(progress=0.5)
That should take place in its own autocommitted transaction, allowing the progress to be seen by your web server.
Now, does this work? I honestly don't know, I've never tried anything like it!

Related

Creating data for Python tests

I have written a module in Python that reads a couple of tables from a database using pd.read_sql method, performs some operations on the data, and writes the results back to the same database using pd.to_sql method.
Now, I need to write unit tests for operations involved in the above mentioned module. As an example, one of the tests would check if the dataframe obtained from the database is empty, another one would check if the data types are correct etc. For such tests, how do I create sample data that reflects these errors (such as empty data frame, incorrect data type)? For other modules that do not read/write from a database, I created a single sample data file (in CSV), read the data, make necessary manipulations and test different functions. For the module related to database operations, how do I (and more importantly where do I) create sample data?
I was hoping to make a local data file (as I did for testing other modules), and then read using read_sql method, but that does not seem possible. Creating a local database using postegresql etc might be possible, but such tests cannot be deployed to clients without requiring them to create the same local databases.
Am I thinking of the problem correctly or missing something?
Thank you
You're thinking about the problem in the right way. Unit-tests should not rely on the existence of a database, as it makes them slower, more difficult to setup, and more fragile.
There are (at least) three approaches to the challenge you're describing:
The first, and probably the best one in your case, is to leave read_sql and write_sql out of the tested code. Your code should consist of a 'core' function that accepts a data frame and produces another data frame. You can unit-test this core function using local CSV files, or whatever other data you prefer. In production, you'll have another, very simple, function that just creates data using read_sql, pass it to the 'core' function, get the result, and write it using write_sql. You won't be unit-testing this wrapper function - but it's a really simple function and you should be fine.
Use sqlite. The tested function gets a database connection string. In prod, that would be a 'real' database. During your tests, it'll be a lightweight sqlite database that you can keep in your source control or create it as part of the test.
The last option, and the most sophisticated one, is to monkey-patch read_sql and write_sql in your test. I think it's an overkill in this case. Here's how one can do it.
def my_func(sql, con):
print("I'm here!")
return "some dummy dataframe"
pd.read_sql = my_func
pd.read_sql("select something ...", "dummy_con")

Django - Best way to create snapshots of objects

I am currently working on a Django 2+ project involving a blockchain, and I want to make copies of some of my object's states into that blockchain.
Basically, I have a model (say "contract") that has a list of several "signature" objects.
I want to make a snapshot of that contract, with the signatures. What I am basically doing is taking the contract at some point in time (when it's created for example) and building a JSON from it.
My problem is: I want to update that snapshot anytime a signature is added/updated/deleted, and each time the contract is modified.
The intuitive solution would be to override each "delete", "create", "update" of each of the models involved in that snapshot, and pray that all of them the are implemented right, and that I didn't forget any. But I think that this is not scalable at all, hard to debug and ton maintain.
I have thought of a solution that might be more centralized: using a periodical job to get the last update date of my object, compare it to the date of my snapshot, and update the snapshot if necessary.
However with that solution, I can identify changes when objects are modified or created, but not when they are deleted.
So, this is my big question mark: how with django can you identify deletions in relationships, without any prior context, just by looking at the current database's state ? Is there a django module to record deleted objects ? What are your thoughts on my issue ?
All right?
I think that, as I understand your problem, you are in need of a module like Django Signals, which listens for changes in the database and, when identified (and if all the desired conditions are met), executes certain commands in your application ( even be able to execute in the database).
This is the most recent documentation:
https://docs.djangoproject.com/en/3.1/topics/signals/

How to Improve a report processing time (Django/MySQL)?

I'm looking for ideas on how to improve a report that takes up to 30 minutes to process on the server, I'm currently working with Django and MySQL but if there is a solution that requires changing the language or SQL database I'm open to it.
The report I'm talking about reads multiple excel files and insert all the rows from those files into a table (the report table) with a range from 12K to 15K records, the table has around 50 columns. This part doesn't take that much time.
Once I have all the records on the report table I start applying multiple phases of business logic so I end having something like this:
def create_report():
business_logic_1()
business_logic_2()
business_logic_3()
business_logic_4()
Each function of the business_logic_X does something very similar, it starts by doing a ReportModel.objects.all() and then it applies multiple calculations like checking dates, quantities, etc and updates the record. Since it's a 12K record table it quickly starts adding time to the complete report.
The reason I'm going multiple functions separately and no all processing in one pass it's because the logic from the first function needs to be completed so the logic on the next functions works (ex. the first function finds all related records and applies the same status for all of them.
The first thing that I know could be optimized is somehow caching the objects.all() instead of calling it in each function but I'm not sure how to pass it to the next function without saving the records first.
I already optimized the report a bit by using update_fields on the save method of the functions and that saved a bit of time.
My question is, is there a better approach to this kind of problem? Is Django/MySQL the right stack for this?
What takes time is the business logic that you're doing in Django. So it does several round trips between the database and the application.
It sounds like there are several tables involved, so I would suggest that you write your query in raw sql and once you have the results you get that into the application, if you need it.
The orm has a method "raw" that you can use. Or you could drop down to even lower level and interface with Your database directly.
Unless I see more what you do, I can't give any more specific advice

Continually process data from a PostGre database - what approach to take?

Have a question about what sort of approach to take on a process I am trying to structure. Working with PostgreSQL and Python.
Scenario:
I have two databases A and B.
B is a processed version of A.
Data continually streams into A, which needs to be processed in a certain
way (using multi-processing) and is then stored in B.
Each new row in A needs to be processed only once.
So:
streamofdata ===> [database A] ----> process ----> [database B]
Database A is fairly large (40 GB) and growing. My question is regarding the determination on what is the new data not yet processed and put into B. What is the best way to determine what rows have to be processed still.
Matching primary keys each time on what has not yet been processed is not the way to go I am guessing
So let's say new rows 120 to 130 come into database A over some time period. So my last row processed row was 119. Is a correct approach to look at the last row id (the primary key) 119 processed and say that anything beyond that should now be processed?
Also wondering whether anyone has any further resources on this sort of 'realtime' processing of data. Not exactly sure what I am looking for technically speaking.
Well, there are a few ways you could handle this problem. As a reminder, the process you are describing is basically re-implementing a form of database replication, so you may want to familiarize yourself with the various popular replication options out there for Postgres and how they work, particularly Slony might be of interest to you. You didn't specify what sort of database "database B" is, so I'll assume it's a separate PostgreSQL instance, though that assumption won't change a whole lot about the decisions below other than ruling out some canned solutions like Slony.
Set up a FOR EACH ROW trigger on the important table(s) you have in database A which need to be replicated. Your trigger would take each new row INSERTed (and/or UPDATEd, DELETEd, if you need to catch those) in those tables and send them off to database B appropriately. You mentioned using Python, so just a reminder you can certainly write these trigger functions in PL/python if that makes life easy for you, i.e. you should hopefully be able to more-or-less easily tweak your existing code so that it runs inside the database as a PL/Python trigger function.
If you read up on Slony, you might have noticed that proposal #1 is very similar to how Slony works -- consider whether it would be easy or helpful for you to have Slony take over the replication of the necessary tables from database A to database B, then if you need to further move/transform the data into other tables inside database B, you might do that with triggers on those tables in database B.
Set up a trigger or RULE which will send out a NOTIFY with a payload indicating the row which has changed. Your code will will LISTEN for these notifications and know immediately which rows have changed. The psycopg2 adapter has good support for LISTEN and NOTIFY. N.B. you will need to exercise some care to handle the case that your listener code has crashed or gets disconnected from the database or otherwise misses some notifications.
In case you have control over the code streaming data into database A, you could have that code take over the job of replicating its new data into database B.

Can serialized objects be accessed simultaneously by different processes, and how do they behave if so?

I have data that is best represented by a tree. Serializing the structure makes the most sense, because I don't want to sort it every time, and it would allow me to make persistent modifications to the data.
On the other hand, this tree is going to be accessed from different processes on different machines, so I'm worried about the details of reading and writing. Basic searches didn't yield very much on the topic.
If two users simultaneously attempt to revive the tree and read from it, can they both be served at once, or does one arbitrarily happen first?
If two users have the tree open (assuming they can) and one makes an edit, does the other see the change implemented? (I assume they don't because they each received what amounts to a copy of the original data.)
If two users alter the object and close it at the same time, again, does one come first, or is an attempt made to make both changes simultaneously?
I was thinking of making a queue of changes to be applied to the tree, and then having the tree execute them in the order of submission. I thought I would ask what my problems are before trying to solve any of them.
Without trying it out I'm fairly sure the answer is:
They can both be served at once, however, if one user is reading while the other is writing the reading user may get strange results.
Probably not. Once the tree has been read from the file into memory the other user will not see edits of the first user. If the tree hasn't been read from the file then the change will still be detected.
Both changes will be made simultaneously and the file will likely be corrupted.
Also, you mentioned shelve. From the shelve documentation:
The shelve module does not support concurrent read/write access to
shelved objects. (Multiple simultaneous read accesses are safe.) When
a program has a shelf open for writing, no other program should have
it open for reading or writing. Unix file locking can be used to solve
this, but this differs across Unix versions and requires knowledge
about the database implementation used.
Personally, at this point, you may want to look into using a simple key-value store like Redis with some kind of optimistic locking.
You might try klepto, which provides a dictionary interface to a sql database (using sqlalchemy under the covers). If you choose to persist your data to a mysql, postgresql, or other available database (aside from sqlite), then you can have two or more people access the data simultaneously or have two threads/processes access the database tables -- and have the database manage the concurrent read-writes. Using klepto with a database backend will perform under concurrent access as well as if you were accessing the database directly. If you don't want to use a database backend, klepto can write to disk as well -- however there is some potential for conflict when writing to disk -- even though klepto uses a "copy-on-write, then replace" strategy that minimizes concurrency conflicts when working with files on disk. When working with a file (or directory) backend, your issues 1-2-3 are still handled due to the strategy klepto employs for saving writes to disk. Additionally, klepto can use a in-memory caching layer that enables fast access, where loads/dumps from the on-disk (or database) backend are done either on-demand or when the in-memory cache reaches a user-determined size.
To be specific: (1) both are served at the same time. (2) if one user makes an edit, the other user sees the change -- however that change may be 'delayed' if the second user is using an in-memory caching layer. (3) multiple simultaneous writes are not a problem, due to klepto letting NFS or the sql database handle the "copy-on-write, then replace" changes.
The dictionary interface for klepto.archvives is also available in a decorator form that provided LRU caching (and LFU and others), so if you have a function that is generating/accessing the data, hooking up the archive is really easy -- you get memorization with an on-disk or database backend.
With klepto, you can pick from several different serialization methods to encrypt your data. You can have klepto cast data to a string, or use a hashing algorithm (like md5), or use a pickler (like json, pickle, or dill).
You can get klepto here: https://github.com/uqfoundation/klepto

Categories