How to work with queues for file processing in python?

How to work with queues for file processing in python? - python

I'm currently writing an application in python that mainly works with files. So the algorithm works like this:
A user submits a file through an API, imagine a post request with the file and some data.
Then the program works with the file and extracts some conclusions.
After that, those conclusions are stored inside a DB.
Then the user is able to query the db and ask for conclusions.
As the user is able to submit a file through an API and this can be done simultaneously by many users in many systems and the process of file processing may take some time. I want to explore a way to implement a work queue such as:
Only one file can be processed at a time, so when a user submits a file, that file is put inside a work queue and so has to wait before getting inside the "processing function".
How can I do that, any reference or tutorial?
Thanks

Checkout Celery, there are many good tutorials online.
It works with workers so it also doesn't block you api listening.
Also it could provide the option to process multiple files at once concurently if you'd want.

Related

Can Flask or Django handle concurrent tasks?

What I'm trying to accomplish:
I have a sensor that is constantly reading in data. I need to print this data to a UI whenever data appears. While the aforementioned task is taking place, the user should be able to write data to the sensor. Ideally, both these tasks would / could happen at the same time. Currently, I have the program written using flask; but if django would be better suited (or a third party) I would be willing to make the switch. Note: this website will never be deployed so no need to worry about that. Only user will be me, running program from my laptop.
I have spent a lot of time researching flask async functions and coroutines; however I have not seen any clear indications if something like this would be possible.
Not looking for a line by line solution. Rather, a way (async, threading etc) to set up the code such that the aforementioned tasks are possible. All help is appreciated, thanks.

I'm a Django guy, so I'll throw out what I think could be possible
Django has a decorator #start_new_thread which can be put on any function and it will run in a thread.
You could make a view, POST to it with Javascript/Ajax and start a thread for communication with the sensor using the data POSTed.
You could also make a threading function that will read from the sensor
Could be a management command or a 'start' btn that POSTs to a view that then starts the thread
Note: You need to do Locks or some other logic so the two threads don't conflict when reading/writing
Maybe it's a single thread that reads/writes to the sensor and each loop it checks if there's anything to write (existence + contents of a file? Maybe db entry?
Per the UI, lets say a webpage. You're best best would be Websockets, but because you're the only one that will ever use it you could just write up some Javascript/Ajax that would Ping a view every x seconds and display the new data on the webpage
Note: that's essentially what websockets do, ping every x seconds
Now the common thread is Javascript/Ajax, this is so the page doesn't need to refresh and you can constantly see the data coming in without the page being refreshed.
You can probably do all of this in Flask if you find a similar threading ability and just add some javascript to the frontend
Hopefully you find some of this useful, and idk why stackoverflow hates these types of questions... They're literally fine

Python flask how to avoid multi file access at the same time (mutual exclusion on files)？

I am currently developing a data processing web server(linux) using python flask.
The general work flow is:
Get an input file from the user (handled by python flask)
Flask passes this input file to a java program
Java program processes this input file, saves the outputs (multiple files) on the server.
Flask calls another python script which will process these outputs to get the final result and return the result back to the client.
The problem is: between step 3 and step 4, there exist some intermediate files, this would not have been a problem at all if this is a local program. but as a server program, When more than one clients access this program, they could get unexpected result generated by input that is provided by another user who is using the web program at the same time.
From the point I see it, this is kind of a mutual exclusion problem on file access. I have had problems with mutual exclusion problems on threads before, I solved some of them using thread locks such as like synchronization in java and lock in pythons, but I am not sure what to do when it comes to files instead of threads.
It occurred to me that maybe I canspawn different copies of files based on different clients. But as I understand, the HTTP is stateless so you can't really know who is accessing the server. I don't want to add a login system and a user database to achieve this purpose as I sense there is a much simpler and better way to resolve this problem.
I have been looking for a good solution these days but haven't found an ideal one so I am looking for some advice here. Any suggestions will be highly appreciated. If you can suggest a viable solution, please feel free to provide me with your name so I can add you to the thank list of digital and paper publications about this tool when it's published.

As a system kind of person I suggest you something like this
https://docs.python.org/3/library/fcntl.html#fcntl.lockf
This is how I would solve it there is so many way to solve this problem and it is up to debate of course it is come hard with the best solution
Assume the output file is where the conflict happen
so you lock the file and you keep polling until the resource is release (the user need to wait) so you force one user to access the file at a time (polling here time.sleep) for like 2-3 seconds (add a try except) here thread lock on the output file only when the resource is release the next user process will pass through normally.
Another easy way is to dump the data in a rds like mysql or postgres it will handle all the file access nightmare occurred from concurrent request (put the output file in a db).

How to handle processing of large file on GAE?

I'm looking for a powerful and fast way to handle processing of large file in Google App Engine.
It works as the following (simplified workflow at the end):
The customer send a CSV file, that our server will treat, line by line.
Once the file is uploaded, an entry is added in the NDB datastore Uploads with the CSV name, file path (to Google Storage) and some basic informations. Then, a Task is created, called "pre-processing".
The pre-processing task will loop over all the lines of the CSV file (could be millions) and will add a NDB entry to UploadEntries model, for each line, with the CSV id, the line, the data to extract/treat, and some indicators (boolean) on if this line has started processing, and ended processing ("is_treating", "is_done")
Once the pre-processing task has ended, it updates the information to the client "XXX lines will be processed"
A call to Uploads.next() is made. The next method will :
Search the UploadEntries that has is_treating and is_done at false,
Will add a task in a Redis datastore for the next line found. (The Redis datastore is used because the work here is made on servers not managed by Google)
Will also create a new entry in the task Process-healthcheck (This task is runned after 5 minutes and checks that 7) has been correctly executed. If not, it considers that the Redis/Outside server has failed and do the same as 7), without the result ("error" instead)).
Then, it updates UploadEntries.is_treating to True for that entry.
The outside server will process the data, and returns the results by making a POST request to an endpoint on the server.
That endpoint update the UploadEntries entry in the datastore (including "is_treating" and "is_done"), and call Uploads.next() to start the next line.
In Uploads.next, when searching for the next entries returns nothing, I consider the file to be finally treated, and call the task post-process that will rebuild the CSV with the treated data, and returns it to the customer.
Here's a few things to keep in mind :
The servers that does the real work are outside of Google AppEngine, that's why I had to come up with Redis.
The current way of doing things give me a flexibility on the number of parallel entries to process : In the 5), the Uploads.next() methods contains a limit argument that let me search for n process in parallel. Can be 1, 5, 20, 50.
I can't just add all the lines from the pre-processing task directly to Redis becase in that case, the next customer will have to wait for the first file to be finished processing, and this will pile-up to take too long
But this system has various issues, and that's why I'm turning to your help :
Sometimes, this system is so fast that the Datastore is not yet updated correctly and when calling Uploads.next(), the entries returned are already being processed (it's just that entry.is_treating = True is not yet pushed to the database)
The Redis or my server (I don't really know) sometime loose the task, or the POST request after the processing is not made, so the task never goes to is_done = True. That's why I had to implement a Healcheck system, to ensure the line is correctly treated no matter what. This has a double advantage : The name of that task contains the csv ID, and the line. Making it unique per file. If I the datastore is not up to date and the same task is run twice, the creation of the healthcheck will fail because the same name already exist, letting me know that there is a concurrence issue, so I ignore that task because it means the Datastore is not yet up to date.
I initiall thought about running the file in one independant process, line by line, but this has the big disadvantage of not being able to run multiple line in parallel. Moreover, Google limits the running of a task to 24h for dedicated targets (not default), and when the file is really big, it can run for longer than 24h.
For information, if it helps, I'm using Python
And to simplify the workflow, here's what I'm trying to achieve in the best way possible :
Process a large file, run multiple paralllel process, one per line.
Send the work to an outside server using Redis. Once done, that outside server returns the result via a POST request to the main server
The main server then update the information about that line, and goes to the next line
I'd really appreciate if someone had a better way of doing this. I really believe I'm not the first to do that kind of work and I'm pretty sure I'm not doing it correctly.
(I believe Stackoverflow is the best section of Stack Exchange to post that kind of question since it's an algorithm question, but it's also possible I didn't saw a better network for that. If so, I'm sorry about that).

The servers that does the real work are outside of Google AppEngine
Have you considered using Google Cloud Dataflow for processing large files instead?
It is a managed service that will handle the file splitting and processing for you.
Based on initial thoughts here is an outline process:
User uploads files direct to google cloud storage, using signed urls or blobstore API
A request from AppEngine launches a small compute engine instance that initiates a blocking request (BlockingDataflowPipelineRunner) to launch the dataflow task. (I'm afraid it needs to be a compute instance because of sandbox and blocking I/O issues).
When the dataflow task is finished the compute engine instance is unblocked and posts a message into pubsub.
The pubsub message invokes a webhook on the AppEngine service that changes the tasks state from 'in progress' to 'complete' so the user can fetch their results.

Python - Gathering statistics from different scripts

I've written a program in Python that takes data (any kind), organizes it, and sends it to a remote service for visualization. The API just needs a configuration file that specifies what types of data it should expect, and then it's ready to go.
I then have lots of different programs that all report some form of statistics in their execution. What i want to do is to be able to connect them to my API without having to modify the programs themselves and this is proving tricky.
It's possible to just instantiate an object of the API in each class and then call on the methods provided, however this is too much of a hassle - you should be able to use it without having the API-classes in your own repository.
One solution we came up with was to simply print all of the output into a file, and have the API listen to this file - once it detected something new, it would read it and send it off. However, this is too slow for our needs as some of the program produce vast amounts of data very quickly.
Is there any other solution to my problem?
Thanks!

suggestions for a daemon that accepts zip files for processing

im looking to write a daemon that:
reads a message from a queue (sqs, rabbit-mq, whatever ...) containing a path to a zip file
updates a record in the database saying something like "this job is processing"
reads the aforementioned archive's contents and inserts a row into a database w/ information culled from file meta data for each file found
duplicates each file to s3
deletes the zip file
marks the job as "complete"
read next message in queue, repeat
this should be running as a service, and initiated by a message queued when someone uploads a file via the web frontend. the uploader doesn't need to immediately see the results, but the upload be processed in the background fairly expediently.
im fluent with python, so the very first thing that comes to mind is writing a simple server with twisted to handle each request and carry out the process mentioned above. but, ive never written anything like this that would run in a multi-user context. its not going to service hundreds of uploads per minute or hour, but it'd be nice if it could handle several at a time, reasonable. i also am not terribly familiar with writing multi-threaded applications and dealing with issues like blocking.
how have people solved this in the past? what are some other approaches i could take?
thanks in advance for any help and discussion!

I've used Beanstalkd as a queueing daemon to very good effect (some near-time processing and image resizing - over 2 million so far in the last few weeks). Throw a message into the queue with the zip filename (maybe from a specific directory) [I serialise a command and parameters in JSON], and when you reserve the message in your worker-client, no one else can get it, unless you allow it to time out (when it goes back to the queue to be picked up).
The rest is the unzipping and uploading to S3, for which there are other libraries.
If you want to handle several zip files at once, run as many worker processes as you want.

I would avoid doing anything multi-threaded and instead use the queue and the database to synchronize as many worker processes as you care to start up.
For this application I think twisted or any framework for creating server applications is going to be overkill.
Keep it simple. Python script starts up, checks the queue, does some work, checks the queue again. If you want a proper background daemon you might want to just make sure you detach from the terminal as described here: How do you create a daemon in Python?
Add some logging, maybe a try/except block to email out failures to you.

i opted to use a combination of celery (http://ask.github.com/celery/introduction.html), rabbitmq, and a simple django view to handle uploads. the workflow looks like this:
django view accepts, stores upload
a celery Task is dispatched to process the upload. all work is done inside the Task.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.