suggestions for a daemon that accepts zip files for processing

suggestions for a daemon that accepts zip files for processing - python

im looking to write a daemon that:
reads a message from a queue (sqs, rabbit-mq, whatever ...) containing a path to a zip file
updates a record in the database saying something like "this job is processing"
reads the aforementioned archive's contents and inserts a row into a database w/ information culled from file meta data for each file found
duplicates each file to s3
deletes the zip file
marks the job as "complete"
read next message in queue, repeat
this should be running as a service, and initiated by a message queued when someone uploads a file via the web frontend. the uploader doesn't need to immediately see the results, but the upload be processed in the background fairly expediently.
im fluent with python, so the very first thing that comes to mind is writing a simple server with twisted to handle each request and carry out the process mentioned above. but, ive never written anything like this that would run in a multi-user context. its not going to service hundreds of uploads per minute or hour, but it'd be nice if it could handle several at a time, reasonable. i also am not terribly familiar with writing multi-threaded applications and dealing with issues like blocking.
how have people solved this in the past? what are some other approaches i could take?
thanks in advance for any help and discussion!

I've used Beanstalkd as a queueing daemon to very good effect (some near-time processing and image resizing - over 2 million so far in the last few weeks). Throw a message into the queue with the zip filename (maybe from a specific directory) [I serialise a command and parameters in JSON], and when you reserve the message in your worker-client, no one else can get it, unless you allow it to time out (when it goes back to the queue to be picked up).
The rest is the unzipping and uploading to S3, for which there are other libraries.
If you want to handle several zip files at once, run as many worker processes as you want.

I would avoid doing anything multi-threaded and instead use the queue and the database to synchronize as many worker processes as you care to start up.
For this application I think twisted or any framework for creating server applications is going to be overkill.
Keep it simple. Python script starts up, checks the queue, does some work, checks the queue again. If you want a proper background daemon you might want to just make sure you detach from the terminal as described here: How do you create a daemon in Python?
Add some logging, maybe a try/except block to email out failures to you.

i opted to use a combination of celery (http://ask.github.com/celery/introduction.html), rabbitmq, and a simple django view to handle uploads. the workflow looks like this:
django view accepts, stores upload
a celery Task is dispatched to process the upload. all work is done inside the Task.

Related

How to work with queues for file processing in python?

I'm currently writing an application in python that mainly works with files. So the algorithm works like this:
A user submits a file through an API, imagine a post request with the file and some data.
Then the program works with the file and extracts some conclusions.
After that, those conclusions are stored inside a DB.
Then the user is able to query the db and ask for conclusions.
As the user is able to submit a file through an API and this can be done simultaneously by many users in many systems and the process of file processing may take some time. I want to explore a way to implement a work queue such as:
Only one file can be processed at a time, so when a user submits a file, that file is put inside a work queue and so has to wait before getting inside the "processing function".
How can I do that, any reference or tutorial?
Thanks

Checkout Celery, there are many good tutorials online.
It works with workers so it also doesn't block you api listening.
Also it could provide the option to process multiple files at once concurently if you'd want.

Python flask how to avoid multi file access at the same time (mutual exclusion on files)？

I am currently developing a data processing web server(linux) using python flask.
The general work flow is:
Get an input file from the user (handled by python flask)
Flask passes this input file to a java program
Java program processes this input file, saves the outputs (multiple files) on the server.
Flask calls another python script which will process these outputs to get the final result and return the result back to the client.
The problem is: between step 3 and step 4, there exist some intermediate files, this would not have been a problem at all if this is a local program. but as a server program, When more than one clients access this program, they could get unexpected result generated by input that is provided by another user who is using the web program at the same time.
From the point I see it, this is kind of a mutual exclusion problem on file access. I have had problems with mutual exclusion problems on threads before, I solved some of them using thread locks such as like synchronization in java and lock in pythons, but I am not sure what to do when it comes to files instead of threads.
It occurred to me that maybe I canspawn different copies of files based on different clients. But as I understand, the HTTP is stateless so you can't really know who is accessing the server. I don't want to add a login system and a user database to achieve this purpose as I sense there is a much simpler and better way to resolve this problem.
I have been looking for a good solution these days but haven't found an ideal one so I am looking for some advice here. Any suggestions will be highly appreciated. If you can suggest a viable solution, please feel free to provide me with your name so I can add you to the thank list of digital and paper publications about this tool when it's published.

As a system kind of person I suggest you something like this
https://docs.python.org/3/library/fcntl.html#fcntl.lockf
This is how I would solve it there is so many way to solve this problem and it is up to debate of course it is come hard with the best solution
Assume the output file is where the conflict happen
so you lock the file and you keep polling until the resource is release (the user need to wait) so you force one user to access the file at a time (polling here time.sleep) for like 2-3 seconds (add a try except) here thread lock on the output file only when the resource is release the next user process will pass through normally.
Another easy way is to dump the data in a rds like mysql or postgres it will handle all the file access nightmare occurred from concurrent request (put the output file in a db).

Amazon AWS - python for beginners

I have a computationally intensive program doing calculations that I intend to parallelise. It is written in python and I hope to use the multiprocess module. I would like some help with understanding what I would need to do to have one program run from my laptop controlling the entire process.
I have two options in terms of what computers I can use. One is computers which I can access through ssh user#comp1.com from the terminal ( not sure how to access them through python ) and then run the instance there, although I'd like a more programmatic way to get to them than that. It seems that if I ran a remote manager type application it would work?
The second option I was thinking is utilising AWS E2C servers. (I think that is what I need). And I found boto which I've never used but seemed to provide an interface to control the AWS system. I feel that I would then need something to actually distribute jobs on AWS, probably similarly as option 1 (?). I'm a bit in the dark here.
EDIT:
To give you an idea of how parallelisable it is:
res = []
for param in Parameters:
res.append(FunctionA(param))
Parameters2 = FunctionB(res)
res2 = []
for param in Parameters2:
res2.append(FunctionC(param))
return res, res2
So the two loops are basically where I can send off many param values to be run in parallel and I know how to recombine them to create res as long as I know which param they came from. Then I need to group them all together to get Parameters2 and then the second part is again parallelisable.

you would want to use the multiprocess module only if you want the processes to share data in memory. That is something I would recommend ONLY if you absolutely have to have shared memory due to performance considerations. python multiprocess applications are non-trivial to write and debug.
If you are doing something like the distributed.net or seti#home projects, where even though the tasks are computationally intenive they are reasonably isolated, you can follow the following process.
Create a master application that would break down the large task into smaller computation chunks (assuming that the task can be broken down and the results then can be combined centrally).
Create python code that would take the task from the server (perhaps as a file or some other one time communication with instructions on what to do) and run multiple copies of these python processes
These python processes will work independently from each other, process data and then return the results to the master process for collation of results.
you could run these processes on AWS single core instances if you wanted, or use your laptop to run as many copies as you have cores to spare.
EDIT: Based on the updated question
So your master process will create files (or some other data structures) that will have the parameter info in them. As many files as you have params to process. This files will be stored in a shared folder called needed-work
Each python worker (on AWS instances) will look at the needed-work shared folder, looking for available files to work on (or wait on a socket for the master process to assign the file to them).
The python process that takes on a file that needs work, will work on it and store the result in a separate shared folder with the the parameter as part of the file structure.
The master process will look at the files in the work-done folder, process these files and generate the combined response
This whole solution could be implemented as sockets as well, where workers will listen to sockets for the master to assign work to them, and master will wait on a socket for the workers so submit response.
The file based approach would require a way for the workers to make sure that the work they pick up is not taken on by another worker. This could be fixed by having separate work folders for each worker and the master process would decided when there needs to be more work for the worker.
Workers could delete files that they pick up from the work folder and master process could keep a watch on when a folder is empty and add more work files to it.
Again more elegant to do this using sockets if you are comfortable with that.

Run a repeating task for a web app

This seems like a simple question, but I am having trouble finding the answer.
I am making a web app which would require the constant running of a task.
I'll use sites like Pingdom or Twitterfeed as an analogy. As you may know, Pingdom checks uptime, so is constantly checking websites to see if they are up and Twitterfeed checks RSS feeds to see if they;ve changed and then tweet that. I too need to run a simple script to cycle through URLs in a database and perform an action on them.
My question is: how should I implement this? I am familiar with cron, currently using it to do my server backups. Would this be the way to go?
I know how to make a Python script which runs indefinitely, starting back at the beginning with the next URL in the database when I'm done. Should I just run that on the server? How will I know it is always running and doesn't crash or something?
I hope this question makes sense and I hope I am not repeating someone else or anything.
Thank you,
Sam
Edit: To be clear, I need the task to run constantly. As in, check URL 1 in the database, check URl 2 in the database, check URL 3 and, when it reaches the last one, go right back to the beginning. Thanks!

If you need a repeatable running of the task which can be run from command line - that's what the cron is ideal for.
I don't see any demerits of this approach.
Update:
Okay, I saw the issue somewhat different. Now I see several solutions:
run the cron task at set intervals, let it process the data once per run, next time it will process the data on another run; use PIDs/Database/semaphores to avoid parallel processes;
update the processes that insert/update data in the database; let the information be processed when it is inserted/updated; c)
write a demon process which will reside in memory and check the data in real time.

cron would definitely be a way to go with this, as well as any other task scheduler you may prefer.
The main point is found in the title to your question:
Run a repeating task for a web app
The background task and the web application should be kept separate. They can share code, they can share access to a database, but they should be separate and discrete application contexts. (Consider them as separate UIs accessing the same back-end logic.)
The main reason for this is because web applications and background processes are architecturally very different and aren't meant to be mixed. Consider the structure of a web application being held within a web server (Apache, IIS, etc.). When is the application "running"? When it is "on"? It's not really a running task. It's a service waiting for input (requests) to handle and generate output (responses) and then go back to waiting.
Web applications are for responding to requests. Scheduled tasks or daemon jobs are for running repeated processes in the background. Keeping the two separate will make your management of the two a lot easier.

Python "Task Server"

My question is: which python framework should I use to build my server?
Notes:
This server talks HTTP with it's clients: GET and POST (via pyAMF)
Clients "submit" "tasks" for processing and, then, sometime later, retrieve the associated "task_result"
submit and retrieve might be separated by days - different HTTP connections
The "task" is a lump of XML describing a problem to be solved, and a "task_result" is a lump of XML describing an answer.
When a server gets a "task", it queues it for processing
The server manages this queue and, when tasks get to the top, organises that they are processed.
the processing is performed by a long running (15 mins?) external program (via subprocess) which is feed the task XML and which produces a "task_result" lump of XML which the server picks up and stores (for later Client retrieval).
it serves a couple of basic HTML pages showing the Queue and processing status (admin purposes only)
I've experimented with twisted.web, using SQLite as the database and threads to handle the long running processes.
But I can't help feeling that I'm missing a simpler solution. Am I? If you were faced with this, what technology mix would you use?

I'd recommend using an existing message queue. There are many to choose from (see below), and they vary in complexity and robustness.
Also, avoid threads: let your processing tasks run in a different process (why do they have to run in the webserver?)
By using an existing message queue, you only need to worry about producing messages (in your webserver) and consuming them (in your long running tasks). As your system grows you'll be able to scale up by just adding webservers and consumers, and worry less about your queuing infrastructure.
Some popular python implementations of message queues:
http://code.google.com/p/stomper/
http://code.google.com/p/pyactivemq/
http://xph.us/software/beanstalkd/

I'd suggest the following. (Since it's what we're doing.)
A simple WSGI server (wsgiref or werkzeug). The HTTP requests coming in will naturally form a queue. No further queueing needed. You get a request, you spawn the subprocess as a child and wait for it to finish. A simple list of children is about all you need.
I used a modification of the main "serve forever" loop in wsgiref to periodically poll all of the children to see how they're doing.
A simple SQLite database can track request status. Even this may be overkill because your XML inputs and results can just lay around in the file system.
That's it. Queueing and threads don't really enter into it. A single long-running external process is too complex to coordinate. It's simplest if each request is a separate, stand-alone, child process.
If you get immense bursts of requests, you might want a simple governor to prevent creating thousands of children. The governor could be a simple queue, built using a list with append() and pop(). Every request goes in, but only requests that fit will in some "max number of children" limit are taken out.

My reaction is to suggest Twisted, but you've already looked at this. Still, I stick by my answer. Without knowing you personal pain-points, I can at least share some things that helped me reduce almost all of the deferred-madness that arises when you have several dependent, blocking actions you need to perform for a client.
Inline callbacks (lightly documented here: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.html) provide a means to make long chains of deferreds much more readable (to the point of looking like straight-line code). There is an excellent example of the complexity reduction this affords here: http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html
You don't always have to get your bulk processing to integrate nicely with Twisted. Sometimes it is easier to break a large piece of your program off into a stand-alone, easily testable/tweakable/implementable command line tool and have Twisted invoke this tool in another process. Twisted's ProcessProtocol provides a fairly flexible way of launching and interacting with external helper programs. Furthermore, if you suddenly decide you want to cloudify your application, it is not all that big of a deal to use a ProcessProtocol to simply run your bulk processing on a remote server (random EC2 instances perhaps) via ssh, assuming you have the keys setup already.

You can have a look at celery

It seems any python web framework will suit your needs. I work with a similar system on a daily basis and I can tell you, your solution with threads and SQLite for queue storage is about as simple as you're going to get.
Assuming order doesn't matter in your queue, then threads should be acceptable. It's important to make sure you don't create race conditions with your queues or, for example, have two of the same job type running simultaneously. If this is the case, I'd suggest a single threaded application to do the items in the queue one by one.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.