Python Service File Caching Apache Race Condition

Python Service File Caching Apache Race Condition - python

I am writing a python service (pyamf) through which a user can access images. All images are stored on a central server. The python services will be running on satellite machines which have network access to server. The service should work as follows:
check locally to see if the file exists, if so, use it.
check locally to see if file is currently being transferred from server ( file.part exists and size is changing ). If so, wait for download to finish, then use file.
if file does not exist and file is not being downloaded, download the file via urlretrieve.
The problem is with Apache's multiple threads. Threads are reaching the file presence check at the same time and therefore they all think the file needs to be downloaded. Needless to say, this is not good.
What is the right way to handle this race condition?
Thanks!

I'm guessing its either a threaded or a forked apache, but the effect would be the same since they are accessing a remote resource.
This problem is sometimes called the "dog pile" problem and its one of the issues addressed by the Beaker caching library (http://beaker.groovie.org). It provides a system bywhich you can create a callable that "creates" a new cached value, in this case a URL corresponding to some image that is fetched, if a value doesn't already exist. Locking is used such that concurrent threads or processes wait for the single process elected as the "creator" to finish what its doing. Beaker will use lockfiles if configured on a unix-like multi-process oriented system or mutexes if on a windows system.
I'm the original author of Beaker's guts along with Ben Bangert who packaged it up for usage with Pylons.

Related

Database Management using Python

I want to create a SQLite3 database using python and wanted to know what is the best method of managing it? When I say managing, I mean how to ensure that there is a backup if/when there is some kind of corruption. Is there any way to retrieve it or is it better to have a backup of your whole database? What is a best practice?
TIA!

Based on comments, the database is currently planned to be SQLite3. That narrows down the scope considerably. Essentially, the database is, as far as operating system issues (processes, data file access, etc.) simply a file created and updated by the Python program directly. Which means any backup/restore is normally done by a full copy of the database file(s) to either another location on the same machine (handles corruption due to program problems but not hardware problems, OS crash, etc.) or over a network to either another local machine or a remote/cloud server of some sort.
If your application needs to run continuously, then you may want to manage backup within the application. If it runs periodically (specific times or via some external trigger) then you can have the backups run on a regular basis outside the application (which, for SQLite3, simply means a script to copy the appropriate file(s)).

Python flask how to avoid multi file access at the same time (mutual exclusion on files)？

I am currently developing a data processing web server(linux) using python flask.
The general work flow is:
Get an input file from the user (handled by python flask)
Flask passes this input file to a java program
Java program processes this input file, saves the outputs (multiple files) on the server.
Flask calls another python script which will process these outputs to get the final result and return the result back to the client.
The problem is: between step 3 and step 4, there exist some intermediate files, this would not have been a problem at all if this is a local program. but as a server program, When more than one clients access this program, they could get unexpected result generated by input that is provided by another user who is using the web program at the same time.
From the point I see it, this is kind of a mutual exclusion problem on file access. I have had problems with mutual exclusion problems on threads before, I solved some of them using thread locks such as like synchronization in java and lock in pythons, but I am not sure what to do when it comes to files instead of threads.
It occurred to me that maybe I canspawn different copies of files based on different clients. But as I understand, the HTTP is stateless so you can't really know who is accessing the server. I don't want to add a login system and a user database to achieve this purpose as I sense there is a much simpler and better way to resolve this problem.
I have been looking for a good solution these days but haven't found an ideal one so I am looking for some advice here. Any suggestions will be highly appreciated. If you can suggest a viable solution, please feel free to provide me with your name so I can add you to the thank list of digital and paper publications about this tool when it's published.

As a system kind of person I suggest you something like this
https://docs.python.org/3/library/fcntl.html#fcntl.lockf
This is how I would solve it there is so many way to solve this problem and it is up to debate of course it is come hard with the best solution
Assume the output file is where the conflict happen
so you lock the file and you keep polling until the resource is release (the user need to wait) so you force one user to access the file at a time (polling here time.sleep) for like 2-3 seconds (add a try except) here thread lock on the output file only when the resource is release the next user process will pass through normally.
Another easy way is to dump the data in a rds like mysql or postgres it will handle all the file access nightmare occurred from concurrent request (put the output file in a db).

Are Heroku instances persistent? (Or, can I use dict/array as a cache?)

So my friend told me that instances on Heroku are persistent (I'm not sure if the vocab is right, but he implied that all users share the same instance).
So, if I have app.py, and an instance runs it, then all users share that instance. That means we can use a dict as a temporary cache for storing small things for faster response time.
So for example, if I'm serving an API, I can maybe define a cache like this and then use it.
How true is that? I tried looking this up, but could not find anything.
I deployed the linked API to heroku on 1 dyno, and with just a few requests per second, it was taking over 100 seconds to serve it. So my understanding is that the cache wasn't working. (It might be useful to note here that majority of time was due to request queueing, according to new relic.)

The Heroku Devcenter has several articles about the Heroku architecture.
Processes don't share memory. Moreover, your code is compiled into a slug and optimized for distribution to the dyno manager. In simple words, it means you don't even know which machine will execute your code. Theoretically, 5 users hitting your app may be routed to 5 different machines and processes.
Last but not least, keep in mind that if your app has only a single web dyno running, that web dyno will sleep. You have to have more than one web dyno to prevent web dynos from sleeping. When the dyno enter the sleep mode, the memory is released and you will loose all the data in memory.
This means that your approach will not work.
Generally speaking, in Heroku you should use external storages. For example, you can use the Memcached add-on and store your cache information in Memcached.
Also note you should not use the file system as cache. Not only because it's slower than Memcached, but also because the Cedar stack file system should be considered ephemeral.

Coordinating distributed Python processes using queuing or REST web service

Server A has a process that exports n database tables as flat files. Server B contains a utility that loads the flat files into a DW appliance database.
A process runs on server A that exports and compresses about 50-75 tables. Each time a table is exported and a file produced, a .flag file is also generated.
Server B has a bash process that repeatedly checks for each .flag file produced by server A. It does this by connecting to A and checking for the existence of a file. If the flag file exists, Server B will scp the file from Server A, uncompress it, and load it into an analytics database. If the file doesn't yet exist, it will sleep for n seconds and try again. This process is repeated for each table/file that Server B expects to be found on Server A. The process executes serially, processing a single file at a time.
Additionally: The process that runs on Server A cannot 'push' the file to Server B. Because of file-size and geographic concerns, Server A cannot load the flat file into the DW Appliance.
I find this process to be cumbersome and just so happens to be up for a rewrite/revamp. I'm proposing a messaging-based solution. I initially thought this would be a good candidate for RabbitMQ (or the like) where
Server A would write a file, compress it and then produce a message for a queue.
Server B would subscribe to the queue and would process files named in the message body.
I feel that a messaging-based approach would not only save time as it would eliminate the check-wait-repeat cycle for each table, but also permit us to run processes in parallel (as there are no dependencies).
I showed my team a proof-of-concept using RabbitMQ and they were all receptive to using messaging. A number of them quickly identified other opportunities where we would benefit from message-based processing. One such area that we would benefit from implementing messaging would be to populate our DW dimensions in real-time rather then through batch.
It then occurred to me that a MQ-based solution might be overkill given the low volume (50-75 tasks). This might be overkill given our operations team would have to install RabbitMQ (and its dependencies, including Erlang), and it would introduce new administration headaches.
I then realized this could be made more simple with a REST-based solution. Server A could produce a file and then make a HTTP call to a simple (web.py) web service on Server B. Server B could then initiate the transfer-and-load process based on the URL that is called. Given the time that it takes to transfer, uncompress, and load each file, I would likely use Python's multiprocessing to create a subprocess that loads each file.
I'm thinking that the REST-based solution is idea given the fact that it's simpler. In my opinion, using an MQ would be more appropriate for higher-volume tasks but we're only talking (for now) 50-75 operations with potentially more to come.
Would REST-based be a good solution given my requirements and volume? Are there other frameworks or OSS products that already do this? I'm looking to add messaging without creating other administration and development headaches.

Message brokers such as Rabbit contain practical solutions for a number of problems:
multiple producers and consumers are supported without risk of duplication of messages
atomicity and unit-of-work logic provide transactional integrity, preventing duplication and loss of messages in the event of failure
horizontal scaling--most mature brokers can be clustered so that a single queue exists on multiple machines
no-rendezvous messaging--it is not necessary for sender and receiver to be running at the same time, so one can be brought down for maintenance without affecting the other
preservation of FIFO order
Depending on the particular web service platform you are considering, you may find that you need some of these features and must implement them yourself if not using a broker. The web service protocols and formats such as HTTP, SOAP, JSON, etc. do not solve these problems for you.
In my previous job the project management passed on using message brokers early on, but later the team ended up implementing quick-and-dirty logic meant to solve some of the same issues as above in our web service architecture. We had less time to provide business value because we were fixing so many concurrency and error-recovery issues.
So while a message broker may seem on its face like a heavyweight solution, and may actually be more than you need right now, it does have a lot of benefits that you may need later without yet realizing it.

As wberry alluded to, a REST or web-hook based solution can be functional but will not be very tolerant to failure. Paying the operations price up front for messaging will pay long term dividends as you will find additional problems which are a natural fit for the messaging model.
Regarding other OSS options; If you are considering stream based processing in addition to this specific use case, I would recommend taking a look at Apache Kafka. Kafka provides similar messaging semantics to RabbitMQ, but is tightly focused on processing message streams (not to mention that is has been battle tested in production at LinkedIn).

Is there a better way to serve the results of an expensive, blocking python process over HTTP?

We have a web service which serves small, arbitrary segments of a fixed inventory of larger MP3 files. The MP3 files are generated on-the-fly by a python application. The model is, make a GET request to a URL specifying which segments you want, get an audio/mpeg stream in response. This is an expensive process.
We're using Nginx as the front-end request handler. Nginx takes care of caching responses for common requests.
We initially tried using Tornado on the back-end to handle requests from Nginx. As you would expect, the blocking MP3 operation kept Tornado from doing its thing (asynchronous I/O). So, we went multithreaded, which solved the blocking problem, and performed quite well. However, it introduced a subtle race condition (under real world load) that we haven't been able to diagnose or reproduce yet. The race condition corrupts our MP3 output.
So we decided to set our application up as a simple WSGI handler behind Apache/mod_wsgi (still w/ Nginx up front). This eliminates the blocking issue and the race condition, but creates a cascading load (i.e. Apache creates too many processses) on the server under real world conditions. We're working on tuning Apache/mod_wsgi right now, but still at a trial-and-error phase. (Update: we've switched back to Tornado. See below.)
Finally, the question: are we missing anything? Is there a better way to serve CPU-expensive resources over HTTP?
Update: Thanks to Graham's informed article, I'm pretty sure this is an Apache tuning problem. In the mean-time, we've gone back to using Tornado and are trying to resolve the data-corruption issue.
For those who were so quick to throw more iron at the problem, Tornado and a bit of multi-threading (despite the data integrity problem introduced by threading) handles the load acceptably on a small (single core) Amazon EC2 instance.

Have you tried Spawning? It is a WSGI server with a flexible assortment of threading modes.

Are you making the mistake of using embedded mode of Apache/mod_wsgi? Read:
http://blog.dscpl.com.au/2009/03/load-spikes-and-excessive-memory-usage.html
Ensure you use daemon mode if using Apache/mod_wsgi.

You might consider a queuing system with AJAX notification methods.
Whenever there is a request for your expensive resource, and that resource needs to be generated, add that request to the queue (if it's not already there). That queuing operation should return an ID of an object that you can query to get its status.
Next you have to write a background service that spins up worker threads. These workers simply dequeue the request, generate the data, then saves the data's location in the request object.
The webpage can make AJAX calls to your server to find out the progress of the generation and to give a link to the file once it's available.
This is how LARGE media sites work - those that have to deal with video in particular. It might be overkill for your MP3 work however.
Alternatively, look into running a couple machines to distribute the load. Your threads on Apache will still block, but atleast you won't consume resources on the web server.

Please define "cascading load", as it has no common meaning.
Your most likely problem is going to be if you're running too many Apache processes.
For a load like this, make sure you're using the prefork mpm, and make sure you're limiting yourself to an appropriate number of processes (no less than one per CPU, no more than two).

It looks like you are doing things right -- just lacking CPU power: can you determine what is the CPU loading in the process of generating these MP3?
I think the next thing you have to do there is to add more hardware to render the MP3's on other machines. Or that or find a way to deliver pre-rendered MP3 (maybe you can cahce some of your media?)
BTW, scaling for the web was the theme of a Keynote lecture by Jacob Kaplan-Moss on PyCon Brasil this year, and it is far from being a closed problem. The stack of technologies one needs to handle is quite impressible - (I could not find an online copy o f the presentation, though - -sorry for that)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.