Our Python application needs to run on a new HW with write limitation.
I was asked to reduce writing as much as possible.
After changing verbosity, and reduce log messages, I was thinking of creating a cache mechanism that will log messages and write them once a day in one transaction.
How would I do that?
I was thinking if inherit and override the _log() method, but not sure how to write "one" message out of all the old messages, while making the log file still appears as with many messages.
Related
I am currently developing a data processing web server(linux) using python flask.
The general work flow is:
Get an input file from the user (handled by python flask)
Flask passes this input file to a java program
Java program processes this input file, saves the outputs (multiple files) on the server.
Flask calls another python script which will process these outputs to get the final result and return the result back to the client.
The problem is: between step 3 and step 4, there exist some intermediate files, this would not have been a problem at all if this is a local program. but as a server program, When more than one clients access this program, they could get unexpected result generated by input that is provided by another user who is using the web program at the same time.
From the point I see it, this is kind of a mutual exclusion problem on file access. I have had problems with mutual exclusion problems on threads before, I solved some of them using thread locks such as like synchronization in java and lock in pythons, but I am not sure what to do when it comes to files instead of threads.
It occurred to me that maybe I canspawn different copies of files based on different clients. But as I understand, the HTTP is stateless so you can't really know who is accessing the server. I don't want to add a login system and a user database to achieve this purpose as I sense there is a much simpler and better way to resolve this problem.
I have been looking for a good solution these days but haven't found an ideal one so I am looking for some advice here. Any suggestions will be highly appreciated. If you can suggest a viable solution, please feel free to provide me with your name so I can add you to the thank list of digital and paper publications about this tool when it's published.
As a system kind of person I suggest you something like this
https://docs.python.org/3/library/fcntl.html#fcntl.lockf
This is how I would solve it there is so many way to solve this problem and it is up to debate of course it is come hard with the best solution
Assume the output file is where the conflict happen
so you lock the file and you keep polling until the resource is release (the user need to wait) so you force one user to access the file at a time (polling here time.sleep) for like 2-3 seconds (add a try except) here thread lock on the output file only when the resource is release the next user process will pass through normally.
Another easy way is to dump the data in a rds like mysql or postgres it will handle all the file access nightmare occurred from concurrent request (put the output file in a db).
I'm working on a Django web app. The app includes messages that will self-delete after a certain amount of time. I'm using timezone.now() as the sent time and the user inputs a timedelta to display the message until. I'm checking to see if the message should delete itself by checking if current time is after sent time plus the time delta. Will this place a heavy load on the server? How frequently will it automatically check? Is there a way that I can tell it to check once a minute (or otherwise set the frequency)?
Thanks
How frequently will it automatically check?
who is "it" ? If you mean "the django process", then it will NOT check anything by itself. You will have to use either a cronjob or some async queue to take care of removing "dead" messages.
Is there a way that I can tell it to check once a minute (or otherwise set the frequency)?
Well yes, cf above. cronjobs are the simplest solution, async queues (like celery) are much more heavy-weight but if you have a lot of "off-band" processing (processes you want to launch from the request/response cycle BUT execute outside of it) then it's the way to go.
Will this place a heavy load on the server?
It's totally impossible to answer this. It depends on your exact models, the way you write the "check & clean" code, and, of course, data volumes. But using either a cronjob or an async queue this won't run within the django server process(es) itself, and can even be runned on another server as long as it can access the database. IOW the load will be on the database mostly (well, on the server running the process too of course but given your problem description a simple SQL delete query should be enough so..).
I have two classes in my Python program and one of them is a thread. Is it a bad idea to have both classes open the same log file and write to it?
Is there any good approach to write to the same log file for two classes which are running at the same time?
This is a classical concurrency issue. You need to ensure that you exactly control what is happening. Regarding log files, the easiest solution might be to have a queue collecting log messages from various places (from different threads or even processes) and then have one entity that pops messages from that queue and writes them to the log file. This way, at least single messages stay self-contained.
The operating system does not prevent message mix up if you write to the file from different unsynchronized entities. Hence, if you do not explicitly control what should happen in which order, you might end up with corrupted messages in that file, even if things seem to work most of the time.
Use the python logging module. It handles the gory details for you.
As long as you control which class is reading and writing from a file and ensure that only one of them can write to it at a time you should be fine and every time you switch you reread the file.
Look into using lock to ensure that both classes are not accessing the file at the same time.
Using Google App Engine, Python 2.7, threadsafe:true, webapp2.
I would like to include all logging.XXX() messages in my API responses, so I need an efficient way to collect up all the log messages that occur during the scope of a request. I also want to operate in threadsafe:true, so I need to be careful to get only the right log messages.
Currently, my strategy is to add a logging.Handler at the start of my webapp2 dispatch method, and then remove it at the end. To collect logs only for my thread, I instantiate the logging.Handler with the name of the current thread; the handler will simply throw out log records that are from a different thread. I am using thread name and not thread ID because I was getting some unexpected results on dev_appserver when using the ID.
Questions:
Is it efficient to constantly be adding/removing logging.Handler objects in this fashion? I.e., every request will add, then remove, a Handler. Is this "cheap"?
Is this the best way to get only the logging messages for my request? My big assumption is that each request gets its own thread, and that thread name will actually select the right items.
Am I fundamentally misunderstanding Python logging? Perhaps I should only have a single additional Handler added once at the "module-level" statically, and my dispatch should do something lighter.
Any advice is appreciated. I don't have a good understanding of what Python (and specifically App Engine Python) does under the hood with respect to logging. Obviously, this is eminently possible because the App Engine Log Viewer does exactly the same thing: it displays all the log messages for that request. In fact, if I could piggyback on that somehow, that would be even better. It absolutely needs to be super-cheap though - i.e., an RPC call is not going to cut it.
I can add some code if that will help.
I found lots of goodness here:
from google.appengine.api import logservice
entries = logservice.logs_buffer().parse_logs()
im looking to write a daemon that:
reads a message from a queue (sqs, rabbit-mq, whatever ...) containing a path to a zip file
updates a record in the database saying something like "this job is processing"
reads the aforementioned archive's contents and inserts a row into a database w/ information culled from file meta data for each file found
duplicates each file to s3
deletes the zip file
marks the job as "complete"
read next message in queue, repeat
this should be running as a service, and initiated by a message queued when someone uploads a file via the web frontend. the uploader doesn't need to immediately see the results, but the upload be processed in the background fairly expediently.
im fluent with python, so the very first thing that comes to mind is writing a simple server with twisted to handle each request and carry out the process mentioned above. but, ive never written anything like this that would run in a multi-user context. its not going to service hundreds of uploads per minute or hour, but it'd be nice if it could handle several at a time, reasonable. i also am not terribly familiar with writing multi-threaded applications and dealing with issues like blocking.
how have people solved this in the past? what are some other approaches i could take?
thanks in advance for any help and discussion!
I've used Beanstalkd as a queueing daemon to very good effect (some near-time processing and image resizing - over 2 million so far in the last few weeks). Throw a message into the queue with the zip filename (maybe from a specific directory) [I serialise a command and parameters in JSON], and when you reserve the message in your worker-client, no one else can get it, unless you allow it to time out (when it goes back to the queue to be picked up).
The rest is the unzipping and uploading to S3, for which there are other libraries.
If you want to handle several zip files at once, run as many worker processes as you want.
I would avoid doing anything multi-threaded and instead use the queue and the database to synchronize as many worker processes as you care to start up.
For this application I think twisted or any framework for creating server applications is going to be overkill.
Keep it simple. Python script starts up, checks the queue, does some work, checks the queue again. If you want a proper background daemon you might want to just make sure you detach from the terminal as described here: How do you create a daemon in Python?
Add some logging, maybe a try/except block to email out failures to you.
i opted to use a combination of celery (http://ask.github.com/celery/introduction.html), rabbitmq, and a simple django view to handle uploads. the workflow looks like this:
django view accepts, stores upload
a celery Task is dispatched to process the upload. all work is done inside the Task.