In a django application I am working on, I have just added the ability to archive a number of files (starting 50mb in total) to a zip file. Currently, i am doing it something like this:
get files to zip
zip all files
send HTML response
Obviously, this causes a big wait on line two where the files are being compressed. What can i do to make this processes a whole lot better for the user? Although having a progress bar would be the best, even if it just returned a static page saying 'please wait' or whatever.
Any thoughts and ideas would be loved.
You should keep in mind showing the progress bar may not be a good idea, since you can get timeouts or get your server suffer from submitting lot of simultaneous requests.
Put the zipping task in the queue and have it callback to notify the user somehow - by e-mail for instance - that the process has finished.
Take a look at django-lineup
Your code will look pretty much like:
from lineup import registry
from lineup import _debug
def create_archive(queue_id, queue):
queue.set_param("zip_link", _create_archive(resource = queue.context_object, user = queue.user))
return queue
def create_archive_callback(queue_id, queue):
_send_email_notification(subject = queue.get_param("zip_link"), user = queue.user)
return queue
registry.register_job('create_archive', create_archive, callback = create_archive_callback)
In your views, create queued tasks by:
from lineup.factory import JobFactory
j = JobFactory()
j.create_job(self, 'create_archive', request.user, your_resource_object_containing_files_to_zip, { 'extra_param': 'value' })
Then run your queue processor (probably inside of a screen session):
./manage.py run_queue
Oh, and on the subject you might be also interested in estimating zip file creation time. I got pretty slick answers there.
Fun fact: You might be able to use a progress bar to trick users into thinking that things are going faster than they really are.
http://www.chrisharrison.net/projects/progressbars/index.html
You could use a 'log-file' to keep track of the zipped files, and of how many files still remain.
The procedural way should be like this:
Count the numbers of file, write it in a text file, in a format like totalfiles.filespreocessed
Every file you zip, simply update the file
So, if you have to zip 3 files, the log file will grown as:
3.0 -> begin, no file still processed
3.1 -> 1 file on 3 processed, 33% task complete
3.2 -> 2 file on 3 processed, 66% task complete
3.3 -> 3 file on 3 processed, 100% task complete
And then with a simple ajax function (an interval) check the log-file every second.
In python, open, read and rite a file such small should be very quick, but maybe can cause some requests trouble if you'll have many users doing that in the same time, but obviously you'll need to create a log file for each request, maybe with rand name, and delete it after the task is completed.
A problem could be that, for let the ajax read the log-file, you'll need to open and close the file handler in python every time you update it.
Eventually, for a more accurate progress meter, you culd even use the file size instead of the number of file as parameter.
Better than a static page, show a Javascript dialog (using Shadowbox, JQuery UI or some custom method) with a throbber ( you can get some at hxxp://www.ajaxload.info/ ). You can also show the throbber in your page, without dialogs. Most users only want to know their action is being handled, and can live without reliable progress information ("Please wait, this could take some time...")
JQUery UI also has a progress bar API. You could make periodic AJAX queries to a didcated page on your website to get a progress report and change the progress bar accordingly. Depending on how often the archiving is ran, how many users can trigger it and how you authenticate your users, this could be quite hard.
Related
I have an anchor tag that hits a route which generates a report in a new tab. I am lazyloading the report specs because I don't want to have copies of my data in the original place and on the report object. But collecting that data takes 10-20 seconds.
from flask import render_template
#app.route('/report/')
#app.route('/report/<id>')
def report(id=None):
report_specs = function_that_takes_20_seconds(id)
return render_template('report.html', report_specs=report_specs)
I'm wondering what I can do so that the server responds immediately with a spinner and then when function_that_takes_20_seconds is done, load the report.
You are right: a HTTP view is not a place for a long running tasks.
You need think your system architecture: what you can prepare outside the view and what computations must happen real time in a view.
The usual solutions include adding asynchronous properties and processing your data in a separate process. Often people use schedulers like Celery for this.
Prepare data in a scheduled process which runs for e.g. every 10 minutes
Cache results by storing them in a database
HTTP view always returns the last cached version
This, or then make a your view to do an AJAX call via JavaScript ("the spinner approach"). This requires obtaining some basic JavaScript skills. However this doesn't make the results appear to an end user any faster - it's just user experience smoke and mirrors.
I do not want to lose my sets if windows is about to shutdown/restart/log off/sleep, Is it possible to save it before shutdown? Or is there an alternative to save information without worring it will get lost on windows shutdown? JSON, CSV, DB? Anything?
s = {1,2,3,4}
with open("s.pick","wb") as f: # pickle it to file when PC about to shutdown to save information
pickle.dump(s,f)
I do not want to lose my sets if windows is about to shutdown/restart/log off/sleep, Is it possible to save it before shutdown?
Yes, if you've built an app with a message loop, you can receive the WM_QUERYENDSESSION message. If you want to have a GUI, most GUI libraries will probably wrap this up in their own way. If you don't need a GUI, your simplest solution is probably to use PyWin32. Somewhere in the docs there's a tutorial on creating a hidden window and writing a simple message loop. Just do that on the main thread, and do your real work on a background thread, and signal your background thread when a WM_QUERYENDSESSION message comes in.
Or, much more simply, as Evgeny Prokurat suggests, just use SetConsoleCtrlHandler (again through PyWin32). This can also catch ^C, ^BREAK, and the user closing your console, as well as the logoff and shutdown messages that WM_QUERYENDSESSION catches. More importantly, it doesn't require a message loop, so if you don't have any other need for one, it's a lot simpler.
Or is there an alternative to save information without worring it will get lost on windows shutdown? JSON, CSV, DB? Anything?
The file format isn't going to magically solve anything. However, a database could have two advantages.
First, you can reduce the problem by writing as often as possible. But with most file formats, that means rewriting the whole file as often as possible, which will be very slow. The solution is to streaming to a simpler "journal" file, packing that into the real file less often, and looking for a leftover journal at every launch. You can do that manually, but a database will usually do that for you automatically.
Second, if you get killed in the middle of a write, you end up with half a file. You can solve that by the atomic writing trick—write a temporary file, then replace the old file with the temporary—but this is hard to get right on Windows (especially with Python 2.x) (see Getting atomic writes right), and again, a database will usually do it for you.
The "right" way to do this is to create a new window class with a msgproc that dispatches to your handler on WM_QUERYENDSESSION. Just as MFC makes this easier than raw Win32 API code, win32ui (which wraps MFC) makes this easier than win32api/win32gui (which wraps raw Win32 API). And you can find lots of samples for that (e.g., a quick search for "pywin32 msgproc example" turned up examples like this, and searches for "python win32ui" and similar terms worked just as well).
However, in this case, you don't have a window that you want to act like a normal window, so it may be easier to go right to the low level and write a quick&dirty message loop. Unfortunately, that's a lot harder to find sample code for—you basically have to search the native APIs for C sample code (like Creating a Message Loop at MSDN), then figure out how to translate that to Python with the pywin32 documentation. Less than ideal, especially if you don't know C, but not that hard. Here's an example to get you started:
def msgloop():
while True:
msg = win32gui.GetMessage(None, 0, 0)
if msg and msg.message == win32con.WM_QUERYENDSESSION:
handle_shutdown()
win32api.TranslateMessage(msg)
win32api.DispatchMessage(msg)
if msg and msg.message == win32con.WM_QUIT:
return msg.wparam
worker = threading.Thread(real_program)
worker.start()
exitcode = msgloop()
worker.join()
sys.exit(exitcode)
I haven't shown the "how to create a minimal hidden window" part, or how to signal the worker to stop with, e.g., a threading.Condition, because there are a lot more (and easier-to-find) good samples for those parts; this is the tricky part to find.
you can detect windows shutdown/log off with win32api.setConsoleCtrlHandler
there is a good example How To Catch “Kill” Events with Python
So I have a BlobstoreUploadHandler class that, uses put_async and wait like so:
x = Model.put_async()
x.wait()
then proceeds to pass some data up front to javascript, so that the user is redirected to the class serving their file upload, it does this like so:
redirecthref = '%s/serve/%s' % (
self.request.host_url, Model.uploadid)
self.response.headers['Content-Type'] = 'application/json'
obj = { 'success' : True, 'redirect': redirecthref }
self.response.write(json.dumps(obj))
this all works well and good, however, it takes a CRAZY amount of time for this redirect to happen, we're talking minutes, and while the file is uploading, the page is completely frozen. I've noticed I am able to access the link that javascript would redirect to even while the upload is happening and the page is frozen, so my question is, what strategies can I pursue to make the redirect happen right when the url becomes available? Is this what the 'callback' parameter of put_async is for, or is this where I want to look into url_fetch.
Im pretty new to this and any and all help is appreciated. Thanks!
UPDATE:
So I've figured out that the upload is slow for several reasons:
I should be using put() rather than put_aync(), which I've found does speed up the upload time, however something is breaking and it's giving me a 500 error that looks like:
POST http://example.com/_ah/upload/AMmfu6au6zY86nSUjPMzMmUqHuxKmdTw1YSvtf04vXFDs-…tpemOdVfHKwEB30OuXov69ZQ9cXY/ALBNUaYAAAAAU-giHjHTXes0sCaJD55FiZxidjdpFTmX/ 500 (Internal Server Error)
It still uploads both the resources, but the redirect does not work. I believe this is happening on the created upload_url, which is created using
upload_url = blobstore.create_upload_url('/upload')
All that aside, even using put() instead of put_async(), the wait() method is still taking an exorbitant amount of time.
If I remove the x.wait(), the upload will still happen, but the redirect gives me:
IndexError: List index out of range
this error is thrown on the following line of my /serve class Handler
qry = Model.query(Model.uploadid == param).fetch(1)[0]
So in short, I believe the fastest way to serve an entity after upload is to take out x.wait() and instead use a try: and except: on the query, so that it keeps trying to serve the page until it doesnt get a listindex error.
Like I said, im pretty new to this so actually making this happen is a little over my skill level, thus any thoughts or comments are greatly appreciated, and I am always happy to offer more in the way of code or explanation. Thanks!
async calls are about sending something to the background when you don't REALLY care about when it finishes. Seems to me you are looking for a put.
By definition a put_async isn't meant to finish fast. It sends something to the back for when your instance has time to do it. You're looking for a put I think. It'll freeze your application the same way your wait is doing, but instead of waiting a LONG time for the async to finish, it'll start working on it right away.
as said in the async documentation (https://developers.google.com/appengine/docs/java/datastore/async):
However, if your application needs the result of the get() plus the result of a Query to render the response, and if the get() and the Query don't have any data dependencies, then waiting until the get() completes to initiate the Query is a waste of time.
Doesn't seem to be what you're doing. You're using an async call in a purely synced way. It WILL take longer to complete than a simple put. Unless there is some reason to push the "put" to take longer, you shouldn't use async
Looking back, I wanted to circle up on this since I solved this one shortly after posting about it. What I discovered was that there was no way real way to speed up the upload, other than using put instead of put_async of course.
But there was a tricky way to access the blob in my redirect url, other than through the Model.uploadid which was not guaranteed to be consistently uploaded by the time the redirect occurred.
The solution was to simply access the blob using the .keys() method of my upload object and to pass that into the redirect_href, instead of the Model.uploadid
redirecthref = '%s/serve/%s' % (self.request.host_url, self.get_uploads(‘my_upload_object’)[0].key())
Not sure why the .keys() lookup seemed to bypass the whole upload process, but this seemed to work for me.
Thanks,
Need to read an Apache log file realtime from the server and if some string is found an e-mail has to be sent. I have adopted the code found here to read the log file. Next how to send this e-mail. Do I have to issue a sleep command? Please advice.
Note: Since this is real time, after sending the e-mail python program has to begin reading the log file again. This process continues.
import time
import os
#open the file
filename = '/var/log/apache2/access.log'
file = open(filename,'r')
while 1:
where = file.tell()
line = file.readline()
if not line:
time.sleep(1)
file.seek(where)
else:
if 'MyTerm' in line:
print line
Well, if you want it real time and not to be stuck on sending mails you could start a separate thread to send the email. Here is how you use threads in python (thread and threading):
http://www.tutorialspoint.com/python/python_multithreading.htm
Next, you can easily send an email in python using the smtplib. Here is another example from the same website (which I use and it is pretty good):
http://www.tutorialspoint.com/python/python_sending_email.htm
Well, you need to do this to speed up as much as possible the log-reading thread and to be sure it won't wait for mailing.
Now some pitfalls you have to take care of:
You must be careful with starting too many threads. For instance you are parsing (let's just assume for the moment) the log every 1 second, but sending an email takes 10 seconds. It is easy to see that (this is an exaggerated example, of course) you will start many threads and you will fill the available resources. I don't know how many times the string you are expecting will pop out each second, but it is a scenario you must consider.
Again depending on the workload, you can implement a streaming algorithm and avoid emails entirely. I don't know if it applies in your case, but I prefer to remind you about this scenario too.
You can create a queue and put a certain number of messages in it and send them together, thus avoiding sending many mails at once (again assuming you don't need to trigger an alarm for every single occurrence of your target string you have).
UPDATE
If you really want to create the perfect program you can do something else by using event triggering when the log file is modified. This way, you will avoid sleep entirely and each time something has been added to the file, python will be called and you can parse the new content and send the email if required. Take a look at watchdog:
http://pythonhosted.org/watchdog/
and this:
python pyinotify to monitor the specified suffix files in a dir
https://github.com/seb-m/pyinotify
This is probably a truly basic thing that I'm simply having an odd time figuring out in a Python 2.5 app.
I have a process that will take roughly an hour to complete, so I made a backend. To that end, I have a backend.yaml that has something like the following:
-name: mybackend
options: dynamic
start: /path/to/script.py
(The script is just raw computation. There's no notion of an active web session anywhere.)
On toy data, this works just fine.
This used to be public, so I would navigate to the page, the script would start, and time out after about a minute (HTTP + 30s shutdown grace period I assume, ). I figured this was a browser issue. So I repeat the same thing with a cron job. No dice. Switch to a using a push queue and adding a targeted task, since on paper it looks like it would wait for 10 minutes. Same thing.
All 3 time out after that minute, which means I'm not decoupling the request from the backend like I believe I am.
I'm assuming that I need to write a proper Handler for the backend to do work, but I don't exactly know how to write the Handler/webapp2Route. Do I handle _ah/start/ or make a new endpoint for the backend? How do I handle the subdomain? It still seems like the wrong thing to do (I'm sticking a long-process directly into a request of sorts), but I'm at a loss otherwise.
So the root cause ended up being doing the following in the script itself:
models = MyModel.all()
for model in models:
# Magic happens
I was basically taking for granted that the query would automatically batch my Query.all() over many entities, but it was dying at the 1000th entry or so. I originally wrote it was computational only because I completely ignored the fact that the reads can fail.
The actual solution for solving the problem we wanted ended up being "Use the map-reduce library", since we were trying to look at each model for analysis.