I would like to run jobs, but as they may be long, I would like to know how far they have been processed during their execution. That is, the executor would regularly return its progress, without ending the job it is executing.
I have tried to do this with APScheduler, but it seems the scheduler can only receive event messages like EVENT_JOB_EXECUTED or EVENT_JOB_ERROR.
Is it possible to get information from an executor while it is executing a job?
Thanks in advance!
There is, I think, no particular support for this within APScheduler. This requirement has come up for me many times, and the best solution will depend on exactly what you need. Some possibilities:
Job status dictionary
The simplest solution would be to use a plain python dictionary. Make the key the job's key, and the value whatever status information you require. This solution works best if you only have one copy of each job running concurrently (max_instances=1), of course. If you need some structure to your status information, I'm a fan of namedtuples for this. Then, you either keep the dictionary as an evil global variable or pass it into each job function.
There are some drawbacks, though. The status information will stay in the dictionary forever, unless you delete it. If you delete it at the end of the job, you don't get to read a 'job complete' status, and otherwise you have to make sure that whatever is monitoring the status definitely checks and clears every job. This of course isn't a big deal if you have a reasonable sized set of jobs/keys.
Custom dict
If you need some extra functions, you can do as above, but subclass dict (or UserDict or MutableMapping, depending on what you want).
Memcached
If you've got a memcached server you can use, storing the status reports in memcached works great, since they can expire automatically and they should be globally accessible to your application. One probably-minor drawback is that the status information could be evicted from the memcached server if it runs out of memory, so you can't guarantee that the information will be available.
A more major drawback is that this does require you to have a memcached server available. If you might or might not have one available, you can use dogpile.cache and choose the backend that's appropriate at the time.
Something else
Pieter's comment about using a callback function is worth taking note of. If you know what kind of status information you'll need, but you're not sure how you'll end up storing or using it, passing a wrapper to your jobs will make it easy to use a different backend later.
As always, though, be wary of over-engineering your solution. If all you want is a report that says "20/133 items processed", a simple dictionary is probably enough.
Related
I have a code that contains a variable that I want to change manually when I want without stopping the main loop neither pause it (with input()). I can't find a library that allows me to set manually in the run, or access the RAM memory to change that value.
for now I set a file watcher that reads the parameters every 1 minutes but this is inefficient way I presume.
I guess you just want to expose API. You did it with files which works but less common. You can use common best practices such as:
HTTP web-server. You can do it quickly with Flask/bottle.
gRPC
pub/sub mechanism - Redis, Kafka (more complicated, requires another storage solution - the DB itself).
I guess that there are tons of other solution but you got the idea. I hope that's what you're looking for.
With those solution you can manually trigger your endpoint and change whatever you want in your application.
In my Bottle app running on pythonanywhere, I want objects to be persisted between requests.
If I write something like this:
X = {'count': 0}
#route('/count')
def count():
X['count'] += 1
tpl = SimpleTemplate('Hello {{count}}!')
return tpl.render(count=X['count'])
The count increments, meaning that X persists between requests.
I am currently running this on pythonanywhere, which is a managed service where I have no control over the web server (nginx I presume?) threading, load balancing (if any) etc...
My question is, is this coincidence because it's only using one thread while on minimal load from me doing my tests?
More generally, at which point will this stop working? E.g. I have more than one thread/socket/instance/load-balanced server etc...?
Beyond that, what is my best options to make something like this work (sticking to Bottle) even if I have to move to a barebones server.
Here's what Bottle docs have to say about their request object:
A thread-safe instance of LocalRequest. If accessed from within a request callback, this instance always refers to the current request (even on a multi-threaded server).
But I don't fully understand what that means, or where global variables like the one I used stand with regards to multi-threading.
TL;DR: You'll probably want to use an external database to store your state.
If your application is tiny, and you're planning to always have exactly one server process running, then your current approach can work; "all" you need to do is acquire a lock around every (!) access to the shared state (the dict X in your sample code). (I put "all" in scare quotes there because it's likely to become more complicated than it sounds at first.)
But, since you're asking about multithreading, I'll assume that your application is more than a toy, meaning that you plan to receive substantial traffic and/or want to handle multiple requests concurrently. In this case, you'll want multiple processes, which means that your approach--storing state in memory--cannot work. Memory is not shared across processes. The (general) way to share state across processes is to store the state externally, e.g. in a database.
Are you familiar with Redis? That'd be on my short list of candidates.
I go the answers by contacting PythonAnywhere support, who had this to say:
When you run a website on a free PythonAnywhere account, just
one process handles all of your requests -- so a global variable like
the one you use there will be fine. But as soon as you want to scale
up, and get (say) a hacker account, then you'll have multiple processes
(not, not threads) -- and of course each one will have its own global
variables, so things will go wrong.
So that part deals with the PythonAnywhere specifics on why it works, and when it would stop working on there.
The answer to the second part, about how to share variables between multiple Bottle processes, I also got from their support (most helpful!) once they understood that a database would not work well in this situation.
Different processes cannot of course share variables, and the most viable solution would be to:
write your own kind of caching server to handle keeping stuff in memory [...] You'd have one process that ran all of the time, and web API requests would access it somehow (an internal REST API?). It could maintain stuff in memory [...]
Ps: I didn't expect other replies to tell me to store state in a database, I figured that the fact I'm asking this means I have a good reason not to use a database, apologies for time wasted!
The problem is still on the drawing board so far, so I can go for another better suited approach. The situation is like this:
We create a queue of n-processes, each of which execute independently of the other tasks in the queue itself. They do not share any resources etc. However, we noticed that sometimes (depending on queue parameters) a process k's behaviour might depend on existence of a flag specific to k+1 process. This flag is to be set in a DynamoDB table, and therefore; the execution could fails.
What I am currently searching around for is a method so that I can set some sort of waiters/suspenders in my tasks/workers so that they poll until the flag is set in the DynamoDB table, and meanwhile let the other subprocess take up the CPU.
The setting of this boolean value is done a little early in the processes themselves. The dependent part of the process comes much later.
So, we went ahead with creating n number of processes instead of having a suspender. This would not be the ideal approach, but for the time being; it solves the issue at hand.
I'd still love a better method to achieve the same.
I have a requirement like
For the first time I run the process, I need to set a=1
And for the remaining times I run the same process, I need to set a=2
Is it possible to maintain cache that tells the process is ran for second time.
I don't want another physical file to be created in my directory structure.
I searched in internet, but found always the cache within the process.
Thanks in advance
The ways to preserve data between totally separate executions of a process are:
Saving a file.
Handing the data off to another process such as a Memcached or Redis instance, or a database, which will keep the data in memory and/or write it to disk somewhere.
Recording the data in some other, more unusual way such as changing the environment of the running operating system, printing out the data or otherwise displaying it so that the human operator can keep track of it, or something like that.
When you use the word 'cache' and state that you do not wish to write the data to disk, the first thing that comes to mind is memcached or some other in-memory cache. But any file-based solution will certainly be less complex than setting up and maintaining an in-memory key-value store.
Which solution you choose depends in part on what 'second time' means. Second time ever? Second time ever on a given computer? Second time since reboot? Since manual reset? Different methods of recording data are suited to different storage requirements.
If your application is really just a=1 versus a=2 using a file is as good as anything, otherwise consult here: http://docs.python.org/2/library/persistence.html for other persistence methods.
Data cached inside the process dies along with the process. You'll have to cache this info elsewhere since you want it to persist longer than the process lives. A file seems reasonable.
I have separate processes working on redis keys, one of them is very fast at updating, the second can take a bit to happen, but still needs to happen.
I'm looking to implement a real lock in redis, not just a watch that fails the following actions if the key is modified.
Is there a way to check for a watch on a key or something like that?