There probably are such applications, but couldn't find them so I'm asking a question here.
I'd like a dynamic python script scheduler.
It runs one script from beginning to end, and then reads the next script in queue, and executes it.
I'd like to dynamically add new python scripts into the queue (and probably also delete it as well)
If I do a list job, it should show me the list of all jobs that have been executed, and those that are still in the queue.
Do you know any program that provides such functionality?
I know a Load Sharing Facility, but I don't need to distribute jobs to clusters,
I just need to queue jobs on my machine...
If you need an in-process scheduler, you can try APScheduler, which is quite simple to use and thoroughly documented. You can even build a custom scheduler program based on APScheduler, communicating with a minimal GUI through an SQL database
If you are looking for very basic functionality, you could also have a program polling a chosen directory (the pending jobs pool) for e.g. batch files. Each time it finds one, it moves it to another directory (the running jobs pool), runs it and finally moves it to a final directory (the executed jobs pool). Adding a job is as simple as adding a batch file, monitoring the queue consists in viewing the pending jobs's directory contents, etc.
This script can IMHO take less time to write than learning to use an advanced scheduler (let aside APScheduler).
Related
I have used a lot of distribute task package, like celery, python-rq, they all depend on a external service such as redis, rabbit-mq and so on.
But, usually I don't need a queue service, in another word , don't want to install redis or other non-python service on my vps.(also to simplify the environment)
I should say that it is good to split producer and worker to different process(two code file). Using multiprocessing.Queue need put all stuff into one file, and need write a lot of additional code to catch ctrl+c to handler the exit and save current enqueued tasks. This would not happened by using celery, python-rq, though stop workers and producers, tasks still saved in the queue.
I'd like to use a local queue(can work just by pip install xxx), such as a disk queue.
After some searching, only find queuelib(Collection of persistent (disk-based) queues), but sadly it don't support access from to multi processes.
Check Luigi package.
It allows definition of multiple tasks with all the dependencies and then request it to run with particular parameters.
It runs so called scheduler, which can be local one or "production" one in form of web service.
Luigi has a concept of finished tasks, being marked by presence of some sort of target. The target can be a file on local filesystem, on AWS S3 etc. Once the target exists, the task is considered to be completed. Luigi takes care of creating the target files atomically, this means, the file is created first in some temporary location and moved to final one after it is really completed. This prevents having half-finished targets.
If you stop the processing in the middle, next start will check for existence of completed targets and start processing only those, which are not completed yet.
There is no extra service to install, the command sets up the local or "production" scheduler on its own, if not existing yet.
I need to run a specific manage.py commands on an EC2 instance every X minutes. For example: python manage.py some_command.
I have looked up django-chronograph. Following the instructions, I've added chronograph to my settings.py but on runserver it keeps telling me No module named chronograph.
Is there something I'm missing to get this running? And after running how do I get manage.py commands to run using chronograph?
Edit: It's installed in the EC2 instance's virtualenv.
I would suggest you to configure cron to run your command at specific times/intervals.
First, install it by running pip install django-chronograph.
I would say handle this through cross, but if you don't want to use cross then:
Make sure you installed the module in the virtualenv (With easy_install, pip, or any other way that Amazon EC2 allows). After that you might want to look up the threading module documentation:
Python 2 threading module documentation
Python 3 threading module documentation
The purpose of using threading will be to have the following structure:
A "control" thread, which will use the chronograph module and do the time measurements, and putting the new work to do in an "input queue" on each scheduled time, for the worker threads (which will be active already) to process, or just trigger each worker thread (make it active) at the time you want to trigger each execution. In the first case you'll be taking advantage of parallel threads to do a big chunk of work and minimize io wait times, but since the work is in a queue, the workers will process one at a time. Meaning if you schedule two things too close together and the previous element is still being processed, the new item will have to wait (Depending on your programming logic and amount of worker threads some workers might start processing the new item, but is a bit more complex logic).
In the second case your control thread will actually trigger the start of a new thread (or group of threads) each time you want to trigger a scheduled action. If there's big data to process you might need to spawn a new queue for each task to process and create a group of worker threads for it for each task, but if the data is not that big then you can just get away with having the worker process just one data package and be done once execution is done and you get a result. Either way this method will allow you to schedule tasks without limitation on how close they can be, since new independent worker threads will be created for them every time.
Finally, you might want to create an "output queue" and output thread, to store and process (or output, or anything else you want to do with it...) the results of each worker threads.
The control thread will be basically trying to imitate cron in its logic, triggering actions at certain times depending on how it was configured.
There's also a multiprocessing module in python which will work with processes instead and take advantage of true multiprocessing hardware, but I don't think you'll really need it in this case, unless you see performance issues caused by cpu performance.
If you need any clarification, help, examples, just let me know.
I have a computationally intensive program doing calculations that I intend to parallelise. It is written in python and I hope to use the multiprocess module. I would like some help with understanding what I would need to do to have one program run from my laptop controlling the entire process.
I have two options in terms of what computers I can use. One is computers which I can access through ssh user#comp1.com from the terminal ( not sure how to access them through python ) and then run the instance there, although I'd like a more programmatic way to get to them than that. It seems that if I ran a remote manager type application it would work?
The second option I was thinking is utilising AWS E2C servers. (I think that is what I need). And I found boto which I've never used but seemed to provide an interface to control the AWS system. I feel that I would then need something to actually distribute jobs on AWS, probably similarly as option 1 (?). I'm a bit in the dark here.
EDIT:
To give you an idea of how parallelisable it is:
res = []
for param in Parameters:
res.append(FunctionA(param))
Parameters2 = FunctionB(res)
res2 = []
for param in Parameters2:
res2.append(FunctionC(param))
return res, res2
So the two loops are basically where I can send off many param values to be run in parallel and I know how to recombine them to create res as long as I know which param they came from. Then I need to group them all together to get Parameters2 and then the second part is again parallelisable.
you would want to use the multiprocess module only if you want the processes to share data in memory. That is something I would recommend ONLY if you absolutely have to have shared memory due to performance considerations. python multiprocess applications are non-trivial to write and debug.
If you are doing something like the distributed.net or seti#home projects, where even though the tasks are computationally intenive they are reasonably isolated, you can follow the following process.
Create a master application that would break down the large task into smaller computation chunks (assuming that the task can be broken down and the results then can be combined centrally).
Create python code that would take the task from the server (perhaps as a file or some other one time communication with instructions on what to do) and run multiple copies of these python processes
These python processes will work independently from each other, process data and then return the results to the master process for collation of results.
you could run these processes on AWS single core instances if you wanted, or use your laptop to run as many copies as you have cores to spare.
EDIT: Based on the updated question
So your master process will create files (or some other data structures) that will have the parameter info in them. As many files as you have params to process. This files will be stored in a shared folder called needed-work
Each python worker (on AWS instances) will look at the needed-work shared folder, looking for available files to work on (or wait on a socket for the master process to assign the file to them).
The python process that takes on a file that needs work, will work on it and store the result in a separate shared folder with the the parameter as part of the file structure.
The master process will look at the files in the work-done folder, process these files and generate the combined response
This whole solution could be implemented as sockets as well, where workers will listen to sockets for the master to assign work to them, and master will wait on a socket for the workers so submit response.
The file based approach would require a way for the workers to make sure that the work they pick up is not taken on by another worker. This could be fixed by having separate work folders for each worker and the master process would decided when there needs to be more work for the worker.
Workers could delete files that they pick up from the work folder and master process could keep a watch on when a folder is empty and add more work files to it.
Again more elegant to do this using sockets if you are comfortable with that.
I have a web-scraper (command-line scripts) written in Python that run on 4-5 Amazon-EC2 instances.
What i do is place the copy of these python scripts in these EC2 servers and run them.
So the next time when i change the program i have to do it for all the copies.
So, you can see the problem of redundancy, management and monitoring.
So, to reduce the redundancy and for easy management , I want to place the code in a separate server from which it can be executed on other EC2 servers and also monitor theses python programs, and logs created them through a Django/Web interface situated in this server.
There are at least two issues you're dealing with:
monitoring of execution of the scraping tasks
deployment of code to multiple servers
and each of them requires a different solution.
In general I would recommend using task queue for this kind of assignment (I have tried and was very pleased with Celery running on Amazon EC2).
One advantage of the task queue is that it abstracts the definition of the task from the worker which actually performs it. So you send the tasks to the queue, and then a variable number of workers (servers with multiple workers) process those tasks by asking for them one at a time. Each worker if it's idle will connect to the queue and ask for some work. If it receives it (a task) it will start processing it. Then it might send the results back and it will ask for another task and so on.
This means that a number of workers can change over time and they will process the tasks from the queue automatically until there are no more tasks to process. The use case for this is using Amazon's Spot instances which will greatly reduce the cost. Just send your tasks to the queue, create X spot requests and see the servers processing your tasks. You don't really need to care about the servers going up and down at any moment because the price went above your bid. That's nice, isn't it ?
Now, this implicitly takes care of monitoring - because celery has tools for monitoring the queue and processing, it can even be integrated with django using django-celery.
When it comes to deployment of code to multiple servers, Celery doesn't support that. The reasons behind this are of different nature, see e.g. this discussion. One of them might be that it's just difficult to implement.
I think it's possible to live without it, but if you really care, I think there's a relatively simple DIY solution. Put your code under VCS (I recommend Git) and check for updates on a regular basis. If there's an update, run a bash script which will kill your workers, make all the updates and start the workers again so that they can process more tasks. Given Celerys ability to handle failure this should work just fine.
I need a framework which will allow me to do the following:
Allow to dynamically define tasks (I'll read an external configuration file and create the tasks/jobs; task=spawn an external command for instance)
Provide a way of specifying dependencies on existing tasks (e.g. task A will be run after task B is finished)
Be able to run tasks in parallel in multiple processes if the execution order allows it (i.e. no task interdependencies)
Allow a task to depend on some external event (don't know exactly how to describe this, but some tasks finish and they will produce results after a while, like a background running job; I need to specify some of the tasks to depend on this background-job-completed event)
Undo/Rollback support: if one tasks fail, try to undo everything that has been executed before (I don't expect this to be implemented in any framework, but I guess it's worth to ask..)
So, obviously, this looks more or less like a build system, but I don't seem to be able to find something that will allow me to dynamically create tasks, most things I've seem already have them defined in the "Makefile".
Any ideas?
I've been doing a little more research and I've stumbled upon doit which provides the core functionality I need, without being overkill (not saying that Celery wouldn't have solved the job, but this does it better for my use case).
Another option is to use make.
Write a Makefile manually or let a python script write it
use meaningful intermediate output file stages
Run make, which should then call out the processes. The processes would be a python (build) script with parameters that tell it which files to work on and what task to do.
parallel execution is supported with -j
it also deletes output files if tasks fail
This circumvents some of the python parallelisation problems (GIL, serialisation).
Obviously only straightforward on *nix platforms.
AFAIK, there is no such framework in python which does exactly what you describe. So your options include either building something on your own or hack some bits of your requirements and model them using an existing tool. Which smells like celery.
You may have a celery task which reads a configuration file which contains some python functions' source code, then use eval or ast.literal_eval to execute them.
Celery provides a way to define subtasks (dependencies between tasks), so if you are aware of your dependencies, you can model them accordingly.
Provided that you know the execution order of your tasks you can route them to as many worker machines as you want.
You can periodically poll this background job's result and then start your tasks that are dependent on it.
Undo/Rollback: this might be tricky and depends on what you want to undo; results? state?