Watchfiles runtime path changes - python

Playing around with new rust backed python package watchfiles. I have a use case which requires modifying the list of paths being watched at runtime. We would like to provide our users a service to set up a path to be watched for events to trigger jobs to run on the server. Ex. watch "/path/to/ftp/location" for files added and run "my_file_mover_task" when an event occurs.
Does anyone have a good pattern for modifying the list of watched paths at runtime? All of the examples in the docs seem to focus on a known list of paths to watch at startup.
We've considered a few approaches:
Restart the watcher every time the path list is modified. I think we run the risk of missing events which occur while the the watcher is getting reinitialized. Maybe some A/B changeover logic could work but then we run the risk of duplicating events.
Use awatch to create an async watcher for each path to be watched. We could keep a list or dictionary of these watchers which would allow us to stop/start watching on a per path basis at runtime. I believe awatch uses anyio to spawn a thread to support the asyncio api. We will potentially be watching thousands of paths though so I'm concerned that we may spawn way to many threads.
We have also considered the watchdog library as an alternative, but it seems to run into the same issues.

Related

Amazon AWS - python for beginners

I have a computationally intensive program doing calculations that I intend to parallelise. It is written in python and I hope to use the multiprocess module. I would like some help with understanding what I would need to do to have one program run from my laptop controlling the entire process.
I have two options in terms of what computers I can use. One is computers which I can access through ssh user#comp1.com from the terminal ( not sure how to access them through python ) and then run the instance there, although I'd like a more programmatic way to get to them than that. It seems that if I ran a remote manager type application it would work?
The second option I was thinking is utilising AWS E2C servers. (I think that is what I need). And I found boto which I've never used but seemed to provide an interface to control the AWS system. I feel that I would then need something to actually distribute jobs on AWS, probably similarly as option 1 (?). I'm a bit in the dark here.
EDIT:
To give you an idea of how parallelisable it is:
res = []
for param in Parameters:
res.append(FunctionA(param))
Parameters2 = FunctionB(res)
res2 = []
for param in Parameters2:
res2.append(FunctionC(param))
return res, res2
So the two loops are basically where I can send off many param values to be run in parallel and I know how to recombine them to create res as long as I know which param they came from. Then I need to group them all together to get Parameters2 and then the second part is again parallelisable.
you would want to use the multiprocess module only if you want the processes to share data in memory. That is something I would recommend ONLY if you absolutely have to have shared memory due to performance considerations. python multiprocess applications are non-trivial to write and debug.
If you are doing something like the distributed.net or seti#home projects, where even though the tasks are computationally intenive they are reasonably isolated, you can follow the following process.
Create a master application that would break down the large task into smaller computation chunks (assuming that the task can be broken down and the results then can be combined centrally).
Create python code that would take the task from the server (perhaps as a file or some other one time communication with instructions on what to do) and run multiple copies of these python processes
These python processes will work independently from each other, process data and then return the results to the master process for collation of results.
you could run these processes on AWS single core instances if you wanted, or use your laptop to run as many copies as you have cores to spare.
EDIT: Based on the updated question
So your master process will create files (or some other data structures) that will have the parameter info in them. As many files as you have params to process. This files will be stored in a shared folder called needed-work
Each python worker (on AWS instances) will look at the needed-work shared folder, looking for available files to work on (or wait on a socket for the master process to assign the file to them).
The python process that takes on a file that needs work, will work on it and store the result in a separate shared folder with the the parameter as part of the file structure.
The master process will look at the files in the work-done folder, process these files and generate the combined response
This whole solution could be implemented as sockets as well, where workers will listen to sockets for the master to assign work to them, and master will wait on a socket for the workers so submit response.
The file based approach would require a way for the workers to make sure that the work they pick up is not taken on by another worker. This could be fixed by having separate work folders for each worker and the master process would decided when there needs to be more work for the worker.
Workers could delete files that they pick up from the work folder and master process could keep a watch on when a folder is empty and add more work files to it.
Again more elegant to do this using sockets if you are comfortable with that.

Saltstack Manage and Query a Tally/Threshold via events and salt-call?

I have over 100 web servers instances running a php application using apc and we occasionally (order of once per week across the entire fleet) see a corruption to one of the caches which results in a distinctive error log message.
Once this occurs then the application is dead on that node any transactions routed to it will fail.
I've written a simple wrapper around tail -F which can spot the patter any time it appears in the log file and evaluate a shell command (using bash eval) to react. I have this using the salt-call command from salt-stack to trigger processing a custom module which shuts down the nginx server, warms (refreshes) the cache, and, of course, restarts the web server. (Actually I have two forms of this wrapper, bash and Python).
This is fine and the frequency of events is such that it's unlikely to be an issue. However my boss is, quite reasonably, concerned about a common mode failure pattern ... that the regular expression might appear in too many of these logs at once and take town the entire site.
My first thought would be to wrap my salt-call in a redis check (we already have a Redis infrastructure used for caching and certain other data structures). That would be implemented as an integer, with an expiration. The check would call INCR, check the result, and sleep if more than N returned (or if the Redis server were unreachable). If the result were below the threshold then salt-call would be dispatched and a decrement would be called after the server is back up and running. (Expiration of the Redis key would kill off any stale increments after perhaps a day or even a few hours ... our alerting system will already have notified us of down servers and our response time is more than adequate for such time frames).
However, I was reading about the Saltstack event handling features and wondering if it would be better to use that instead. (Advantage, the nodes don't have redis-cli command tool nor the Python Redis libraries, but, obviously, salt-call is already there with its requisite support). So using something in Salt would minimize the need to add additional packages and dependencies to these systems. (Alternatively I could just write all the Redis handling as a separate PHP command line utility and just have my shell script call that).
Is there a HOWTO for writing simple Saltstack modules? The docs seem to plunge deeply into reference details without any orientation. Even some suggestions about which terms to search on would be helpful (because their use of terms like pillars, grains, minions, and so on seems somewhat opaque).
The main doc for writing a Salt module is here: http://docs.saltstack.com/en/latest/ref/modules/index.html
There are many modules shipped with Salt that might be helpful for inspiration. You can find them here: https://github.com/saltstack/salt/tree/develop/salt/modules
One thing to keep in mind is that the Salt Minion doesn't do anything unless you tell it to do something. So you could create a module that checks for the error pattern you mention, but you'd need to add it to the Salt Scheduler or cron to make sure it gets run frequently.
If you need more help you'll find helpful people on IRC in #salt on freenode.

Python framework for task execution and dependencies handling

I need a framework which will allow me to do the following:
Allow to dynamically define tasks (I'll read an external configuration file and create the tasks/jobs; task=spawn an external command for instance)
Provide a way of specifying dependencies on existing tasks (e.g. task A will be run after task B is finished)
Be able to run tasks in parallel in multiple processes if the execution order allows it (i.e. no task interdependencies)
Allow a task to depend on some external event (don't know exactly how to describe this, but some tasks finish and they will produce results after a while, like a background running job; I need to specify some of the tasks to depend on this background-job-completed event)
Undo/Rollback support: if one tasks fail, try to undo everything that has been executed before (I don't expect this to be implemented in any framework, but I guess it's worth to ask..)
So, obviously, this looks more or less like a build system, but I don't seem to be able to find something that will allow me to dynamically create tasks, most things I've seem already have them defined in the "Makefile".
Any ideas?
I've been doing a little more research and I've stumbled upon doit which provides the core functionality I need, without being overkill (not saying that Celery wouldn't have solved the job, but this does it better for my use case).
Another option is to use make.
Write a Makefile manually or let a python script write it
use meaningful intermediate output file stages
Run make, which should then call out the processes. The processes would be a python (build) script with parameters that tell it which files to work on and what task to do.
parallel execution is supported with -j
it also deletes output files if tasks fail
This circumvents some of the python parallelisation problems (GIL, serialisation).
Obviously only straightforward on *nix platforms.
AFAIK, there is no such framework in python which does exactly what you describe. So your options include either building something on your own or hack some bits of your requirements and model them using an existing tool. Which smells like celery.
You may have a celery task which reads a configuration file which contains some python functions' source code, then use eval or ast.literal_eval to execute them.
Celery provides a way to define subtasks (dependencies between tasks), so if you are aware of your dependencies, you can model them accordingly.
Provided that you know the execution order of your tasks you can route them to as many worker machines as you want.
You can periodically poll this background job's result and then start your tasks that are dependent on it.
Undo/Rollback: this might be tricky and depends on what you want to undo; results? state?

How do I run long term (infinite) Python processes?

I've recently started experimenting with using Python for web development. So far I've had some success using Apache with mod_wsgi and the Django web framework for Python 2.7. However I have run into some issues with having processes constantly running, updating information and such.
I have written a script I call "daemonManager.py" that can start and stop all or individual python update loops (Should I call them Daemons?). It does that by forking, then loading the module for the specific functions it should run and starting an infinite loop. It saves a PID file in /var/run to keep track of the process. So far so good. The problems I've encountered are:
Now and then one of the processes will just quit. I check ps in the morning and the process is just gone. No errors were logged (I'm using the logging module), and I'm covering every exception I can think of and logging them. Also I don't think these quitting processes has anything to do with my code, because all my processes run completely different code and exit at pretty similar intervals. I could be wrong of course. Is it normal for Python processes to just die after they've run for days/weeks? How should I tackle this problem? Should I write another daemon that periodically checks if the other daemons are still running? What if that daemon stops? I'm at a loss on how to handle this.
How can I programmatically know if a process is still running or not? I'm saving the PID files in /var/run and checking if the PID file is there to determine whether or not the process is running. But if the process just dies of unexpected causes, the PID file will remain. I therefore have to delete these files every time a process crashes (a couple of times per week), which sort of defeats the purpose. I guess I could check if a process is running at the PID in the file, but what if another process has started and was assigned the PID of the dead process? My daemon would think that the process is running fine even if it's long dead. Again I'm at a loss just how to deal with this.
Any useful answer on how to best run infinite Python processes, hopefully also shedding some light on the above problems, I will accept
I'm using Apache 2.2.14 on an Ubuntu machine.
My Python version is 2.7.2
I'll open by stating that this is one way to manage a long running process (LRP) -- not de facto by any stretch.
In my experience, the best possible product comes from concentrating on the specific problem you're dealing with, while delegating supporting tech to other libraries. In this case, I'm referring to the act of backgrounding processes (the art of the double fork), monitoring, and log redirection.
My favorite solution is http://supervisord.org/
Using a system like supervisord, you basically write a conventional python script that performs a task while stuck in an "infinite" loop.
#!/usr/bin/python
import sys
import time
def main_loop():
while 1:
# do your stuff...
time.sleep(0.1)
if __name__ == '__main__':
try:
main_loop()
except KeyboardInterrupt:
print >> sys.stderr, '\nExiting by user request.\n'
sys.exit(0)
Writing your script this way makes it simple and convenient to develop and debug (you can easily start/stop it in a terminal, watching the log output as events unfold). When it comes time to throw into production, you simply define a supervisor config that calls your script (here's the full example for defining a "program", much of which is optional: http://supervisord.org/configuration.html#program-x-section-example).
Supervisor has a bunch of configuration options so I won't enumerate them, but I will say that it specifically solves the problems you describe:
Backgrounding/Daemonizing
PID tracking (can be configured to restart a process should it terminate unexpectedly)
Log normally in your script (stream handler if using logging module rather than printing) but let supervisor redirect to a file for you.
You should consider Python processes as able to run "forever" assuming you don't have any memory leaks in your program, the Python interpreter, or any of the Python libraries / modules that you are using. (Even in the face of memory leaks, you might be able to run forever if you have sufficient swap space on a 64-bit machine. Decades, if not centuries, should be doable. I've had Python processes survive just fine for nearly two years on limited hardware -- before the hardware needed to be moved.)
Ensuring programs restart when they die used to be very simple back when Linux distributions used SysV-style init -- you just add a new line to the /etc/inittab and init(8) would spawn your program at boot and re-spawn it if it dies. (I know of no mechanism to replicate this functionality with the new upstart init-replacement that many distributions are using these days. I'm not saying it is impossible, I just don't know how to do it.)
But even the init(8) mechanism of years gone by wasn't as flexible as some would have liked. The daemontools package by DJB is one example of process control-and-monitoring tools intended to keep daemons living forever. The Linux-HA suite provides another similar tool, though it might provide too much "extra" functionality to be justified for this task. monit is another option.
I assume you are running Unix/Linux but you don't really say. I have no direct advice on your issue. So I don't expect to be the "right" answer to this question. But there is something to explore here.
First, if your daemons are crashing, you should fix that. Only programs with bugs should crash. Perhaps you should launch them under a debugger and see what happens when they crash (if that's possible). Do you have any trace logging in these processes? If not, add them. That might help diagnose your crash.
Second, are your daemons providing services (opening pipes and waiting for requests) or are they performing periodic cleanup? If they are periodic cleanup processes you should use cron to launch them periodically rather then have them run in an infinite loop. Cron processes should be preferred over daemon processes. Similarly, if they are services that open ports and service requests, have you considered making them work with INETD? Again, a single daemon (inetd) should be preferred to a bunch of daemon processes.
Third, saving a PID in a file is not very effective, as you've discovered. Perhaps a shared IPC, like a semaphore, would work better. I don't have any details here though.
Fourth, sometimes I need stuff to run in the context of the website. I use a cron process that calls wget with a maintenance URL. You set a special cookie and include the cookie info in with wget command line. If the special cookie doesn't exist, return 403 rather than performing the maintenance process. The other benefit here is login to the database and other environmental concerns of avoided since the code that serves normal web pages are serving the maintenance process.
Hope that gives you ideas. I think avoiding daemons if you can is the best place to start. If you can run your python within mod_wsgi that saves you having to support multiple "environments". Debugging a process that fails after running for days at a time is just brutal.

Advice: Python Framework Server/Worker Queue management (not Website)

I am looking for some advice/opinions of which Python Framework to use in an implementation of multiple 'Worker' PCs co-ordinated from a central Queue Manager.
For completeness, the 'Worker' PCs will be running Audio Conversion routines (which I do not need advice on, and have standalone code that works).
The Audio conversion takes a long time, and I need to co-ordinate an arbitrary number of the 'Workers' from a central location, handing them conversion tasks (such as where to get the source files, or where to ask for the job configuration) with them reporting back some additional info, such as the runtime of the converted audio etc.
At present, I have a script that makes a webservice call to get the 'configuration' for a conversion task, based on source files located on the worker already (we manually copy the source files to the worker, and that triggers a conversion routine). I want to change this, so that we can distribute conversion tasks ("Oy you, process this: xxx") based on availability, and in an ideal world, based on pending tasks too.
There is a chance that Workers can go offline mid-conversion (but this is not likely).
All the workers are Windows based, the co-ordinator can be WIndows or Linux.
I have (in my initial searches) come across the following - and I know that some are cross-dependent:
Celery (with RabbitMQ)
Twisted
Django
Using a framework, rather than home-brewing, seems to make more sense to me right now. I have a limited timeframe in which to develop this functional extension.
An additional consideration would be using a Framework that is compatible with PyQT/PySide so that I can write a simple UI to display Queue status etc.
I appreciate that the specifics above are a little vague, and I hope that someone can offer me a pointer or two.
Again: I am looking for general advice on which Python framework to investigate further, for developing a Server/Worker 'Queue management' solution, for non-web activities (this is why DJango didn't seem the right fit).
How about using pyro? It gives you remote object capability and you just need a client script to coordinate the work.

Categories