Python - Gathering statistics from different scripts

Python - Gathering statistics from different scripts - python

I've written a program in Python that takes data (any kind), organizes it, and sends it to a remote service for visualization. The API just needs a configuration file that specifies what types of data it should expect, and then it's ready to go.
I then have lots of different programs that all report some form of statistics in their execution. What i want to do is to be able to connect them to my API without having to modify the programs themselves and this is proving tricky.
It's possible to just instantiate an object of the API in each class and then call on the methods provided, however this is too much of a hassle - you should be able to use it without having the API-classes in your own repository.
One solution we came up with was to simply print all of the output into a file, and have the API listen to this file - once it detected something new, it would read it and send it off. However, this is too slow for our needs as some of the program produce vast amounts of data very quickly.
Is there any other solution to my problem?
Thanks!

Related

Python, updating variables manual while code running

I have a code that contains a variable that I want to change manually when I want without stopping the main loop neither pause it (with input()). I can't find a library that allows me to set manually in the run, or access the RAM memory to change that value.
for now I set a file watcher that reads the parameters every 1 minutes but this is inefficient way I presume.

I guess you just want to expose API. You did it with files which works but less common. You can use common best practices such as:
HTTP web-server. You can do it quickly with Flask/bottle.
gRPC
pub/sub mechanism - Redis, Kafka (more complicated, requires another storage solution - the DB itself).
I guess that there are tons of other solution but you got the idea. I hope that's what you're looking for.
With those solution you can manually trigger your endpoint and change whatever you want in your application.

Python flask how to avoid multi file access at the same time (mutual exclusion on files)？

I am currently developing a data processing web server(linux) using python flask.
The general work flow is:
Get an input file from the user (handled by python flask)
Flask passes this input file to a java program
Java program processes this input file, saves the outputs (multiple files) on the server.
Flask calls another python script which will process these outputs to get the final result and return the result back to the client.
The problem is: between step 3 and step 4, there exist some intermediate files, this would not have been a problem at all if this is a local program. but as a server program, When more than one clients access this program, they could get unexpected result generated by input that is provided by another user who is using the web program at the same time.
From the point I see it, this is kind of a mutual exclusion problem on file access. I have had problems with mutual exclusion problems on threads before, I solved some of them using thread locks such as like synchronization in java and lock in pythons, but I am not sure what to do when it comes to files instead of threads.
It occurred to me that maybe I canspawn different copies of files based on different clients. But as I understand, the HTTP is stateless so you can't really know who is accessing the server. I don't want to add a login system and a user database to achieve this purpose as I sense there is a much simpler and better way to resolve this problem.
I have been looking for a good solution these days but haven't found an ideal one so I am looking for some advice here. Any suggestions will be highly appreciated. If you can suggest a viable solution, please feel free to provide me with your name so I can add you to the thank list of digital and paper publications about this tool when it's published.

As a system kind of person I suggest you something like this
https://docs.python.org/3/library/fcntl.html#fcntl.lockf
This is how I would solve it there is so many way to solve this problem and it is up to debate of course it is come hard with the best solution
Assume the output file is where the conflict happen
so you lock the file and you keep polling until the resource is release (the user need to wait) so you force one user to access the file at a time (polling here time.sleep) for like 2-3 seconds (add a try except) here thread lock on the output file only when the resource is release the next user process will pass through normally.
Another easy way is to dump the data in a rds like mysql or postgres it will handle all the file access nightmare occurred from concurrent request (put the output file in a db).

Probing running python process for information

I've created a python (and C, but the "controlling" part is Python) program for carrying out Bayesian inversion using Markov chain Monte Carlo methods. Unfortunately, McMC can take several days to run. Part of my research is in reducing the time, but we can only reduce so much.
I'm running it on a headless Centos 7 machine using nohup since maintaining a connection and receiving prints for several days is not practical. However, I would like to be able to check in on my program occasionally to see how many iterations it's done, how many proposals have been accepted, whether it's out of burn-in, etc.
Is there something I can use to interact with the python process to get this info?

I would recommend SAWs (Scientific Application Web server). It creates a thread in your process to handle HTTP request. The variables of interest are returned to the client in JSON format .
For the variables managed by the python runtime, write them into a (JSON) file on the harddisk (or any shared memory) and use SimpleHTTPServer to serve it. The SAWs web interface on the client side can still be used as long as you follow the JSON format of SAWs.

Advice: Python Framework Server/Worker Queue management (not Website)

I am looking for some advice/opinions of which Python Framework to use in an implementation of multiple 'Worker' PCs co-ordinated from a central Queue Manager.
For completeness, the 'Worker' PCs will be running Audio Conversion routines (which I do not need advice on, and have standalone code that works).
The Audio conversion takes a long time, and I need to co-ordinate an arbitrary number of the 'Workers' from a central location, handing them conversion tasks (such as where to get the source files, or where to ask for the job configuration) with them reporting back some additional info, such as the runtime of the converted audio etc.
At present, I have a script that makes a webservice call to get the 'configuration' for a conversion task, based on source files located on the worker already (we manually copy the source files to the worker, and that triggers a conversion routine). I want to change this, so that we can distribute conversion tasks ("Oy you, process this: xxx") based on availability, and in an ideal world, based on pending tasks too.
There is a chance that Workers can go offline mid-conversion (but this is not likely).
All the workers are Windows based, the co-ordinator can be WIndows or Linux.
I have (in my initial searches) come across the following - and I know that some are cross-dependent:
Celery (with RabbitMQ)
Twisted
Django
Using a framework, rather than home-brewing, seems to make more sense to me right now. I have a limited timeframe in which to develop this functional extension.
An additional consideration would be using a Framework that is compatible with PyQT/PySide so that I can write a simple UI to display Queue status etc.
I appreciate that the specifics above are a little vague, and I hope that someone can offer me a pointer or two.
Again: I am looking for general advice on which Python framework to investigate further, for developing a Server/Worker 'Queue management' solution, for non-web activities (this is why DJango didn't seem the right fit).

How about using pyro? It gives you remote object capability and you just need a client script to coordinate the work.

Writing a parallel programming framework, what have I missed?

Clarification: As per some of the comments, I should clarify that this is intended as a simple framework to allow execution of programs that are naturally parallel (so-called embarrassingly parallel programs). It isn't, and never will be, a solution for tasks which require communication or synchronisation between processes.
I've been looking for a simple process-based parallel programming environment in Python that can execute a function on multiple CPUs on a cluster, with the major criterion being that it needs to be able to execute unmodified Python code. The closest I found was Parallel Python, but pp does some pretty funky things, which can cause the code to not be executed in the correct context (with the appropriate modules imported etc).
I finally got tired of searching, so I decided to write my own. What I came up with is actually quite simple. The problem is, I'm not sure if what I've come up with is simple because I've failed to think of a lot of things. Here's what my program does:
I have a job server which hands out jobs to nodes in the cluster.
The jobs are handed out to servers listening on nodes by passing a dictionary that looks like this:
{
'moduleName':'some_module',
'funcName':'someFunction',
'localVars': {'someVar':someVal,...},
'globalVars':{'someOtherVar':someOtherVal,...},
'modulePath':'/a/path/to/a/directory',
'customPathHasPriority':aBoolean,
'args':(arg1,arg2,...),
'kwargs':{'kw1':val1, 'kw2':val2,...}
}
moduleName and funcName are mandatory, and the others are optional.
A node server takes this dictionary and does:
sys.path.append(modulePath)
globals()[moduleName]=__import__(moduleName, localVars, globalVars)
returnVal = globals()[moduleName].__dict__[funcName](*args, **kwargs)
On getting the return value, the server then sends it back to the job server which puts it into a thread-safe queue.
When the last job returns, the job server writes the output to a file and quits.
I'm sure there are niggles that need to be worked out, but is there anything obvious wrong with this approach? On first glance, it seems robust, requiring only that the nodes have access to the filesystem(s) containing the .py file and the dependencies. Using __import__ has the advantage that the code in the module is automatically run, and so the function should execute in the correct context.
Any suggestions or criticism would be greatly appreciated.
EDIT: I should mention that I've got the code-execution bit working, but the server and job server have yet to be written.

I have actually written something that probably satisfies your needs: jug. If it does not solve your problems, I promise you I'll fix any bugs you find.
The architecture is slightly different: workers all run the same code, but they effectively generate a similar dictionary and ask the central backend "has this been run?". If not, they run it (there is a locking mechanism too). The backend can simply be the filesystem if you are on an NFS system.

I myself have been tinkering with batch image manipulation across my computers, and my biggest problem was the fact that some things don't easily or natively pickle and transmit across the network.
for example: pygame's surfaces don't pickle. these I have to convert to strings by saving them in StringIO objects and then dumping it across the network.
If the data you are transmitting (eg your arguments) can be transmitted without fear, you should not have that many problems with network data.
Another thing comes to mind: what do you plan to do if a computer suddenly "disappears" while doing a task? while returning the data? do you have a plan for re-sending tasks?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.