Python concurrent logging to avoid disk bottleneck in code

Python concurrent logging to avoid disk bottleneck in code - python

I'm not super familiar with asyncio, but I was hoping there would be some easy way to use asyncio.Queue to push log messages to a Queue instead of writing them to the disk, and then have a worker on a thread wait for these Queue events and write them to disk when resources are available. This seems pretty widely necessary, as logging is a huge bottleneck in a lot of code but isn't always needed in real time. Are there any pre-existing packages for this or can anyone with more experience write a short example script to get me started? NOTE: This needs to interface with existing code, so making it all packaged in a class would probably be preferred.

It's handled in the standard library in recent Python versions. See this post for information, and the official documentation. This functionality predates asyncio, and so doesn't use it (and doesn't especially need to).
For Python 2.7, you can use the logutils package which provides equivalent functionality.

Related

Simple websocket server in Python for publishing

I have a running CLI application in Python that uses threads to execute some workers. Now I am writing a GUI using electron for this application. For simple requests/responses I am using gRPC to communicate between the Python application and the GUI.
I am, however, struggling to find a proper publishing mechanism to push data to the GUI: gRPCs integrated streaming won't work since it uses generators; as already mentioned my longer, blocking tasks are executed using threads (subclasses of threading.Thread). Also I'd like to emit certain events (e.g., the progress) from within those threads.
Then I've found the Flasks SocketIO implementation, which is, however, a blocking execution and thus not really suited for what I have in mind - I'd have to again execute two processes (Flask and my CLI application)...
Another package I've found is websockets but I can't get my head around how I could implement this producer() function that they mention in the patterns.
My last idea would be to deploy a broker-based message system like Redis or simply fall back to the brokerless zmq, which is a bit of a hassle to setup for the GUI application.
So the simple question:
Is there any easy framework that allows to create a server-"task" in a Python that I can pass messages to publish to?

For anyone struggling with concurrency in python:
No, there isn't any simple framework. IMHO pythons' concurrency handling is a bit of a mess (compared to other languages like golang, where concurrency is built in). There's multiple major packages implementing this, one of them asyncio, but most of them are incompatible. I've ended up using a similar solution like proposed in this question.

Is it possible to use Python threads in C module to achieve true parallelism?

I wrote an extension in C which uses threads. In order to try to stay cross-platfrom, I used Apache Portable Runtime wrappers around platform-specific functions related to parallelism. However, the installation of such package for Windows users will be really painful. Another concern that I have is that I don't really need the entire APR library, only the part which deals with threads.
Before I started working on this project, I considered different libraries for this task, and when looking into Python's implementation of threads, all the exported API I could find was dealing with GIL. In principle, I could create thread objects, and have them run C functions to do the work, however, I'm wondering if it makes sense? Do Python threads map to underlying OS threads (like, in case of Linux, pthreads library), or are they basically a prototype for asyncio, where they don't do any work in parallel (and maybe only wait in parallel)? The only exported API I found is the set of functions related to PyThreadState. I can see Python wrappers for pthreads and NT threads in the source code, but they don't seem to be available for extensions. Or am I missing something?

It could make sense to wrap your C code with Python threads as long as your long running tasks in C don't access python objects and so they can release the GIL:
Calling system I/O functions is the most common use case for releasing
the GIL, but it can also be useful before calling long-running
computations which don’t need access to Python objects
https://docs.python.org/3/c-api/init.html#thread-state-and-the-global-interpreter-lock

Distributed system: Python 3 worker and Node.js server?

I'm looking to set up a distributed system where there are compute/worker machines running resource-heavy Python 3 code, and there is a single web server that serves the results of the Python computation to clients. I would very much like to write the web server in Node.js.
I've looked into using an RPC framework—specifically, this question lead me to ZeroRPC, but it's not compatible with Python 3 (the main issue is that it requires gevent, which isn't that close to a Python 3 version yet). There doesn't seem to be another viable option for Python–Node.js RPC as far as I can tell.
In light of that, I'm open to using something other than RPC, especially since I've read that the RPC strategy hides too much from the programmer.
I'm also open to using a different language for the web server if that really makes more sense; for example, it may be much simpler from a development point of view to just use Python for the server too.
How can I achieve this?

You have a few options here.
First, it sounds like you like ZeroRPC, and your only problem is that it depends on gevent, which is not 3.x-ready yet.
Well, gevent is close to 3.x-ready. There are a few forks of it that people are testing and even using, which you can see on issue #38. As of mid-September 2014 the one that seems to be getting the most traction is Michal Mazurek's. If you're lucky, you can just do this:
pip3 install git+https://github.com/MichalMazurek/gevent
pip3 install ZeroRPC
Or, if ZeroRPC has metadata that says it's Python 2-only, you can install it from its repo the same way as gevent.
The down side is that none of the gevent-3.x forks are quite battle-tested yet, which is why none of them have been accepted upstream and released yet. But if you're not in a huge hurry, and willing to take a risk, there's a pretty good chance you can start with a fork today, and switch to the final version when it's released, hopefully before you've reached 1.0 yourself.
Second, ZeroRPC is certainly not the only RPC library available for either Python or Node. And most of them have a similar kind of interface for exposing methods over RPC. And, while you may ultimately need something like ZeroMQ for scalability or deployment reasons, you can probably use something simpler and more widespread like JSON-RPC over HTTP—which has a dozen or more Python and Node implementations—for early development, then switch to ZeroRPC later.
Third, RPC isn't exactly complicated, and binding methods to RPCs the way most libraries do isn't that hard. Making it asynchronous can be tricky, but again, for early development you can just use an easy but nonscalable solution—creating a thread for each request—and switch to something else later. (Of course that solution is only easy if your service is stateless; otherwise you're just eliminating all of your async problems and replacing them with race condition problems…)

ZeroMQ offers several transport classes, while the first two will be best suited for the case of a heterogenous RPC layer
ipc://
tcp://
pgm://
epgm://
inproc://
ZeroMQ has ports for both systems, so will definitely serve also your projected needs.

Executing server-side Unix scripts asynchronously

We have a collection of Unix scripts (and/or Python modules) that each perform a long running task. I would like to provide a web interface for them that does the following:
Asks for relevant data to pass into scripts.
Allows for starting/stopping/killing them.
Allows for monitoring the progress and/or other information provided by the scripts.
Possibly some kind of logging (although the scripts already do logging).
I do know how to write a server that does this (e.g. by using Python's built-in HTTP server/JSON), but doing this properly is non-trivial and I do not want to reinvent the wheel.
Are there any existing solutions that allow for maintaining asynchronous server-side tasks?

Django is great for writing web applications, and the subprocess module (subprocess.Popen en .communicate()) is great for executing shell scripts. You can give it a stdin,stdout and stderr stream for communication if you want.

Answering my own question, I recently saw the announcement of Celery 1.0, which seems to do much of what I am looking for.

I would use SGE, but I think it could be overkill for your need...

How to make Ruby or Python web sites to use multiple cores?

Even though Python and Ruby have one kernel thread per interpreter thread, they have a global interpreter lock (GIL) that is used to protect potentially shared data structures, so this inhibits multi-processor execution. Even though the portions in those languajes that are written in C or C++ can be free-threaded, that's not possible with pure interpreted code unless you use multiple processes. What's the best way to achieve this? Using FastCGI? Creating a cluster or a farm of virtualized servers? Using their Java equivalents, JRuby and Jython?

I'm not totally sure which problem you want so solve, but if you deploy your python/django application via an apache prefork MPM using mod_python apache will start several worker processes for handling different requests.
If one request needs so much resources, that you want to use multiple cores have a look at pyprocessing. But I don't think that would be wise.

The 'standard' way to do this with rails is to run a "pack" of Mongrel instances (ie: 4 copies of the rails application) and then use apache or nginx or some other piece of software to sit in front of them and act as a load balancer.
This is probably how it's done with other ruby frameworks such as merb etc, but I haven't used those personally.
The OS will take care of running each mongrel on it's own CPU.
If you install mod_rails aka phusion passenger it will start and stop multiple copies of the rails process for you as well, so it will end up spreading the load across multiple CPUs/cores in a similar way.

Use an interface that runs each response in a separate interpreter, such as mod_wsgi for Python. This lets multi-threading be used without encountering the GIL.
EDIT: Apparently, mod_wsgi no longer supports multiple interpreters per process because idiots couldn't figure out how to properly implement extension modules. It still supports running requests in separate processes FastCGI-style, though, so that's apparently the current accepted solution.

In Python and Ruby it is only possible to use multiple cores, is to spawn new (heavyweight) processes.
The Java counterparts inherit the possibilities of the Java platform. You could imply use Java threads. That is for example a reason why sometimes (often) Java Application Server like Glassfish are used for Ruby on Rails applications.

For Python, the PyProcessing project allows you to program with processes much like you would use threads. It is included in the standard library of the recently released 2.6 version as multiprocessing. The module has many features for establishing and controlling access to shared data structures (queues, pipes, etc.) and support for common idioms (i.e. managers and worker pools).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.