Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am working a lot with texts in Python, but im kinda new to the language and don't yet know how to employ multi-threading in Py.
My usecase is the following:
Single producer P (database/XML) which generates texts T_s.
Each of the texts in T_s could be processed independently. Processed texts compose T_p set.
The resulting set is written to a text-file/XML/database by a single thread S.
Data volumes are huge and all the processing couldn't keep anything except for the current data in the memory.
I would organize the process as the following:
Producer put the texts into Q_s queue.
There are a set of workers and a manager that gets texts from the queue and distributes between workers.
Each worker puts the processed text to the Q_p.
Sink process reads processed texts from Q_p and persists them.
Beyound all that Producer should be able to communicate that it ended reading the input data source to the manager and the sink.
Summary. I learned so far, that there is a nice lib/solution for each of the typical tasks in Py. Is there any for my current task?
Due to the nature of CPython (see gil), you will need to use multiple processes rather than threads if your tasks are CPU and not I/O bound. Python comes with the multiprocessing module that has everything you need to get the job done. Specifically, it has pools and thread-safe queues.
In your case, you need an input and output queues that you pass to each worker and they asynchronously read from the input queue and write to the output queue. The single threaded producers/consumers just operate on their respective queues, keeping only what's necessary in memory. The only potential quirk here is that order of outputs may not correlate with the order of the inputs.
Note: you can communicate status with the JoinableQueue class.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 12 months ago.
Improve this question
I recently heard of this feature in Python3.7+ where the asyncio brought a thing called "tasks" which people refer to as background tasks. So that's may first question:
Do these tasks really run in background?
Also, when comparing asyncio tasks to threads in Python, we know that python has a GIL. So, there's nothing like parallel. I know the difference in core structure i.e. asyncio tasks run in an event loop inside the same thread, while python threads are simply forked threads. But when it comes to speed, none of these are parallel.
We can call them concurrent instead. So the second question is:
Which of these two would be faster?
A few things I got to know about memory consumption is:
Threads consume a fair amount of data since each thread needs to have its own stack. With async code, all the code shares the same stack and the stack is kept small due to continuously unwinding the stack between tasks.
Threads are OS structures and therefore require more memory for the platform to support. There is no such problem with asynchronous tasks.
References:
What does asyncio.create_task() do?
How does asyncio actually work?
Coming to my last question:
When should you use asyncio tasks compared to threads? (This question has came in my mind as we can even fire async task from sync code)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am very new to the concept of threading. I was going over the content in this site about threading and came across this claim that Tasks that spend much of their time waiting for external events are generally good candidates for threading. May I know why is this statement true.
Threading allows for efficient CPU usage. Tasks that spend a lot of time waiting for other events to finish can be put to sleep (this means temporarily stopped) with Threading.
By putting a thread to sleep, the CPU it was being executed with becomes free to execute other tasks while waiting for the thread to be woken up.
The ability to sleep and wake up allows:
(1) Faster computation without much overhead
(2) A reduction in wasted computational resources
Alternative viewpoint:
I don't know about Python specifically, but in many other programming languages/libraries, there will be some kind of "async I/O" mechanism, or "select" mechanism, or "event-driven" paradigm that enables a single-threaded program to actively do one thing while simultaneously waiting for multiple other things to happen.
The problem that threads solve comes from the fact that each independent source of input/events typically drives some independent "activity," and each of those activities has its own separate state. In an async/event-driven programming style, the state of each activity must be explicitly represented in some set of variables: When the program receives an event that drives activity X, it has to be able to examine those variables so that it can "pick up where it left off" last time it was working on activity X.
With threads, part or all of the state of activity X can be implicit in the X thread's context (that is, in the value of its program counter, in its registers, and in the contents of its call stack.) The code for a thread that handles one particular activity can look a lot like the pure-procedural code that that we all first learned to write when we were rank beginners—much more familiar looking than any system of "event handlers" and explicit state variables.
The down-side of using multiple threads, is that the familiar look and feel of the code can lull us into a false sense of security—we can easily overlook the possibility of deadlocks, and race conditions, and other hazards to which multi-threading exposes us. Multi-threaded code can be easy to read, but it can be much harder to write it without making subtle mistakes that are hard to catch in testing.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm reforming a more concise version of my question here. Got flagged for being too broad.
I'm looking for a way, either native python or a framework, which will allow me to do the following:
Publish a webservice which an end customer can call like any other standard webservice (using curl, postman, requests, etc.)
This webservice will be accepting gigabytes (perhaps 10s of GB) of data per call.
While this data is being transmitted, I'd like to break it into chunks and spin off separate threads and/or processes to simultaneously work with it (my processing is complex but each chunk will be independent and self-contained)
Doing this will allow my logic to be running in parallel with the data upload across the internet, and avoid wasting all that time while the data is just being transmitted
It will also prevent the gigabytes/10s GB to be put all into RAM before my logic even begins.
Original Question:
I'm trying to build a web service (in Python) which can accept potentially tens of gigabytes of data and process this data. I don't want this to be completely received and built into an in-memory object before passing to my logic as a) this will use a ton of memory, and b) the processing will be pretty slow and I'd love to have a processing thread working on chunks of the data while the rest of the data is being received asynchronously.
I believe I need some sort of streaming solution for this but I'm having trouble finding any Python solution to handle this case. Most things I've found are about streaming the output (not an issue for me). Also it seems like wsgi has issues by design with a data streaming solution.
Is there a best practice for this sort of issue which I'm missing? And/or, is there a solution that I haven't found?
Edit: Since a couple of people asked, here's an example of the sort of data I'd be looking at. Basically I'm working with lists of sentences, which may be millions of sentences long. But each sentence (or group of sentences, for ease) is a separate processing task. Originally I had planned on receiving this as a json array like:
{"sentences: [
"here's a sentence",
"here's another sentence",
"I'm also a sentence"
]
}
For this modification I'm thinking it would just be newline delimited sentences, since I don't really need the json structure. So in my head, my solution would be; I get a constant stream of characters, and whenever I get a newline character, I'd split off the previous sentence and pass it to a worker thread or threadpool to do my processing. I could also do in groups of many sentences to avoid having a ton of threads going at once. But the main thing is, while the main thread is getting this character stream, it is splitting off tasks periodically so other threads can start the processing.
Second Edit: I've had a few thoughts on how to process the data. I can't give tons of specific details as it's proprietary, but I could either store the sentences as they come in into ElasticSearch or some other database, and have an async process working on that data, or (ideally) I'd just work with the sentences (in batches) in memory. Order is important, and also not dropping any sentences is important. The inputs will be coming from customers over the internet though, so that's why I'm trying to avoid a message queue like process, so there's not the overhead of a new call for each sentence.
Ideally, the customer of the webservice doesn't have to do anything particularly special other than do the normal POST request with a gigantic body, and all this special logic is server-side. My customers won't be expert software engineers so while a webservice call is perfectly within their wheelhouse, handling a more complex message queue process or something along those lines isn't something I want to impose on them.
Unless you share a little more of the type of data, processing or what other constraints your problem has, it's going to be very difficult to provide more tailored advice than maybe pointing you to a couple resources.
... Here is my attempt, hope it helps!
It seems like what you need is the following:
A message passing system vs streaming system in order to deliver/receive the data
Optionally, an asynchronous task queue to fire up different processing tasks on the data
or even a custom data processing pipeline system
Messaging vs Streaming
Examples: RabbitMQ, Kombu (per #abolotnov's comment), Apache Kafka (and python ports), Faust
The main differences between messaging and streaming can vary on the system/definition/who you ask, but in general:
- messaging: a "simple" system that will take care of sending/receiving single messages between two processes
- streaming adds functionality like the ability to "replay", send mini-batches of groups of messages, process rolling windows, etc.
Messaging systems may implement as well broadcasting (send message to all receivers) and publish/subscribe scenarios, that would come handy if you don't want your publisher (creator of data) to keep track of who to send the data to (subscribers), or alternatively your subscribers to keep track who and when to go and get the data from.
Asynchronous task queue
Examples: Celery, RQ, Taskmaster
This will basically help you assign a set of tasks that may be the smaller chunks of the main processing you are intending to do, and then make sure these tasks get performed whenever new data pops up.
Custom Data Processing Systems
I mainly have one in mind: Dask (official tutorial repo)
This is a system very much created for what seems to me you have in your hands. Basically large amounts of information emerging from some source (that may or not be fully under your control), that need to flow through a set of processing steps in order to be consumable by some other process (or stored).
Dask is kind of a combination of the previous, in that you define a computation graph (or task graph) with data sources and computation nodes that connect and some may depend on other nodes. Later, and dependent on the system you deploy it on, you can specify for sync or different types of async in which the tasks will be able to be executed, but keeping this run-time implementation detail separate from the actual tasks to be performed. This means, you could deploy on your computer, but later decide to deploy the same pipeline on a cluster, and you would only need to change the "settings" of this run-time implementation.
Additionally, Dask basically imitates numpy / pandas / pyspark or whatever data processing framework you may be already using, so the syntax will be (almost in every case) virtually the same.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Hi I have a Server/client model using SocketServer module. The server job is to receive test name from the clients and launch the test.
the test is launched using subprocess module.
I would like the server to keep answering clients and any new jobs to be stacked on a list or queue and launch one after the other, the only restriction I have is the server should not launch the test unless currently running one is completed.
Thanks
You can use the module multiprocessing for starting new processes. On the server-side, you would have a variable which refers to the current running process. You can still have your SocketServer running and accepting requests and storing them in a list. Every second (or whatever you want), in another thread, you would check if the current process is dead or not by calling isAlive(). If it is dead, then just simply run the next test on the list.
Another way to do it (better), is that on the third thread (the one that checks), you call .join() from the process so that it will only call the next line of code once the current process is dead. That way you don't have to keep checking every second or whatever and it is more efficient.
What you might want to do is:
Get test name in server socket, put it in a Queue
In a separate thread, read test names from the Queue one by one
Execute the process and wait for it to end using communicate()
Keep polling Queue for new tests, repeat steps 2, 3 if test names are available
Meanwhile server continues receiving and putting test names in Queue
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am using Python 2.7.6 and the threading module.
I am fairly new to python threading. I am trying to write a program to read files from a filesystem and store some hashes in my database. That are a lot of files and I would like to do it in threads. Like one thread for every folder that starts with a, one thread for every folder that starts with b. Since I want to use a database connection in the threads I don't want to generate 26 threads at once. So I would like to have 10 threads running and always if one of them finishes I want to start a new thread.
The main program should hold a list of threads with a specified max
amount of threads (e.g. 10)
The main program should start 10 threads
The main program should be notified when one thread finished
If a thread is finished start a new one
And so on ... until the job is done and every thread is finished
I am not quite sure how the main program has to look like. How can I manage this list of threads without a big overhead?
I'd like to indicate you that python doesn't manage well multi-threading : As you might know (or not) python comes with a Global Interpreter Lock (GIL), that doesn't allow real concurrency : Indeed, only one thread will execute at a time. (However you will not see the execution as a sequential one, thanks to the process scheduler of your machine)
Take a look here for more information : http://www.dabeaz.com/python/UnderstandingGIL.pdf
That said, if you still want to do it this way, take a look at semaphores : every thread will have to acquire it, and if you initialize this lock to 10, only 10 thread at a time will be able to acquire it.
https://docs.python.org/2/library/threading.html#threading.Semaphore
Hope it helps