I'm working on a small program that collects certain data and processes it. Currently, the program runs constantly on my server, storing the data to the disk. Every once in a while I run another program that reads the stored data, processes it, sorts it, saves it to a new location, and clears the old data files.
I've never learned about threads, but it SOUNDS like this is a good place to use them? If threading works the way I think it does, I could set up a queue to hold the data, and have a separate thread that could pull data from the queue and process it as it's ready. If the queue is full, thread1 could sleep for a bit. If it's empty, thread2 could sleep for a bit
That would reduce disk writing, get rid of disk reading, and make the data collection run side-by-side with the data processing to save time.
Is any of this accurate? I'm a senior CS student and threads have never once come up (Surely that's a little odd?). I would appreciate any tips/knowledge/advice with using threads, as well as if this is the correct solution to my "problem".
Thanks!
That does sound like a situation where some form of parallelism might be useful. However, since this is Python, you might not want to actually use threads. Python, in the standard implementation, has something called the Global Interpreter Lock. Effectively, to allow the garbage collector to work, only one thread of a Python program can actually be running Python code at any time (modules written directly in C, or external operations such as disk IO or database queries, are not "running Python code" for this purpose, though you will have called them from Python).
Because of this, threading in Python is generally only a good idea if your Python code spends significant amounts of time waiting on responses from non-Python parts of the program, or external sources. If the data collection or processing is being done outside Python (collecting from a database or website, processing in numpy, etc.), that might be reasonable. If your code isn't in this situation often enough, your program ends up wasting more time switching between threads than it gains (because if two threads are both in Python code it still only runs one at a time)
If not, you should try the multiprocessing module instead. This is also a generally safer model, as the only things that can be shared between processes in multiprocessing are the things you explicitly share (while threads share all state, potentially allowing one thread to break another because you forgot to lock something).
Alternately, you might use subprocess. Effectively, this would be having your first program intermittently restart the second one every time it finishes a batch of data.
Related
I'm not a Python programmer and i rarely work with linux but im forced to use it for a project. The project is fairly straightforward, one task constantly gathers information as a single often updating numpy float32 value in a class, the other task that is also running constantly needs to occasionally grab the data in that variable asynchronously but not in very time critical way, and entirely on linux. My default for such tasks is the creation of a thread but after doing some research it appears as if python threading might not be the best solution for this from what i'm reading.
So my question is this, do I use multithreading, multiprocessing, concurrent.futures, asyncio, or (just thinking out loud here) some stdin triggering / stdout reading trickery or something similar to DDS on linux that I don't know about, on 2 seperate running scripts?
Just to append, both tasks do a lot of IO, task 1 does a lot of USB IO, the other task does a bit serial and file IO. I'm not sure if this is useful. I also think the importance of resetting the data once pulled and having as little downtime in task 1 as possible should be stated. Having 2 programs talk via a file probably won't satisfy this.
Any help would be appreciated, this has proven to be a difficult use case to google for.
Threading will probably work fine, a lot of the problems with the BKL (big kernel lock) are overhyped. You just need to make sure both threads provide active opportunities for the scheduler to switch contexts on a regular basis, typically by calling sleep(0). Most threads run in a loop and if the loop body is fairly short then calling sleep(0) at the top or bottom of it on every iteration is usually enough. If the loop body is long you might want to put a few more in along the way. It’s just a hint to the scheduler that this would be a good time to switch if other threads want to run.
Have you considered a double ended queue? You write to one end and grab data from the other end. Then with your multi-threading you could write with one thread and read with the other:
https://docs.python.org/3/library/collections.html#collections.deque
Quoting the documentation: "Deques are a generalization of stacks and queues (the name is pronounced “deck” and is short for “double-ended queue”). Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction."
In order to get a speed boost for my python program, should I spawn a separate thread or a separate process for logging? My program uses a lot of logging and I am not sure if threading is suitable because of GIL. A lot of resources seem to suggest that it should be fine for I/O. I think that logging is I/O but I am not sure what does "should be fine" mean for most of the resources out there. I just need speed.
Before you start trying to optimize a program, there are some things you should do.
To start with, you should profile you programs. You could e.g. use line_profiler.
If it turns out that your software spends a considerable amount of time logging, there are two easy options.
Set the loglevel in production code so that no or few(er) messages are logged. There will still be some overhead left, but it should be much reduced.
Use mechanical means (like sed or grep) to completely remove the logging calls from the production code. If this doesn't improve the speed/throughput of your program, logging wasn't an issue.
If neither of those is suitable and logging is a significant fraction of your programs time you can try to implement thread or process based logging.
If you want to use threading for logging, you will besically need a list and a lock. The function that is called from the main thread to do logging grabs the lock, appends the text to log to the list and releases the lock. The second thread waits for the lock, grabs the lock, pops a couple of items from the list, releases the lock and writes the items to a file. Since the GIL makes sure that only one thread at a time is running Python bytecode, this will reduce the performance of your program somewhat; part of its time is spent running the bytecode from the logging thread.
Using multiprocessing is slightly different, in that you probably want to use e.g. a Queue to send logging messages from the main process to the logging process. The logging process takes items from the Queue and writes them to disk. This means that the time spent writing the logging actions to disk is spent in a different program. But there is some overhead associated with using a Queue as well.
You would have to measure to see which method uses less time in your program.
I am going by these assumptions:
You already determine that it is your logging that is bottlenecking your program.
You have a good reason why you are logging what you are logging.
The perceived slowness is most likely due to the success or failure acknowledgement from the logging action. To avoid this "command queuing" make the calls to a separate process asynchonously and skipping the callback. This may end up consuming more resources, but this will alleviate the backlog in your main program.
Nodejs handles this naturally or you can roll your own python listener. Since this will be a separate process. You can redirect the logging feature of your other programs to this one. You can even have a separate machine to handle this workload.
Does the presence of python GIL imply that in python multi threading the same operation is not so different from repeating it in a single thread?.
For example, If I need to upload two files, what is the advantage of doing them in two threads instead of uploading them one after another?.
I tried a big math operation in both ways. But they seem to take almost equal time to complete.
This seems to be unclear to me. Can someone help me on this?.
Thanks.
Python's threads get a slightly worse rap than they deserve. There are three (well, 2.5) cases where they actually get you benefits:
If non-Python code (e.g. a C library, the kernel, etc.) is running, other Python threads can continue executing. It's only pure Python code that can't run in two threads at once. So if you're doing disk or network I/O, threads can indeed buy you something, as most of the time is spent outside of Python itself.
The GIL is not actually part of Python, it's an implementation detail of CPython (the "reference" implementation that the core Python devs work on, and that you usually get if you just run "python" on your Linux box or something.
Jython, IronPython, and any other reimplementations of Python generally do not have a GIL, and multiple pure-Python threads can execute simultaneously.
The 0.5 case: Even if you're entirely pure-Python and see little or no performance benefit from threading, some problems are really convenient in terms of developer time and difficulty to solve with threads. This depends in part on the developer, too, of course.
It really depends on the library you're using. The GIL is meant to prevent Python objects and its internal data structures to be changed at the same time. If you're doing an upload, the library you use to do the actual upload might release the GIL while it's waiting for the actual HTTP request to complete (I would assume that is the case with the HTTP modules in the standard library, but I didn't check).
As a side note, if you really want to have things running in parallel, just use multiple processes. It will save you a lot of trouble and you'll end up with better code (more robust, more scalable, and most probably better structured).
It depends on the native code module that's executing. Native modules can release the GIL and then go off and do their own thing allowing another thread to lock the GIL. The GIL is normally held while code, both python and native, are operating on python objects. If you want more detail you'll probably need to go and read quite a bit about it. :)
See:
What is a global interpreter lock (GIL)? and Thread State and the Global Interpreter Lock
Multithreading is a concept where two are more tasks need be completed simultaneously, for example, I have word processor in this application there are N numbers of a parallel task have to work. Like listening to keyboard, formatting input text, sending a formatted text to display unit. In this context with sequential processing, it is time-consuming and one task has to wait till the next task completion. So we put these tasks in threads and simultaneously complete the task. Three threads are always up and waiting for the inputs to arrive, then take that input and produce the output simultaneously.
So multi-threading works faster if we have multi-core and processors. But in reality with single processors, threads will work one after the other, but we feel it's executing with greater speed, Actually, one instruction executes at a time and a processor can execute billions of instructions at a time. So the computer creates illusion that multi-task or thread working parallel. It just an illusion.
Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.
I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.
NUM_THREADS = 4
for every line in the file:
Add line to list of lines
if we've read enough lines for a single thread:
Create thread that uploads list of lines
thread.start()
clear list of lines.
for every thread started:
thread.join()
Important notes and possible sources of the problem I've checked so far:
When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.
Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.
Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.
The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.
One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)
In fact the batch_writes save you some time, but nothing economically.
One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?
Good luck!
Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MG. With a 10 MG file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.
As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.
I am aware that this question is rather high-level and may be vague. Please ask if you need any more details and I will try to edit.
I am using QuickFix with Python bindings to consume high-throughput market data from circa 30 markets simultaneously. Most of computing the work is done in separate CPUs via the multiprocessing module. These parallel processes are spawned by the main process on startup. If I wish to interact with the market in any way via QuickFix, I have to do this within the main process, thus any commands (to enter orders, for example) which come from the child processes must be piped (via an mp.Queue object we will call Q) to the main process before execution.
This raises the problem of monitoring Q, which must be done within the main process. I cannot use Q.get(), since this method blocks and my entire main process will hang until something shows up in Q. In order to decrease latency, I must check Q frequently, on the order of 50 times per second. I have been using the apscheduler to do this, but I keep getting Warning errors stating that the runtime was missed. These errors are a serious issue because they prevent me from easily viewing important information.
I have therefore refactored my application to use the code posted by MestreLion as an answer to this question. This is working for me because it starts a new thread from the main process, and it does not print error messages. However, I am worried that this will cause nasty problems down the road.
I am aware of the Global Interpreter Lock in python (this is why I used the multiprocessing module to begin with), but I don't really understand it. Owing to the high-frequency nature of my application, I do not know if the Q monitoring thread and the main process consuming lots of incoming messages will compete for resources and slow each other down.
My questions:
Am I likely to run into trouble in this scenario?
If not, can I add more monitoring threads using the present approach and still be okay? There are at least two other things I would like to monitor at high frequency.
Thanks.
#MestreLion's solution that you've linked creates 50 threads per second in your case.
All you need is a single thread to consume the queue without blocking the rest of the main process:
import threading
def consume(queue, sentinel=None):
for item in iter(queue.get, sentinel):
pass_to_quickfix(item)
threading.Thread(target=consume, args=[queue], daemon=True).start()
GIL may or may not matter for performance in this case. Measure it.
Without knowing your scenario, it's difficult to say anything specific. Your question suggests, that the threads are waiting most of the time via get, so GIL is not a problem. Interprocess communication may result in problems much earlier. There you can think of switching to another protocol, using some kind of TCP-sockets. Then you can write the scheduler more efficient with select instead of threads, as threads are also slow and resource consuming. select is a system function, that allows to monitor many socket-connection at once, therefore it scales incredibly efficient with the amount of connections and needs nearly no CPU-power for monitoring.