Slower execution of AWS Lambda batch-writes to DynamoDB with multiple threads - python

Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.
I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.
NUM_THREADS = 4
for every line in the file:
Add line to list of lines
if we've read enough lines for a single thread:
Create thread that uploads list of lines
thread.start()
clear list of lines.
for every thread started:
thread.join()
Important notes and possible sources of the problem I've checked so far:
When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.
Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.

Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.
The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.
One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)
In fact the batch_writes save you some time, but nothing economically.
One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?
Good luck!

Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MG. With a 10 MG file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.

As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.

Related

Python3 help, 2 concurrently running tasks, one needs data from the other

I'm not a Python programmer and i rarely work with linux but im forced to use it for a project. The project is fairly straightforward, one task constantly gathers information as a single often updating numpy float32 value in a class, the other task that is also running constantly needs to occasionally grab the data in that variable asynchronously but not in very time critical way, and entirely on linux. My default for such tasks is the creation of a thread but after doing some research it appears as if python threading might not be the best solution for this from what i'm reading.
So my question is this, do I use multithreading, multiprocessing, concurrent.futures, asyncio, or (just thinking out loud here) some stdin triggering / stdout reading trickery or something similar to DDS on linux that I don't know about, on 2 seperate running scripts?
Just to append, both tasks do a lot of IO, task 1 does a lot of USB IO, the other task does a bit serial and file IO. I'm not sure if this is useful. I also think the importance of resetting the data once pulled and having as little downtime in task 1 as possible should be stated. Having 2 programs talk via a file probably won't satisfy this.
Any help would be appreciated, this has proven to be a difficult use case to google for.
Threading will probably work fine, a lot of the problems with the BKL (big kernel lock) are overhyped. You just need to make sure both threads provide active opportunities for the scheduler to switch contexts on a regular basis, typically by calling sleep(0). Most threads run in a loop and if the loop body is fairly short then calling sleep(0) at the top or bottom of it on every iteration is usually enough. If the loop body is long you might want to put a few more in along the way. It’s just a hint to the scheduler that this would be a good time to switch if other threads want to run.
Have you considered a double ended queue? You write to one end and grab data from the other end. Then with your multi-threading you could write with one thread and read with the other:
https://docs.python.org/3/library/collections.html#collections.deque
Quoting the documentation: "Deques are a generalization of stacks and queues (the name is pronounced “deck” and is short for “double-ended queue”). Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction."

In Python, should I use threading for this?

I'm working on a small program that collects certain data and processes it. Currently, the program runs constantly on my server, storing the data to the disk. Every once in a while I run another program that reads the stored data, processes it, sorts it, saves it to a new location, and clears the old data files.
I've never learned about threads, but it SOUNDS like this is a good place to use them? If threading works the way I think it does, I could set up a queue to hold the data, and have a separate thread that could pull data from the queue and process it as it's ready. If the queue is full, thread1 could sleep for a bit. If it's empty, thread2 could sleep for a bit
That would reduce disk writing, get rid of disk reading, and make the data collection run side-by-side with the data processing to save time.
Is any of this accurate? I'm a senior CS student and threads have never once come up (Surely that's a little odd?). I would appreciate any tips/knowledge/advice with using threads, as well as if this is the correct solution to my "problem".
Thanks!
That does sound like a situation where some form of parallelism might be useful. However, since this is Python, you might not want to actually use threads. Python, in the standard implementation, has something called the Global Interpreter Lock. Effectively, to allow the garbage collector to work, only one thread of a Python program can actually be running Python code at any time (modules written directly in C, or external operations such as disk IO or database queries, are not "running Python code" for this purpose, though you will have called them from Python).
Because of this, threading in Python is generally only a good idea if your Python code spends significant amounts of time waiting on responses from non-Python parts of the program, or external sources. If the data collection or processing is being done outside Python (collecting from a database or website, processing in numpy, etc.), that might be reasonable. If your code isn't in this situation often enough, your program ends up wasting more time switching between threads than it gains (because if two threads are both in Python code it still only runs one at a time)
If not, you should try the multiprocessing module instead. This is also a generally safer model, as the only things that can be shared between processes in multiprocessing are the things you explicitly share (while threads share all state, potentially allowing one thread to break another because you forgot to lock something).
Alternately, you might use subprocess. Effectively, this would be having your first program intermittently restart the second one every time it finishes a batch of data.

Python threading or multiprocessing for my 'tool'

I have created a script that :
Imports a list of IP's from .txt ( around 5K )
Connects to a REST API and performs a query based on the IP ( web logs for each IP)
Data is returned from the API and some calculations are done on the data
Results of calculations are written to a .csv
At the moment it's really slow as it takes one IP at a time does everything and then goes to the next IP .
I may be wrong but from my understanding with threading or multiprocessing i could have 3-4 threads each doing an IP which would increase the speed of the tool by a huge margin . Is my understanding correct and if it is should i be looking at threading or multi-processing for my task ?
Any help would amazing
Random info, running python 2.7.5 , Win7 with plenty or resources.
multiprocessing is definitely the way to go forward. You could start a process that reads the IPs and puts them in a multiprocessing.Queue and then make a few processes (depending on available resources) that read from this queue, connect to the API and make the requests. These requests shall then be made in parallel, and if the API can handle these requests, your program should finish faster. If the calculations are complex and time consuming, the output from the API can be put into another Queue, from where other processes you start, can read them and make the calculations and store results. You may have to start a collector process to collect the final outputs.
You can find some sample code for such problems in this stackoverflow question. If you require further explanation or sample code, let me know in comments.
With multiprocessing a primitive way to do this whould be chunck the file into 5 equal pieces and give it to 5 different processes write their results to 5 different files, when all processes are done you will merge the results.
You can have the same logic with Python threads without much complication. And probably won't make any difference since the bottle neck is probably the API. So in the end it does not really matter which approach you choose here.
There are two things two consider though:
Using Threads, you are not realy using multiple CPUs hence you are have "wasted resources"
Using Multiprocessing will use multiple processors but it is heavier on start up ... So you will benefit from never stoping the script and keeping the processes alive if the script needs to run very often.
Since the information you gave about the scenario where you use this script (or better say program) is limited, it really hard to say which is the better approach.

Multithreading With Very Large Number of Threads

I'm working on simulating a mesh network with a large number of nodes. The nodes pass data between different master nodes throughout the network.
Each master comes live once a second to receive the information, but the slave nodes don't know when the master is up or not, so when they have information to send, they try and do so every 5 ms for 1 second to make sure they can find the master.
Running this on a regular computer with 1600 nodes results in 1600 threads and the performance is extremely bad.
What is a good approach to handling the threading so each node acts as if it is running on its own thread?
In case it matters, I'm building the simulation in python 2.7, but I'm open to changing to something else if that makes sense.
For one, are you really using regular, default Python threads available in the default Python 2.7 interpreter (CPython), and is all of your code in Python? If so, you are probably not actually using multiple CPU cores because of the global interpreter lock CPython has (see https://wiki.python.org/moin/GlobalInterpreterLock). You could maybe try running your code under Jython, just to check if performance will be better.
You should probably rethink your application architecture and switch to manually scheduling events instead of using threads, or maybe try using something like greenlets (https://stackoverflow.com/a/15596277/1488821), but that would probably mean less precise timings because of lack of parallelism.
To me, 1600 threads sounds like a lot but not excessive given that it's a simulation. If this were a production application it would probably not be production-worthy.
A standard machine should have no trouble handling 1600 threads. As to the OS this article could provide you with some insights.
When it comes to your code a Python script is not a native application but an interpreted script and as such will require more CPU resources to execute.
I suggest you try implementing the simulation in C or C++ instead which will produce a native application which should execute more efficiently.
Do not use threading for that. If sticking to Python, let the nodes perform their actions one by one. If the performance you get doing so is OK, you will not have to use C/C++. If the actions each node perform are simple, that may work. Anyway, there is no reason to use threads in Python at all. Python threads are usable mostly for making blocking I/O not to block your program, not for multiple CPU kernels utilization.
If you want to really use parallel processing and to write your nodes as if they were really separated and exchanging only using messages, you may use Erlang (http://www.erlang.org/). It is a functional language very well suited for executing parallel processes and making them exchange messages. Erlang processes do not map to OS threads, and you may create thousands of them. However, Erlang is a purely functional language and may seem extremely strange if you have never used such languages. And it also is not very fast, so, like Python, it is unlikely to handle 1600 actions every 5ms unless the actions are rather simple.
Finally, if you do not get desired performance using Python or Erlang, you may move to C or C++. However, still do not use 1600 threads. In fact, using threads to gain performance is reasonable only if the number of threads does not dramatically exceed number of CPU kernels. A reactor pattern (with several reactor threads) is what you may need in that case (http://en.wikipedia.org/wiki/Reactor_pattern). There is an excellent implementation of the reactor pattern in boost.asio library. It is explained here: http://www.gamedev.net/blog/950/entry-2249317-a-guide-to-getting-started-with-boostasio/
Some random thoughts here:
I did rather well with several hundred threads working like this in Java; it can be done with the right language. (But I haven't tried this in Python.)
In any language, you could run the master node code in one thread; just have it loop continuously, running the code for each master in each cycle. You'll lose the benefits of multiple cores that way, though. On the other hand, you'll lose the problems of multithreading, too. (You could have, say, 4 such threads, utilizing the cores but getting the multithreading headaches back. It'll keep the thread-overhead down, too, but then there's blocking...)
One big problem I had was threads blocking each other. Enabling 100 threads to call the same method on the same object at the same time without waiting for each other requires a bit of thought and even research. I found my multithreading program at first often used only 25% of a 4-core CPU even when running flat out. This might be one reason you're running slow.
Don't have your slave nodes repeat sending data. The master nodes should come alive in response to data coming in, or have some way of storing it until they do come alive, or some combination.
It does pay to have more threads than cores. Once you have two threads, they can block each other (and will if they share any data). If you have code to run that won't block, you want to run it in its own thread so it won't be waiting for code that does block to unblock and finish. I found once I had a few threads, they started to multiply like crazy--hence my hundreds-of-threads program. Even when 100 threads block at one spot despite all my brilliance, there's plenty of other threads to keep the cores busy!

How to manage program turnaround time?

I have just written a script that is intended to be run 24/7 to update some files. However, if it takes 3 minutes to update one file, then it would take 300 minutes to update 100 files.
Is it possible to run n instances of the script to manage n separate files to speed up the turnaround time?
Yes it is possible. Use the multiprocessing module to start several concurrent processes. This has the advantage that you do not run into problems because of the Global Interpreter Lock and threads as is explained in the manual page. The manual page includes all the examples you will need to make your script execute in parallel. Of course this works best if the processes do not have to interact, which your example suggests.
I suggest you first find out if there is any way to reduce the 3 minutes in a single thread.
The method I use to discover speedup opportunities is demonstrated here.
That will also tell you if you are purely I/O bound.
If you are completely I/O bound, and all files are on a single disk, parallelism won't help.
In that case, possibly storing the files on a solid-state drive would help.
On the other hand, if you are CPU bound, parallelism will help, as #hochl said.
Regardless, find the speedup opportunities and fix them.
I've never seen any good-size program that didn't have one or several of them.
That will give you one speedup factor, and parallelism will give you another, and the total speedup will be the product of those two factors.

Categories