I have a script with a main for loop that repeats about 15k times. In this loop it queries a local MySQL database and does a SVN update on a local repository. I placed the SVN repository in a RAMdisk as before most of the time seemed to be spent reading/writing to disk.
Now I have a script that runs at basically the same speed but CPU utilization for that script never goes over 10%.
ProcessExplorer shows that mysqld is also not taking almost any CPU time or reading/writing a lot to disk.
What steps would you take to figure out where the bottleneck is?
Doing SQL queries in a for loop 15k times is a bottleneck in every language..
Is there any reason you query every time again ? If you do a single query before the for loop and then loop over the resultset and the SVN part, you will see a dramatic increase in speed.
But I doubt that you will get a higher CPU usage. The reason is that you are not doing calculations, but mostly IO.
Btw, you can't measure that in mysqld cpu usage, as it's in the actual code not complexity of the queries, but their count and the latency of the server engine to answer. So you will see only very short, not expensive queries, that do sum up in time, though.
Profile your Python code. That will show you how long each function/method call takes. If that's the method call querying the MySQL database, you'll have a clue where to look. But it also may be something else. In any case, profiling is the usual approach to solve such problems.
It is "well known", so to speak, that svn update waits up to a whole second after it has finished running, so that file modification timestamps get "in the past" (since many filesystems don't have a timestamp granularity finer than one second). You can find more information about it by Googling for "svn sleep_for_timestamps".
I don't have any obvious solution to suggest. If this is really performance critical you could either: 1) not update as often as you are doing 2) try to use a lower-level Subversion API (good luck).
Related
Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.
I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.
NUM_THREADS = 4
for every line in the file:
Add line to list of lines
if we've read enough lines for a single thread:
Create thread that uploads list of lines
thread.start()
clear list of lines.
for every thread started:
thread.join()
Important notes and possible sources of the problem I've checked so far:
When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.
Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.
Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.
The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.
One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)
In fact the batch_writes save you some time, but nothing economically.
One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?
Good luck!
Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MG. With a 10 MG file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.
As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.
I've just added pytest to an existing Django project - all unit tests are using Django's unittest subclasses, etc. We use an SQLite in-memory database for tests.
manage.py test takes rougly 80 seconds on our test suite
py.test takes the same
py.test -n1 (or -n4, or anything like that) takes roughly 1280 seconds.
I would expect an overhead to support the distribution, but obviously with -n4, it should be approximately 3-4 times faster on a large-ish test suite.
Findings...
I've so far traced the issue down to database access. The tests run quickly until they first hit the database, but at the first .save() call on a Django model, that test will be incredibly slow.
After some profiling on the workers, it looks like they are spending a lot of time waiting on locks, but I have no idea if that's a reliable finding or not.
I wondered if there was some sort of locking on the database, it was suggested to me that the in-memory SQLite database might be a memory mapped file, and that locking might be happening on that between the workers, but apparently each call to open an in-memory database with SQLite will return a completely separate instance.
As it stands, I've probably spent 5+ hours on this so far, and spoken at length to colleagues and others about this, and not yet found the issue. I have not been able to reproduce on a separate codebase.
What kind of things might cause this?
What more could I do to further track down the issue?
Thanks in advance for any ideas!
I am retrieving data from a remote web application and I store it in a local sqlite DB. From time to time I perform a commit() call on the connection, which results in committing about 50 inserts to 8 tables. There are about 1000000 records in most tables. I don't use explicit BEGIN and END commands.
I measured commit() call time and got 100-150 ms. But during the commit, my PC freezes for ~5-15 seconds. The INSERTs themselves are naive (i.e. one execute() call per insert) but are performed fast enough (their rate is limited by the speed of records retrieval, anyways, which is fairly low). I'm using Arch Linux x64 on a PC with AMD FX 6200 CPU, 8 GB RAM and a SATA HDD, Python 3.4.1, sqlite 3.8.4.3.
Does anyone have an idea why this could happen? I guess it has something to do with HDD caching. If so, is there something I could optimize?
UPD: switched to WAL and synchronous=1, no improvements.
UPD2: I have seriously underestimated number of INSERTs per commit. I measured it using sqlite3.Connection's total_changes property, and it appears there are 30000-60000 changes per commit. Is it possible to optimize inserts, or maybe it is about time to switch to postgres?
If the call itself is quick enough, as you say, it surely sounds like an IO problem. You could use tools such as iotop to check this maybe. If possible I would suggest that you divide the inserts into smaller and more frequent ones instead of large chunks. If that is not possible you should consider investing in an SSD disk instead of traditional harddisk, due to normally quicker write speeds.
There sure could be system parameters to investigate. You should at least make sure you mount your disk with noatime and nodiratime flags. You could also try data=writebackas parameter. See the following for more details:
https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
I have just written a script that is intended to be run 24/7 to update some files. However, if it takes 3 minutes to update one file, then it would take 300 minutes to update 100 files.
Is it possible to run n instances of the script to manage n separate files to speed up the turnaround time?
Yes it is possible. Use the multiprocessing module to start several concurrent processes. This has the advantage that you do not run into problems because of the Global Interpreter Lock and threads as is explained in the manual page. The manual page includes all the examples you will need to make your script execute in parallel. Of course this works best if the processes do not have to interact, which your example suggests.
I suggest you first find out if there is any way to reduce the 3 minutes in a single thread.
The method I use to discover speedup opportunities is demonstrated here.
That will also tell you if you are purely I/O bound.
If you are completely I/O bound, and all files are on a single disk, parallelism won't help.
In that case, possibly storing the files on a solid-state drive would help.
On the other hand, if you are CPU bound, parallelism will help, as #hochl said.
Regardless, find the speedup opportunities and fix them.
I've never seen any good-size program that didn't have one or several of them.
That will give you one speedup factor, and parallelism will give you another, and the total speedup will be the product of those two factors.
I have a number of records in the database I want to process. Basically, I want to run several regex substitution over tokens of the text string rows and at the end, and write them back to the database.
I wish to know whether does multiprocessing speeds up the time required to do such tasks.
I did a
multiprocessing.cpu_count
and it returns 8. I have tried something like
process = []
for i in range(4):
if i == 3:
limit = resultsSize - (3 * division)
else:
limit = division
#limit and offset indicates the subset of records the function would fetch in the db
p = Process(target=sub_table.processR,args=(limit,offset,i,))
p.start()
process.append(p)
offset += division + 1
for po in process:
po.join()
but apparently, the time taken is higher than the time required to run a single thread. Why is this so? Can someone please enlighten is this a suitable case or what am i doing wrong here?
Why is this so?
Can someone please enlighten in what cases does multiprocessing gives better performances?
Here's one trick.
Multiprocessing only helps when your bottleneck is a resource that's not shared.
A shared resource (like a database) will be pulled in 8 different directions, which has little real benefit.
To find a non-shared resource, you must have independent objects. Like a list that's already in memory.
If you want to work from a database, you need to get 8 things started which then do no more database work. So, a central query that distributes work to separate processors can sometimes be beneficial.
Or 8 different files. Note that the file system -- as a whole -- is a shared resource and some kinds of file access are involve sharing something like a disk drive or a directory.
Or a pipeline of 8 smaller steps. The standard unix pipeline trick query | process1 | process2 | process3 >file works better than almost anything else because each stage in the pipeline is completely independent.
Here's the other trick.
Your computer system (OS, devices, database, network, etc.) is so complex that simplistic theories won't explain performance at all. You need to (a) take several measurements and (b) try several different algorithms until you understand all the degrees of freedom.
A question like "Can someone please enlighten in what cases does multiprocessing gives better performances?" doesn't have a simple answer.
In order to have a simple answer, you'd need a much, much simpler operating system. Fewer devices. No database and no network, for example. Since your OS is complex, there's no simple answer to your question.
Here are a couple of questions:
In your processR function, does it slurp a large number of records from the database at one time, or is it fetching 1 row at a time? (Each row fetch will be very costly, performance wise.)
It may not work for your specific application, but since you are processing "everything", using database will likely be slower than a flat file. Databases are optimised for logical queries, not seqential processing. In your case, can you export the whole table column to a CSV file, process it, and then re-import the results?
Hope this helps.
In general, multicpu or multicore processing help most when your problem is CPU bound (i.e., spends most of its time with the CPU running as fast as it can).
From your description, you have an IO bound problem: It takes forever to get data from disk to the CPU (which is idle) and then the CPU operation is very fast (because it is so simple).
Thus, accelerating the CPU operation does not make a very big difference overall.