Multithreaded MD5 Checksum in Python

Multithreaded MD5 Checksum in Python - python

I have a python script that recursively walks a specified directory, and checksums each file it finds. It then writes a log file which lists all file paths and their md5 checksums.
Sequentially, this takes a long time for 50,000 files at 15 MB each. However, my computer has much more resources available than it's actually using. How can I adjust my approach so that the script uses more resources to execute faster?
For example, could I split my file list into thirds and run a thread for each, giving me a 3x runtime?
I'm not very comfortable with threading, and I hope someone wouldn't mind whipping up and example for my case.
Here's the code for my sequential md5 loop:
for (root, dirs, files) in os.walk(root_path):
for filename in files:
file_path = root + "/" + filename
md5_pairs.append([file_path, md5file(file_path, 128)])
Thanks for your help in advance!

For this kind of work, I think multiprocessing.Pool would give you less surprises -
check the examples and docs at http://docs.python.org/library/multiprocessing.html

If you're going to use threads, you need to first initiate your threads and have them poll work off a Queue.Queue instance. Then in your main thread, run through the for-loop you have, but instead of calling md5file(..), push all the arguments on the Queue.Queue. Threading / Queue in Python has an example, but look at the docs as well: http://docs.python.org/library/queue.html

Threads would not be very helpful do the GIL (Global Interpreter Lock.) Your application would never execute more than one call to the md5.update function at the same time. I would continue to try and optimize improve your process pool.

Go embarrassingly parallel and start a process for a chunk of files. We do this on clusters. You can have dozens or hundreds of processes each md5ing a few dozen files. At that point, disk IO will be your bottleneck.

Related

How to increase speed of reading excel file using python file object?

I am processing round 2800 excel file using python file object which taking more time to read because of that my tool taking 5 hour to execute so I want to know is there any way to make process faster of reading excel file.
reading file excel file code
import os
path=os.getcwd()
folder=path+"\\input"
files = os.listdir(folder)
for file in files:
_input = folder + '\\' + file
f=open(_input)
data=f.read()

Try executing the processing of each Excel in parallel with others, have a look to:
Multiprocessing
Threading

Fundamentally, there are two things you can do: either speed up the processing of each file, or process multiple files simultaneously. The best solution to this depends on why it is taking so long. You could start by looking into if the processing that happens on each file is as fast as it can be.
As for processing in parallel:
If a Python program is taking a long time to run because it's waiting for files to be read and written, it can help to use threading. This will allow one thread to process one file while another thread is waiting for its data to be read or written. Whether or not this will help depends on many factors. If the processing itself accounts for most of the time, it won't help. If file IO accounts for most of the time, it might help. Reading multiple files in parallel won't be faster than reading them sequentially if the hard drive is already serving them as fast as it can. Essentially, threading (in Python) only helps if the computer switches back and forth between waiting for the CPU to finish processing, and then waiting for the hard drive to write, and then waiting for the hard drive to read, etcetera. This is because of the Global Interpreter Lock in Python.
To work around the GIL, we need to use multi-processing, where Python actually launches multiple separate processes. This allows it to use more CPU resources, which can dramatically speed things up. It doesn't come for free, however. Each process takes a lot longer to start up than each thread, and they can't really share much in the way of resources so they will use more memory. Whether or not it's worth it depends on the task at hand.
The easiest (in my opinion) way to use multiple threads or processes in parallel is to use the concurrent library. Assuming we have some function that we want to run on each file:
def process_file(file_path):
pass #do stuff
Then we can run this sequentially:
for file_name in some_list_of_files:
process_file(file_name)
... or in parallel either via threads:
import concurrent.futures
number_of_threads = 4
with concurrent.futures.ThreadPoolExecutor(number_of_threads) as executor:
for file_name in some_array_of_files:
executor.submit(process_file, file_name)
executor.shutdown()
print("all done!")
Or with multiprocessing:
if __name__ == "__main__":
number_of_processes = 4
with concurrent.futures.ThreadPoolExecutor(number_of_processes) as executor:
for file_name in some_array_of_files:
executor.submit(process_file, file_name)
executor.shutdown()
print("All done!")
We need the if __name__ == "__main__" bit because the processes that we spin up will actually import the Python file (but the name won't be "__main__"), so we need to stop them from recursively redoing the same work.
Which is faster will depend entirely on the actual work that needs doing. Sometimes it's faster to just do it sequentially in the main thread like in "normal" code.

Python processing items from list/queue and saving progress

If I have about 10+ million little tasks to process in python (convert images or so), how can I create queue and save progress in case of crash in processing. To be clear, how can I save progress or stop process whatever I want and continue processing from the last point.
Also how to deal with multiple threads in that case?
In general question is how to save progress on processed data to file. Issue if it huge amount of very small files, saving file after each iteration will be longer than processing itself...
Thanks!
(sorry for my English if its not clear)

First of I would suggest not to go for multi-threading. Use multi-processing instead. Multiple threads do not work synchronously in python due to GIL when it comes to computation intensive task.
To solve the problem of saving result use following sequence
Get the names of all the files in a list and divide the list into chunks.
Now assign each process one chunk.
Append names of processed files after every 1000 steps to some file(say monitor.txt) on system(assuming that you can process 1000 files again in case of failure).
In case of failure skip all the files which are saved in the monitor.txt for each process.
You can have monitor_1.txt, monitor_2.txt ... for each process so you will not have to read the whole file for each process.
Following gist might help you. You just need to add code for the 4th point.
https://gist.github.com/rishibarve/ccab04b9d53c0106c6c3f690089d0229

I/O operations like saving files are always relatively slow. If you have to process a large batch of files, you will be stuck with a long I/O time regardless of the number of threads you use.
The easiest is to use multithreading and not multiprocessing, and let the OS's scheduler figure it all out. The docs have a good explanation of how to set up threads. A simple example would be
from threading import Thread
def process_data(file_name):
# does the processing
print(f'processed {file_name}')
if __name__ == '__main__':
file_names = ['file_1', 'file_2']
processes = [Thread(target=process_data, args=(file_name,)) for file_name in file_names]
# here you start all the processes
for proc in processes:
proc.start()
# here you wait for all processes to finish
for proc in processes:
proc.join()
One solution that might be faster is to create a separate process that does the I/O. Then you use a multiprocessing.Queue to queue the files from the `data process thread', and let the I/O thread pick these up and process them one after the other.
This way the I/O never has to rest, which will be close to optimal. I don't know if this will yield a big advantage over the threading based solution, but as is generally the case with concurrency, the best way to find out is to do some benchmarks with your own application.
One issue to watch out for is that if the data processing is much faster, then the Queue can grow very big. This might have a performance impact, depending on your system amongst other things. A quick workaround is to pause the data processing if the queue gets to large.
Remember to write all multiprocessing code in Python in a script with the
if __name__ == '__main__':
# mp code
guard, and be aware that some IDEs don't play nice with concurrent Python code. The safe bet is to test your code by executing it from a terminal.

Slower execution of AWS Lambda batch-writes to DynamoDB with multiple threads

Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.
I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.
NUM_THREADS = 4
for every line in the file:
Add line to list of lines
if we've read enough lines for a single thread:
Create thread that uploads list of lines
thread.start()
clear list of lines.
for every thread started:
thread.join()
Important notes and possible sources of the problem I've checked so far:
When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.
Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.

Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.
The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.
One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)
In fact the batch_writes save you some time, but nothing economically.
One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?
Good luck!

Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MG. With a 10 MG file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.

As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.

Multiple threads reading from single folder on Linux

My projects needs multiple threads reading files from the same folder. This folder has incoming files and the file should only be processed by any one of those threads. Later, this file reading thread, deletes the file after processing it.
EDIT after the first answer: I don't want a single thread in charge of reading filenames and feeding those names to other threads, so that they can read it.
Is there any efficient way of achieving this in python?

You should probably use the Queue module. From the docs:
The Queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when information must be exchanged safely between multiple threads.
I would use a FIFO approach, with a thread in charge of checking for inbound files and queuing them, and a number of workers processing them. A LIFO approach or an approach in which priority is assigned with a custom method are also supported by the module.
EDIT: If you don't want to use the Queue module and you are under a *nix system, you could use fcntl.lockf instead. An alternative, opening the files with os.open('filename', os.O_EXLOCK).
Depending on how often you perform this operation, you might find it less performing than using Queue, as you will have to account for race conditions (i.e.: you might acquire the name of the file to open, but the file might get locked by another thread before you get a chance to open it, throwing an exception that you will have to trap). Queue is there for a reason! ;)
EDIT2: Comments in this and other questions are bringing up the problem with simultaneous disk access to different files and the consequent performance hit. I was thinking that task_done would have been used for preventing this, but reading others' comments it occurred to me that instead of queuing file names, one could queue the files' content directly. This second alternative would work only for a limited amount of limited size queued files, given that RAM would fill up rather quickly otherwise.
I'm unaware if RAID and other parallel disk configurations would already take care of reading one file per disk rather than bouncing back and forth between two files on both disks.
HTH!

If you want multiple threads to read directly from the same folder several files in parallel, then I must disappoint you. Reading in parallel from a single disk is not a viable option. A single disk needs to spin and seek the next location to be read. If you're reading with multiple threads, you are just bouncing the disk around between seeks and the performance is much worse than a simple sequential read.
Just stick to mac's advice and use a single thread for reading.

Is using a giant OR regex inefficent in python?

Simple question, is using a giant OR regex inefficent in python. I am building a script to search for bad files. I have a source file that contains 50 or so "signatures" so far. The list is in the form of:
Djfsid
LJflsdflsdf
fjlsdlf
fsdf
.
.
.
There are no real "consistancies" so optomizing the list by removing "duplicates" or checking for "is one entry a substring of another entry" won't do much.
I basically wan't to OS walk down a directory, open a file, check for the signature, close and move on.
To speed things up I will break the list up into 50/n different sublists where N is the number of cores and have a thread do work on a few entries of the list.
Would using a giant regex re.search('(entry1|entry2|entry3....|entryK)', FILE_CONTENTS) or a giant for i in xrange(0,NUM_SUBENTRIES)...if subentry[i] in FILE_CONTENTS... be better off?
Also is this a good way to multithread? This is unix so multiple threads can work on the same file at the same time. Will disk access basically bottelneck me to the point where multithreading is useless?

Also is this a good way to multithread?
Not really.
Will disk access basically bottelneck me to the point where multithreading is useless?
Correct.
You might want to look closely at multiprocessing.
A worker Process should do the OS.walk and put the file names into a Queue.
A pool of worker Process instances. Each will get a file name from the Queue, open it, check the signature and enqueue results into a "good" Queue and a "bad" Queue. Create as many of these as it takes to make the CPU 100% busy.
Another worker Process instance can dequeue good entries and log them.
Another worker Process instance can dequeue bad entries and delete or rename or whatever is supposed to happen. This can interfere with the os.walk. A possibility is to log these into a "do this next" file which is processed after the os.walk is finished.

It would depend on the machine you are using. If you use the machine's maximum capaticy, it will slow down, of course. I think the best way to find out is to try.

Don't worry about optimisation.
50 data points is tiny compared to what your computer can manage, so you'll probably waste a lot of your time, and make your program more complicated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.