How to increase speed of reading excel file using python file object? - python

I am processing round 2800 excel file using python file object which taking more time to read because of that my tool taking 5 hour to execute so I want to know is there any way to make process faster of reading excel file.
reading file excel file code
import os
path=os.getcwd()
folder=path+"\\input"
files = os.listdir(folder)
for file in files:
_input = folder + '\\' + file
f=open(_input)
data=f.read()

Try executing the processing of each Excel in parallel with others, have a look to:
Multiprocessing
Threading

Fundamentally, there are two things you can do: either speed up the processing of each file, or process multiple files simultaneously. The best solution to this depends on why it is taking so long. You could start by looking into if the processing that happens on each file is as fast as it can be.
As for processing in parallel:
If a Python program is taking a long time to run because it's waiting for files to be read and written, it can help to use threading. This will allow one thread to process one file while another thread is waiting for its data to be read or written. Whether or not this will help depends on many factors. If the processing itself accounts for most of the time, it won't help. If file IO accounts for most of the time, it might help. Reading multiple files in parallel won't be faster than reading them sequentially if the hard drive is already serving them as fast as it can. Essentially, threading (in Python) only helps if the computer switches back and forth between waiting for the CPU to finish processing, and then waiting for the hard drive to write, and then waiting for the hard drive to read, etcetera. This is because of the Global Interpreter Lock in Python.
To work around the GIL, we need to use multi-processing, where Python actually launches multiple separate processes. This allows it to use more CPU resources, which can dramatically speed things up. It doesn't come for free, however. Each process takes a lot longer to start up than each thread, and they can't really share much in the way of resources so they will use more memory. Whether or not it's worth it depends on the task at hand.
The easiest (in my opinion) way to use multiple threads or processes in parallel is to use the concurrent library. Assuming we have some function that we want to run on each file:
def process_file(file_path):
pass #do stuff
Then we can run this sequentially:
for file_name in some_list_of_files:
process_file(file_name)
... or in parallel either via threads:
import concurrent.futures
number_of_threads = 4
with concurrent.futures.ThreadPoolExecutor(number_of_threads) as executor:
for file_name in some_array_of_files:
executor.submit(process_file, file_name)
executor.shutdown()
print("all done!")
Or with multiprocessing:
if __name__ == "__main__":
number_of_processes = 4
with concurrent.futures.ThreadPoolExecutor(number_of_processes) as executor:
for file_name in some_array_of_files:
executor.submit(process_file, file_name)
executor.shutdown()
print("All done!")
We need the if __name__ == "__main__" bit because the processes that we spin up will actually import the Python file (but the name won't be "__main__"), so we need to stop them from recursively redoing the same work.
Which is faster will depend entirely on the actual work that needs doing. Sometimes it's faster to just do it sequentially in the main thread like in "normal" code.

Related

Python processing items from list/queue and saving progress

If I have about 10+ million little tasks to process in python (convert images or so), how can I create queue and save progress in case of crash in processing. To be clear, how can I save progress or stop process whatever I want and continue processing from the last point.
Also how to deal with multiple threads in that case?
In general question is how to save progress on processed data to file. Issue if it huge amount of very small files, saving file after each iteration will be longer than processing itself...
Thanks!
(sorry for my English if its not clear)
First of I would suggest not to go for multi-threading. Use multi-processing instead. Multiple threads do not work synchronously in python due to GIL when it comes to computation intensive task.
To solve the problem of saving result use following sequence
Get the names of all the files in a list and divide the list into chunks.
Now assign each process one chunk.
Append names of processed files after every 1000 steps to some file(say monitor.txt) on system(assuming that you can process 1000 files again in case of failure).
In case of failure skip all the files which are saved in the monitor.txt for each process.
You can have monitor_1.txt, monitor_2.txt ... for each process so you will not have to read the whole file for each process.
Following gist might help you. You just need to add code for the 4th point.
https://gist.github.com/rishibarve/ccab04b9d53c0106c6c3f690089d0229
I/O operations like saving files are always relatively slow. If you have to process a large batch of files, you will be stuck with a long I/O time regardless of the number of threads you use.
The easiest is to use multithreading and not multiprocessing, and let the OS's scheduler figure it all out. The docs have a good explanation of how to set up threads. A simple example would be
from threading import Thread
def process_data(file_name):
# does the processing
print(f'processed {file_name}')
if __name__ == '__main__':
file_names = ['file_1', 'file_2']
processes = [Thread(target=process_data, args=(file_name,)) for file_name in file_names]
# here you start all the processes
for proc in processes:
proc.start()
# here you wait for all processes to finish
for proc in processes:
proc.join()
One solution that might be faster is to create a separate process that does the I/O. Then you use a multiprocessing.Queue to queue the files from the `data process thread', and let the I/O thread pick these up and process them one after the other.
This way the I/O never has to rest, which will be close to optimal. I don't know if this will yield a big advantage over the threading based solution, but as is generally the case with concurrency, the best way to find out is to do some benchmarks with your own application.
One issue to watch out for is that if the data processing is much faster, then the Queue can grow very big. This might have a performance impact, depending on your system amongst other things. A quick workaround is to pause the data processing if the queue gets to large.
Remember to write all multiprocessing code in Python in a script with the
if __name__ == '__main__':
# mp code
guard, and be aware that some IDEs don't play nice with concurrent Python code. The safe bet is to test your code by executing it from a terminal.

What is the best way to load multiple files into memory in parallel using python 3.6?

I have 6 large files which each of them contains a dictionary object that I saved in a hard disk using pickle function. It takes about 600 seconds to load all of them in sequential order. I want to start loading all them at the same time to speed up the process. Suppose all of them have the same size, I hope to load them in 100 seconds instead. I used multiprocessing and apply_async to load each of them separately but it runs like sequential. This is the code I used and it doesn't work.
The code is for 3 of these files but it would be the same for six of them. I put the 3rd file in another hard disk to make sure the IO is not limited.
def loadMaps():
start = timeit.default_timer()
procs = []
pool = Pool(3)
pool.apply_async(load1(),)
pool.apply_async(load2(),)
pool.apply_async(load3(),)
pool.close()
pool.join()
stop = timeit.default_timer()
print('loadFiles takes in %.1f seconds' % (stop - start))
If your code is primarily limited by IO and the files are on multiple disks, you might be able to speed it up using threads:
import concurrent.futures
import pickle
def read_one(fname):
with open(fname, 'rb') as f:
return pickle.load(f)
def read_parallel(file_names):
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(read_one, f) for f in file_names]
return [fut.result() for fut in futures]
The GIL will not force IO operations to run serialized because Python consistently releases it when doing IO.
Several remarks on alternatives:
multiprocessing is unlikely to help because, while it guarantees to do its work in multiple processes (and therefore free of the GIL), it also requires the content to be transferred between the subprocess and the main process, which takes additional time.
asyncio will not help you at all because it doesn't natively support asynchronous file system access (and neither do the popular OS'es). While it can emulate it with threads, the effect is the same as the code above, only with much more ceremony.
Neither option will speed up loading the six files by a factor of six. Consider that at least some of the time is spent creating the dictionaries, which will be serialized by the GIL. If you want to really speed up startup, a better approach is not to create the whole dictionary upfront and switch to an in-file database, possibly using the dictionary to cache access to its content.

What's the Best Way to Schedule and Manage Multiple Processes in Python 3

I'm working on a project in Python 3 that involves reading lines from a text file, manipulating those lines in some way, and then writing the results of said manipulation into another text file. Implementing that flow in a serial way is trivial.
However, running every step serially takes a long time (I'm working on text files that are several hundred megabytes/several gigabytes in size). I thought about breaking up the process into multiple, actual system processes. Based on the recommended best practices, I'm going to use Python's multiprocessing library.
Ideally, there should be one and only one Process to read from and write to the text files. The manipulation part, however, is where I'm running into issues.
When the "reader process" reads a line from the initial text file, it places that line in a Queue. The "manipulation processes" then pull from that line from the Queue, do their thing, then put the result into yet another Queue, which the "writer process" then takes and writes to another text file. As it stands right now, the manipulation processes simply check to see if the "reader Queue" has data in it, and if it does, they get() the data from the Queue and do their thing. However, those processes may be running before the reader process runs, thus causing the program to stall.
What, in your opinions, would be the "Best Way" to schedule the processes in such a way so the manipulation processes won't run until the reader process has put data into the Queue, and vice-versa with the writer process? I considered firing off custom signals, but I'm not sure if that's the most appropriate way forward. Any help will be greatly appreciated!
If I were you, I would separate the tasks of dividing your file into tractable chunks and the compute-intensive manipulation part. If that is not possible (for example, if lines are not independent for some reason), then you might have to do a purely serial implementation anyway.
Once you have N chunks in separate files, you can just start your serial manipulation script N times, for each chunk. Afterwards, combine the output back into one file. If you do it it this way, no queue is needed and you will save yourself some work.
You're describing a task queue. Celery is a task queue: http://www.celeryproject.org/

Multithreaded MD5 Checksum in Python

I have a python script that recursively walks a specified directory, and checksums each file it finds. It then writes a log file which lists all file paths and their md5 checksums.
Sequentially, this takes a long time for 50,000 files at 15 MB each. However, my computer has much more resources available than it's actually using. How can I adjust my approach so that the script uses more resources to execute faster?
For example, could I split my file list into thirds and run a thread for each, giving me a 3x runtime?
I'm not very comfortable with threading, and I hope someone wouldn't mind whipping up and example for my case.
Here's the code for my sequential md5 loop:
for (root, dirs, files) in os.walk(root_path):
for filename in files:
file_path = root + "/" + filename
md5_pairs.append([file_path, md5file(file_path, 128)])
Thanks for your help in advance!
For this kind of work, I think multiprocessing.Pool would give you less surprises -
check the examples and docs at http://docs.python.org/library/multiprocessing.html
If you're going to use threads, you need to first initiate your threads and have them poll work off a Queue.Queue instance. Then in your main thread, run through the for-loop you have, but instead of calling md5file(..), push all the arguments on the Queue.Queue. Threading / Queue in Python has an example, but look at the docs as well: http://docs.python.org/library/queue.html
Threads would not be very helpful do the GIL (Global Interpreter Lock.) Your application would never execute more than one call to the md5.update function at the same time. I would continue to try and optimize improve your process pool.
Go embarrassingly parallel and start a process for a chunk of files. We do this on clusters. You can have dozens or hundreds of processes each md5ing a few dozen files. At that point, disk IO will be your bottleneck.

Python multiprocessing question

Trying to think of the best way to code 2 processes that have to run in parallel. I not even sure if multiprocessing is the preferred module.
I am generating a lot of data for a long period of time with a dataCollector, but I would like to check the data periodically with a dataChecker while the dataCollector keeps running. In my mind there are 2 significant times that I consider, one, the time in which the dataCollector dumps a file a begins writing another, which is the same time that the dataChecker will start analyzing the dumped file, and two, the time in which the dataChecker is finished and begins waiting for the dataCollector again.
Can someone suggest a general outline with the multiprocessing module? Should I be using a different module? Thanks
Why would you use any module at all? This is simple to do by having two separate processes that start at the same time. The dataChecker would list all files in a directory, count them, and sleep for a short time (several seconds or more). Then it would do it again and if the number of files changes, it opens the new ones, reads them and processes them.
The synchronisation of the two processes would be done entirely through mailboxes, implemented as a directory with files in it. A message is received only when the dataCollector starts writing a new message.

Categories