I am trying to download zip files containing a csv file.
On one hand I have more than 3000 URL therefore 3000 files. The code used took between 1 to 2 hours. The total size of the zip files is 40GB. Unzipped 230GB.
On the other hand there is also another set of URLs on the 100Ks. Looking at how long it took to process the previous number of URLs, is there something I can do to improve this code?
Should I make it all in one function?
I have the possibility to run this on a Spark cluster.
#URLs in a list called links
#Filepaths are a list from ls on the folder of the raw zip files
def download_zip_files(x,base_url,filepath):
r = requests.get(x)
status_code = r.status_code
filepath = str(x.replace(base_url,filepath))
with open(filepath, "wb") as file:
file.write(r.content)
def extract_zip_files(x,basepath,exportpath):
path = str(basepath+x[1])
with zipfile.ZipFile(path, "r") as zip_ref:
zip_ref.extractall(exportpath)
list(map(lambda x: download_zip_files(x,base_url,filepath), links))
list(map(lambda x: extract_zip_files(x,basepath,exportpath), raw_zip_filepaths))
What you need is multithreading, A quick google definition is as follows:
Multithreading is a CPU (central processing unit) feature that allows two or more instruction threads to execute independently while sharing the same processor resources. A thread is a self-contained sequence of instructions that can execute in parallel with other threads that are part of the same root process.
Most of the time the programs we write are executed in a single-threaded manner (same as yours) each instruction in the program executes in a sequence, which is slow if we have a case like yours, to run a program parallelly through multiple threads see the following example.
Single thread
Multi-thread approach
Please consider the output of both single-threaded and multi-threaded examples.
In a multi-threaded example, the program is executing in parallel through threads. In simple words, the print_words() function executes two times parallelly(at the same time) with different parameters.
Let's come to your example:
You can divide your URL list into multiple URL lists and give each thread a list of URLs, see the following example which is just a sudo code you should implement it by yourself.
import threading
# Please divide the following list using any function I'm giving a simple example right now, so I'm not doing this.
url_list=['url1','url2','url3','url4']
# Following lists are divided into two lists.
url_list_1=['url1','url2']
url_list_2=['url3','url4']
def download_zip_files(x,base_url,filepath):
r = requests.get(x)
status_code = r.status_code
filepath = str(x.replace(base_url,filepath))
with open(filepath, "wb") as file:
file.write(r.content)
def start_loop_download_zip(links):
list(map(lambda x: download_zip_files(x,base_url,filepath), links))
t1 = threading.Thread(target=start_loop_download_zip,args=(url_list_1,))
t2 = threading.Thread(target=start_loop_download_zip,args=(url_list_2,))
# starting thread 1
t1.start()
# starting thread 2
t2.start()
# wait until thread 1 is completely executed
t1.join()
# wait until thread 2 is completely executed
t2.join()
This approach will significantly reduce your processing time.
Related
I have a Python script that does two things; 1) it downloads a large file by making an API call, and 2) preprocess that large file. I want to use Multiprocessing to run my script. Each individual part (1 and 2) takes quite long. Everything happens in-memory due to the large size of the files, so ideally a single core would do both (1) and (2) consecutively. I have a large amount of cores available (100+), but I can only have 4 API calls running at the same time (limitation set by the API developers). So what I want to do is spawn 4 cores that start downloading by making an API-call, and as soon as one of those cores is done downloading and starts preprocessing I want a new core to start the whole process as well. This so there's always 4 cores downloading, and as many cores as needed doing the pre-processing. I do not know however how to have a new core spawn as soon as another core is finished with the first part of the script.
My actual code is way too complex to just dump here, but let's say I have the following two functions:
import requests
def make_api_call(val):
"""Function that does part 1); makes an API call, stores it in memory and returns a large
satellite GeoTIFF
"""
large_image = requests.get(val)
return(large_image)
def preprocess_large_image(large_image):
"""Function that does part 2); preprocesses a large image, and returns the relevant data
"""
results = preprocess(large_image)
return(results)
how then can I make sure that as soon as a single core/process is finished with 'make_api_call' and starts with 'preprocess_large_image', another core spawns and starts the entire process as well? This so there is always 4 images downloading side-by-side. Thank you in advance for the help!
This is a perfect application for a multiprocessing.Semaphore (or for safety, use a BoundedSemaphore)! Basically you put a lock around the api call part of the process, but let up to 4 worker processes hold the lock at any given time. For various reasons, things like Lock, Semaphore, Queue, etc all need to be passed at the creation of a Pool, rather than when a method like map or imap is called. This is done by specifying an initialization function in the pool constructor.
def api_call(arg):
return foo
def process_data(foo):
return "done"
def map_func(arg):
global semaphore
with semaphore:
foo = api_call(arg)
return process_data(foo)
def init_pool(s):
global semaphore = s
if __name__ == "__main__":
s = mp.BoundedSemaphore(4) #max concurrent API calls
with mp.Pool(n_workers, init_pool, (s,)) as p: #n_workers should be great enough that you always have a free worker waiting on semaphore.acquire()
for result in p.imap(map_func, arglist):
print(result)
If both the downloading (part 1) and the conversion (part 2) take long, there is not much reason to do everything in memory.
Keep in mind that networking is generally slower than disk operations.
So I would suggest to use two pools, saving the downloaded files to disk, and send file names to workers.
The first Pool is created with four workers and does the downloading. The worker saves the image to a file and returns the filename. With this Pool you use the imap_unordered method, because that starts yielding values as soon as they become available.
The second Pool does the image processing. It gets fed by apply_async, which returns an AsyncResult object.
We need to save those to keep track of when all the conversions are finished.
Note that map or imap_unordered are not suitable here because they require a ready-made iterable.
def download(url):
large_image = requests.get(url)
filename = url_to_filename(url) # you need to write this
with open(filename, "wb") as imgf:
imgf.write(large_image)
def process_image(name):
with open(name, "rb") as f:
large_image = f.read()
# File processing goes here
with open(name, "wb") as f:
f.write(large_image)
return name
dlp = multiprocessing.Pool(processes=4)
# Default pool size is os.cpu_count(); might be too much.
imgp = multiprocessing.Pool(processes=20)
urllist = ['http://foo', 'http://bar'] # et cetera
in_progress = []
for name in dlp.imap_unordered(download, urllist):
in_progress.append(imgp.apply_async(process_image, (name,)), )
# Wait for the conversions to finish.
while in_progress:
finished = []
for res in in_progress:
if res.ready():
finished.append(res)
for f in finished:
in_progress.remove(f)
print(f"Finished processing '{f.get()}'.")
time.sleep(0.1)
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I have a big text file that needs to be processed. I first read all text into a list and then use ThreadPoolExecutor to start multiple threads to process it. The two functions called in process_text() are not listed here: is_channel and get_relations().
I am on Mac and my observations show that it doesn't really speed up the processing (cpu with 8 cores, only 15% cpu is used). If there is a performance bottleneck in either the function is_channel or get_relations, then the multithreading won't help much. Is that the reason for no performance gain? Should I try to use multiprocessing to speed up instead of multithreading?
def process_file(file_name):
all_lines = []
with open(file_name, 'r', encoding='utf8') as f:
for index, line in enumerate(f):
line = line.strip()
all_lines.append(line)
# Classify text
all_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for index, result in enumerate(executor.map(process_text, all_lines, itertools.repeat(channel))):
all_results.append(result)
for index, entities_relations_list in enumerate(all_results):
# print out results
def process_text(text, channel):
global channel_text
global non_channel_text
is_right_channel = is_channel(text, channel)
entities = ()
relations = None
entities_relations_list = set()
entities_relations_list.add((entities, relations))
if is_right_channel:
channel_text += 1
entities_relations_list = get_relations(text, channel)
return (text, entities_relations_list, is_right_channel)
non_channel_text += 1
return (text, entities_relations_list, is_right_channel)
The first thing that should be done is finding out how much time it takes to:
Read the file in memory (T1)
Do all processing (T2)
Printing result (T3)
The third point (printing), if you are really doing it, can slow down things. It's fine as long as you are not printing it to terminal and just piping the output to a file or something else.
Based on timings, we'll get to know:
T1 >> T2 => IO bound
T2 >> T1 => CPU bound
T1 and T2 are close => Neither.
by x >> y I mean x is significantly greater than y.
Based on above and the file size, you can try a few approaches:
Threading based
Even this can be done 2 ways, which one would work faster can be found out by again benchmarking/looking at the timings.
Approach-1 (T1 >> T2 or even when T1 and T2 are similar)
Run the code to read the file itself in a thread and let it push the lines to a queue instead of the list.
This thread inserts a None at end when it is done reading from file. This will be important to tell the worker that they can stop
Now run the processing workers and pass them the queue
The workers keep reading from the queue in a loop and processing the results. Similar to the reader thread, these workers put results in a queue.
Once a thread encounters a None, it stops the loop and re-inserts the None into the queue (so that other threads can stop themselves).
The printing part can again be done in a thread.
The above is example of single Producer and multiple consumer threads.
Approach-2 (This is just another way of doing what is being already done by the code snippet in the question)
Read the entire file into a list.
Divide the list into index ranges based on no. of threads.
Example: if the file has 100 lines in total and we use 10 threads
then 0-9, 10-19, .... 90-99 are the index ranges
Pass the complete list and these index ranges to the threads to process each set. Since you are not modifying original list, hence this works.
This approach can give results better than running the worker for each individual line.
Multiprocessing based
(CPU bound)
Split the file into multiple files before processing.
Run a new process for each file.
Each process gets the path of the file it should read and process
This requires additional step of combining all results/files at end
The process creation part can be done from within python using multiprocessing module
or from a driver script to spawn a python process for each file, like a shell script
Just by looking at the code, it seems to be CPU bound. Hence, I would prefer multiprocessing for doing that. I have used both approaches in practice.
Multiprocessing: when processing huge text files(GBs) stored on disk (like what you are doing).
Threading (Approach-1): when reading from multiple databases. As that is more IO bound than CPU (I used multiple producer and multiple consumer threads).
I have about 4 input text files that I want to read them and write all of them into one separate file.
I use two threads so it runs faster!
Here is my questions and code in python:
1-Does each thread has its own version of variables such as "lines" inside the function "writeInFile"?
2-Since I copied some parts of the code from Tutorialspoint, I don't understand what is "while 1: pass" in the last line. Can you explain? link to the main code: http://www.tutorialspoint.com/python/python_multithreading.htm
3-Does it matter what delay I put for the threads?
4-If I have about 400 input text files and want to do some operations on them before writing all of them into a separate file, how many threads I can use?
5- If assume I use 10 threads, is it better to have the inputs in different folders (10 folders with 40 input text files each) and for each thread call one folder OR I use what I already done in the below code in which I ask each thread to read one of the 400 input text files if they have not been read before by other threads?
processedFiles=[] # this list to check which file in the folder has already been read by one thread so the other thread don't read it
#Function run by the threads
def writeInFile( threadName, delay):
for file in glob.glob("*.txt"):
if file not in processedFiles:
processedFiles.append(file)
f = open(file,"r")
lines = f.readlines()
f.close()
time.sleep(delay)
#open the file to write in
f = open('myfile','a')
f.write("%s \n" %lines)
f.close()
print "%s: %s" % ( threadName, time.ctime(time.time()) )
# Create two threads as follows
try:
f = open('myfile', 'r+')
f.truncate()
start = timeit.default_timer()
thread.start_new_thread( writeInFile, ("Thread-1", 0, ) )
thread.start_new_thread( writeInFile, ("Thread-2", 0, ) )
stop = timeit.default_timer()
print stop - start
except:
print "Error: unable to start thread"
while 1:
pass
Yes. Each of the local variables is on the thread's stack and are not shared between threads.
This loop allows the parent thread to wait for each of the child threads to finish and exit before termination of the program. The actual construct you should use to handle this is join and not a while loop. See what is the use of join() in python threading.
In practice, yes, especially if the threads are writing to a common set of files (e.g., both thread 1 and thread2 will be reading/writing to the same file). Depending on the hardware, the size of the files and the amount of data you re trying to write, different delays may make your program feel more responsive to the user than not. The best bet is to start with a simple value and adjust it as you see the program work in a real-world setting.
While you can technically use as many threads as you want, you generally won’t get any performance benefits over 1 thread per core per CPU.
Different folders won’t matter as much for only 400 files. If you’re talking about 4,000,000 files, than it might matter for instances when you want to do ls on those directories. What will matter for performance is whether each thread is working on it's own file or whether two or more threads might be operating on the same file.
General thought: while it is a more advanced architecture, you may want to try to learn/use celery for these types of tasks in a production environment http://www.celeryproject.org/.
I am totaly new in multiprocessing. I am trying to change my code in order to run part of it simultaneously.
I have a huge list where I have to call an API for each node. Since, the APIs are independence, I don't need the result of the first one in order to proceed to the second one. So, I have this code:
def xmlpart1(id):
..call the api..
..retrieve the xml..
..find the part of xml I want..
return xml_part1
def xmlpart2(id):
..call the api..
..retrieve the xml..
..find the part of xml I want..
return xml_part2
def main(index):
mylist = [[..,..],[..,..],[..,..],[..,...]] # A huge list of lists with ids I need for calling the APIs
myL= mylist[index] c
mydic = {}
for i in myL:
flag1 = xmlpart1(i)
flag2 = xmlpart2(i)
mydic[flag1] = flag2
root = "myfilename %s.json" %(str(index))
with open(root, "wb") as f:
json.dump(mydic,f)
from multiprocessing import Pool
if __name__=='__main__':
Pool().map(main, [0,1,2,3])
After a few suggestions from here and from the chat, I end up with this code. The problem is still there. I run the script at 9:50. At 10:25 the first file "myfilename 0.json" appeared in my folder. Now it is 11:25 and neither of the other files have been appeared. The sublists have equal length and they do the same thing, so they need approximately the same time.
This is something more suited to the multiprocessing.Pool() class.
Here's a simple example:
from multiprocessing import Pool
def job(args):
"""Your job function"""
Pool().map(job, inputs)
Where:
inputs is your list of inputs. Each input gets passed to job and processed in a separate process.
You get the results back as a list when all jobs have completed.
multiprocessing.Pool().map is just like the Python builtin map() but sets up a process pool of workers for you and passes each input to the given function.
See the docs for more details: http://docs.python.org/2/library/multiprocessing.html