I'm new to python multiprocessing. I'm trying to read excel file with enormous file size. When reading big files it takes a long time and I just found out about Multiprocessing It is the technology that allows your program to run in parallel by using multiple CPU cores at the same time. It is used to significantly speed up your program I'm trying to use it but when I read it via Pool, it returns error no such file or directory but without pool (just call the function with the parameter) it doesn't raise error.
import os
import pandas as pd
import multiprocessing as mp
def read_csv(filename):
return pd.read_csv(filename)
def main():
myfile = "C:\\excel_processor\\1_wht_escrow_verified_id -1.csv"
pool = mp.Pool(2)
mydf = pool.map(read_csv, myfile)
if __name__ == '__main__':
main()
Related
This program returns the resolution of the video but since I need for a large scale project I need multiprocessing. I have tried using and parallel processing using a different function but that would just run it multiple times not making it efficent I am posting the entire code. Can you help me create a main process that takes all cores.
import os
from tkinter.filedialog import askdirectory
from moviepy.editor import VideoFileClip
if __name__ == "__main__":
dire = askdirectory()
d = dire[:]
print(dire)
death = os.listdir(dire)
print(death)
for i in death: #multiprocess this loop
dire = d
dire += f"/{i}"
v = VideoFileClip(dire)
print(f"{i}: {v.size}")
This code works fine but I need help with creating a main process(uses all cores) for the for loop alone. can you excuse the variables names I was angry at multiprocessing. Also if you have tips on making the code efficient I would appreciate it.
You are, I suppose, assuming that every file in the directory is a video clip. I am assuming that processing the video clip is an I/O bound "process" for which threading is appropriate. Here I have rather arbitrarily crated a thread pool size of 20 threads this way:
MAX_WORKERS = 20 # never more than this
N_WORKERS = min(MAX_WORKERS, len(death))
You would have to experiment with how large MAX_WORKERS could be before performance degrades. This might be a low number not because your system cannot support lots of threads but because concurrent access to multiple files on your disk that may be spread across the medium may be inefficient.
import os
from tkinter.filedialog import askdirectory
from moviepy.editor import VideoFileClip
from concurrent.futures import ThreadPoolExecutor as Executor
from functools import partial
def process_video(parent_dir, file):
v = VideoFileClip(f"{parent_dir}/{file}")
print(f"{file}: {v.size}")
if __name__ == "__main__":
dire = askdirectory()
print(dire)
death = os.listdir(dire)
print(death)
worker = partial(process_video, dire)
MAX_WORKERS = 20 # never more than this
N_WORKERS = min(MAX_WORKERS, len(death))
with Executor(max_workers=N_WORKERS) as executor:
results = executor.map(worker, death) # results is a list: [None, None, ...]
Update
According to #Reishin, moviepy results in executing the ffmpeg executable and thus ends up creating a process in which the work is being done. So there us no point in also using multiprocessing here.
moviepy is just an wrapper around ffmpeg and designed to edit clips thus working with one file at time - the performance is quite poor. Invoking each time the new process for a number of files is time consuming. At the end, need of multiple processes might be a result of choice of wrong lib.
I'd like to recommend to use pyAV lib instead, which provides direct py binding for ffmpeg and good performance:
import av
import os
from tkinter.filedialog import askdirectory
import multiprocessing
from concurrent.futures import ThreadPoolExecutor as Executor
MAX_WORKERS = int(multiprocessing.cpu_count() * 1.5)
def get_video_resolution(path):
container = None
try:
container = av.open(path)
frame = next(container.decode(video=0))
return path, f"{frame.width}x{frame.height}"
finally:
if container:
container.close()
def files_to_proccess():
video_dir = askdirectory()
return (full_file_path for f in os.listdir(video_dir) if (full_file_path := os.path.join(video_dir, f)) and not os.path.isdir(full_file_path))
def main():
for f in files_to_proccess():
print(f"{os.path.basename(f)}: {get_video_resolution(f)[1]}")
def main_multi_threaded():
with Executor(max_workers=MAX_WORKERS) as executor:
for path, resolution in executor.map(get_video_resolution, files_to_proccess()):
print(f"{os.path.basename(path)}: {resolution}")
if __name__ == "__main__":
#main()
main_multi_threaded()
Above are single and multi-threaded implementation, with optimal parallelism setting (in case if multithreading is something absolute required)
I want to read and process a file by using multiprocessing with low memory consumption, high throughput (sentence/s), and - especially important - ordered results.
I was wondering whether we can use linecache's getline for this purpose. The following code reads a file, hopefully in parallel, and executes some function on the lines that are gathered in the subprocess. Here I opted for running some tokenisation on the files with spaCy.
import datetime
from multiprocessing import Pool, current_process
from os import cpu_count
from pathlib import Path
from functools import partial
from linecache import getline
import spacy
class Processor:
def __init__(self, spacy_model='en_core_web_sm', batch_size=2048):
self.nlp = spacy.load(spacy_model, disable=['ner', 'textcat'])
self.batch_size = batch_size
#staticmethod
def get_n_lines(pfin):
with pfin.open(encoding='utf-8') as fhin:
for line_idx, _ in enumerate(fhin, 1):
pass
return line_idx
def process_file(self, fin):
pfin = Path(fin).resolve()
total_lines = self.get_n_lines(pfin)
start_time = datetime.datetime.now()
procfunc = partial(self.process_batch, pfin)
with Pool(cpu_count() - 1) as pool:
# map the starting indexex to the processes
for _ in pool.imap(procfunc, range(0, total_lines+1, self.batch_size)):
pass
print('done', (datetime.datetime.now() - start_time).total_seconds())
def process_batch(self, pfin, start):
lines = [getline(str(pfin), i) for i in range(start, start+self.batch_size)]
# Parse text with spaCy
docs = list(self.nlp.pipe(lines))
# Chop into sentences
spacy_sents = [str(sent) for doc in docs for sent in doc.sents]
return str(current_process()), spacy_sents
if __name__ == '__main__':
fn = r'data/train.tok.low.en'
proc = Processor()
proc.process_file(fn)
I found that on my work laptop, running with 3 active cores on a file of 140K sentences the duration is 261 seconds. When running with a single core (n_workers=1), the processing time is 431 seconds. I am not sure how to interpret this difference but I guess it comes down to the question: does linecache.getline allow for concurrent reading? Parallel execution s faster, but considering getline expects a file name (rather than a file object), I expect it to have to open the file every time and as such blocking access for other processes. Is this assumption correct because parallel execution still seems much faster? Is there a better way to read files fast and in parallel whilst also keeping the results ordered?
You don't need linecache, and it doesn't help.
First, you don't need any special tricks to read the same file simultaneously from multiple processes. You can just do it. It'll work.
Second, linecache loads a whole file immediately as soon as a single line is requested from that file. You're not splitting the work of reading the file at all. You're doing more I/O than if you just had the parent process read the file and let the workers inherit the data. If you're getting any speedup from multiprocessing, it's probably due to parallelizing the NLP work, not the file reading.
Third, linecache is designed to support the traceback module, and it does a lot of stuff that doesn't make sense for a general-purpose file reading module, including searching the import path for a file if it doesn't find the file in the current directory.
I want to fetch whois data from a txt file with 50000 urls. It's working but takes at least 20 minutes. What can i do to improve performance of this
import whois
from concurrent.futures import ThreadPoolExecutor
import threading
import time
pool = ThreadPoolExecutor(max_workers=2500)
def query(domain):
while True:
try:
w = whois.whois(domain)
fwrite = open("whoISSuccess.txt", "a")
fwrite.write('\n{0} : {1}'.format(w.domain, w.expiration_date))
fwrite.close()
except:
time.sleep(3)
continue
else:
break
with open('urls.txt') as f:
for line in f:
lines = line.rstrip("\n\r")
pool.submit(query, lines)
pool.shutdown(wait=True)
You can do two things to improve the speed:
Use multiprocessing, rather than threading as python Threads do not really run in parallel, while processes will be managed the OS and truly run in parallel.
Secondly, have each process write to its own file, e.g. <url>.txt, as having all processes write to the same file will cause lock contention on the file writing, which significantly slows your program. After all processes have completed you can then aggregate all files to a single one if this is a critical requirement. Alternatively, you can just keep the whois result in memory and then write it out to a file in the end.
Is there a good way to download a lot of files en masse using python? This code is speedy enough for downloading about 100 or so files. But I need to download 300,000 files. Obviously they are all very small files (or I wouldn't be downloading 300,000 of them :) ) so the real bottleneck seems to be this loop. Does anyone have any thoughts? Maybe using MPI or threading?
Do I just have to live with the bottle neck? Or is there a faster way, maybe not even using python?
(I included the full beginning of the code just for completeness sake)
from __future__ import division
import pandas as pd
import numpy as np
import urllib2
import os
import linecache
#we start with a huge file of urls
data= pd.read_csv("edgar.csv")
datatemp2=data[data['form'].str.contains("14A")]
datatemp3=data[data['form'].str.contains("14C")]
#data2 is the cut-down file
data2=datatemp2.append(datatemp3)
flist=np.array(data2['filename'])
print len(flist)
print flist
###below we have a script to download all of the files in the data2 database
###here you will need to create a new directory named edgar14A14C in your CWD
original=os.getcwd().copy()
os.chdir(str(os.getcwd())+str('/edgar14A14C'))
for i in xrange(len(flist)):
url = "ftp://ftp.sec.gov/"+str(flist[i])
file_name = str(url.split('/')[-1])
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
print i
The usual pattern with multiprocessing is to create a job() function that takes arguments and performs some potentially CPU bound work.
Example: (based on your code)
from multiprocessing import Pool
def job(url):
file_name = str(url.split('/')[-1])
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
pool = Pool()
urls = ["ftp://ftp.sec.gov/{0:s}".format(f) for f in flist]
pool.map(job, urls)
This will do a number of things:
Create a multiprocessing pool and set of workers as you have CPU(s) or CPU Core(s)
Create a list of inputs to the job() function.
Map the list of inputs urls to job() and wait for all jobs to complete.
Python's multiprocessing.Pool.map will take care of splitting up your input across the no. of workers in the pool.
Another useful neat little thing I've done for this kind of work is to use progress like this:
from multiprocessing import Pool
from progress.bar import Bar
def job(input):
# do some work
pool = Pool()
inputs = range(100)
bar = Bar('Processing', max=len(inputs))
for i in pool.imap(job, inputs):
bar.next()
bar.finish()
This gives you a nice progress bar on your console as your jobs are progressing so you have some idea of progress and eta, etc.
I also find the requests library very useful here and a much nicer set of API(s) for dealing with web resources and downloading of content.
I'm using multiprocessing in Python for parallelizing.
I'm trying to parallelize the process on chunks of data read from an excel file using pandas.
I'm new to multiprocessing and parallel processing. During implementation on simple code,
import time;
import os;
from multiprocessing import Process
import pandas as pd
print os.getpid();
df = pd.read_csv('train.csv', sep=',',usecols=["POLYLINE"],iterator=True,chunksize=2);
print "hello";
def my_function(chunk):
print chunk;
count = 0;
processes = [];
for chunk in df:
if __name__ == '__main__':
p = Process(target=my_function,args=(chunk,));
processes.append(p);
if(count==4):
break;
count = count + 1;
The print "hello" is being executed multiple times, I'm guessing the individual process created should work on the target rather than main code.
Can anyone suggest me where I'm wrong.
The way that multiprocessing works is create a new process and then import the file with the target function. Since your outermost scope has print statements, it will get executed once for every process.
By the way you should use a Pool instead of Processes directly. Here's a cleaned up example:
import os
import time
from multiprocessing import Pool
import pandas as pd
NUM_PROCESSES = 4
def process_chunk(chunk):
# do something
return chunk
if __name__ == '__main__':
df = pd.read_csv('train.csv', sep=',', usecols=["POLYLINE"], iterator=True, chunksize=2)
pool = Pool(NUM_PROCESSES)
for result in pool.map(process_chunk, df):
print result
Using multiprocessing is probably not going to speed up reading data from disk, since disk access is much slower than e.g. RAM access or calculations. And the different pieces of the file will end up in different processes.
Using mmap could help speed up data access.
If you do a read-only mmap of the data file before starting e.g. a Pool.map, each worker could read its own slice of data from the shared memory mapped file and process it.