I am trying to use multiprocessing in handling csv files in excess of 2GB. The problem is that the input is only being consumed in one process while the others seem to be idling.
The following recreates the problem I am encountering. Is it possible to use multiprocess with an iterator? Consuming the full input into memory is not desired.
import csv
import multiprocessing
import time
def something(row):
# print row[0]
# pass
return row
def main():
start = time.time()
i = open("input.csv")
reader = csv.reader(i, delimiter='\t')
print reader.next()
p = multiprocessing.Pool(16)
print "Starting processes"
j = p.imap(something, reader, chunksize=10000)
count= 1
while j:
print j.next()
print time.time() - start
if __name__ == '__main__':
main()
I think you are confusing "processes" with "processors".
Your program is definitely spawning multiple processes at the same time, as you can verify in the system or resources monitor while your program is running. How many processors or CPU cores are being used depends mainly on the OS, and has a lot to do with how CPU intensive is the task you are delegating to each process.
Make a little modification to your something function, to introduce a sleep time, that simulates the work being done in the function:
def something(row):
time.sleep(.4)
return row
Now, first run your function sequentially to each row in your file, and notice that each result is coming one by one every 400ms.
def main():
with open("input.csv") as i:
reader = csv.reader(i)
print (next(reader))
# SEQUENTIALLY:
for row in reader:
result = something(row)
print (result)
Now try with the pool of workers. Keep it at a low number, say 4 workers, and you will see that the result comes every 400ms, but in groups of 4 (or roughly the number of workers in the pool):
def main():
with open("input.csv") as i:
reader = csv.reader(i)
print (next(reader))
# IN PARALLEL
print ("Starting processes")
p = multiprocessing.Pool(4)
results = p.imap(something, reader)
for result in results:
print(result) # one result is the processing of 4 rows...
While running in parallel, check the system monitor and look for how many "python" processes are being executed. Should be one plus the number of workers.
I hope this explanation is useful.
Related
I have this code that reads and processes text file line by line, problem is my text file has between 1,5-2 billion lines and it is taking forever. Is there away to process over 100 Million lines at the same time?
from cryptotools.BTC.HD import check, WORDS
with open("input.txt", "r") as a_file:
for line in a_file:
stripped_line = line.strip()
for word in WORDS:
mnemonic = stripped_line.format(x=word)
if check(mnemonic):
print(mnemonic)
with open("print.txt", "a") as i:
i.write(mnemonic)
i.write("\n")
Input file has the following sample lines:
gloom document {x} stomach uncover peasant sock minor decide special roast rural
happy seven {x} gown rally tennis yard patrol confirm actress pledge luggage
tattoo time {x} other horn motor symbol dice update outer fiction sign
govern wire {x} pill valid matter tomato scheme girl garbage action pulp
To process 100 million lines at once you would have to have 100 million threads. Another approach to improve the speed of your code is to split the work among different threads (lower than 100 million).
Because write and read operations from file are not asynchronous, you would be better off with reading all the file at the beginning of your program and write out thr processed data at the end. In the code below i will assume you do not care about the order at which wou write the file out. But if order is important you could set a dictionary that has as key the positional value of the current line being elaborated by a specific thread and at the end sort accordingly.
import concurrent.futures as cf
N_THREADS = 20
result = []
def doWork(data):
for line in data:
#do what you have to do
result.append(mnemonic)
m_input = open("input.txt", "r")
lines = [line for line in m_input]
#the data for the threads will be here
#as a list of rows for each thread
m_data= { i: [] for i in range(0, N_THREADS)}
for l, n in zip(lines, range(0, len(lines))):
m_data[n%N_THREADS].append(l)
'''
If you have to trim the number of threads uncomment these lines
m_data= { k:v for k, v in m_data.items() if len(v) != 0}
N_THREADS = N_THREADS if len(m_data) > N_THREADS else len(m_data)
if(N_THREADS == 0):
exit()
'''
with cf.ThreadPoolExecutor(max_workers=N_THREADS) as tp:
for d in m_data.keys():
tp.submit(doWork,m_data[d])
#work done
output = open("print.txt", "w")
for item in result:
output.write(f"{item}\n")
output.close()
Change the number of threads as you find most efficient.
Edit (with memory optimizations):
The code above, while being extrimely fast, uses a significant amount of memory because loads the whole file in memory and then works on it.
You then have two options:
split your file in multiple smaller files, from my testing (see below) using a test file with ~10 million lines, the program run extremely fast actually, but used up to 1.3 GB of ram.
use the code down here where i load one line at a time and assign that line to a thread that works on that line and then pushes the data to a thread that is only responsible to write to the file. In this way the memory usage significantly drops but the execution time rises.
The code below, reads a single line from the file (the one with 10 million lines that is approximately ~500 MB) and then sends that data to a class that manages a fixed number of threads. Currently I spawn a new thread every time one finishes, actually could be more efficient, using always the same threads and use a queue for each thread. Then I spawn a writer thread that its only work is to write to the out.txt file that will contain the result. In my testing I only read the text file and write the same lines in another file.
What i found out is the following (using the 10 million lines file):
Original code: it took 14.20630669593811 seconds and used 1.301 GB (average usage) of ram and 10% cpu usage
Updated code: it took 1230.4356942176819 seconds and used 4.3 MB (average usage) of ram and 10% cpu usage with the internal parameters as in the code below.
The timed results were obtained using the same number of threads for both the programs.
From those results it is evident how the memory optimized code runs significantly slower while using way less ram. You can tune the internal parameters such as the number of threads or the maximum queue size to improve the performance, keeping in mind that would affect memory usage. After a lots of tests, I would suggest to split the file in multiple subfiles that can fit in your memory and run the original version of the code (see above), because the tradeoff between time and speed is simply not justified in my opinion.
Here I put the code I optimized for memory cunsumption but keep in mind that is NOT optimized in any significant way as far as thread management goes, one suggestion would be to use always the same threads and use multiple queues to pass the data to those threads.
Here I leave the code i used to optimize memory consumption (and yes is way more complex than the one above XD, and maybe more than what it needs to be) :
from threading import Thread
import time
import os
import queue
MAX_Q_SIZE = 100000
m_queue = queue.Queue(maxsize=MAX_Q_SIZE)
end_thread = object()
def doWork(data):
#do your work here, before
#checking if the queue is full,
#otherwise when you finish the
#queue might be full again
while m_queue.full():
time.sleep(0.1)
pass
m_queue.put(data)
def writer():
#check if file exists or creates it
try:
out = open("out.txt", "r")
out.close()
except FileNotFoundError:
out = open("out.txt", "w")
out.close()
out = open("out.txt", "w")
_end = False
while True:
if m_queue.qsize == 0:
if _end:
break
continue
try:
item = m_queue.get()
if item is end_thread:
out.close()
_end = True
break
global written_lines
written_lines += 1
out.write(item)
except:
break
class Spawner:
def __init__(self, max_threads):
self.max_threads = max_threads
self.current_threads = [None]*max_threads
self.active_threads = 0
self.writer = Thread(target=writer)
self.writer.start()
def sendWork(self, data):
m_thread = Thread(target=doWork, args=(data, ))
replace_at = -1
if self.active_threads >= self.max_threads:
#wait for at least 1 thread to finish
while True:
for index in range(self.max_threads):
if self.current_threads[index].is_alive() :
pass
else:
self.current_threads[index] = None
self.active_threads -= 1
replace_at = index
break
if replace_at != -1:
break
#else: no threads have finished, keep waiting
if replace_at == -1:
#only if len(current_threads) < max_threads
for i in range(len(self.current_threads)):
if self.current_threads[i] == None:
replace_at = i
break
self.current_threads[replace_at] = m_thread
self.active_threads += 1
m_thread.start()
def waitEnd(self):
for t in self.current_threads:
if t.is_alive():
t.join()
self.active_threads -= 1
while True:
if m_queue.qsize == MAX_Q_SIZE:
time.sleep(0.1)
continue
m_queue.put(end_thread)
break
if self.writer.is_alive():
self.writer.join()
start_time = time.time()
spawner = Spawner(50)
with open("input.txt", "r") as infile:
for line in infile:
spawner.sendWork(line)
spawner.waitEnd()
print("--- %s seconds ---" % (time.time() - start_time))
You can remove the prints for the time, I left those just for reference to know how I computed the time the program took to run, as well below you find the screenshots of the execution of the two programs from task manager.
Memory optimized version:
Original version (I forgot to expand the terminal process when taking the screenshot, anyway the memory usage of subprocesses of the terminal is negligibile with respect to the one used by the program and the 1.3 GB of ram is accurate):
Let us consider the following code where I calculate the factorial of 4 really large numbers, saving each output to a separate .txt file (out_mp_{idx}.txt). I use multiprocessing (4 processes) to reduce the computation time. Though this works fine, I want to output all the 4 results in one file.
One way is to open each of the generated (4) files I create (from the code below) and append to a new file, but that's not my choice (below is just a simplistic version of my code, I have too many files to handle, which defeats the purpose of time-saving via multiprocessing). Is there a better way to automate such that the results from the processes are all dumped/appended to some file? Also, in my case the returned results form each process could be several lines, so how would we avoid open-file conflict for the case when the results are appended in the output file by one process and second process returns its answer and wants to open/access the output file?
As an alternative, I tried process.immap route, but that's not as computationally efficient as the below code. Something like this SO post.
from multiprocessing import Process
import os
import time
tic = time.time()
def factorial(n, idx): # function to calculate the factorial
num = 1
while n >= 1:
num *= n
n = n - 1
with open(f'out_mp_{idx}.txt', 'w') as f0: # saving output to a separate file
f0.writelines(str(num))
def My_prog():
jobs = []
N = [10000, 20000, 40000, 50000] # numbers for which factorial is desired
n_procs = 4
# executing multiple processes
for i in range(n_procs):
p = Process(target=factorial, args=(N[i], i))
jobs.append(p)
for j in jobs:
j.start()
for j in jobs:
j.join()
print(f'Exec. Time:{time.time()-tic} [s]')
if __name__=='__main__':
My_prog()
You can do this.
Create a Queue
a) manager = Manager()
b) data_queue = manager.Queue()
c) put all data in this queue.
Create a thread and start it before multiprocess
a) create a function which waits on data_queue.
Something like
`
def fun():
while True:
data = data_queue.get()
if instance(data_queue, Sentinal):
break
#write to a file
`
3) Remember to send some Sentinal object after all multiprocesses are done.
You can also make this thread a daemon thread and skip sentinal part.
I am pulling .8 million of records in one go(this is one time process) from mongodb using pymongo and performing some operation over it .
My code look as below.
proc = []
for rec in cursor: # cursor has .8 million rows
print cnt
cnt = cnt + 1
url = rec['urlk']
mkptid = rec['mkptid']
cii = rec['cii']
#self.process_single_layer(url, mkptid, cii)
proc = Process(target=self.process_single_layer, args=(url, mkptid, cii))
procs.append(proc)
proc.start()
# complete the processes
for proc in procs:
proc.join()
process_single_layer is a function which is basically downloading urls.from cloud and storing locally.
Now the problem is downloading process is slow as it has to hit a url. And since records are huge to process 1k rows it is taking 6 minutes.
To reduce the time I wanted to implement Multiprocessing. But It is hard to see any difference with above code.
Please suggest me how can I improve the performance in this scenario.
First of all you need to count all the rows in your file and then spawn a fixed number of processes (ideally matching the number of your processor cores), to which you feed via queues (one for each process) a number of rows equal to the division total_number_of_rows / number_of_cores. The idea behind this approach is that you split the processing of those rows between multiple processes, hence achieving parallelism.
A way to find out the number of cores dynamically is by doing:
import multiprocessing as mp
cores_count = mp.cpu_count()
A slight improvement that can be done by avoiding the initial rows count is by adding a row cyclically by creating the list of queues and then apply a cycle iterator on it.
A full example:
import queue
import multiprocessing as mp
import itertools as itools
cores_count = mp.cpu_count()
def dosomething(q):
while True:
try:
row = q.get(timeout=5)
except queue.Empty:
break
# ..do some processing here with the row
pass
if __name__ == '__main__':
processes
queues = []
# spawn the processes
for i in range(cores_count):
q = mp.Queue()
queues.append(q)
proc = Process(target=dosomething, args=(q,))
processes.append(proc)
queues_cycle = itools.cycle(queues)
for row in cursor:
q = next(queues_cycle)
q.put(row)
# do the join after spawning all the processes
for p in processes:
p.join()
It's easier to use a pool in this scenario.
Queues are not necessary as you don't need to communicate between your spawned processes. We can use the Pool.map to distribute the workload.
Pool.imap or Pool.imap_unordered might be faster with a larger chunk size. (Ref: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap) You can use the Pool.starmap if you want and get rid of tuple unpacking.
from multiprocessing import Pool
def process_single_layer(data):
# unpack the tuple and do the processing
url, mkptid, cii = data
return "downloaded" + url
def get_urls():
# replace this code: iterate over cursor and yield necessary data as a tuple
for rec in range(8):
url = "url:" + str(rec)
mkptid = "mkptid:" + str(rec)
cii = "cii:" + str(rec)
yield (url, mkptid, cii)
# you can come up with suitable process count based on the number of CPUs.
with Pool(processes=4) as pool:
print(pool.map(process_single_layer, get_urls()))
Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]
I have a situation where I have to read and write the list simultaneously.
It seems the code starts to read after it completes writing all the elements in the list.What I want to do is that the code will keep on adding elements in one end and I need to keep on processing first 10 elements simultaneously.
import csv
testlist=[]
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
testlist.append(row)
def render(small)
#do some stuff
while(len(testlist)>0)
pool = Pool(processes=10)
small=testlist[:10]
del testlist[:10]
pool.map_async(render,small)
pool.close()
pool.join()
You need a queue that is shared between processes. One process adds to the queue, the other processes from the queue.
Here is a simplified example:
from multiprocessing import Process, Queue
def put(queue):
# writes items to the queue
queue.put('something')
def get(queue):
# gets item from the queue to work on
while True:
item = queue.get()
# do something with item
if __name__=='__main__':
queue = Queue()
getter_process = Process(target=get, args=((queue),))
getter_process.daemon = True
getter_process.start()
writer(queue) # Send something to the queue
getter_process.join() # Wait for the getter to finish
If you want to only process 10 things at a time, you can limit the queue size to 10. This means, the "writer" cannot write anything until if the queue already has 10 items waiting to be processed.
By default, the queue has no bounds/limits. The documentation is a good place to start for more on queues.
You can do like this
x=[]
y=[1,2,3,4,5,6,7,...]
for i in y:
x.append(i)
if len(x)<10:
print x
else:
print x[:10]
x=x-x[:10]
PS: assuming y is an infinite stream