Readline and threading - python

So I run the code below, and when I use queue.qsize() after I run it, there are still 450,000 or so items in the queue, implying most lines of the text file were not read. Any idea what is going on here?
from Queue import Queue
from threading import Thread
lines = 660918 #int(str.split(os.popen('wc -l HGDP_FinalReport_Forward.txt').read())[0]) -1
queue = Queue()
File = 'HGDP_FinalReport_Forward.txt'
num_threads =10
short_file = open(File)
class worker(Thread):
def __init__(self,queue):
Thread.__init__(self)
self.queue = queue
def run(self):
while True:
try:
self.queue.get()
i = short_file.readline()
self.queue.task_done() #signal to the queue that the task is done
except:
break
## This is where I should make the call to the threads
def main():
for i in range(num_threads):
worker(queue).start()
queue.join()
for i in range(lines): # put the range of the number of lines in the .txt file
queue.put(i)
main()

It's hard to know exactly what you're trying to do here, but if each line can be processed independently, multiprocessing is a much simpler choice that will take care of all the synchronization for you. An added bonus is that you don't have to know the number of lines in advance.
Basically,
import multiprocessing
pool = multiprocessing.Pool(10)
def process(line):
return len(line) #or whatever
with open(path) as lines:
results = pool.map(process, lines)
Or, if you're just trying to get some kind of aggregate result from the lines, you can use reduce to lower memory usage.
import operator
with open(path) as lines:
result = reduce(operator.add, pool.map(process, lines))

so I tried doing this but I am getting a bit confused because I need to pass a single line each time, and that isn't what the code seems to be doing
import multiprocessing as mp
File = 'HGDP_FinalReport_Forward.txt'
#short_file = open(File)
test = []
def pro(temp_line):
temp_line = temp_line.strip().split()
return len(temp_line)
if __name__ == "__main__":
with open("HGDP_FinalReport_Forward.txt") as lines:
pool = mp.Pool(processes = 10)
t = pool.map(pro,lines)

Related

Python multiprocessing possible deadlock with two queue as producer-consumer pattern?

I'm wondering if there can be a sort of deadlock in the following code. I have to read each element of a database (about 1 million items), process it, then collect the results in a unique file.
I've parallelized the execution with multiprocessing using two Queue's and three types of processes:
Reader: Main process which reads the database and adds the read items in a task_queue
Worker: Pool of processes. Each worker gets an item from task_queue, processes the item, saves the results in an intermediate file stored in item_name/item_name.txt and puts the item_name in a completed_queue
Writer: Process which gets an item_name from completed_queue, gets the intermediate result from item_name/item_name.txt and writes it in results.txt
from multiprocessing import Pool, Process, Queue
class Computation():
def __init__(self,K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self,):
pool = Pool(n_cpus, self.worker, args=())
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
pool.close()
pool.join()
self.completed_queue.put("DONE")
writer.join()
The code works, but it seems that sometimes the writer or the pool stops working (or they are very slow). Is a deadlock possible in this scenario?
There are a couple of issues with your code. First, by using the queues as you are, you are in effect creating your own process pool and have no need for using the multiprocessing.Pool class at all. You are using a pool initializer as an actual pool worker and it's a bit of a misuse of this class; you would be better off to just use regular Process instances (my opinion, anyway).
Second, although it is well and good that you are putting message DONE to the writer_process to signal it to terminate, you have not done similarly for the self.n_cpus worker processes, which are looking for 'STOP' messages, and therefore the reader function needs to put self.n_cpus STOP messages in the task queue:
from multiprocessing import Process, Queue
class Computation():
def __init__(self, K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
self.task_queue.put('STOP')
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
processes = [Process(target=self.worker) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
for p in processes:
p.join()
self.completed_queue.put("DONE")
writer.join()
Personally, instead of using 'STOP' and 'DONE' as the sentinel messages, I would use None instead, assuming that is not a valid actual message. I have tested the above code where reader just processed strings in a list and self.process_item(item) simply appended ' done' to the each of those strings and put the modified string on the completed_queue and replaced self.write_f in the writer_process with a print call. I did not see any problems with the code as is.
Update to use a Managed Queue
Disclaimer: I have had no experience using mpi4py and have no idea how the queue proxies would get distributed across different computers. The above code may not be sufficient as suggested by the following article, How to share mutliprocessing queue object between multiple computers. However, that code is creating instances of Queue.Queue (that code is Python 2 code) and not the proxies that are returned by the multiprocessing.SyncManager. The documentation on this is very poor. Try the above change to see if it works better (it will be slower).
Because the proxy returned by manager.Queue(), I have had to rearrange the code a bit; the queues are now being passed explicitly as arguments to the process functions:
from multiprocessing import Process, Manager
class Computation():
def __init__(self, K):
self.n_cpus = K
def reader(self, task_queue):
with open(db, "r") as db:
... # Read an item
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
task_queue.put('STOP')
def worker(self, task_queue, completed_queue):
while True:
item = task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self, completed_queue):
while True:
f = completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
with Manager() as manager:
task_queue = manager.Queue()
completed_queue = manager.Queue()
processes = [Process(target=self.worker, args=(task_queue, completed_queue)) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=(completed_queue,))
writer.start()
self.reader(task_queue)
for p in processes:
p.join()
completed_queue.put("DONE")
writer.join()

How to use multiprocessing apply_async pool in a while loop correctly

I need to use a pool to asynchronously parse results coming from an extraction method and send those results to a write queue.
I have tried this: but it seems to just run iteratively... one process after the other.
process_pool = Pool(processes=30, maxtasksperchild=1)
while True:
filepath = read_queue.get(True)
if filepath is None:
break
res = process_pool.apply_async(func=process.run, args=(filepath, final_path), callback=write_queue.put)
results.append(res)
for result in results:
result.wait()
process_pool.close()
process_pool.join()
I have also tried just waiting on each result, but that does the same thing as the above:
process_pool = Pool(processes=30, maxtasksperchild=1)
while True:
filepath = read_queue.get(True)
if filepath is None:
break
res = process_pool.apply_async(func=process.run, args=(filepath, final_path), callback=write_queue.put)
res.wait()
process_pool.close()
process_pool.join()
I also tried just scheduling processes and letting the pool block itself if it's out of workers to spawn:
process_pool = Pool(processes=30, maxtasksperchild=1)
while True:
filepath = read_queue.get(True)
if filepath is None:
break
process_pool.apply_async(func=process.run, args=(filepath, final_path), callback=write_queue.put)
process_pool.close()
process_pool.join()
This doesn't work, and just runs through the processes over and over, not actually running any sort of function and I'm not sure why. It seems I have to do something with the AsyncResult for the pool to actually schedule the process.
I need it to work like this:
When there is a result waiting in the queue, spawn a new process in the pool with that specific argument from the queue.
On callback, put that processed result in the write queue.
However, I can't seem to get it to work asynchronously correctly. It will only work iteratively because I have to do something with result to actually get the task to schedule properly. Whether that is a .get, .wait, whatever.
# write.py
def write(p_list):
outfile = Path('outfile.txt.bz2')
for data in p_list:
if Path.exists(outfile):
mode = 'ab'
else:
mode = 'wb'
with bz2.open(filename=outfile, mode=mode, compresslevel=9) as output:
temp = (str(data) + '\n').encode('utf-8')
output.write(temp)
print('JSON files written', flush=True)
class Write(Process):
def __init__(self, write_queue: Queue):
Process.__init__(self)
self.write_queue = write_queue
def run(self):
while True:
try:
p_list = self.write_queue.get(True, 900)
except Empty:
continue
if p_list is None:
break
write(p_list)
-
# process.py
def parse(data: int):
global json_list
time.sleep(.1) # simulate parsing the json
json_list.append(data)
def read(data: int):
time.sleep(.1)
parse(data)
def run(data: int):
global json_list
json_list = []
read(data)
return json_list
if __name__ == '__main__':
global output_path, json_list
-
# main.py
if __name__ == '__main__':
read_queue = Queue()
write_queue = Queue()
write = Write(write_queue=write_queue)
write.daemon = True
write.start()
for i in range(0, 1000000):
read_queue.put(i)
read_queue.put(None)
process_pool = Pool(processes=30, maxtasksperchild=1)
while True:
data = read_queue.get(True)
if data is None:
break
res = process_pool.apply_async(func=process.run, args=(data,), callback=write_queue.put)
write_queue.put(None)
process_pool.close()
process_pool.join()
write.join()
print('process done')
So, the problem is that there is no problem. I'm just stupid. If you define a max task per worker of 1, the processes will schedule very quickly and it will look like nothing is happening (or maybe im the only one who thought that).
Here's a reasonable way to use an asynchronous process pool correctly within a while loop with a maxtasksperchild of 1
if __name__ == '__main__':
def func(elem):
time.sleep(0.5)
return elem
def callback(elem):
# do something with processed data
pass
queue = multiprocessing.Queue()
for i in range(0, 10000):
queue.put(i)
process_pool = multiprocessing.Pool(processes=num_processes, maxtasksperchild=1)
results = []
while True:
data = queue.get(True)
if data is None:
break
res = process_pool.apply_async(func=func, args=(data,), callback=callback)
results.append(res)
flag = False
for i, res in enumerate(results):
try:
res.wait(600)
# do some logging
results[i] = None
except TimeoutError:
flag = True
# do some logging
process_pool.close()
if flag:
process_pool.terminate()
process_pool.join()
# done!

Python multiprocessing for dataset preparation

I'm looking for shorter ways to prepare my dataset for a machine-learning task. I found that the multiprocessing library might helpful. However, because I'm a newbie in multiprocessing, I couldn't find a proper way.
I first wrote some codes like below:
class DatasetReader:
def __init__(self):
self.data_list = Read_Data_from_file
self.data = []
def _ready_data(self, ex, idx):
# Some complex functions that takes several minutes
def _dataset_creator(self, queue):
for idx, ex in enumerate(self.data_list):
queue.put(self._ready_data(ex, idx))
def _dataset_consumer(self, queue):
total_mem = 0.0
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
for idx in t:
ins = queue.get()
self.data.append(ins)
gc.collect()
def _build_dataset(self):
queue = Queue()
creator = Process(target=self._dataset_creator, args=(queue,))
consumer = Process(target=self._dataset_consumer, args=(queue,))
creator.start()
consumer.start()
queue.close()
queue.join_thread()
creator.join()
consumer.join()
However, in my opinion, because the _dataset_creator processes data (here _ready_data) in serial manner, this would not be helpful for reducing time consumption.
So, I modified the code to generate multiple processes that process one datum:
class DatasetReader:
def __init__(self):
self.data_list = Read_Data_from_file
self.data = []
def _ready_data(self, ex, idx):
# Some complex functions that takes several minutes
def _dataset_creator(self, ex, idx, queue):
queue.put(self._ready_data(ex, idx))
def _dataset_consumer(self, queue):
total_mem = 0.0
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
for idx in t:
ins = queue.get()
self.data.append(ins)
gc.collect()
def _build_dataset(self):
queue = Queue()
for idx, ex in enumerate(self.data_list):
p = Process(target=self._dataset_creator, args=(ex, idx, queue,))
p.start()
consumer = Process(target=self._dataset_consumer, args=(queue,))
consumer.start()
queue.close()
queue.join_thread()
consumer.join()
However, this returns me errors:
Process Process-18:
Traceback ~~~
RuntimeError: can't start new thread
Traceback ~~~
OSError: [Errno 12] Cannot allocate memory
Could you help me to process complex data in a parallel way?
EDIT 1:
Thanks to #tdelaney, I can reduce the time consumption by generating self.num_worker processes (16 in my experiment):
def _dataset_creator(self, pid, queue):
for idx, ex in list(enumerate(self.data_list))[pid::self.num_worker]:
queue.put(self._ready_data(ex, idx))
def _dataset_consumer(self, queue):
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
for _ in t:
ins = queue.get()
self.data[ins['idx']] = ins
def _build_dataset(self):
queue = Queue()
procs = []
for pid in range(self.num_worker):
p = Process(target=self._dataset_creator, args=(pid, queue,))
procs.append(p)
p.start()
consumer = Process(target=self._dataset_consumer, args=(queue,))
consumer.start()
queue.close()
queue.join_thread()
for p in procs:
p.join()
consumer.join()
I'm trying to sketch out what a solution with a multiprocessing pool would look like. I got rid of the consumer process completely because it looks like the parent process is just waiting anyway (and needs the data eventually) so it can be the consumer. So, I set up a pool and use imap_unordered to handle passing the data to the worker.
I guessed that the data processing doesn't really need the DatasetReader at all and moved it out to its own function. On Windows, either the entire DataReader object is serialized to the subprocess (including data you don't want) or the child version of the object is incomplete and may crash when you try to use it.
Either way, changes made to a DatasetReader object in the child processes aren't seen in the parent. This can be unexpected if the parent is dependent on updated state in that object. Its best to severely bracket what's happening in subprocesses, in my opinion.
from multiprocessing import Pool, get_start_method, cpu_count
# moved out of class (assuming it is not class dependent) so that
# the entire DatasetReader object isn't pickled and sent to
# the child on spawning systems like Microsoft Windows
def _ready_data(idx_ex):
idx, ex = idx_ex
# Some complex functions that take several minutes
result = complex_functions(ex)
return (idx, result)
class DatasetReader:
def __init__(self):
self.data_list = Read_Data_from_file
self.data = [None] * len(data_list)
def _ready_data_fork(self, idx):
# on forking system, call worker with object data
return _ready_data((idx, self.data_list[idx]))
def run(self):
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ',
bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) '
'[{elapsed}<{remaining},{rate_fmt}{postfix}]')
pool = Pool(min(cpu_count, len(self.data_list)))
if get_start_method() == 'fork':
# on forking system, self.data_list is in child process and
# we only pass the index
result_iter = pool.imap_unordered(self._ready_data_fork,
(idx for idx in range(len(data_list))),
chunksize=1)
else:
# on spawning system, we need to pass the data
result_iter = pool.imap_unordered(_ready_data,
enumerate(self.data_list,
chunksize=1)
for idx, result in result_iter:
next(t)
self.data[idx] = result
pool.join()

python how to set a thread limit?

I was wondering how I could limit something like this too use only 10 threads at one time
with open("data.txt") as f:
for line in f:
lines = line.rstrip("\n\r")
t1 = Thread(target=Checker, args=("company"))
t1.start()
Use Python's ThreadPoolExecutor with max_workers argument set to 10.
Something like this:`
pool = ThreadPoolExecutor(max_workers=10)
with open("data.txt") as f:
for line in f:
lines = line.rstrip("\n\r")
pool.submit(Checker,"company")
pool.shutdown(wait=True)
The pool will automatically allocate threads as needed, limiting maximum number of allocation to 10. The first argument in pool.submit() is the function name, the arguments are simply passed as comma-separated values.
pool.shutdown(wait=True) waits for all threads to complete execution.
Use the ThreadPoolExecutor and tell it that you want 10 threads.
def your_function_processing_one_line(line):
pass # your computations
with concurrent.futures.ThreadPoolExecutor(10) as executor:
result = executor.map(your_function_processing_one_line, [line for line in f])
...and you will have all the results in result.
I wrote this nested loop to cap threads to a variable.
This code relies on a preset array of commands to process.
I have borrowed some elements from other answers for thread launch.
import os, sys, datetime, logging, thread, threading, time
from random import randint
# set number of threads
threadcount = 20
# alltests is an array of test data
numbertests = len(alltests)
testcounter = numbertests
# run tests
for test in alltests:
# launch worker thread
def worker():
"""thread worker function"""
os.system(command)
return
threads = []
t = threading.Thread(target=worker)
threads.append(t)
t.start()
testcounter -= 1
# cap the threads if over limit
while threading.active_count() >= threadcount:
threads = threading.active_count()
string = "Excessive threads, pausing 5 secs - " + str(threads)
print (string)
logging.info(string)
time.sleep(5)
# monitor for threads winding down
while threading.active_count() != 1:
threads = threading.active_count()
string = "Active threads running - " + str(threads)
print (string)
logging.info(string)
time.sleep(5)
(for both Python 2.6+ and Python 3)
Use the threadPool from multiprocessing module:
from multiprocessing.pool import ThreadPool
The only thing is that it is not well documented...

python multithreading join causes hang

I'm using the threading module in python to do some tests on I/O bound processing.
Basically, I am simply reading a file, line by line and writing it out concurrently.
I put the reading and writing loops in separate threads and use a Queue to pass data between:
q = Queue()
rt = ReadThread(ds)
wt = WriteThread(outBand)
rt.start()
wt.start()
If I run it as above, it works fine, but the interpreter crashes at the end of execution. (Any ideas why?)
If I add:
rt.join()
wt.join()
at the end, the interpreter simply hangs. Any ideas why?
The code for the ReadThread and WriteThread classes is as follows:
class ReadThread(threading.Thread):
def __init__(self, ds):
threading.Thread.__init__(self)
self.ds = ds #The raster datasource to read from
def run(self):
reader(self.ds)
class WriteThread(threading.Thread):
def __init__(self, ds):
threading.Thread.__init__(self)
self.ds = ds #The raster datasource to write to
def run(self):
writer(self.ds)
def reader(ds):
"""Reads data from raster, starting with a chunk for three lines then removing/adding a row for the remainder"""
data = read_lines(ds)
q.put(data[1, :]) #add to the queue
for i in np.arange(3, ds.RasterYSize):
data = np.delete(data, 0, 0)
data = np.vstack([data, read_lines(ds, int(i), 1)])
q.put(data[1,:]) # put the relevant data on the queue
def writer(ds):
""" Writes data from the queue to a raster file """
i = 0
while True:
arr = q.get()
ds.WriteArray(np.atleast_2d(arr), xoff = 0, yoff = i)
i +=1
Call q.get() will block infinitely in case your Queue is empty.
You can try to use get_nowait(), but you have to make sure that by the time you get to the writer function, there is something in the Queue.
wt.join() waits for the thread to finish, which it never does because of the infinite loop around q.get() in writer. To make it finish, add
q.put(None)
as the last line of reader, and change writer to
def writer(ds):
""" Writes data from the queue to a raster file """
for i, arr in enumerate(iter(q.get, None)):
ds.WriteArray(np.atleast_2d(arr), xoff = 0, yoff = i)
iter(q.get, None) yields values from q until q.get returns None. I added enumerate just for the sake of simplifying the code further.

Categories