Multi threading read and write file using python - python

So, I have a 9000 lines csv file. I have read it and stored it in a dictionary list with string key, m. What I want to do is to loop for every item list[m] and process it into a function processItem(item). This processItem will return a string with csv-like format. My aim is to write the result of processItem function for every item in list. Is there any idea how to do this multi thread way?
I think I should divide the list to N sub-lists and then process these sub-lists in multi thread way. Every thread will return the string processed from sub-lists and then merge it. Finally write it to a file. How to implement that?

This is a perfect example for using the multiprocessing module and the Pool() function (be aware that threading module can not be used for speed).
You have to apply a function on each element of your list, so this can be easily parallelized.
with Pool() as p:
processed = p.map(processItem, lst)
If you are using Python 2, Pool() cannot be used as a context manager, but you can use it like this:
p = Pool()
processed = p.map(processItem, lst)
Your function processItem() will be call for each element in your lst, and the result will create a new list processed (order is preserved).
The function Pool() spawn as many process workers that your CPU has cores, and it executes new task as soon as the previous one is finished, until every elements has been processed.

Related

Multiprocessing items in a list - Python, To avoid clashes would usage of queue be required?

I have a list of 18k ids and want to perform 3 functions i.e
fetch_data_from_db(),
data_clean(),
push_data_back_to_db()
For each of the 18k records, I want to perform 1,2 in sequence and dump the output to files. 3. can be kicked off at a later stage.
To make it complete faster, I am trying to multiprocess by writing a wrapper function around (1,2) using pool library.
from multiprocessing import pool
import time
list_of_18k = [1,2,3,4,5....]
def func(id):
fetch_data_from_db(id)
data_clean(id)
if __name__ == '__main__':
p = pool.Pool()
res = p.map(func, list_of_18k)
p.close()
p.join()
Question is, if we simply run it this way, will it automatically distribute the list of 18k ids to each of the processes(cores) and no process will read the list item twice?? or is it necessary to use locks before list_of_18k = [] or add some queue / pipe??

Multiprocessing storing read-only string-array for all processes

I am trying to create workers for a task that involves reading a lot of files and analyzing them.
I want something like this:
list_of_unique_keys_from_csv_file = [] # About 200mb array (10m rows)
# a list of uniquekeys for comparing inside worker processes to a set of flatfiles
I need more threads as it is going very slow, doing the comparison with one process (10 minutes per file).
I have another set of flat-files that I compare the CSV file to, to see if unique keys exist. This seems like a map reduce type of problem.
main.py:
def worker_process(directory_glob_of_flat_files, list_of_unique_keys_from_csv_file):
# Do some parallel comparisons "if not in " type stuff.
# generate an array of
# lines of text like : "this item_x was not detected in CSV list (from current_flatfile)"
if current_item not in list_of_unique_keys_from_csv_file:
all_lines_this_worker_generated.append(sometext + current_item)
return all_lines_this_worker_generated
def main():
all_results = []
pool = Pool(processes=6)
partitioned_flat_files = [] # divide files from glob by 6
results = pool.starmap(worker_process, partitioned_flat_files, {{{{i wanna pass in my read-only parameter}}}})
pool.close()
pool.join()
all_results.extend(results )
resulting_file.write(all_results)
I am using both a linux and a windows environment, so perhaps I need something cross-platform compatible (the whole fork() discussion).
Main Question: Do I need some sort of Pipe or Queue, I can't seem to find good examples of how to transfer around a big read-only string array, a copy for each worker process?
You can just split your read-only parameters and then pass them in. The multiprocessing module is cross-platform compatible, so don't worry about it.
Actually, every process, even sub-process, has its own resources, that means no matter how you pass the parameters to it, it will keep a copy of the original one instead of sharing it. In this simple case, when you pass the parameters from main process into sub-processes, Pool automatically makes a copy of your variables. Because sub-processes just have the copies of original one, so the modification cannot be shared. It doesn't matter in this case as your variables are read-only.
But be careful about your code, you need to wrap the parameters you need into an
iterable collection, for example:
def add(a, b):
return a + b
pool = Pool()
results = pool.starmap(add, [(1, 2), (3, 4)])
print(results)
# [3, 7]

Python multi function multithreading with threading.Thread? (variable number of threads)

I'm trying to start a variable number of threads to compute the results of functions for one of my automated trading modules. I have about 14 functions all of which are computationally expensive. I've been calculating each function sequentially, but it takes around 3 minutes to complete, and my platform is high frequency, I have the need to cut that computation time down to 1 minute or less.
I've read up on multiprocessing and multithreading, but I can't find a solution that fits my need.
What I'm trying to do is define "n" number of threads to use, then divide my list of functions into "n" groups, then compute each group of functions in a separate thread. Essentially:
functionList = [func1,func2,func3,func4]
outputList = [func1out,func2out,func3out,func4out]
argsList = [func1args,func2args,func3args,func4args]
# number of threads
n = 3
functionSplit = np.array_split(np.array(functionList),n)
outputSplit = np.array_split(np.array(outputList),n)
argSplit = np.array_split(np.array(argsList),n)
Now I'd like to start "n" seperate threads, each processing the functions according to the split lists. Then I'd like to name the output of each function according to the outputList and create a master dict of the outputs from each function. I then will loop through the output dict and create a dataframe with column ID numbers according to the information in each column (already have this part worked out, just need the multithreading).
Is there any way to do something like this? I've been looking into creating a subclass of the threading.Thread class and passing the functions, output names, and arguments into the run() method, but I don't know how to name and output the results of the functions from each thread! Nor do I know how to call functions in a list according to their corresponding arguments!
The reason that I'm doing this is to discover the optimum thread number balance between computational efficiency and time. Like I said, this will be integrated into a high frequency trading platform I'm developing where time is my major constraint!
Any ideas?
You can use multiprocessing library like below
import multiprocessing
def callfns(fnList, argList, outList, d):
for i in range(len(fnList)):
d[somekey] = fnList[i](argList, outList)
...
manager = multiprocessing.Manager()
d = manager.dict()
processes = []
for i in range(len(functionSplit)):
process = multiprocessing.Process(target=callfns, args=(functionSplit[i], argSplit[i], outputSplit[i], d))
processes.append(process)
for j in processes:
j.start()
for j in processes:
j.join()
# use d here
You can use a server process to share the dictionary between these processes. To interact with the server process you need Manager. Then you can create a dictionary in server process manager.dict(). Once all process join back to the main process, you can use the dictionary d.
I hope this help you solve your problem.
You should use multiprocessing instead of threading for cpu bound tasks.
Manually creating and managing processes can be difficult and require more efforts. Do checkout the concurrent.futures and try the ProcessPool for maintaining a pool of processes. You can submit tasks to them and retrieve results.
The Pool.map method from multiprocessing module can take a function and iterable and then process them in chunks in parallel to compute faster. The iterable is broken into separate chunks. These chunks are passed to the function in separate processes. Then the results are then put back together.

Optimization for Python code

I have a small function (see below) that returns a list of names that are mapped from a list of integers (eg [1,2,3,4]) which can be of length up to a thousand.
This function can potentially get called tens of thousands of times at a time and I want to know if I can do anything to make it run faster.
The graph_hash is a large hash that maps keys to sets of length 1000 or less. I am iterating over a set and mapping the values to names and returning a list. The u.get_name_from_id() queries an sqlite database.
Any thoughts to optimize any part of this function?
def get_neighbors(pid):
names = []
for p in graph_hash[pid]:
names.append(u.get_name_from_id(p))
return names
Caching and multithreading are only going to get you so far, you should create a new method that uses executemany under the hood to retrieve multiple names from the database in bulk.
Something like names = u.get_names_from_ids(graph_hash[pid]).
You're hitting the database sequentially here:
for p in graph_hash[pid]:
names.append(u.get_name_from_id(p))
I would recommend doing it concurrently using threads. Something like this should get you started:
def load_stuff(queue, p):
q.put(u.get_name_from_id(p))
def get_neighbors(pid):
names = Queue.Queue()
# we'll keep track of the threads with this list
threads = []
for p in graph_hash[pid]:
thread = threading.Thread(target=load_stuff, args=(names,p))
threads.append(thread)
# start the thread
thread.start()
# wait for them to finish before you return your Queue
for thread in threads:
thread.join()
return names
You can turn the Queue back into a list with [item for item in names.queue] if needed.
The idea is that the database calls are blocking until they're done, but you can make multiple SELECT statements on a database without locking. So, you should use threads or some other concurrency method to avoid waiting unnecessarily.
I would recommend to use deque instead of list if you doing thousands of appends. So, names should be names = deque().
A list comprehension is a start (similar to #cricket_007's generator suggestion), but you are limited by function calls:
def get_neighbors(pid):
return [u.get_name_from_id(p) for p in graph_hash[pid]]
As #salparadise suggested, consider memoization to speed up get_name_from_id().

Python and multiprocessing example

I am totaly new in multiprocessing. I am trying to change my code in order to run part of it simultaneously.
I have a huge list where I have to call an API for each node. Since, the APIs are independence, I don't need the result of the first one in order to proceed to the second one. So, I have this code:
def xmlpart1(id):
..call the api..
..retrieve the xml..
..find the part of xml I want..
return xml_part1
def xmlpart2(id):
..call the api..
..retrieve the xml..
..find the part of xml I want..
return xml_part2
def main(index):
mylist = [[..,..],[..,..],[..,..],[..,...]] # A huge list of lists with ids I need for calling the APIs
myL= mylist[index] c
mydic = {}
for i in myL:
flag1 = xmlpart1(i)
flag2 = xmlpart2(i)
mydic[flag1] = flag2
root = "myfilename %s.json" %(str(index))
with open(root, "wb") as f:
json.dump(mydic,f)
from multiprocessing import Pool
if __name__=='__main__':
Pool().map(main, [0,1,2,3])
After a few suggestions from here and from the chat, I end up with this code. The problem is still there. I run the script at 9:50. At 10:25 the first file "myfilename 0.json" appeared in my folder. Now it is 11:25 and neither of the other files have been appeared. The sublists have equal length and they do the same thing, so they need approximately the same time.
This is something more suited to the multiprocessing.Pool() class.
Here's a simple example:
from multiprocessing import Pool
def job(args):
"""Your job function"""
Pool().map(job, inputs)
Where:
inputs is your list of inputs. Each input gets passed to job and processed in a separate process.
You get the results back as a list when all jobs have completed.
multiprocessing.Pool().map is just like the Python builtin map() but sets up a process pool of workers for you and passes each input to the given function.
See the docs for more details: http://docs.python.org/2/library/multiprocessing.html

Categories