Non pickleable Parallel joblib

Non pickleable Parallel joblib - python

I have a zip file with many .dat files. In each of them, I am aiming to apply some function that outputs two results, and I want to save the result of that function and the time that takes into three lists. The order matters. Here is the code to do it with no parallel computing:
result_1 = []
result_2 = []
runtimes = []
args_function = 'some args' # Always the same
with zipfile.ZipFile(zip_file, "r") as zip_ref:
for name in sorted(zip_ref.namelist()):
data = np.loadtxt(zip_ref.open(name))
start_time = time.time()
a, b = function(data, args_function)
runtimes.append(time.time() - start_time)
result_1.append(a)
result_2.append(b)
This seems to me embarrassingly parallel, so I did:
result_1 = []
result_2 = []
runtimes = []
args_function = 'some args' # Always the same
def compute_paralel(name, zip_ref):
data = np.loadtxt(zip_ref.open(name))
start_time = time.time()
a, b = function(data, args_function)
runtimes.append(time.time() - start_time)
result_1.append(a)
result_2.append(b)
with zipfile.ZipFile(zip_file, "r") as zip_ref:
Parallel(n_jobs=-1)(delayed(compute_paralel)(name, zip_ref) for name in sorted(zip_ref.namelist()))
But rises me the following error: pickle.PicklingError: Could not pickle the task to send it to the workers.. Therefore I'm not really sure what to do... Any ideas?

Related

Writing a JSON file using multiprocessing - Python

I have a JSON that has some values that must be extracted, processed and with those, add new values to the file. To do this, I use multiprocessing and although it is a priori synchronized I got race conditions.
The function that is called just transforms a rating between a range of 0-5 to a range of 0-100.
Thanks in advance!
import json
import multiprocessing
n = 1000
maxRating = 5
percentage = 100
inputfile= 'rawData.JSON'
outfile= 'processedData.JSON'
#load data into dictionary "data"
with open(inputfile) as f:
data = json.load(f)
#create an empty dictionary that will contain the new informations
results = {}
def saver(init,end,q,l):
for num in range(init, end):
l.acquire()
rating = data["bars"][num]["rating"]
ratioRating = (percentage * rating) / maxRating
results["ratingP"] = ratioRating
print(ratioRating)
#put data in queue
q.put(results)
l.release()
#main function
if __name__ == '__main__':
i = 0
cores = 4
q = multiprocessing.Queue()
lock = multiprocessing.Lock()
if(cores > 1): #parallel
for i in range (cores):
init = (i*n)/cores
fin = ((i+1)*n)/cores
p = multiprocessing.Process(target = saver, args = (init,fin,q,lock)).start()
for i in range (n):
data["bars"][i].update(q.get()) #update "data" dictionary adding new processed data
else: #sequential
saver(0,n,q)
for l in range (n):
data["bars"][l].update(q.get()) #update "data" dictionary adding new processed data
#write the updated JSON file with the added processed data
with open(outfile,'w') as outfile:
json.dump(data,outfile)

Speed up the write-to-different-files process

I am reading from a huge file (232MB) line by line.
First, i recognize each line according to a Regular Expression.
Then for each line, I am writing to different city.txt files under the 'report' directory according to a cityname in each line. However, this process takes a while. I am wondering if there is anyway of speeding up the process?
Example of input file: (each column split by a \t)
2015-02-03 19:20 Sane Diebgo Music 692.08 Cash
Actually i have tested the code with writing to different files and not writing to different file(simply process the large file and come up with 2 dicts) the time difference is huge. 80% of the time is spent writing to different files
def processFile(file):
pattern = re.compile(r"(\d{4}-\d{2}-\d{2})\t(\d{2}:\d{2})\t(.+)\t(.+)\t(\d+\.\d+|\d+)\t(\w+)\n")
f = open(file)
total_sale = 0
city_dict = dict()
categories_dict = dict()
os.makedirs("report", exist_ok = True)
for line in f:
valid_entry = pattern.search(line)
if valid_entry == None:
print("Invalid entry: '{}'".format(line.strip()))
continue
else:
entry_sale = float(valid_entry.group(5))
total_sale += entry_sale
city_dict.update({valid_entry.group(3) : city_dict.get(valid_entry.group(3), 0) + entry_sale})
categories_dict.update({valid_entry.group(4) : categories_dict.get(valid_entry.group(4), 0) + entry_sale})
filename = "report/" + valid_entry.group(3) + ".txt"
if os.path.exists(filename):
city_file = open(filename, "a")
city_file.write(valid_entry.group(0))
city_file.close()
else:
city_file = open(filename, "w")
city_file.write(valid_entry.group(0))
city_file.close()
f.close()
return (city_dict, categories_dict, total_sale)

The dictionary lookups and updates could be improved by using defaultdict:
from collections import defaultdict
city_dict = defaultdict(float)
categories_dict = defaultdict(float)
...
city = valid_entry.group(3)
category = valid_entry.group(4)
...
city_dict[city] += entry_sale
category_dict[category] += entry_sale

Python Code Speed Up

My code should compare two vectors saved as dictionary (two pickle files) and save the result into a pickle file too. This works but very slowly. For one compare result I'm waiting about 7:2o min. Because I have a lot of videos (exactly 2033) this prog will run about 10 days. This is too long. How can I speed up my code for Python 2.7?
import math
import csv
import pickle
from itertools import izip
global_ddc_file = 'E:/global_ddc.p'
io = 'E:/AV-Datensatz'
v_source = ''
def dot_product(v1, v2):
return sum(map(lambda x: x[0] * x[1], izip(v1, v2))) # izip('ABCD', 'xy') --> Ax By
def cosine_measure(v1, v2):
prod = dot_product(v1, v2)
len1 = math.sqrt(dot_product(v1, v1))
len2 = math.sqrt(dot_product(v2, v2))
if (len1 * len2) <> 0:
out = prod / (len1 * len2)
else: out = 0
return out
def findSource(v):
v_id = "/"+v[0].lstrip("<http://av.tib.eu/resource/video").rstrip(">")
v_source = io + v_id
v_file = v_source + '/vector.p'
source = [v_id, v_source, v_file]
return source
def getVector(v, vectorCol):
with open (v, 'rb') as f:
try:
vector_v = pickle.load(f)
except: print 'file couldnt be loaded'
tf_idf = []
tf_idf = [vec[1][vectorCol] for vec in vector_v]
return tf_idf
def compareVectors(v1, v2, vectorCol):
v1_source = findSource(v1)
v2_source = findSource(v2)
V1 = getVector(v1_source[2], vectorCol)
V2 = getVector(v2_source[2], vectorCol)
sim = [v1_source[0], v2_source[0], cosine_measure(V1, V2)]
return sim
#with open('videos_av_portal_cc_3.0_nur2bspStanford.csv', 'rb') as dataIn:
with open('videos_av_portal_cc_3.0_vollstaendig.csv', 'rb') as dataIn:
#with open('videos_av_portal_cc_3.0.csv', 'rb') as dataIn:
try:
reader = csv.reader(dataIn)
v_source = []
for row in reader:
v_source.append(findSource(row))
#print v_source
for one in v_source:
print one[1]
compVec = []
for another in v_source:
if one <> another:
compVec.append(compareVectors(one, another, 3))
compVec_sort = sorted(compVec, key=lambda cosim: cosim[2], reverse = True)
# save vector file for each video
with open (one[1] + '/compare.p','wb') as f:
pickle.dump(compVec_sort,f)
finally:
dataIn.close()

Split code in 2 parts:
1. Load Dictionary in vectors
2. Compare 2 dictionaries using multiprocessmultiprocess example
3. Launch process simultaneously according to memory availability and end the process after 8 mins. Then update the 3rd dictionary.
4. Then relaunch process on next set of data , follow step 3 and continue till the dictionary length.
This should reduce total turnaround time.
Let me know if you need code .

Python I/O multiprocessing with no Return on function

I have a working python script that, in a simplified way, works as follows:
open("A", 'r')
open("B", 'r')
open("C", 'w')
for lineA in A:
part1, part2, part3 = lineA.split(' ')
for lineB in B:
if part2 in lineB:
C.write(lineB)
I want to check in file B if a section of the line of file A exists there. If so, write that whole line from file B in a new file C.
The process is somewhat time consuming the way I have designed it (1-I still consider myself a noob with Python, 2-There are at least 4 IF statements running inside the main FOR loop), and now I have started to use input files around 200x larger than previously, so I am getting times of around 5 hours per input file here.
I have tried to use multiprocessing but I can't seem to get it to work.
I tried a simple code inside my main() function initially, without any significant improvement and definitely without using more than one CPU:
p = Process(target=multi_thread, args=(arg1,arg2,arg3))
p.start()
p.join()
Then I tried the jobs approach:
jobs = []
for i in range(4):
p = Process(target='myfunc')
jobs.append(p)
p.start()
p.join()
And a pool example I found here in the forums, to which I added a Return statement to my main function:
def multiproc(arg1,arg2,arg3):
(...)
return lineB # example of Return statment
def main():
pool = Pool(4)
with open('file.txt', 'w') as map_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(multi_thread, map_file, 4)
if __name__ == "__main__":
main()
The jobs approach actually created the file and then restarted 3 more times the whole process from scratch. This last one gives me the following error:
io.UnsupportedOperation: not readable
And I also suppose that my Return statement is breaking my loop...
Any suggestions to enable multiprocessing for this piece of code, or also to improve its neatness?
Thanks!
EDIT:
As requested, here is the full messy code:
#!/usr/bin/python3
__author__ = 'daniel'
import os
import re
from multiprocessing import Process
from multiprocessing import Pool
import time
start_time = time.time()
def multi_thread(filePath, datasetFolder, mapFileDataset):
fout = open('outdude.txt', 'w')
cwd = os.getcwd()
cwdgen, sep, id = filePath.rpartition('/')
dataset = datasetFolder.rsplit("/",1)
dataset = dataset[1]
## Create file
for i in os.listdir(cwd):
if ".ped" in i:
sample_id, sep, rest = i.partition('.ped')
for i in os.listdir(cwd):
if sample_id+'.pileupgatk' in i and dataset in i:
pileup4map = open(i,'r')
snpcounter = sum(1 for _ in pileup4map)-1
pileup4map.seek(0)
mapout = open(sample_id+'.map', 'w')
counter = 1
for line in pileup4map:
if counter <= snpcounter:
mapFileData = open(datasetFolder+'/'+mapFileDataset,'r')
line = line.rstrip()
chro, coord, refb, rbase, qual = line.split(' ')
chrom = chro.strip("chr")
counter+=1
for ligna in mapFileData:
if coord in ligna:
k = re.compile(r'(?=%s )' % coord, re.I)
lookAhead = k.search(ligna)
k = re.compile(r'(?<= %s)' % coord, re.I)
lookBehind = k.search(ligna)
if lookAhead and lookBehind != None:
lignaChrom = ligna[:2].rstrip(' ')
if chrom == lignaChrom:
lignaOut = ligna.rstrip()
mapout.write(lignaOut+'\n')
## For POOL
return lignaOut
else:
pass
else:
pass
else:
pass
mapout.close()
def main():
#Multiproc
# p = Process(target=multi_thread, args=('/home/full_karyo.fa', '/home/haak15', 'dataPP.map'))
# p.start()
# p.join()
# print("--- %s seconds ---" % (time.time() - start_time))
#Jobs
# jobs = []
# for i in range(4):
# p = Process(target=multi_thread, args=('/home/full_karyo.fa', '/home/haak15', 'dataPP.map'))
# jobs.append(p)
# p.start()
# p.join()
#Pool
pool = Pool(4)
with open('file.txt', 'w') as map_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(multi_thread, map_file, 4)
print(results)
print("--- %s seconds ---" % (time.time() - start_time))
if __name__ == "__main__":
main()
EDIT2:
Following Robert E and TheBigC's advises I re-wrote my code and it is now 13x faster, and not as confusing. I used a dictionary approach that is not as I/O hungry as the previous one, as TheBigC pointed. I am happy enough with the speed so I will leave multiprocessing aside for now. Thanks for the comments!
if makemap == True:
## Dictionary method - 13X faster
for i in os.listdir(cwd):
if ".ped" in i:
sample_id, sep, rest = i.partition('.ped')
for i in os.listdir(cwd):
if sample_id+'.pileupgatk' in i and dataset in i:
print("\n\t> Creating MAP file from sample: "+sample_id)
pileup4map = open(i,'r')
snpcounter = sum(1 for _ in pileup4map)-1
pileup4map.seek(0)
counter = 1
piledic = {}
for line in pileup4map:
if counter <= snpcounter:
line = line.rstrip()
#chr21 43805965 G G G
chro, coord, refb, rbase, qual = line.split(' ')
chrom = chro.strip("chr")
piledic[chrom,coord]=int(counter)
counter += 1
pileup4map.close()
mapFileData = open(datasetFolder+'/'+mapFileDataset,'r')
mapDic = {}
counterM =1
for ligna in mapFileData:
#22 Affx-19821577 0.737773 50950707 A G
chroMap,ident,prob,posMap,bas1,bas2 = ligna.split()
mapDic[chroMap,posMap]=int(counterM)
counterM +=1
listOfmatches = []
for item in piledic:
if item in mapDic:
listOfmatches.append(mapDic[item])
listOfmatches.sort()
mapWrite = open(sample_id+".map", 'w')
mapFileData.seek(0)
lineCounter = 1
for lignagain in mapFileData:
if lineCounter in listOfmatches:
mapWrite.write(lignagain)
lineCounter +=1
mapWrite.close()
mapFileData.close()

Output between single-threaded and multi-threaded versions of same application differs [Python]

I have written 2 versions of a program to parse a log file and return the number of strings that match a given regex. The single-threaded version return the correct output
Number of Orders ('ORDER'): 1108
Number of Replacements ('REPLACE'): 742
Number of Orders and Replacements: 1850
Time to process: 5.018553
The multithreaded program however returns erroneous values:
Number of Orders ('ORDER'): 1579
Number of Replacements ('REPLACE'): 1108
Number of Orders and Replacements: 2687
Time to process: 2.783091
The time can vary (it should be faster for the multithreaded one) but I can't seem to find why the values for orders and replacements differ between the two versions.
Here is the multithreaded version:
import re
import time
import sys
import threading
import Queue
class PythonLogParser:
queue = Queue.Queue()
class FileParseThread(threading.Thread):
def __init__(self, parsefcn, f, startind, endind, olist):
threading.Thread.__init__(self)
self.parsefcn = parsefcn
self.startind = startind
self.endind = endind
self.olist = olist
self.f = f
def run(self):
self.parsefcn(self.f, self.startind, self.endind, self.olist)
def __init__(self, filename):
assert(len(filename) != 0)
self.filename = filename
self.start = 0
self.end = 0
def open_file(self):
f = None
try:
f = open(self.filename)
except IOError as e:
print 'Unable to open file:', e.message
return f
def count_orders_from(self, f, starting, ending, offset_list):
f.seek(offset_list[starting])
order_pattern = re.compile(r'.*(IN:)(\s)*(ORDER).*(ord_type)*')
replace_pattern = re.compile(r'.*(IN:)(\s)*(REPLACE).*(ord_type)*')
order_count=replace_count = 0
for line in f:
if order_pattern.match(line) != None:
order_count+=1 # = order_count + 1
if replace_pattern.match(line) != None:
replace_count+=1 # = replace_count + 1
#return (order_count, replace_count, order_count+replace_count)
self.queue.put((order_count, replace_count, order_count+replace_count))
def get_file_data(self):
offset_list = []
offset = 0
num_lines = 0
f = 0
try:
f = open(self.filename)
for line in f:
num_lines += 1
offset_list.append(offset)
offset += len(line)
f.close()
finally:
f.close()
return (num_lines, offset_list)
def count_orders(self):
self.start = time.clock()
num_lines, offset_list = self.get_file_data()
start_t1 = 0
end_t1 = num_lines/2
start_t2 = end_t1 + 1
f = open(self.filename)
t1 = self.FileParseThread(self.count_orders_from, f, start_t1, end_t1, offset_list)
self.count_orders_from(f, start_t2, num_lines, offset_list)
t1.start()
self.end = time.clock()
tup1 = self.queue.get()
tup2 = self.queue.get()
order_count1, replace_count1, sum1 = tup1
order_count2, replace_count2, sum2 = tup2
print 'Number of Orders (\'ORDER\'): {0}\n'\
'Number of Replacements (\'REPLACE\'): {1}\n'\
'Number of Orders and Replacements: {2}\n'\
'Time to process: {3}\n'.format(order_count1+order_count2, \
replace_count1+replace_count2, \
sum1+sum2, \
self.end - self.start)
f.close()
def test2():
p = PythonLogParser('../../20150708.aggregate.log')
p.count_orders()
def main():
test2()
main()
The idea is that since the file is large, each thread will read half the file. t1 reads the first half and the main thread reads the second. The main thread then adds together the results from both iterations and displays them.
My suspicion is that somehow the order_count and replace_count in count_orders_from are being modified between threads rather than starting at 0 for each thread, but I'm not sure since I don't see why separate calls to a method from 2 separate threads would modify the same variables.

The error was occurring because even though in theory the threads were parsing individual halves, what was in fact happening is that one thread parsed halfway and the other one parsed the full file, so items were double counted. This error was fixed by adding a linecount variable to count_orders_from, in order to check whether the reader has reached the line it is supposed to read to.
def count_orders_from(self, f, starting, ending, offset_list):
f.seek(offset_list[starting])
order_pattern = re.compile(r'.*(IN:)(\s)*(ORDER).*(ord_type)*')
replace_pattern = re.compile(r'.*(IN:)(\s)*(REPLACE).*(ord_type)*')
order_count=replace_count=linecount = 0
for line in f:
if order_pattern.match(line) != None:
order_count+=1 # = order_count + 1
if replace_pattern.match(line) != None:
replace_count+=1 # = replace_count + 1
if linecount==ending:
break
linecount+=1
self.queue.put((order_count, replace_count, order_count+replace_count))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Non pickleable Parallel joblib - python

Related

Writing a JSON file using multiprocessing - Python

Speed up the write-to-different-files process

Python Code Speed Up

Python I/O multiprocessing with no Return on function

Output between single-threaded and multi-threaded versions of same application differs [Python]

Categories

Resources