Python: pickle error with multiprocessing - python

As suggested below, I have changed my code to use Pool instead. I've also simplified my functions and included all my code below. However, now I'm getting a different error: NameError: global name 'split_files' is not defined
What I want to do is pass the actual file chunk into the parse_csv_chunk function but I'm not sure how to do that.
import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile
csv_filename = 'test.csv'
def parse_csv_chunk(files_index):
global split_files
print files_index
print len(split_files)
return 1
def split(infilename, num_chunks):
READ_BUFFER = 2**13
in_file_size = os.path.getsize(infilename)
print 'Original file size:', in_file_size
chunk_size = in_file_size // num_chunks
print 'Target chunk size:', chunk_size
print 'Target number of chunks:', num_chunks
files = []
with open(infilename, 'rb', READ_BUFFER) as infile:
infile.next()
infile.next()
infile.next()
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell() < chunk_size:
try:
#write 3 lines before checking if still < chunk_size
#this is done to improve performance
#the result is that each chunk will not be exactly the same size
temp_file.write(infile.next())
temp_file.write(infile.next())
temp_file.write(infile.next())
#end of original file
except StopIteration:
break
#rewind each chunk
temp_file.seek(0)
files.append(temp_file)
return files
if __name__ == '__main__':
start = time.time()
num_chunks = mp.cpu_count()
split_files = split(csv_filename, num_chunks)
print 'Number of files after splitting: ', len(split_files)
pool = mp.Pool(processes = num_chunks)
results = [pool.apply_async(parse_csv_chunk, args=(x,)) for x in range(num_chunks)]
output = [p.get() for p in results]
print output
I'm trying to split up a csv file into parts and have them processed by each of my CPU's cores. This is what I have so far:
import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile
def parse_csv_chunk(infile):
#code here
return
def split(infilename, num_chunks):
#code here
return files
def get_header_indices(infilename):
#code here
return
if __name__ == '__main__':
start = time.time() #start measuring performance
num_chunks = mp.cpu_count() #record number of CPU cores
files = split(csv_filename, num_chunks) #split csv file into a number equal of CPU cores and store as list
print 'number of files after splitting: ', len(files)
get_header_indices(csv_filename) #get headers of csv file
print headers_list
processes = [mp.Process(target=parse_csv_chunk,
args=ifile) for ifile in enumerate(files)] #create a list of processes for each file chunk
for p in processes:
p.start()
for p in processes:
p.join()
end = time.time()
print "Execution time: %.2f" % (end - start) #display performance
There seems to be a problem at the line 'p.start()'. I see a lot of output on the console, which eventually indicates an error:
pickle.PicklingError: Can't pickle <built-in method write of file object at 0x02
22EAC8>: it's not found as __main__.write
I did not include the code for the functions I called as they are quite long, but I can if needed. I'm wondering if I'm using multiprocessing correctly.

First off, if there a reason you are not using a Pool and the imap method of the Pool?
Second, it's very hard to tell any specifics without seeing your code, especially since the error points to parts of the code that are not provided.
However, it looks like you are using multiprocessing correctly from what you have provided -- and it's a serialization problem.
Note that if you use dill, you can serialize the write method.
>>> import dill
>>>
>>> f = open('foo.bar', 'w')
>>> dill.dumps(f.write)
'\x80\x02cdill.dill\n_get_attr\nq\x00cdill.dill\n_create_filehandle\nq\x01(U\x07foo.barq\x02U\x01wq\x03K\x00\x89c__builtin__\nopen\nq\x04\x89K\x00U\x00q\x05tq\x06Rq\x07U\x05writeq\x08\x86q\tRq\n.'
Most versions of multiprocessing use cPickle (or a version of pickle that is built in C), and while dill can inject it's types into the python version of pickle, it can't do so in the C equivalent.
There is a dill-activated fork of multiprocessing -- so you might try that, as if it's purely a pickling problem, then you should get past it with multiprocess.
See: https://github.com/uqfoundation/multiprocess.
EDIT (after OP update): The global declaration in your helper function isn't going to play well with pickle. Why not just use a payload function (like split) that reads a portion of the file and returns the contents, or writes to the target file? Don't return a list of files. I know they are TemporaryFile objects, but unless you use dill (and even then it's touchy) you can't pickle a file. If you absolutely have to, return the file name, not the file, and don't use a TemporaryFile. pickle will choke trying to pass the file. So, you should refactor your code, or as I suggested earlier, try to see if you can bypass serialization issues by using multiprocess (which uses dill).

Related

Python Multithreading issue file manipulation

I've been trying to get the hang of multithreading in Python. However, whenever I attempt to make it do something that might be considered useful, I run into issues.
In this case, I have 300 PDF files. For simplicity, we'll assume that each PDF only has a unique number on it (say 1 to 300). I'm trying to make Python open the file, grab the text from it, and then use that text to rename the file accordingly.
The non-multithreaded version I make works amazing. But it's a bit slow and I thought I'd see if I could speed it up a bit. However, this version finds the very first file, renames it correctly, and then throws an error saying:
FileNotFoundError: [Errno 2] No such file or directory: './pdfPages/1006941.pdf'
Which is it basically telling me that it can't find a file by that name. The reason it can't is because it already named it. And in my head that tells me that I've probably messed something up with this loop and/or multithreading in general.
Any help would be appreciated.
Source:
import PyPDF2
import os
from os import listdir
from os.path import isfile, join
from PyPDF2 import PdfFileWriter, PdfFileReader
from multiprocessing.dummy import Pool as ThreadPool
# Global
i=0
def readPDF(allFiles):
print(allFiles)
global i
while i < l:
i+=1
pdf_file = open(path+allFiles, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
pdf_file.close()
Text = str(page_content.encode('utf-8')).strip("b").strip("'")
os.rename(path+allFiles,path+pre+"-"+Text+".PDF")
pre = "77"
path = "./pdfPages/"
included_extensions = ['pdf','PDF']
allFiles = [f for f in listdir(path) if any(f.endswith(ext) for ext in included_extensions)] # Get all files in current directory
l = len(allFiles)
pool = ThreadPool(4)
doThings = pool.map(readPDF, allFiles)
pool.close()
pool.join()
Yes, you have, in fact, messed up with the loop as you say. The loop should not be there at all. This is implicitly handled by the pool.map(...) that ensures that each function call will receive a unique file name from your list to work with. You should not do any other looping.
I have updated your code below, by removing the loop and some other changes (minor, but still improvements, I think):
# Removed a number of imports
import PyPDF2
import os
from multiprocessing.dummy import Pool as ThreadPool
# Removed not needed global variable
def readPDF(allFiles):
# The while loop not needed, as pool.map will distribute the different
# files to different processes anyway
print(allFiles)
pdf_file = open(path+allFiles, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
pdf_file.close()
Text = str(page_content.encode('utf-8')).strip("b").strip("'")
os.rename(path+allFiles,path+pre+"-"+Text+".PDF")
pre = "77"
path = "./pdfPages/"
included_extensions = ('pdf','PDF') # Tuple instead of list
# Tuple allows for simpler "F.endswith"
allFiles = [f for f in os.listdir(path) if f.endswith(included_ext)]
pool = ThreadPool(4)
doThings = pool.map(readPDF, allFiles)
# doThings will be a list of "None"s since the readPDF returns nothing
pool.close()
pool.join()
Thus, the global variable and the counter are not needed, since all of that is handled implicitly. But, even with these changes, it is not at all certain that this will speed up your execution much. Most likely, the bulk of your program execution is waiting for the disk to load. In that case, it is possible that even if you have multiple threads, they will still have to wait for the main resource, i.e., the hard drive. But to know for certain, you have to test.

tempfile is not accessible when using subprocess.Popen

When I run the following script, the error"Command line argument error: Argument "query". File is not accessible" occurs. I'm using python 3.4.2.
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import subprocess
import tempfile
import sys
def main():
# read name file and put all identifications into a list
infile_I = open('OTU_name.txt','r')
name = infile_I.read().split('>')
infile_I.close()
# extract sequence segments to a temporary file one at a time
for i in name:
i = i.replace('\n','')
for j in SeqIO.parse("GemSIM_OTU_ids.fa","fasta"):
if str(i) == str(j.id):
f = tempfile.NamedTemporaryFile()
record = j.seq
f.write(bytes(str(record),'UTF-8'))
f.seek(0)
f = f.read().decode()
Result = subprocess.Popen(['blastn','-remote','-db','chromosome','-query',f,'-out',str(i)],stdout=subprocess.PIPE)
output = Result.communicate()[0]
if __name__== '__main__': main()
f = tempfile.NamedTemporaryFile() returns a file-like object which you're trying to provide as a command line argument. Instead, you want the actual filename which is available via its .name attribute - although I'm somewhat confused why you're creating a tempfile, writing to it, seeking back to position 0, then replacing your tempfile f object with the contents of the file? I suspect you don't want to do that replacement and use f.name for your query.
Result = subprocess.Popen(['blastn','-remote','-db','chromosome','-query',f.name,'-out',str(i)],stdout=subprocess.PIPE)
Also, there's some convenient wrapper functions around subprocess.Popen such as subprocess.check_output which are also somewhat more explicit as to your intent which could be used here instead.

python multiprocessing shared Counter , pickling error

I am multiprocessing to process some very large files.
I can count occurrences of a particular string using the collections.Counter collection that is shared between the processes using a multiprocessing.BaseManager subclass.
Although I can share the Counter and seemingly pickle it does not seem to be pickled properly. I can copy the dictionary to a new dictionary that I can pickle.
I am trying to understand how to avoid the "copy" of the shared counter before picking it.
Here is my (pseudocode):
from multiprocessing.managers import BaseManager
from collections import Counter
class MyManager(BaseManager):
pass
MyManager.register('Counter', Counter)
def main(glob_pattern):
# function that processes files
def worker_process(files_split_to_allow_naive_parallelization, mycounterdict):
# code that loops through files
for line in file:
# code that processes line
my_line_items = line.split()
index_for_read = (my_line_items[0],my_line_items[6])
mycounterdict.update((index_for_read,))
manager = MyManager()
manager.start()
mycounterdict = manager.Counter()
# code to get glob files , split them with unix shell split and then chunk then
for i in range(NUM_PROCS):
p = multiprocessing.Process(target=worker_process , args = (all_index_file_tuples[chunksize * i:chunksize * (i + 1)],mycounterdict))
procs.append(p)
p.start()
# Now we "join" the processes
for p in procs:
p.join()
# This is the part I have trouble with
# This yields a pickled file that fails with an error
pickle.dump(mycounterdict,open("Combined_count_gives_error.p","wb"))
# This however works
# How can I avoid doing it this way?
mycopydict = Counter()
mydictcopy.update(mycounterdict.items())
pickle.dump(mycopydict,open("Combined_count_that_works.p","wb"))
When I try to load the "pickled" error giving file which is always a smaller fixed size , I get an error that does not make sense.
How do I pickle the shared dict without going through the creation of fresh dict in the pseudocode above.
>>> p = pickle.load(open("Combined_count_gives_error.p"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 880, in load_eof
raise EOFError
EOFError
There are several problems with your code. First of all, you are not guaranteed to close the file if you leave it dangling. Secondly, the mycounterdict is not an actual Counter but a proxy over it - pickle it and you will run into many problems, as it is unpicklable outside this process. However, you do not need to copy with update either: .copy makes a new Counter copy of it.
Thus you should use
with open("out.p", "wb") as f:
pickle.dump(mycounterdict.copy(), f)
As for if this is a good pattern, the answer is no. Instead of using a shared counter you should count separately in each process, for a much simpler code:
from multiprocessing import Pool
from collections import Counter
import pickle
def calculate(file):
counts = Counter()
...
return counts
pool = Pool(processes=NPROCS)
counts = Counter()
for result in pool.map(calculate, files):
counts += result
with open("out.p", "wb") as f:
pickle.dump(counts, f)

How to define an input for python multiprocessing function to take all the files in a directory?

This question may sounds basic because I do not know that much about multiprocessing, I am just learning.
I have python code which processes a bunch of files in a directory.
with Pool(processes=cores) as pp:
pp.map(function, list)
Here is my code:
path = '/data/personal'
print("Running with PID: %d" % getpid())
psl = PublicSuffixList()
d = defaultdict(set)
start = time()
#
files_list = glob(path)
for filename in files:
print(filename)
f = open(filename, 'r')
for n, line in enumerate(f):
line = line[:-1]
ip,reversed_domain_1= line.split('|')
reversed_domain_2 = reversed_domain_1.split('.')
reversed_domain_3 = list(reversed(reversed_domain_2))
domain = ('.'.join(reversed_domain_3))
domain = psl.get_public_suffix(domain)
d[ip].add(domain)
###
for ip, domains in d.iteritems():
for domain in domains:
print(ip,domain)
How can I convert it to be done in a multiprocessing pool?
You can process each file in a separate process like this:
from os import getpid
from collections import defaultdict
from glob import glob
from multiprocessing import Pool
from time import time
from functools import partial
path = '/data/personal'
print("Running with PID: %d" % getpid())
def process_file(psl, filename):
print(filename)
f = open(filename, 'r')
for n, line in enumerate(f):
line = line[:-1]
ip,reversed_domain_1= line.split('|')
reversed_domain_2 = reversed_domain_1.split('.')
reversed_domain_3 = list(reversed(reversed_domain_2))
domain = ('.'.join(reversed_domain_3))
domain = psl.get_public_suffix(domain)
return ip, domain
if __name__ == "__main__":
psl = PublicSuffixList()
d = defaultdict(set)
start = time()
files_list = glob(path)
pp = Pool(processes=cores)
func = partial(process_file, psl)
results = pp.imap_unordered(func, files_list)
for ip, domain in results:
d[ip].add(domain)
p.close()
p.join()
for ip, domains in d.iteritems():
for domain in domains:
print(ip,domain)
Note that the defaultdict is populated in the parent process, because you can't actually share the same defaultdict between multiple processes without using a multiprocessing.Manager. You could do that here if you wanted, but I don't think its necessary. Instead, as soon as any child has a result available, we add it to the defaultdict in the parent. Using imap_unordered instead of map enables us to receive results on-demand, rather than having to wait for all of them to be ready. The only other notable thing is the use of partial to enable passing the psl list to all the child processes in addition to an item from files_list with imap_unordered.
One important note here: Using multiprocessing for this kind of operation may not actually improve performance. A lot of the work you're doing here is reading from disk, which can't be sped up via multiple processes; your hard drive can only do one read operation at a time. Getting read requests for different files from a bunch of processes at once can actually slow things down that doing them sequentially, because it potentially has to to constantly switch to different areas of the physical disk to read a new line from each file. Now, it's possible that the CPU-bound work you're doing with each line is expensive enough to dominate that I/O time, in which case you will see a speed boost.

How do I Convert Python Dict to JSON in a Multi-Threaded Fashion

I have a number of large files with many thousands of lines in python dict format. I'm converting them with json.dumps to json strings.
import json
import ast
mydict = open('input', 'r')
output = open('output.json', "a")
for line in mydict:
line = ast.literal_eval(line)
line = json.dumps(line)
output.write(line)
output.write("\n")
This works flawlessly, however, it does so in a single threaded fashion. Is there an easy way to utilize the remaining cores in my system to speed things up?
Edit:
Based on the suggestions I've started here with the multiprocessing library:
import os
import json
import ast
from multiprocessing import Process, Pool
mydict = open('twosec.in', 'r')
def info(title):
print title
print 'module name:', __name__
print 'parent process: ', os.getppid()
print 'process id:', os.getpid()
def converter(name):
info('converter function')
output = open('twosec.out', "a")
for line in mydict:
line = ast.literal_eval(line)
line = json.dumps(line)
output.write(line)
output.write("\n")
if __name__ == '__main__':
info('main line')
p = Process(target=converter, args=(mydict))
p.start()
p.join()
I don't quite understand where Pool comes into play, can you explain more?
I don't know of an easy way for you to get a speedup from multithreading, but if any sort of speedup is really what you want then I would recommend trying the ujson package instead of json. It has produced very significant speedups for me, basically for free. Use it the same way you would use the regular json package.
http://pypi.python.org/pypi/ujson/
Wrap the code above in a function that takes as its single argument a filename and that writes the json to an output file.
Then create a Pool object from the multiprocessing module, and use Pool.map() to apply your function in parallel to the list of all files. This will automagically use all cores on your CPU, and because it uses multiple processes instead of threads, you won't run into the global interpreter lock.
Edit: Change the main portion of your program like so;
if __name__ == '__main__':
files = ['first.in', 'second.in', 'third.in'] # et cetera
info('main line')
p = Pool()
p.map(convertor, files)
p.close()
Of course you should also change convertor() to derive the output name from the input name!
Below is a complete example of a program to convert DICOM files into PNG format, using the ImageMagick program
"Convert DICOM files to PNG format, remove blank areas."
import os
import sys # voor argv.
import subprocess
from multiprocessing import Pool, Lock
def checkfor(args):
try:
subprocess.check_output(args, stderr=subprocess.STDOUT)
except CalledProcessError:
print "Required program '{}' not found! exiting.".format(progname)
sys.exit(1)
def processfile(fname):
size = '1574x2048'
args = ['convert', fname, '-units', 'PixelsPerInch', '-density', '300',
'-crop', size+'+232+0', '-page', size+'+0+0', fname+'.png']
rv = subprocess.call(args)
globallock.acquire()
if rv != 0:
print "Error '{}' when processing file '{}'.".format(rv, fname)
else:
print "File '{}' processed.".format(fname)
globallock.release()
## This is the main program ##
if __name__ == '__main__':
if len(sys.argv) == 1:
path, binary = os.path.split(sys.argv[0])
print "Usage: {} [file ...]".format(binary)
sys.exit(0)
checkfor('convert')
globallock = Lock()
p = Pool()
p.map(processfile, sys.argv[1:])
p.close()

Categories