I am multiprocessing to process some very large files.
I can count occurrences of a particular string using the collections.Counter collection that is shared between the processes using a multiprocessing.BaseManager subclass.
Although I can share the Counter and seemingly pickle it does not seem to be pickled properly. I can copy the dictionary to a new dictionary that I can pickle.
I am trying to understand how to avoid the "copy" of the shared counter before picking it.
Here is my (pseudocode):
from multiprocessing.managers import BaseManager
from collections import Counter
class MyManager(BaseManager):
pass
MyManager.register('Counter', Counter)
def main(glob_pattern):
# function that processes files
def worker_process(files_split_to_allow_naive_parallelization, mycounterdict):
# code that loops through files
for line in file:
# code that processes line
my_line_items = line.split()
index_for_read = (my_line_items[0],my_line_items[6])
mycounterdict.update((index_for_read,))
manager = MyManager()
manager.start()
mycounterdict = manager.Counter()
# code to get glob files , split them with unix shell split and then chunk then
for i in range(NUM_PROCS):
p = multiprocessing.Process(target=worker_process , args = (all_index_file_tuples[chunksize * i:chunksize * (i + 1)],mycounterdict))
procs.append(p)
p.start()
# Now we "join" the processes
for p in procs:
p.join()
# This is the part I have trouble with
# This yields a pickled file that fails with an error
pickle.dump(mycounterdict,open("Combined_count_gives_error.p","wb"))
# This however works
# How can I avoid doing it this way?
mycopydict = Counter()
mydictcopy.update(mycounterdict.items())
pickle.dump(mycopydict,open("Combined_count_that_works.p","wb"))
When I try to load the "pickled" error giving file which is always a smaller fixed size , I get an error that does not make sense.
How do I pickle the shared dict without going through the creation of fresh dict in the pseudocode above.
>>> p = pickle.load(open("Combined_count_gives_error.p"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 880, in load_eof
raise EOFError
EOFError
There are several problems with your code. First of all, you are not guaranteed to close the file if you leave it dangling. Secondly, the mycounterdict is not an actual Counter but a proxy over it - pickle it and you will run into many problems, as it is unpicklable outside this process. However, you do not need to copy with update either: .copy makes a new Counter copy of it.
Thus you should use
with open("out.p", "wb") as f:
pickle.dump(mycounterdict.copy(), f)
As for if this is a good pattern, the answer is no. Instead of using a shared counter you should count separately in each process, for a much simpler code:
from multiprocessing import Pool
from collections import Counter
import pickle
def calculate(file):
counts = Counter()
...
return counts
pool = Pool(processes=NPROCS)
counts = Counter()
for result in pool.map(calculate, files):
counts += result
with open("out.p", "wb") as f:
pickle.dump(counts, f)
Related
I need to read a file in a thread and store the data present in the file in a local variable (a variable present outside of thread, in the script). I am sending this variable to the thread as an argument. If I am not reading a file, and just accessing the variable, everything works fine. But if I am reading a file, then the variable (dictionary here) comes up empty. I am not sure why that's the case. Please let me know why this might be happening and what I can do about it. Thanks.
(I am using python 3.9.6 on Windows 10)
import threading
import time
import os
def thread_function(filename: str, line_dict: dict, lock:threading.Lock):
lock.acquire(blocking=True, timeout=5.8)
with open(filename, 'r') as f:
for id,line in enumerate(f):
line_dict['line{}'.format(id)] = line
lock.release()
print('[from thread] Keys in Dictionary: \n{}'.format(line_dict.keys()))
line_dict = {}
lock = threading.Lock()
if __name__ == '__main__':
t = threading.Thread(target=thread_function, args=('./lines.txt', line_dict, lock))
t.start()
print('[from main] Keys in Dictionary: \n{}'.format(line_dict.keys()))
t.join()
I have a tarfile containing bz2-compressed files. I want to apply the function clean_file to each of the bz2 files, and collate the results. In series, this is easy with a loop:
import pandas as pd
import json
import os
import bz2
import itertools
import datetime
import tarfile
from multiprocessing import Pool
def clean_file(member):
if '.bz2' in str(member):
f = tr.extractfile(member)
with bz2.open(f, "rt") as bzinput:
dicts = []
for i, line in enumerate(bzinput):
line = line.replace('"name"}', '"name":" "}')
dat = json.loads(line)
dicts.append(dat)
bzinput.close()
f.close()
del f, bzinput
processed = dicts[0]
return processed
else:
pass
# Open tar file and get contents (members)
tr = tarfile.open('data.tar')
members = tr.getmembers()
num_files = len(members)
# Apply the clean_file function in series
i=0
processed_files = []
for m in members:
processed_files.append(clean_file(m))
i+=1
print('done '+str(i)+'/'+str(num_files))
However, I need to be able to do this in parallel. The method I'm trying uses Pool like so:
# Apply the clean_file function in parallel
if __name__ == '__main__':
with Pool(2) as p:
processed_files = list(p.map(clean_file, members))
But this returns an OSError:
Traceback (most recent call last):
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "parse_data.py", line 19, in clean_file
for i, line in enumerate(bzinput):
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/bz2.py", line 195, in read1
return self._buffer.read1(size)
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/_compression.py", line 103, in read
data = self._decompressor.decompress(rawblock, size)
OSError: Invalid data stream
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "parse_data.py", line 53, in <module>
processed_files = list(tqdm.tqdm(p.imap(clean_file, members), total=num_files))
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/site-packages/tqdm/std.py", line 1167, in __iter__
for obj in iterable:
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
OSError: Invalid data stream
So I guess this way isn't properly accessing the files from within data.tar or something. How can I apply the function in parallel?
I'm guessing this will work with any tar archive containing bz2 files but here's my data to reproduce the error:
https://github.com/johnf1004/reproduce_tar_error
You didn't specify what platform you are running on but I suspect that it is Windows because you have ...
if __name__ == '__main__':
main()
... which would be required for code that creates processes on platforms that use OS function spawn for creating new processes. But that also means that when a new process is created (e.g. all the processes in the process pool you are creating), each process begins by re-executing the source program from the very top of the program. This means that the following code is being executed by each pool process:
tr = tarfile.open('data.tar')
members = tr.getmembers()
num_files = len(members)
However, I don't see why this would in itself cause an error, but I can't be sure. The problem may be, however, that this is executing after the call to your worker function, clean_file is being called and so tr has not been set. If this code preceded clean_file it might work, but this is just a guess. Certainly extracting the members with members = tr.getmembers() in each pool process is wasteful. Each process needs to open the tar file, ideally just once.
But what is clear is that the stacktrace you published does not match your code. You show:
Traceback (most recent call last):
File "parse_data.py", line 53, in <module>
processed_files = list(tqdm.tqdm(p.imap(clean_file, members), total=num_files))
Yet your code does not have any reference to tqdm or using method imap. Now it becomes more difficult to analyze what your problem actually is when the code you post doesn't quite match the code that produces the exception.
On the off-chance you are running on a Mac, which might be using fork to create new processes, this can be problematic when the main process has created multiple threads (which you don't necessarily see, perhaps by the tarfile module) and you then create a new process, I have specified code to ensure that spawn is used to create new processes. Anyway, the following code should work. It also introduces a few optimizations. If it doesn't, please post a new stacktrace.
import pandas as pd
import json
import os
import bz2
import itertools
import datetime
import tarfile
from multiprocessing import get_context
def open_tar():
# open once for each process in the pool
global tr
tr = tarfile.open('data.tar')
def clean_file(member):
f = tr.extractfile(member)
with bz2.open(f, "rt") as bzinput:
for line in bzinput:
line = line.replace('"name"}', '"name":" "}')
dat = json.loads(line)
# since you are returning just the first occurrence:
return dat
def main():
with tarfile.open('data.tar') as tr:
members = tr.getmembers()
# just pick members where '.bz2' is in member:
filtered_members = filter(lambda member: '.bz2' in str(member), members)
ctx = get_context('spawn')
# open tar file just once for each process in the pool:
with ctx.Pool(initializer=open_tar) as pool:
processed_files = pool.map(clean_file, filtered_members)
print(processed_files)
# required for when processes are created using spawn:
if __name__ == '__main__':
main()
It seems some race condition was happening.
Opening the tar file separately in every child process solves the issue:
import json
import bz2
import tarfile
import logging
from multiprocessing import Pool
def clean_file(member):
if '.bz2' not in str(member):
return
try:
with tarfile.open('data.tar') as tr:
with tr.extractfile(member) as bz2_file:
with bz2.open(bz2_file, "rt") as bzinput:
dicts = []
for i, line in enumerate(bzinput):
line = line.replace('"name"}', '"name":" "}')
dat = json.loads(line)
dicts.append(dat)
return dicts[0]
except Exception:
logging.exception(f"Error while processing {member}")
def process_serial():
tr = tarfile.open('data.tar')
members = tr.getmembers()
processed_files = []
for i, member in enumerate(members):
processed_files.append(clean_file(member))
print(f'done {i}/{len(members)}')
def process_parallel():
tr = tarfile.open('data.tar')
members = tr.getmembers()
with Pool() as pool:
processed_files = pool.map(clean_file, members)
print(processed_files)
def main():
process_parallel()
if __name__ == '__main__':
main()
EDIT:
Note that another way to solve this problem is to just use the spawn start method:
multiprocessing.set_start_method('spawn')
By doing this, we are instructing Python to "deep-copy" file handles in child processes.
Under the default "fork" start method, the file handles of parent and child share the same offsets.
As suggested below, I have changed my code to use Pool instead. I've also simplified my functions and included all my code below. However, now I'm getting a different error: NameError: global name 'split_files' is not defined
What I want to do is pass the actual file chunk into the parse_csv_chunk function but I'm not sure how to do that.
import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile
csv_filename = 'test.csv'
def parse_csv_chunk(files_index):
global split_files
print files_index
print len(split_files)
return 1
def split(infilename, num_chunks):
READ_BUFFER = 2**13
in_file_size = os.path.getsize(infilename)
print 'Original file size:', in_file_size
chunk_size = in_file_size // num_chunks
print 'Target chunk size:', chunk_size
print 'Target number of chunks:', num_chunks
files = []
with open(infilename, 'rb', READ_BUFFER) as infile:
infile.next()
infile.next()
infile.next()
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell() < chunk_size:
try:
#write 3 lines before checking if still < chunk_size
#this is done to improve performance
#the result is that each chunk will not be exactly the same size
temp_file.write(infile.next())
temp_file.write(infile.next())
temp_file.write(infile.next())
#end of original file
except StopIteration:
break
#rewind each chunk
temp_file.seek(0)
files.append(temp_file)
return files
if __name__ == '__main__':
start = time.time()
num_chunks = mp.cpu_count()
split_files = split(csv_filename, num_chunks)
print 'Number of files after splitting: ', len(split_files)
pool = mp.Pool(processes = num_chunks)
results = [pool.apply_async(parse_csv_chunk, args=(x,)) for x in range(num_chunks)]
output = [p.get() for p in results]
print output
I'm trying to split up a csv file into parts and have them processed by each of my CPU's cores. This is what I have so far:
import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile
def parse_csv_chunk(infile):
#code here
return
def split(infilename, num_chunks):
#code here
return files
def get_header_indices(infilename):
#code here
return
if __name__ == '__main__':
start = time.time() #start measuring performance
num_chunks = mp.cpu_count() #record number of CPU cores
files = split(csv_filename, num_chunks) #split csv file into a number equal of CPU cores and store as list
print 'number of files after splitting: ', len(files)
get_header_indices(csv_filename) #get headers of csv file
print headers_list
processes = [mp.Process(target=parse_csv_chunk,
args=ifile) for ifile in enumerate(files)] #create a list of processes for each file chunk
for p in processes:
p.start()
for p in processes:
p.join()
end = time.time()
print "Execution time: %.2f" % (end - start) #display performance
There seems to be a problem at the line 'p.start()'. I see a lot of output on the console, which eventually indicates an error:
pickle.PicklingError: Can't pickle <built-in method write of file object at 0x02
22EAC8>: it's not found as __main__.write
I did not include the code for the functions I called as they are quite long, but I can if needed. I'm wondering if I'm using multiprocessing correctly.
First off, if there a reason you are not using a Pool and the imap method of the Pool?
Second, it's very hard to tell any specifics without seeing your code, especially since the error points to parts of the code that are not provided.
However, it looks like you are using multiprocessing correctly from what you have provided -- and it's a serialization problem.
Note that if you use dill, you can serialize the write method.
>>> import dill
>>>
>>> f = open('foo.bar', 'w')
>>> dill.dumps(f.write)
'\x80\x02cdill.dill\n_get_attr\nq\x00cdill.dill\n_create_filehandle\nq\x01(U\x07foo.barq\x02U\x01wq\x03K\x00\x89c__builtin__\nopen\nq\x04\x89K\x00U\x00q\x05tq\x06Rq\x07U\x05writeq\x08\x86q\tRq\n.'
Most versions of multiprocessing use cPickle (or a version of pickle that is built in C), and while dill can inject it's types into the python version of pickle, it can't do so in the C equivalent.
There is a dill-activated fork of multiprocessing -- so you might try that, as if it's purely a pickling problem, then you should get past it with multiprocess.
See: https://github.com/uqfoundation/multiprocess.
EDIT (after OP update): The global declaration in your helper function isn't going to play well with pickle. Why not just use a payload function (like split) that reads a portion of the file and returns the contents, or writes to the target file? Don't return a list of files. I know they are TemporaryFile objects, but unless you use dill (and even then it's touchy) you can't pickle a file. If you absolutely have to, return the file name, not the file, and don't use a TemporaryFile. pickle will choke trying to pass the file. So, you should refactor your code, or as I suggested earlier, try to see if you can bypass serialization issues by using multiprocess (which uses dill).
This question may sounds basic because I do not know that much about multiprocessing, I am just learning.
I have python code which processes a bunch of files in a directory.
with Pool(processes=cores) as pp:
pp.map(function, list)
Here is my code:
path = '/data/personal'
print("Running with PID: %d" % getpid())
psl = PublicSuffixList()
d = defaultdict(set)
start = time()
#
files_list = glob(path)
for filename in files:
print(filename)
f = open(filename, 'r')
for n, line in enumerate(f):
line = line[:-1]
ip,reversed_domain_1= line.split('|')
reversed_domain_2 = reversed_domain_1.split('.')
reversed_domain_3 = list(reversed(reversed_domain_2))
domain = ('.'.join(reversed_domain_3))
domain = psl.get_public_suffix(domain)
d[ip].add(domain)
###
for ip, domains in d.iteritems():
for domain in domains:
print(ip,domain)
How can I convert it to be done in a multiprocessing pool?
You can process each file in a separate process like this:
from os import getpid
from collections import defaultdict
from glob import glob
from multiprocessing import Pool
from time import time
from functools import partial
path = '/data/personal'
print("Running with PID: %d" % getpid())
def process_file(psl, filename):
print(filename)
f = open(filename, 'r')
for n, line in enumerate(f):
line = line[:-1]
ip,reversed_domain_1= line.split('|')
reversed_domain_2 = reversed_domain_1.split('.')
reversed_domain_3 = list(reversed(reversed_domain_2))
domain = ('.'.join(reversed_domain_3))
domain = psl.get_public_suffix(domain)
return ip, domain
if __name__ == "__main__":
psl = PublicSuffixList()
d = defaultdict(set)
start = time()
files_list = glob(path)
pp = Pool(processes=cores)
func = partial(process_file, psl)
results = pp.imap_unordered(func, files_list)
for ip, domain in results:
d[ip].add(domain)
p.close()
p.join()
for ip, domains in d.iteritems():
for domain in domains:
print(ip,domain)
Note that the defaultdict is populated in the parent process, because you can't actually share the same defaultdict between multiple processes without using a multiprocessing.Manager. You could do that here if you wanted, but I don't think its necessary. Instead, as soon as any child has a result available, we add it to the defaultdict in the parent. Using imap_unordered instead of map enables us to receive results on-demand, rather than having to wait for all of them to be ready. The only other notable thing is the use of partial to enable passing the psl list to all the child processes in addition to an item from files_list with imap_unordered.
One important note here: Using multiprocessing for this kind of operation may not actually improve performance. A lot of the work you're doing here is reading from disk, which can't be sped up via multiple processes; your hard drive can only do one read operation at a time. Getting read requests for different files from a bunch of processes at once can actually slow things down that doing them sequentially, because it potentially has to to constantly switch to different areas of the physical disk to read a new line from each file. Now, it's possible that the CPU-bound work you're doing with each line is expensive enough to dominate that I/O time, in which case you will see a speed boost.
I am trying to load a json file as part of the mapper function but it returns "No such file in directory" although the file is existent.
I am already opening a file and parsing through its lines. But want to compare some of its values to a second JSON file.
from mrjob.job import MRJob
import json
import nltk
import re
WORD_RE = re.compile(r"\b[\w']+\b")
sentimentfile = open('sentiment_word_list_stemmed.json')
def mapper(self, _, line):
stemmer = nltk.PorterStemmer()
stems = json.loads(sentimentfile)
line = line.strip()
# each line is a json line
data = json.loads(line)
form = data.get('type', None)
if form == 'review':
bs_id = data.get('business_id', None)
text = data['text']
stars = data['stars']
words = WORD_RE.findall(text)
for word in words:
w = stemmer.stem(word)
senti = stems.get[w]
if senti:
yield (bs_id, (senti, 1))
You should not be opening a file in the mapper function at all. You only need to pass the file in as STDIN or as the first argument for the mapper to pick it up. Do it like this:
python mrjob_program.py sentiment_word_list_stemmed.json > output
OR
python mrjob_program.py < sentiment_word_list_stemmed.json > output
Either one will work. It says that there is no such file or directory because these mappers are not able to see the file that you are specifying. The mappers are designed to run on remote machines. Even if you wanted to read from a file in the mapper you would need to copy the file that you are passing to all machines in the cluster which doesn't really make sense for this example. You can actually specify a DEFAULT_INPUT_PROTOCOL so that the mapper know which type of input you are using as well.
Here is a talk on the subject that will help:
http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-mrjob-distributed-computing-for-everyone-4898987/
You are using the json.loads() function, while passing in an open file. Use json.load() instead (note, no s).
stems = json.load(sentimentfile)
You do need to re-open the file every time you call your mapper() function, better just store the filename globally:
sentimentfile = 'sentiment_word_list_stemmed.json'
def mapper(self, _, line):
stemmer = nltk.PorterStemmer()
stems = json.load(open(sentimentfile))
Last but not least, you should use a absolute path to the filename, and not rely on the current working directory being correct.