Can't map a function to tarfile members in parallel

Can't map a function to tarfile members in parallel - python

I have a tarfile containing bz2-compressed files. I want to apply the function clean_file to each of the bz2 files, and collate the results. In series, this is easy with a loop:
import pandas as pd
import json
import os
import bz2
import itertools
import datetime
import tarfile
from multiprocessing import Pool
def clean_file(member):
if '.bz2' in str(member):
f = tr.extractfile(member)
with bz2.open(f, "rt") as bzinput:
dicts = []
for i, line in enumerate(bzinput):
line = line.replace('"name"}', '"name":" "}')
dat = json.loads(line)
dicts.append(dat)
bzinput.close()
f.close()
del f, bzinput
processed = dicts[0]
return processed
else:
pass
# Open tar file and get contents (members)
tr = tarfile.open('data.tar')
members = tr.getmembers()
num_files = len(members)
# Apply the clean_file function in series
i=0
processed_files = []
for m in members:
processed_files.append(clean_file(m))
i+=1
print('done '+str(i)+'/'+str(num_files))
However, I need to be able to do this in parallel. The method I'm trying uses Pool like so:
# Apply the clean_file function in parallel
if __name__ == '__main__':
with Pool(2) as p:
processed_files = list(p.map(clean_file, members))
But this returns an OSError:
Traceback (most recent call last):
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "parse_data.py", line 19, in clean_file
for i, line in enumerate(bzinput):
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/bz2.py", line 195, in read1
return self._buffer.read1(size)
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/_compression.py", line 103, in read
data = self._decompressor.decompress(rawblock, size)
OSError: Invalid data stream
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "parse_data.py", line 53, in <module>
processed_files = list(tqdm.tqdm(p.imap(clean_file, members), total=num_files))
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/site-packages/tqdm/std.py", line 1167, in __iter__
for obj in iterable:
File "/Users/johnfoley/opt/anaconda3/envs/racing_env/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
OSError: Invalid data stream
So I guess this way isn't properly accessing the files from within data.tar or something. How can I apply the function in parallel?
I'm guessing this will work with any tar archive containing bz2 files but here's my data to reproduce the error:
https://github.com/johnf1004/reproduce_tar_error

You didn't specify what platform you are running on but I suspect that it is Windows because you have ...
if __name__ == '__main__':
main()
... which would be required for code that creates processes on platforms that use OS function spawn for creating new processes. But that also means that when a new process is created (e.g. all the processes in the process pool you are creating), each process begins by re-executing the source program from the very top of the program. This means that the following code is being executed by each pool process:
tr = tarfile.open('data.tar')
members = tr.getmembers()
num_files = len(members)
However, I don't see why this would in itself cause an error, but I can't be sure. The problem may be, however, that this is executing after the call to your worker function, clean_file is being called and so tr has not been set. If this code preceded clean_file it might work, but this is just a guess. Certainly extracting the members with members = tr.getmembers() in each pool process is wasteful. Each process needs to open the tar file, ideally just once.
But what is clear is that the stacktrace you published does not match your code. You show:
Traceback (most recent call last):
File "parse_data.py", line 53, in <module>
processed_files = list(tqdm.tqdm(p.imap(clean_file, members), total=num_files))
Yet your code does not have any reference to tqdm or using method imap. Now it becomes more difficult to analyze what your problem actually is when the code you post doesn't quite match the code that produces the exception.
On the off-chance you are running on a Mac, which might be using fork to create new processes, this can be problematic when the main process has created multiple threads (which you don't necessarily see, perhaps by the tarfile module) and you then create a new process, I have specified code to ensure that spawn is used to create new processes. Anyway, the following code should work. It also introduces a few optimizations. If it doesn't, please post a new stacktrace.
import pandas as pd
import json
import os
import bz2
import itertools
import datetime
import tarfile
from multiprocessing import get_context
def open_tar():
# open once for each process in the pool
global tr
tr = tarfile.open('data.tar')
def clean_file(member):
f = tr.extractfile(member)
with bz2.open(f, "rt") as bzinput:
for line in bzinput:
line = line.replace('"name"}', '"name":" "}')
dat = json.loads(line)
# since you are returning just the first occurrence:
return dat
def main():
with tarfile.open('data.tar') as tr:
members = tr.getmembers()
# just pick members where '.bz2' is in member:
filtered_members = filter(lambda member: '.bz2' in str(member), members)
ctx = get_context('spawn')
# open tar file just once for each process in the pool:
with ctx.Pool(initializer=open_tar) as pool:
processed_files = pool.map(clean_file, filtered_members)
print(processed_files)
# required for when processes are created using spawn:
if __name__ == '__main__':
main()

It seems some race condition was happening.
Opening the tar file separately in every child process solves the issue:
import json
import bz2
import tarfile
import logging
from multiprocessing import Pool
def clean_file(member):
if '.bz2' not in str(member):
return
try:
with tarfile.open('data.tar') as tr:
with tr.extractfile(member) as bz2_file:
with bz2.open(bz2_file, "rt") as bzinput:
dicts = []
for i, line in enumerate(bzinput):
line = line.replace('"name"}', '"name":" "}')
dat = json.loads(line)
dicts.append(dat)
return dicts[0]
except Exception:
logging.exception(f"Error while processing {member}")
def process_serial():
tr = tarfile.open('data.tar')
members = tr.getmembers()
processed_files = []
for i, member in enumerate(members):
processed_files.append(clean_file(member))
print(f'done {i}/{len(members)}')
def process_parallel():
tr = tarfile.open('data.tar')
members = tr.getmembers()
with Pool() as pool:
processed_files = pool.map(clean_file, members)
print(processed_files)
def main():
process_parallel()
if __name__ == '__main__':
main()
EDIT:
Note that another way to solve this problem is to just use the spawn start method:
multiprocessing.set_start_method('spawn')
By doing this, we are instructing Python to "deep-copy" file handles in child processes.
Under the default "fork" start method, the file handles of parent and child share the same offsets.

Related

Generator Reading Log File While Name Changes Python

The goal is to read a log file in real time line by line (standard generator stuff) but the catch is, the file name changes at various intervals. The name change can't be helped (application dictated appended with a time string) and the name is changed when the log file size reaches ~2MB (guesstimate).
My approach was to create a file getter function that got the file (or new file) and then passed that to the generator. I thought that when the file changed names I would get a 'File not found' error, but what my test showed, is that the file name change is prevented entirely as 'another program is using this file'. The name change must be allowed, and this reader code cannot interfere with the application logging process at all.
import os
import time
import fnmatch
directory = '\\foo\\'
def fileGenerator(logFile):
""" Run a line generator """
logFile.seek(0,2)
while True:
line = logFile.readline()
if not line:
time.sleep(0.1)
continue
yield line
def fileGetter():
""" Get the Logging File """
matchedFiles = []
for afile in os.listdir(directory):
if fnmatch.fnmatch(afile,'amc_*.txt'):
matchedFiles.append(afile)
if len(matchedFiles)==1:
#There was exactly one matching file found send it to the generator
return os.path.join(directory,matchedFiles[0])
else:
#There either wasn't a file found or many matching
#Error out and stop process... critical error
if __name__ == '__main__':
filePath = fileGetter()
try:
logFile = open(filePath,"r")
except Exception as e:
#Catch the file not found and go back to the file path getter
#Send the file back to the generator
print e
if logFile:
loglines = fileGenerator(logFile)
for line in loglines:
#handle the line
print line,

If you can't hold the file open while waiting for new content to be written to it, I suggest saving the file position you were last at and closing the file before you sleep, and then reopening the file and seeking to that point afterwards. You could also investigate filesystem notification systems if you care about spotting file additions or renames immediately.
def log_reader():
filename = "does_not_exist"
filepos = 0
while True:
try:
file = open(filename)
except FileNotFoundError:
filename = fileGetter()
# if renamed files start empty, set filepos to zero here!
continue
file.seek(filepos)
while True:
line = file.readline()
if not line:
filepos = file.tell()
file.close()
sleep(0.1) # you may want to test different sleep lengths to avoid FS thrash
break
yield line
The opening and closing of the file may stress out your filesystem if you do it too much, so I'd suggest sleeping longer than your previous code did (but you may want to test to see how well your OS handles it if you care about how responsive your log reader is).

Python: pickle error with multiprocessing

As suggested below, I have changed my code to use Pool instead. I've also simplified my functions and included all my code below. However, now I'm getting a different error: NameError: global name 'split_files' is not defined
What I want to do is pass the actual file chunk into the parse_csv_chunk function but I'm not sure how to do that.
import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile
csv_filename = 'test.csv'
def parse_csv_chunk(files_index):
global split_files
print files_index
print len(split_files)
return 1
def split(infilename, num_chunks):
READ_BUFFER = 2**13
in_file_size = os.path.getsize(infilename)
print 'Original file size:', in_file_size
chunk_size = in_file_size // num_chunks
print 'Target chunk size:', chunk_size
print 'Target number of chunks:', num_chunks
files = []
with open(infilename, 'rb', READ_BUFFER) as infile:
infile.next()
infile.next()
infile.next()
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell() < chunk_size:
try:
#write 3 lines before checking if still < chunk_size
#this is done to improve performance
#the result is that each chunk will not be exactly the same size
temp_file.write(infile.next())
temp_file.write(infile.next())
temp_file.write(infile.next())
#end of original file
except StopIteration:
break
#rewind each chunk
temp_file.seek(0)
files.append(temp_file)
return files
if __name__ == '__main__':
start = time.time()
num_chunks = mp.cpu_count()
split_files = split(csv_filename, num_chunks)
print 'Number of files after splitting: ', len(split_files)
pool = mp.Pool(processes = num_chunks)
results = [pool.apply_async(parse_csv_chunk, args=(x,)) for x in range(num_chunks)]
output = [p.get() for p in results]
print output
I'm trying to split up a csv file into parts and have them processed by each of my CPU's cores. This is what I have so far:
import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile
def parse_csv_chunk(infile):
#code here
return
def split(infilename, num_chunks):
#code here
return files
def get_header_indices(infilename):
#code here
return
if __name__ == '__main__':
start = time.time() #start measuring performance
num_chunks = mp.cpu_count() #record number of CPU cores
files = split(csv_filename, num_chunks) #split csv file into a number equal of CPU cores and store as list
print 'number of files after splitting: ', len(files)
get_header_indices(csv_filename) #get headers of csv file
print headers_list
processes = [mp.Process(target=parse_csv_chunk,
args=ifile) for ifile in enumerate(files)] #create a list of processes for each file chunk
for p in processes:
p.start()
for p in processes:
p.join()
end = time.time()
print "Execution time: %.2f" % (end - start) #display performance
There seems to be a problem at the line 'p.start()'. I see a lot of output on the console, which eventually indicates an error:
pickle.PicklingError: Can't pickle <built-in method write of file object at 0x02
22EAC8>: it's not found as __main__.write
I did not include the code for the functions I called as they are quite long, but I can if needed. I'm wondering if I'm using multiprocessing correctly.

First off, if there a reason you are not using a Pool and the imap method of the Pool?
Second, it's very hard to tell any specifics without seeing your code, especially since the error points to parts of the code that are not provided.
However, it looks like you are using multiprocessing correctly from what you have provided -- and it's a serialization problem.
Note that if you use dill, you can serialize the write method.
>>> import dill
>>>
>>> f = open('foo.bar', 'w')
>>> dill.dumps(f.write)
'\x80\x02cdill.dill\n_get_attr\nq\x00cdill.dill\n_create_filehandle\nq\x01(U\x07foo.barq\x02U\x01wq\x03K\x00\x89c__builtin__\nopen\nq\x04\x89K\x00U\x00q\x05tq\x06Rq\x07U\x05writeq\x08\x86q\tRq\n.'
Most versions of multiprocessing use cPickle (or a version of pickle that is built in C), and while dill can inject it's types into the python version of pickle, it can't do so in the C equivalent.
There is a dill-activated fork of multiprocessing -- so you might try that, as if it's purely a pickling problem, then you should get past it with multiprocess.
See: https://github.com/uqfoundation/multiprocess.
EDIT (after OP update): The global declaration in your helper function isn't going to play well with pickle. Why not just use a payload function (like split) that reads a portion of the file and returns the contents, or writes to the target file? Don't return a list of files. I know they are TemporaryFile objects, but unless you use dill (and even then it's touchy) you can't pickle a file. If you absolutely have to, return the file name, not the file, and don't use a TemporaryFile. pickle will choke trying to pass the file. So, you should refactor your code, or as I suggested earlier, try to see if you can bypass serialization issues by using multiprocess (which uses dill).

Subprocess error file

I'm using the python module subprocess to call a program and redirect the possible std error to a specific file with the following command:
with open("std.err","w") as err:
subprocess.call(["exec"],stderr=err)
I want that the "std.err" file is created only if there are errors, but using the command above if there are no errors the code will create an empty file.
How i can make python create a file only if it's not empty?
I can check after execution if the file is empty and in case remove it, but i was looking for a "cleaner" way.

You could use Popen, checking stderr:
from subprocess import Popen,PIPE
proc = Popen(["EXEC"], stderr=PIPE,stdout=PIPE,universal_newlines=True)
out, err = proc.communicate()
if err:
with open("std.err","w") as f:
f.write(err)
On a side note, if you care about the return code you should use check_call, you could combine it with a NamedTemporaryFile:
from tempfile import NamedTemporaryFile
from os import stat,remove
from shutil import move
try:
with NamedTemporaryFile(dir=".", delete=False) as err:
subprocess.check_call(["exec"], stderr=err)
except (subprocess.CalledProcessError,OSError) as e:
print(e)
if stat(err.name).st_size != 0:
move(err.name,"std.err")
else:
remove(err.name)

You can create your own context manager to handle the cleanup for you -- you can't really do what you're describing here, which boils down to asking how you can see into the future. Something like this (with better error handling, etc.):
import os
from contextlib import contextmanager
#contextmanager
def maybeFile(fileName):
# open the file
f = open(fileName, "w")
# yield the file to be used by the block of code inside the with statement
yield f
# the block is over, do our cleanup.
f.flush()
# if nothing was written, remember that we need to delete the file.
needsCleanup = f.tell() == 0
f.close()
if needsCleanup:
os.remove(fileName)
...and then something like:
with maybeFile("myFileName.txt") as f:
import random
if random.random() < 0.5:
f.write("There should be a file left behind!\n")
will either leave behind a file with a single line of text in it, or will leave nothing behind.

tempfile is not accessible when using subprocess.Popen

When I run the following script, the error"Command line argument error: Argument "query". File is not accessible" occurs. I'm using python 3.4.2.
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import subprocess
import tempfile
import sys
def main():
# read name file and put all identifications into a list
infile_I = open('OTU_name.txt','r')
name = infile_I.read().split('>')
infile_I.close()
# extract sequence segments to a temporary file one at a time
for i in name:
i = i.replace('\n','')
for j in SeqIO.parse("GemSIM_OTU_ids.fa","fasta"):
if str(i) == str(j.id):
f = tempfile.NamedTemporaryFile()
record = j.seq
f.write(bytes(str(record),'UTF-8'))
f.seek(0)
f = f.read().decode()
Result = subprocess.Popen(['blastn','-remote','-db','chromosome','-query',f,'-out',str(i)],stdout=subprocess.PIPE)
output = Result.communicate()[0]
if __name__== '__main__': main()

f = tempfile.NamedTemporaryFile() returns a file-like object which you're trying to provide as a command line argument. Instead, you want the actual filename which is available via its .name attribute - although I'm somewhat confused why you're creating a tempfile, writing to it, seeking back to position 0, then replacing your tempfile f object with the contents of the file? I suspect you don't want to do that replacement and use f.name for your query.
Result = subprocess.Popen(['blastn','-remote','-db','chromosome','-query',f.name,'-out',str(i)],stdout=subprocess.PIPE)
Also, there's some convenient wrapper functions around subprocess.Popen such as subprocess.check_output which are also somewhat more explicit as to your intent which could be used here instead.

python multiprocessing shared Counter , pickling error

I am multiprocessing to process some very large files.
I can count occurrences of a particular string using the collections.Counter collection that is shared between the processes using a multiprocessing.BaseManager subclass.
Although I can share the Counter and seemingly pickle it does not seem to be pickled properly. I can copy the dictionary to a new dictionary that I can pickle.
I am trying to understand how to avoid the "copy" of the shared counter before picking it.
Here is my (pseudocode):
from multiprocessing.managers import BaseManager
from collections import Counter
class MyManager(BaseManager):
pass
MyManager.register('Counter', Counter)
def main(glob_pattern):
# function that processes files
def worker_process(files_split_to_allow_naive_parallelization, mycounterdict):
# code that loops through files
for line in file:
# code that processes line
my_line_items = line.split()
index_for_read = (my_line_items[0],my_line_items[6])
mycounterdict.update((index_for_read,))
manager = MyManager()
manager.start()
mycounterdict = manager.Counter()
# code to get glob files , split them with unix shell split and then chunk then
for i in range(NUM_PROCS):
p = multiprocessing.Process(target=worker_process , args = (all_index_file_tuples[chunksize * i:chunksize * (i + 1)],mycounterdict))
procs.append(p)
p.start()
# Now we "join" the processes
for p in procs:
p.join()
# This is the part I have trouble with
# This yields a pickled file that fails with an error
pickle.dump(mycounterdict,open("Combined_count_gives_error.p","wb"))
# This however works
# How can I avoid doing it this way?
mycopydict = Counter()
mydictcopy.update(mycounterdict.items())
pickle.dump(mycopydict,open("Combined_count_that_works.p","wb"))
When I try to load the "pickled" error giving file which is always a smaller fixed size , I get an error that does not make sense.
How do I pickle the shared dict without going through the creation of fresh dict in the pseudocode above.
>>> p = pickle.load(open("Combined_count_gives_error.p"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 880, in load_eof
raise EOFError
EOFError

There are several problems with your code. First of all, you are not guaranteed to close the file if you leave it dangling. Secondly, the mycounterdict is not an actual Counter but a proxy over it - pickle it and you will run into many problems, as it is unpicklable outside this process. However, you do not need to copy with update either: .copy makes a new Counter copy of it.
Thus you should use
with open("out.p", "wb") as f:
pickle.dump(mycounterdict.copy(), f)
As for if this is a good pattern, the answer is no. Instead of using a shared counter you should count separately in each process, for a much simpler code:
from multiprocessing import Pool
from collections import Counter
import pickle
def calculate(file):
counts = Counter()
...
return counts
pool = Pool(processes=NPROCS)
counts = Counter()
for result in pool.map(calculate, files):
counts += result
with open("out.p", "wb") as f:
pickle.dump(counts, f)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't map a function to tarfile members in parallel - python

Related

Generator Reading Log File While Name Changes Python

Python: pickle error with multiprocessing

Subprocess error file

tempfile is not accessible when using subprocess.Popen

python multiprocessing shared Counter , pickling error

Categories

Resources