Read files with ZipFile using multiprocessing

Read files with ZipFile using multiprocessing - python

I'm trying to read raw data from a zipfile. The structure of that file is:
zipfile
data
Spectral0.data
Spectral1.data
Spectral[...].data
Spectral300.data
Header
The goal is to read all Spectral[...].data into an 2D numpy array (whereas Spectral0.data would be the first column). The single threaded approach takes a lot of time since reading one .data file takes some seconds.
import zipfile
import numpy as np
spectralData = np.zeros(shape = (dimY, dimX), dtype=np.int16)
archive = zipfile.ZipFile(path, 'r')
for file in range(fileCount):
spectralDataRaw = archive.read('data/Spectral' + str(file) + '.data')
spectralData[:,file] = np.frombuffer(spectralDataRaw, np.short)
And I thought using multiprocessing could speed up the process. So I read some tutorials how to set up a multiprocessing procedure. This is what I came up with:
import zipfile
import numpy as np
import multiprocessing
from joblib import Parallel, delayed
archive = zipfile.ZipFile(path, 'r')
numCores = multiprocessing.cpu_count()
def testMult(file):
spectralDataRaw = archive.read('data/Spectral' + str(file) + '.data')
return np.frombuffer(spectralDataRaw, np.short)
output = Parallel(n_jobs=numCores)(delayed(testMult)(file)for file in range(fileCount))
output = np.flipud(np.rot90(np.array(output), 1, axes = (0,2)))
Using this approach I get the following error:
numCores = multiprocessing.cpu_count()
def testMult(file):
spectralDataRaw = archive.read('data/Spectral' + str(file) + '.data')
return np.frombuffer(spectralDataRaw, np.short)
output = Parallel(n_jobs=numCores)(delayed(testMult)(file)for file in range(fileCount))
output = np.flipud(np.rot90(np.array(output), 1, axes = (0,2)))
_RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\loky\backend\queues.py", line 153, in _feed
obj_ = dumps(obj, reducers=reducers)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 271, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 264, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\externals\cloudpickle\cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_io.BufferedReader' object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-94-c4b007eea8e2>", line 8, in <module>
output = Parallel(n_jobs=numCores)(delayed(testMult)(file)for file in range(fileCount))
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\parallel.py", line 1061, in __call__
self.retrieve()
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\parallel.py", line 940, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\site-packages\joblib\_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\concurrent\futures\_base.py", line 432, in result
return self.__get_result()
File "C:\ProgramData\Anaconda3\envs\devEnv2\lib\concurrent\futures\_base.py", line 388, in __get_result
raise self._exception
PicklingError: Could not pickle the task to send it to the workers.
My question is, how do I set up this parallelization correctly. I've read that zipfile is not thread safe and therefore I might need a different approach to read the zip content into memory(RAM). I would rather not read the whole zipfile into memory since the file can be quite large.
I thought about using from numba import njit, prange but there the problem occurs that zip is not supported by numba.
What else could I do to make this work?

Related

Transform docx to html raises python MemoryError

I have a function that converts a docx to html and a large docx file to be converted.
The problem is this function is part of a bigger program and the converted html is parsed afterwards so I cannot afford to use another converter without impacting the rest of the code (which is not wanted). Running on python 2.7.13 installed on 32-bit, but changing to 64-bit is also not desired.
This is the function:
import logging
from ooxml import serialize
def trasnformDocxtoHtml(inputFile, outputFile):
logging.basicConfig(filename='ooxml.log', level=logging.INFO)
dfile = ooxml.read_from_file(inputFile)
with open(outputFile,'w') as htmlFile:
htmlFile.write( serialize.serialize(dfile.document))
and here's the error:
>>> import library
>>> library.trasnformDocxtoHtml(r'large_file.docx', 'output.html')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "library.py", line 9, in trasnformDocxtoHtml
dfile = ooxml.read_from_file(inputFile)
File "C:\Python27\lib\site-packages\ooxml\__init__.py", line 52, in read_from_file
dfile.parse()
File "C:\Python27\lib\site-packages\ooxml\docxfile.py", line 46, in parse
self._doc = parse_from_file(self)
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 655, in parse_from_file
document = parse_document(doc_content)
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 463, in parse_document
document.elements.append(parse_table(document, elem))
File "C:\Python27\lib\site-packages\ooxml\parse.py", line 436, in parse_table
for p in tc.xpath('./w:p', namespaces=NAMESPACES):
File "src\lxml\etree.pyx", line 1583, in lxml.etree._Element.xpath
MemoryError
no mem for new parser
MemoryError
Could I somehow increase the buffer memory in python? Or fix the function without impacting the html output format?

Reading TDMS File with python nptdms, cannot open tdms file

I am having issues with getting basic function of the nptdms module working.
First, I am just trying to open a TDMS file and print the contents of specific channels within specific groups.
Using python 2.7 and the nptdms quick start here
Following this, I will be writing these specific pieces of data into a new TDMS file. Then, my ultimate goal is to be able to take a set of source files, open each, and write (append) to a new file. The source data files contain far more information that is needed, so I am breaking out the specifics into their own file.
The problem I have is that I cannot get past a basic error.
When running this code, I get:
Traceback (most recent call last):
File "PullTDMSdataIntoNewFile.py", line 27, in <module>
tdms_file = TdmsFile(r"C:\\Users\daniel.worts\Desktop\this_is_my_tdms_file.tdms","r")
File "C:\Anaconda2\lib\site-packages\nptdms\tdms.py", line 94, in __init__
self._read_segments(f)
File "C:\Anaconda2\lib\site-packages\nptdms\tdms.py", line 119, in _read_segments
object._initialise_data(memmap_dir=self.memmap_dir)
File "C:\Anaconda2\lib\site-packages\nptdms\tdms.py", line 709, in _initialise_data
mode='w+b', prefix="nptdms_", dir=memmap_dir)
File "C:\Anaconda2\lib\tempfile.py", line 475, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags)
File "C:\Anaconda2\lib\tempfile.py", line 244, in _mkstemp_inner
fd = _os.open(file, flags, 0600)
OSError: [Errno 2] No such file or directory: 'r\\nptdms_yjfyam'
Here is my code:
from nptdms import TdmsFile
import numpy as np
import pandas as pd
#set Tdms file path
tdms_file = TdmsFile(r"C:\\Users\daniel.worts\Desktop\this_is_my_tdms_file.tdms","r")
# set variable for TDMS groups
group_nameone = '101'
group_nametwo = '752'
# set objects for TDMS channels
channel_dataone = tdms_file.object(group_nameone 'Payload_1')
channel_datatwo = tdms_file.object(group_nametwo, 'Payload_2')
# set data from channels
data_dataone = channel_dataone.data
data_datatwo = channel_datatwo.data
print data_dataone
print data_datatwo
Big thanks to anyone who may have encountered this before and can help point to what I am missing.
Best,
- Dan
edit:
Solved the read data issue by removing the 'r' argument from the file path.
Now I am having another error I can't trace when trying to write.
from nptdms import TdmsFile, TdmsWriter, RootObject, GroupObject, ChannelObject
import numpy as np
import pandas as pd
newfilepath = r"C:\\Users\daniel.worts\Desktop\Mined.tdms"
datetimegroup101_channel_object = ChannelObject('101', DateTime, data_datetimegroup101)
with TdmsWriter(newfilepath) as tdms_writer:
tdms_writer.write_segment([datetimegroup101_channel_object])
Returns error:
Traceback (most recent call last):
File "PullTDMSdataIntoNewFile.py", line 82, in <module>
tdms_writer.write_segment([datetimegroup101_channel_object])
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 68, in write_segment
segment = TdmsSegment(objects)
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 88, in __init__
paths = set(obj.path for obj in objects)
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 88, in <genexpr>
paths = set(obj.path for obj in objects)
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 254, in path
self.channel.replace("'", "''"))
AttributeError: 'TdmsObject' object has no attribute 'replace'

Getting Dill Memory Error when loading serialized object, how to fix?

I am getting a dill/pickle memory error when loading a serialized object file. I am not quite sure what is happening and I am unsure on how to fix it.
When I call:
stat_bundle = train_batch_iterator(clf, TOTAL_TRAINED_EVENTS)
The code traces to the train_batch_iterator function in which it loads a serialized object and trains the classifier with the data within the object. This is the code:
def train_batch_iterator(clf, tte):
plot_data = [] # initialize plot data array
for file in glob.glob('./SerializedData/Batch8172015_19999/*'):
with open(file, 'rb') as stream:
minibatch_train = dill.load(stream)
clf.partial_fit(minibatch_train.data[1], minibatch_train.target,
classes=np.array([11, 111]))
tte += len(minibatch_train.target)
plot_data.append((test_batch_iterator(clf), tte))
return plot_data
Here is the error:
Traceback (most recent call last):
File "LArSoftSGD-version2.0.py", line 154, in <module>
stat_bundle = train_batch_iterator(clf, TOTAL_TRAINED_EVENTS)
File "LArSoftSGD-version2.0.py", line 118, in train_batch_iterator
minibatch_train = dill.load(stream)
File "/home/jdoe/.local/lib/python3.4/site-packages/dill/dill.py", line 199, in load
obj = pik.load()
File "/home/jdoe/.local/lib/python3.4/pickle.py", line 1038, in load
dispatch[key[0]](self)
File "/home/jdoe/.local/lib/python3.4/pickle.py", line 1184, in load_binbytes
self.append(self.read(len))
File "/home/jdoe/.local/lib/python3.4/pickle.py", line 237, in read
return self.file_read(n)
MemoryError
I have no idea what could be going wrong. The error seems to be in the line minibatch_train = dill.load(stream) and the only thing I can think of is that the serialized data file is too large, however the file is exactly 1161 MB which doesn't seem to big/big enough to cause a memory error.
Does anybody know what might be going wrong?

Python Pickle throwing FileNotFoundError

I have a situation in python where, in a python runnable object that is running together with other processes, the following situation happens: If the code is simply:
f = open(filename, "rb")
f.close()
There is no error, but then when the code changes to the following, introducing pickle in the middle, it throws a FileNotFoundError:
f = open(filename, "rb")
object = pickle.load(f)
f.close()
I don't understand why, if the file exists, pickle would throw such error. The complete trace of the error is:
task = pickle.load(f)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 852, in RebuildProxy
return func(token, serializer, incref=incref, **kwds)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 706, in __init__
self._incref()
File "/usr/lib/python3.4/multiprocessing/managers.py", line 756, in _incref
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python3.4/multiprocessing/connection.py", line 495, in Client
c = SocketClient(address)
File "/usr/lib/python3.4/multiprocessing/connection.py", line 624, in SocketClient
s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory

While I'd say that #TesselatingHeckler provided the answer, this comment doesn't really fit in the comments section… so it's an extension of that answer.
You can't pickle multiprocessing Queue and Pipe objects, and indeed, some pickled objects only fail on load.
I have built a fork of multiprocessing that allows most objects to be pickled. Primarily what was done was to replace pickle with a much more robust serializer (dill). The fork is available as part of the pathos package (on github). I've tried to serialize Pipes before, and it works, and it also works on a python socket and a python Queue. However, it still doesn't work apparently on a multiprocessing Queue.
>>> from processing import Pipe
>>> p = Pipe()
>>>
>>> import dill
>>> dill.loads(dill.dumps(p))
(Connection(handle=12), Connection(handle=14))
>>>
>>> from socket import socket
>>> s = socket()
>>> from Queue import Queue as que
>>> w = que()
>>> dill.loads(dill.dumps(s))
<socket._socketobject object at 0x10dae18a0>
>>> dill.loads(dill.dumps(w))
<Queue.Queue instance at 0x10db49f38>
>>>
>>> from processing import Queue
>>> q = Queue()
>>> dill.loads(dill.dumps(q))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.2.dev-py2.7.egg/dill/dill.py", line 180, in dumps
dump(obj, file, protocol, byref, file_mode, safeio)
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.2.dev-py2.7.egg/dill/dill.py", line 173, in dump
pik.dump(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/Users/mmckerns/lib/python2.7/site-packages/processing/queue.py", line 62, in __getstate__
assertSpawning(self)
File "/Users/mmckerns/lib/python2.7/site-packages/processing/forking.py", line 24, in assertSpawning
'processes through inheritance' % type(self).__name__)
RuntimeError: Queue objects should only be shared between processes through inheritance
It seems that the __getstate__ method for a multiprocessing Queue is hardwired to throw an error. If you had this buried inside a class, it might not be triggered until the class instance is being reconstituted from the pickle.
Trying to pickle a multiprocessing Queue with pickle should also give you the above error.

scipy.io typeerror:buffer too small for requested array

I have a problem in python. I'm using scipy, where i use scipy.io to load a .mat file. The .mat file was created using MATLAB.
listOfFiles = os.listdir(loadpathTrain)
for f in listOfFiles:
fullPath = loadpathTrain + '/' + f
mat_contents = sio.loadmat(fullPath)
print fullPath
Here's the error:
Traceback (most recent call last):
File "tryRankNet.py", line 1112, in <module>
demo()
File "tryRankNet.py", line 645, in demo
mat_contents = sio.loadmat(fullPath)
File "/usr/lib/python2.6/dist-packages/scipy/io/matlab/mio.py", line 111, in loadmat
matfile_dict = MR.get_variables()
File "/usr/lib/python2.6/dist-packages/scipy/io/matlab/miobase.py", line 356, in get_variables
getter = self.matrix_getter_factory()
File "/usr/lib/python2.6/dist-packages/scipy/io/matlab/mio5.py", line 602, in matrix_getter_factory
return self._array_reader.matrix_getter_factory()
File "/usr/lib/python2.6/dist-packages/scipy/io/matlab/mio5.py", line 274, in matrix_getter_factory
tag = self.read_dtype(self.dtypes['tag_full'])
File "/usr/lib/python2.6/dist-packages/scipy/io/matlab/miobase.py", line 171, in read_dtype
order='F')
TypeError: buffer is too small for requested array
The whole thing is in a loop, and I checked the size of the file where it gives the error by loading it interactively in IDLE.
The size is (9,521), which is not at all huge. I tried to find if I'm supposed to clear the buffer after each iteration of the loop, but I could not find anything.
Any help would be appreciated.
Thanks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read files with ZipFile using multiprocessing - python

Related

Transform docx to html raises python MemoryError

Reading TDMS File with python nptdms, cannot open tdms file

Getting Dill Memory Error when loading serialized object, how to fix?

Python Pickle throwing FileNotFoundError

scipy.io typeerror:buffer too small for requested array

Categories

Resources