_pickle in python3 doesn't work for large data saving - python

I am trying to apply _pickle to save data onto disk. But when calling _pickle.dump, I got an error
OverflowError: cannot serialize a bytes object larger than 4 GiB
Is this a hard limitation to use _pickle? (cPickle for python2)

Not anymore in Python 3.4 which has PEP 3154 and Pickle 4.0
https://www.python.org/dev/peps/pep-3154/
But you need to say you want to use version 4 of the protocol:
https://docs.python.org/3/library/pickle.html
pickle.dump(d, open("file", 'w'), protocol=4)

Yes, this is a hard-coded limit; from save_bytes function:
else if (size <= 0xffffffffL) {
// ...
}
else {
PyErr_SetString(PyExc_OverflowError,
"cannot serialize a bytes object larger than 4 GiB");
return -1; /* string too large */
}
The protocol uses 4 bytes to write the size of the object to disk, which means you can only track sizes of up to 232 == 4GB.
If you can break up the bytes object into multiple objects, each smaller than 4GB, you can still save the data to a pickle, of course.

There is a great answers above for why pickle doesn't work.
But it still doesn't work for Python 2.7, which is a problem
if you are are still at Python 2.7 and want to support large
files, especially NumPy (NumPy arrays over 4G fail).
You can use OC serialization, which has been updated to work for data over
4Gig. There is a Python C Extension module available from:
http://www.picklingtools.com/Downloads
Take a look at the Documentation:
http://www.picklingtools.com/html/faq.html#python-c-extension-modules-new-as-of-picklingtools-1-6-0-and-1-3-3
But, here's a quick summary: there's ocdumps and ocloads, very much like
pickle's dumps and loads::
from pyocser import ocdumps, ocloads
ser = ocdumps(pyobject) : Serialize pyobject into string ser
pyobject = ocloads(ser) : Deserialize from string ser into pyobject
The OC Serialization is 1.5-2x faster and also works with C++ (if you are mixing langauges). It works with all built-in types, but not classes
(partly because it is cross-language and it's hard to build C++ classes
from Python).

Related

trouble saving numpy array to matlab readable file

I have an image sequence as a numpy array;
Mov (15916, 480, 768)
dtype = int16
i've tried using Mov.tofile(filename)
this saves the array and I can load it again in python and view the images.
In matlab the images are corrupted after about 3000 frames.
Using the following also works but has the same problem when I retrieve the images in matlab;
fp = np.memmap(sbxpath, dtype='int16', mode='w+', shape=Mov.shape)
fp[:,:,:] = Mov[:,:,:]
If I use:
mv['mov'] = Mov
sio.savemat(sbxpath, mv)
I get the following error;
OverflowError: Python int too large to convert to C long
what am I doing wrong?
I'm sorry for this, because it is a beginners problem. Python saves variables as integers or floats depending on how they are initialized. Matlab defaults to 8 byte doubles. My matlab script expects doubles, my python script was outputting all kinds of variable types, so naturally things got messed up.

Generating TFRecord format data from C+

I'm trying to use TFRecord format to record data from C++ and then use it in python to feed TensorFlow model.
TLDR; Simply serializing proto messages into a stream doesn't satisfy .tfrecord format requirements of Python TFRecordDataset class. Is there an equivalent of Python TfRecordWriter in C++ (either in TensorFlow or in Google Protobuf libraries) to generate proper .tfrecord data?
Details:
The simplified C++ code looks like this:
tensorflow::Example sample;
sample.mutable_features()->mutable_feature()->operator[]("a").mutable_float_list()->add_value(1.0);
std::ofstream out;
out.open("cpp_example.tfrecord", std::ios::out | std::ios::binary);
sample.SerializeToOstream(&out);
In Python, to create a TensorFlow data I'm trying to use TFRecordDataset, but apparently it expects extra header/footer information in the .tfrecord file (rather than simple list of serialized proto messages):
import tensorflow as tf
tfrecord_dataset = tf.data.TFRecordDataset(filenames="cpp_example.tfrecord")
next(tfrecord_dataset.as_numpy_iterator())
output:
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0 [Op:IteratorGetNext]
Note that there is nothing wrong with the recorded binary file, as following code prints a valid output:
import tensorflow as tf
p = open("cpp_example.tfrecord", "rb")
example = tf.train.Example.FromString(p.read())
output:
features {
feature {
key: "a"
value {
float_list {
value: 1.0
}
}
}
}
By analyzing the binary output generated by my C++ example, and an output generated by using Python TfRecordWriter, I observed additional header and footer bytes in the content. Unfortunately, what do these extra bytes represent was an implementation detail (probably compression type and some extra info) and I couldn't track it deeper than some class in python libraries which just exposed the interface from _pywrap_tfe.so.
There was this advice saying that .tfrecord is just a normal google protobuf data. It might be I'm missing the knowledge where to find protobuf data writer (expect serializing proto messages into the output stream)?
It turns out tensorflow::io::RecordWriter class of TensorFlow C++ library does the job.
#include <tensorflow/core/lib/io/record_writer.h>
#include <tensorflow/core/platform/default/posix_file_system.h>
#include <tensorflow/core/example/example.pb.h>
// ...
// Create WritableFile and instantiate RecordWriter.
tensorflow::PosixFileSystem posixFileSystem;
std::unique_ptr<tensorflow::WritableFile> writableFile;
posixFileSystem.NewWritableFile("cpp_example.tfrecord", &writableFile);
tensorflow::io::RecordWriter recordWriter(mWritableFile.get(), tensorflow::io::RecordWriterOptions::CreateRecordWriterOptions(""));
// ...
tensorflow::Example sample;
// ...
// Serialize proto message into a buffer and record in tfrecord format.
std::string buffer;
sample.SerializeToString(&buffer);
recordWriter.WriteRecord(buffer);
It would be helpful if this class is referenced from somewhere in TFRecord documentation.

Best way to design c like struct in python

First of all, I just started python yet I really tried hard to find what fits for me. The thing I am going to do is a simple file system for linux but to tell the truth I don't even sure if it is achievable with python. So I need a bit help of here.
I tried to create a class structure and named tuples (one at a time which one fits) and I decided classes would be better for me. The thing is I couldn't read byte by byte because of the size of my class was 888 while in C it was 44 (I used sys.getsizeof() there) It will be more understand what I want to achieve with some code below
For this structure
struct sb{
int inode_bitmap;
int data_bitmap[10];
};
I used
#SUPER BLOCK
class sb(object):
__slots__ = ['inode_bitmap', 'data_bitmap'] #REDUCE RAM USAGE
def __init__(bruh, inode_bitmap, data_bitmap):
bruh.inode_bitmap = inode_bitmap
bruh.data_bitmap = [None] * 10 #DEFINITION OF ARRAY
Everything was fine till I read it
FILE * fin = fopen("simplefs.bin", "r");
struct inode slash;
fseek(fin, sizeof(struct sb), SEEK_SET);
fread(&slash,sizeof(slash),1,fin);
fin = open("simplefs.bin", "rb")
slash = inode
print("pos:", fin.tell())
contents = fin.read(sys.getsizeof(sb))
print(contents)
Since actual file size was smth like 4800 however when I was reading the size was approximately 318
I am pretty aware that python is not C but I am just doing some experiments if it is achievable
You cannot design a struct and then try to read/write it to the file and expect it to be binary identical. If you want to parse any binary data, you have module struct that allows you to interpret the data you have read as int, float and a dozen of other formats. Still you have to write the formats manually. In your particular case:
import struct
with ('datafile.dat') as fin :
raw_data = fin.read()
data = struct.unpack_from( '11I', raw_data ) # 11 integers
inode_bitmap = data[0]
data_bitmap = data[1:]
Or something along the lines...

Converting float32 to bit-equivalent int32

The Pillow module in Python insists on opening a 32-bit/pixel TIFF file I have as if the pixels were of type float32, whereas I believe the correct interpretation is unsigned int32. If I go ahead and load the data into a 640x512 array of type float32, how can I retype it as uint32 while preserving the underlying binary representation?
In Fortran and C, it's easy to have pointers or data structures of different type pointing to the same block of physical memory so that the raw memory contents can be easily be interpreted according to whatever type I want. Is there an equivalent procedure in Python?
Sample follows (note that I have no information about compression etc.; the file in question was extracted by a commercial software program from a proprietary file format):
from PIL import Image
infile = "20181016_071207_367_R.tif"
im = Image.open(infile)
data = np.array(im.getdata())
print(data)
[ -9.99117374 -10.36103535 -9.80696869 ... -18.41988373 -18.35027885
-18.69905663]
Assuming you have im.mode originally equal to F, you can force Pillow to re-load the same data under a different mode (an very unusual desire indeed) in a somewhat hackish way like that:
imnew = im.convert(mode='I')
imnew.frombytes(im.tobytes())
More generally (outside the context of PIL), whenever you encounter the need to deal with raw memory representation in Python, you should usually rely on numpy or Python's built-in memoryview class with the struct module.
Here is an example of reinterpreting an array of numpy float32 as int32:
a = np.array([1.0, 2.0, 3.0], dtype='float32')
a_as_int32 = a.view('int32')
Here is an example of doing the same using memoryview:
# Create a memory buffer
b = bytearray(4*3)
# Write three floats
struct.pack_into('fff', b, 0, *[1.0, 2.0, 3.0])
# View the same memory as three ints
mem_as_ints = memoryview(b).cast('I')
The answer, in this case, is that Pillow is loading the image with the correct type (float 32) as specified in the image exported from the thermal camera. There is no need to cast the image to integer, and doing so would cause an incorrect result.

Is there a faster way to copy from a bytearray to a mmap slice in Python?

I am writing code for an addon to XBMC that copies an image provided in a bytearray to a slice of a mmap object. Using Kern's line profiler, the bottleneck in my code is when I copy the bytearray into the mmap object at the appropriate location. In essence:
length = 1920 * 1080 * 4
mymmap = mmap.mmap(0, length + 11, 'Name', mmap.ACCESS_WRITE)
image = capture.getImage() # This returns a bytearray that is 4*1920*1080 in size
mymmap[11:(11 + length)] = str(image) # This is the bottleneck according to the line profiler
I cannot change the data types of either the image or mmap. XBMC provides the image as a bytearray and the mmap interface was designed by a developer who won't change the implementation. It is NOT being used to write to a file, however - it was his way of getting the data out of XBMC and into his C++ application. I recognize that he could write an interface using ctypes that might handle this better, but he is not interested in further development. The python implementation in XBMC is 2.7.
I looked at the possibility of using ctypes (in a self-contained way withing python) myself with memmove, but can't quite figure out how to convert the bytearray and mmap slice into c structures that can be used with memmove and don't know if that would be any faster. Any advice on a fast way to move these bytes between these two data types?
If the slice assignment to the mmap object is your bottleneck I don't think anything can be done to improve the performance of your code. All the assignment does internally is call memcpy.

Categories