pandas.DataFrame.load/save between python2 and python3: pickle protocol issues - python

I haven't figure out how to do pickle load/save's between python 2 and 3 with pandas DataFrames. There is a 'protocol' option in the pickler that I've played with unsuccessfully but I'm hoping someone has a quick idea for me to try. Here is the code to get the error:
python2.7
>>> import pandas; from pylab import *
>>> a = pandas.DataFrame(randn(10,10))
>>> a.save('a2')
>>> a = pandas.DataFrame.load('a2')
>>> a = pandas.DataFrame.load('a3')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 30, in load
return com.load(path)
File "/usr/local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/common.py", line 1107, in load
return pickle.load(f)
ValueError: unsupported pickle protocol: 3
python3
>>> import pandas; from pylab import *
>>> a = pandas.DataFrame(randn(10,10))
>>> a.save('a3')
>>> a = pandas.DataFrame.load('a3')
>>> a = pandas.DataFrame.load('a2')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/site-packages/pandas-0.10.1-py3.3-linux-x86_64.egg/pandas/core/generic.py", line 30, in load
return com.load(path)
File "/usr/local/lib/python3.3/site-packages/pandas-0.10.1-py3.3-linux-x86_64.egg/pandas/core/common.py", line 1107, in load
return pickle.load(f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf4 in position 0: ordinal not in range(128)
Maybe expecting pickle to work between python version is a bit optimistic?

I had the same problem. You can change the protocol of the dataframe pickle file with the following function in python3:
import pickle
def change_pickle_protocol(filepath,protocol=2):
with open(filepath,'rb') as f:
obj = pickle.load(f)
with open(filepath,'wb') as f:
pickle.dump(obj,f,protocol=protocol)
Then you should be able to open it in python2 no problem.

If somebody uses pandas.DataFrame.to_pickle() then do the following modification in source code to have the capability of pickle protocol setting:
1) In source file /pandas/io/pickle.py (before modification copy the original file as /pandas/io/pickle.py.ori) search for the following lines:
def to_pickle(obj, path):
pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
Change these lines to:
def to_pickle(obj, path, protocol=pkl.HIGHEST_PROTOCOL):
pkl.dump(obj, f, protocol=protocol)
2) In source file /pandas/core/generic.py (before modification copy the original file as /pandas/core/generic.py.ori) search for the following lines:
def to_pickle(self, path):
return to_pickle(self, path)
Change these lines to:
def to_pickle(self, path, protocol=None):
return to_pickle(self, path, protocol)
3) Restart your python kernel if it runs then save your dataframe using any available pickle protocol (0, 1, 2, 3, 4):
# Python 2.x can read this
df.to_pickle('my_dataframe.pck', protocol=2)
# protocol will be the highest (4), Python 2.x can not read this
df.to_pickle('my_dataframe.pck')
4) After pandas upgrade, repeat step 1 & 2.
5) (optional) Ask the developers to have this capability in official releases (because your code will throw exception on any other Python environments without these changes)
Nice day!

You can override the highest protocol available for the pickle package:
import pickle as pkl
import pandas as pd
if __name__ == '__main__':
# this constant is defined in pickle.py in the pickle package:"
pkl.HIGHEST_PROTOCOL = 2
# 'foo.pkl' was saved in pickle protocol 4
df = pd.read_pickle(r"C:\temp\foo.pkl")
# 'foo_protocol_2' will be saved in pickle protocol 2
# and can be read in pandas with Python 2
df.to_pickle(r"C:\temp\foo_protocol_2.pkl")
This is definitely not an elegant solution but it does the work without changing pandas code directly.
UPDATE: I found that the newer version of pandas, allow to specify the pickle version in the .to_pickle function:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_pickle.html[1]
DataFrame.to_pickle(path, compression='infer', protocol=4)

Related

Unable to read dicom file with Python3 and pydicom

I trying to read dicom file with python3 and pydicom library. For some dicom data, I can't get data correctly and get error messages when I tried to print the result of pydicom.dcmread.
However, I have tried to use python2 and it worked well. I checked out the meta information and compared it with other dicom files which can be processed, I didn't find any difference between them.
import pydiom
ds = pydicom.dcmread("xxxxx.dicom")
print(ds)
Traceback (most recent call last):
File "generate_train_data.py", line 387, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "generate_train_data.py", line 371, in main
create_ann()
File "generate_train_data.py", line 368, in create_ann
ds_ann_dir, case_name, merge_channel=False)
File "generate_train_data.py", line 290, in process_dcm_set
all_dcms, dcm_truth_infos = convert_dicoms(dcm_list, zs)
File "generate_train_data.py", line 179, in convert_dicoms
instance_num, pixel_spacing, img_np = extract_info(dcm_path)
File "generate_train_data.py", line 147, in extract_info
print(ds)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2277-2279: ordinal not in range(128)
Anyone has come across the same problem?
Can you give an example for such a dicom file? When running the pydicom example with python 3.7 it's working perfectly:
import matplotlib.pyplot as plt
import pydicom
from pydicom.data import get_testdata_files
filename = get_testdata_files("CT_small.dcm")[0]
ds = pydicom.dcmread(filename)
plt.imshow(ds.pixel_array, cmap=plt.cm.bone)
It's also working with the sample dicom files from the Medical Image Samples.
I believe the cause of the problem is that Python (for me it only happened in Python 3 running on Centos 7.6 Linux printing to a terminal window on MacOS) is not able to figure out how to print a string that contains a non-ascii character because of the setting of the locale. You can use the locale command to see the results. Mine started out with everything set to "C". I set the LANG environment variable to en_US.UTF-8. With that setting it worked for me.
In csh this is done using
setenv LANG en_US.UTF-8
In bash use:
export LANG=en_US.UTF-8
My problem resulted from having 'µ' in the Series Description element. The file was an attenuation map from a SPECT reconstruction on a Siemens scanner. I used the following Python code to help figure out the problem.
#! /usr/bin/env python3
import pydicom as dicom
from sys import exit, argv
def myprint(ds, indent=0):
"""Go through all items in the dataset and print them with custom format
Modelled after Dataset._pretty_str()
"""
dont_print = ['Pixel Data', 'File Meta Information Version']
indent_string = " " * indent
next_indent_string = " " * (indent + 1)
for data_element in ds:
if data_element.VR == "SQ": # a sequence
print(indent_string, data_element.name)
for sequence_item in data_element.value:
myprint(sequence_item, indent + 1)
print(next_indent_string + "---------")
else:
if data_element.name in dont_print:
print("""<item not printed -- in the "don't print" list>""")
else:
repr_value = repr(data_element.value)
if len(repr_value) > 50:
repr_value = repr_value[:50] + "..."
try:
print("{0:s} {1:s} = {2:s}".format(indent_string,
data_element.name,
repr_value))
except:
print(data_element.name,'****Error printing value')
for f in argv[1:]:
ds = dicom.dcmread(f)
myprint(ds, indent=1)
This is based on the myprint function from]1
The code tries to print out all the data items. It catches exceptions and prints "****Error printing value" when there is an error.

pyth error [rtf to xml/html]

I am trying to convert RTF to XML/xhtml using python 3.6.1.
Python Code: https://github.com/brendonh/pyth/blob/master/examples/reading/rtf15.py
import sys
import os.path
from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.xhtml.writer import XHTMLWriter
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
filename = os.path.normpath(os.path.join(os.path.dirname(__file__),'../../tests/rtfs/sample.rtf'))
doc = Rtf15Reader.read(open(filename, "rb"))
print(XHTMLWriter.write(doc, pretty=True).read())
Error:
Traceback (most recent call last):
File "C:\xx\file1.py", line 14, in <module>
from pyth.plugins.rtf15.reader import Rtf15Reader
File "C:\Python 3.6.1\lib\site-packages\pyth\plugins\rtf15\reader.py", line 594
match = re.match(ur'HYPERLINK "(.*)"', destination)
^
SyntaxError: invalid syntax
May I know how to solve the syntax issue?
Thank you.
Please check the link:
https://pypi.python.org/pypi/pyth/0.6.0
The pyth package is just used for Python 2.x, not worked for Python 3.x version.
PS:
At you sample code, the
print XHTMLWriter.write(doc, pretty=True).read()
is the Python 2.x version, not Python 3.x version. Please check.

Python cPickle unable to load an OCR model library

I have just installed ocropus OCR with all dependencies in my windows 7 machine. (I am using 32bit python 2.7) It seems to be working fine except that I cannot load the default OCR model: en-default.pyrnn.gz. , and receiving a Traceback. I am using the following syntax:
python ocropus-rpred -m en-default.pyrnn.gz book\0001\*.png
here is the error
INFO: #inputs47
# loading object /usr/local/share/ocropus/en-default.pyrnn.gz
Traceback (most recent call last):
File "ocropus-rpred" line 109, in <module>
network = ocrolib.load_object(args.model,verbose=1)
File "C:\anaconda32\lib\site-packages\ocrolib\common.py", line 513, in load_object
return unpickler.load()
EOFError
I have checked the file is not empty; also double checked the binary mode flag enabled i.e. "wb" and "rb"; also converted the newlines of common.py using dos2unix. I am being unable to unable to solve this problem. If anyone have expereinced similar issues, kindly share.
import cPickle
import gzip
def save_object(fname,obj,zip=0):
if zip==0 and fname.endswith(".gz"):
zip = 1
if zip>0:
# with gzip.GzipFile(fname,"wb") as stream:
with os.popen("gzip -9 > '%s'"%fname,"wb") as stream:
cPickle.dump(obj,stream,2)
else:
with open(fname,"wb") as stream:
cPickle.dump(obj,stream,2)
def unpickle_find_global(mname,cname):
if mname=="lstm.lstm":
return getattr(lstm,cname)
if not mname in sys.modules.keys():
exec "import "+mname
return getattr(sys.modules[mname],cname)
def load_object(fname,zip=0,nofind=0,verbose=0):
"""Loads an object from disk. By default, this handles zipped files
and searches in the usual places for OCRopus. It also handles some
class names that have changed."""
if not nofind:
fname = ocropus_find_file(fname)
if verbose:
print "# loading object",fname
if zip==0 and fname.endswith(".gz"):
zip = 1
if zip>0:
# with gzip.GzipFile(fname,"rb") as stream:
with os.popen("gunzip < '%s'"%fname,"rb") as stream:
unpickler = cPickle.Unpickler(stream)
unpickler.find_global = unpickle_find_global
return unpickler.load()
else:
with open(fname,"rb") as stream:
unpickler = cPickle.Unpickler(stream)
unpickler.find_global = unpickle_find_global
return unpickler.load()
UPDATE: Hi, please note that I have used Python's native gzip, and it is working fine. Thank you for pointing that out. Here is the correct syntax that is working on Windows: {with gzip.GzipFile(fname,"rb") as stream:}
Your use of gunzip (in the load_object function) is incorrect. Unless passed the -c argument, gunzip writes the decompressed data to a new file, not to its stdout (which is what you seem to be attempting to do).
As a result, it doesn't write anything to its stdout, and your stream variable contains no data, hence the EOFError.
A quick fix is to change your gunzip command line to give it the -c argument.
More info here: http://linux.die.net/man/1/gzip
That said, why are you even shelling out to gunzip to decompress your data? Python's built-in gzip module should handle that without problems.

Muscle alignment in python

I have a problem with printing my output from muscle aligning in python. My code is:
from Bio.Align.Applications import MuscleCommandline
from StringIO import StringIO
from Bio import AlignIO
def align_v1 (Fasta):
muscle_cline = MuscleCommandline(input="hiv_protease_sequences_w_wt.fasta")
stdout, stderr = muscle_cline()
MultipleSeqAlignment = AlignIO.read(StringIO(stdout), "fasta")
print MultipleSeqAlignment
Any help?
It would be nice to know what error you received, but the following should solve your problem:
from Bio.Align.Applications import MuscleCommandline
from StringIO import StringIO
from Bio import AlignIO
muscle_exe = r"C:\muscle3.8.31_i86win32.exe" #specify the location of your muscle exe file
input_sequences = "hiv_protease_sequences_w_wt.fasta"
output_alignment = "output_alignment.fasta"
def align_v1 (Fasta):
muscle_cline = MuscleCommandline(muscle_exe, input=Fasta, out=output_alignment)
stdout, stderr = muscle_cline()
MultipleSeqAlignment = AlignIO.read(output_alignment, "fasta")
print MultipleSeqAlignment
align_v1(input_sequences)
In my case I received a ValueError:
>>> AlignIO.read(StringIO(stdout), "fasta")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\WinPython-64bit-3.3.2.3\python-3.3.2.amd64\lib\site-packages\Bio\AlignIO\__init__.py", line 427, in read
raise ValueError("No records found in handle")
ValueError: No records found in handle
This could be avoided by saving the output and reopening with AlignIO.read.
I also received a FileNotFoundError that could be avoided by specifying the location of the muscle exe file. eg:
muscle_exe = r"C:\muscle3.8.31_i86win32.exe"
The instructions for this are shown in help(MuscleCommandline), but this is not currently in the Biopython tutorial page.
Finally, I am assuming you want to run the command using different input sequences, so I modifed the function to the format “function_name(input_file).”
I used python 3.3. Hopefully the code above is for python 2.x as in your original post. For python 3.x, change "from StringIO import StringIO" to "from io import StringIO" and of course “print MultipleSeqAlignment” to “print(MultipleSeqAlignment)”.

python os.sys.stdin.buffer.read failed if given buffer length

import os
s = os.sys.stdin.buffer.read(1024*32)
failed with
D:\Projects\pytools>python t1.py
Traceback (most recent call last):
File "t1.py", line 2, in <module>
s = os.sys.stdin.buffer.read(1024*32)
OSError: [Errno 12] Not enough space
buf if given buflen = 1024*32-1 then it goes right
import os
s = os.sys.stdin.buffer.read(1024*32-1)
if you run python t1.py, then the process blocked and waiting for input as expect.
Why python3.3 have 1024*32-1 buffer length limitation? Is it system different, or just a the same for all systems? How can we remove this limitation?
BTW: i using windows 7 python 32 bit version 3.3
We start by looking at the source of os module here, where line 26 reads
import sys, errno
This tells us that os.sys is just a reference to the standard sys module.
Then we head over to the source of the sys module, where in line 1593 we find the following comment (thankfully someone put it there...):
/* stdin/stdout/stderr are now set by pythonrun.c */
Then we go to the pythonrun.c file, where we meet the following code in line 1086:
std = create_stdio(iomod, fd, 0, "<stdin>", encoding, errors);
and this on line 1091:
PySys_SetObject("stdin", std);
Then we look for definition of create_stdio() function which we find in line 910. We look for the return value of this function which is on line 999 and looks like this:
return stream;
Now we have to find out what the stream is. It's the return value of function _PyObject_CallMethodId() called in line 984.
I hope you see the flow - try to follow from here.

Categories