Unable to load a previously dumped pickle file in Python - python

The implemented algorithm which I use is quite heavy and has three parts. Thus, I used pickle to dump everything in between various stages in order to do testing on each stage separately.
Although the first dump always works fine, the second one behaves as if it is size dependent. It will work for a smaller dataset but not for a somewhat larger one. (The same actually also happens with a heatmap I try to create but that's a different question) The dumped file is about 10MB so it's nothing really large.
The dump which creates the problem contains a whole class which in turn contains methods, dictionaries, lists and variables.
I actually tried dumping both from inside and outside the class but both failed.
The code I'm using looks like this:
data = pickle.load(open("./data/tmp/data.pck", 'rb')) #Reads from the previous stage dump and works fine.
dataEvol = data.evol_detect(prevTimeslots, xLablNum) #Export the class to dataEvol
dataEvolPck = open("./data/tmp/dataEvol.pck", "wb") #open works fine
pickle.dump(dataEvol, dataEvolPck, protocol = 2) #dump works fine
dataEvolPck.close()
and even tried this:
dataPck = open("./data/tmp/dataFull.pck", "wb")
pickle.dump(self, dataPck, protocol=2) #self here is the dataEvol in the previous part of code
dataPck.close()
The problem appears when i try to load the class using this part:
dataEvol = pickle.load(open("./data/tmp/dataEvol.pck", 'rb'))
The error in hand is:
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
dataEvol = pickle.load(open("./data/tmp/dataEvol.pck", 'rb'))
ValueError: itemsize cannot be zero
Any ideas?
I'm using Python 3.3 on a 64-bit Win-7 computer. Please forgive me if I'm missing anything essential as this is my first question.
Answer:
The problem was an empty numpy string in one of the dictionaries. Thanks Janne!!!

It is a NumPy bug that has been fixed recently in this pull request. To reproduce it, try:
import cPickle
import numpy as np
cPickle.loads(cPickle.dumps(np.string_('')))

Related

Python read pickle protocol 4 error: STACK_GLOBAL requires str

In Python 3.7.5, ubuntu 18.04, pickle read gives error,
pickle version 4
Sample code:
import pickle as pkl
file = open("sample.pkl", "rb")
data = pkl.load(file)
Error:
UnpicklingError Traceback (most recent call
last)
in
----> 1 data = pickle.load(file)
UnpicklingError: STACK_GLOBAL requires str
Reading from same file object solves problem.
Reading using pandas also gives same problem
I also has this error turned out I was opening a numpy file with pickle. ;)
Turns out it is known issue. There is issue page in
github
I had this problem and just added pckl to the end of the file name.
My problem was that I was trying to pickle and un-pickle across different python environments - watch out to make sure your pickle versions match!
Perhaps this will be the solution to this error for someone.
I needed to load a numpy array:
torch.load(file)
When I loaded the array, this error appeared. All that is needed is to turn the array into a tensor.
For example:
result = torch.from_numpy(np.load(file))

Error "buffer is too small for requested array" when modify a fits file with python

I'm trying to create a new fits file from an initial template.fits
This template.fits has the table of the voice 1 with 3915 rows, instead, my new file, must have more then 50000 rows.
The part of the code is the following:
hdulist = fits.open('/Users/Martina/Desktop/Ubuntu_Condivisa/Post_Doc_IAPS/ASTRI/ASTRI_scienceTools/Astrisim_MC/template.fits')
hdu0=hdulist[0]
hdu0.writeto(out_pile+'.fits', clobber=True)
hdu1=hdulist[1]
hdu1.header['NAXIS2'] = na
hdu1.header['ONTIME'] = tsec
hdu1.header['LIVETIME'] = tsec
hdu1.writeto(out_pile+'.fits', clobber=True)
hdu1_data=hdu1.data
for j in range(na-1):
hdu1_data[j+1][1]=j+1
hdu1_data[j+1][3]=t[j]+0.
hdu1_data[j+1][7]=ra[j]
hdu1_data[j+1][8]=dec[j]
hdu1_data[j+1][21]=enetot[j]
hdu1.writeto(out_pile+'.fits', clobber=True)
When I try to fill the new table (the last part of the code), the error is the following:
Traceback (most recent call last):
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\utils\decorators.py", line 734, in __get__
return obj.__dict__[self._key]
KeyError: 'data'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Astrisim_MC_4.py", line 340, in
hdu1_data=hdu1.data
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\utils\decorators.py", line 736, in __get__
val = self.fget(obj)
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\hdu\table.py", line 404, in data
data = self._get_tbdata()
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\hdu\table.py", line 171, in _get_tbdata
self._data_offset)
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\hdu\base.py", line 478, in _get_raw_data
return self._file.readarray(offset=offset, dtype=code, shape=shape)
File "C:\Users\Martina\AppData\Local\Programs\Python\Python36\lib\site-packages\astropy\io\fits\file.py", line 279, in readarray
buffer=self._mmap)
TypeError: buffer is too small for requested array
I tried to vary the number of rows and the code works correctly up to 3969 rows.
How can I solve the problem?
Thank you very much in advance,
cheers!
Martina
Your initial problem where where you did this:
hdu1.header['NAXIS2'] = na
A natural thing to think you might be able to do, but you actually should not. In general when working with astropy.io.fits, one should almost never manually mess with keywords in the FITS header that describe the structure of the data itself. This stems in part from the design of FITS itself--that it mixes these structural keywords in with metadata keywords--and partly a design issue with astropy.io.fits that it lets you manipulate these keywords at all, or that it doesn't more tightly tie the data to them. I wrote about this issue at more length here: https://github.com/astropy/astropy/issues/3836 but never got around to adding more explanation of this to the documentation.
Basically the way you can think about it is that when a FITS file is opened, its header is first read and parsed into a Header object containing all the header keywords. Some book-keeping is also done to keep track of how much data is in the file after the header. Then when you access the data of the HDU the header keywords are used to determine what the type and shape of the data is. So by doing something like
hdu1.header['NAXIS2'] = na
hdu1_data = hdu1.data
this isn't somehow growing the data in the file. Instead it's just confusing it into thinking there are more rows of data in the file then there actually are, hence the error "buffer is too small for requested array". The "buffer" it's referring to in this case is the rest of the data in the file, and you're requesting that it read an array that's longer than there is data in the file.
The fact that it allows you do break this at all is a bug in Astropy IMO. When the file is first opened it should save away all the correct structural keywords in the background, so that the data can still be loaded properly even if the user accidentally modifies these keywords (or perhaps the user should be completely prevented from modifying these keywords directly.
That's a long way to explain where you went wrong, but maybe it will help better understand how the library works.
As to your actual question, I think #Evert's advice is good, to use the higher-evel and easier to work with astropy.table to create a new table that's the size you need, and then copy the existing table into the new one. You can open the FITS table directly as a Table object as well with Table.read. I think you can also copy the FITS metadata over but I'm not sure exactly the best way to do that.
One other minor comment unrelated to your main question--when working with arrays you don't have to (and in fact shouldn't) use for loops to perform vectorizable operations.
For example since this is just looping over array indices:
for j in range(na-1):
hdu1_data[j+1][1]=j+1
hdu1_data[j+1][3]=t[j]+0.
hdu1_data[j+1][7]=ra[j]
hdu1_data[j+1][8]=dec[j]
hdu1_data[j+1][21]=enetot[j]
you can write operations like this like:
hdu1_data[:][1] = np.arange(na)
hdu1_data[:][3] = t + 0.
hdu1_data[:][7] = ra
and so on (I'm not sure why you were doing j+1 because this is skipping the first row, but the point still stands). This assumes of course that the array being updated (hdu1_data, in this case) already has na rows. But that's why you need to grow or concatenate to your array first if it's not already that size.

PySpark serialization EOFError

I am reading in a CSV as a Spark DataFrame and performing machine learning operations upon it. I keep getting a Python serialization EOFError - any idea why? I thought it might be a memory issue - i.e. file exceeding available RAM - but drastically reducing the size of the DataFrame didn't prevent the EOF error.
Toy code and error below.
#set spark context
conf = SparkConf().setMaster("local").setAppName("MyApp")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
#read in 500mb csv as DataFrame
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
#get dataframe into machine learning format
r_formula = RFormula(formula = "outcome ~ .")
mldf = r_formula.fit(df).transform(df)
#fit random forest model
rf = RandomForestClassifier(numTrees = 3, maxDepth = 2)
model = rf.fit(mldf)
result = model.transform(mldf).head()
Running the above code with spark-submit on a single node repeatedly throws the following error, even if the size of the DataFrame is reduced prior to fitting the model (e.g. tinydf = df.sample(False, 0.00001):
Traceback (most recent call last):
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 157,
in manager
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 61,
in worker
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/worker.py", line 136,
in main if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/serializers.py", line 545,
in read_int
raise EOFError
EOFError
The error appears to happen in the pySpark read_int function. Code for which is as follows from spark site :
def read_int(stream):
length = stream.read(4)
if not length:
raise EOFError
return struct.unpack("!i", length)[0]
This would mean that when reading 4bytes from the stream, if 0 bytes are read, EOF error is raised. The python docs are here.
I have faced the same issues and don't know how to debug it. seems that it will cause executor thread stuck and never return anything.
Have you checked to see where in your code the EOError is arising?
My guess would be that it's coming as you attempt to define df with, since that's the only place in your code that the file is actually trying to be read.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
At every point after this line, your code is working with the variable df, not the file itself, so it would seem likely that this line is generating the error.
A simple way to test if this is the case would be to comment out the rest of your code, and/or place a line like this right after the line above.
print(len(df))
Another way would be to use a try loop, like:
try:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('myfile.csv')
except:
print("Failed to load file into df!")
If it turns out that that line is the one generating the EOFError, then you're never getting the dataframes in the first place, so attempting to reduce them won't make a difference.
If that is the line generating the error, two possibilities come to mind:
Your code is calling one or both of the .csv files earlier on, and isn't closing it prior to this line. If so, simply close it above your code here.
There's something wrong with the .csv files themselves. Try loading them outside of this code, and see if you can get them into memory properly in the first place, using something like csv.reader, and manipulate them in ways you'd expect.

Loading previously saved JModelica result-file

I got the following question:
I am loading a JModelica model and simulate it easily by doing:
from pymodelica import compile_fmu
from pyfmi import load_fmu
model = load_fmu(SOME_FMU);
res=model.simulate();
Everything works fine and it even saves a resulting .txt - file. Now, with this .txt the problem is that I did not find any funtionality so far within the jmodelica-python packages to actually load such an .txt result file again later on into a result-object ( like the one being returned by simulate() ) to easily extract the previous saved data.
Implementing that by hand is of course possible but I find it quiet nasty and just wanted to ask if anyone knows of method that does the job to load that JModlica-format result-file into an result object for me.
Thanks!!!!
The functionality that you need is located in the io module:
from pyfmi.common.io import ResultDymolaTextual
res = ResultDymolaTextual("MyResult.txt")
var = res.get_variable_data("MyVariable")
var.x #Trajectory
var.t #Corresponding time vector

large csv file makes segmentation fault for numpy.genfromtxt

I'd really like to create a numpy array from a csv file, however, I'm having issues when the file is ~50k lines long (like the MNIST training set). The file I'm trying to import looks something like this:
0.0,0.0,0.0,0.5,0.34,0.24,0.0,0.0,0.0
0.0,0.0,0.0,0.4,0.34,0.2,0.34,0.0,0.0
0.0,0.0,0.0,0.34,0.43,0.44,0.0,0.0,0.0
0.0,0.0,0.0,0.23,0.64,0.4,0.0,0.0,0.0
It works fine for something thats 10k lines long, like the validation set:
import numpy as np
csv = np.genfromtxt("MNIST_valid_set_data.csv",delimiter = ",")
If I do the same with the training data (larger file), I'll get a c-style segmentation fault. Does anyone know any better ways besides breaking the file up and then piecing it together?
The end result is that I'd like to pickle the arrays into a similar mnist.pkl.gz file but I can't do that if I can't read in the data.
Any help would be greatly appreciated.
I think you really want to track down the actual problem and solve it, rather than just work around it, because I'll bet you have other problems with your NumPy installation that you're going to have to deal with eventually.
But, since you asked for a workaround that's better than manually splitting the files, reading them, and merging them, here are two:
First, you can split the files programmatically and dynamically, instead of manually. This avoids wasting a lot of your own human effort, and also saves the disk space needed for those copies, even though it's conceptually the same thing you already know how to do.
As the genfromtxt docs make clear, the fname argument can be a pathname, or a file object (open in 'rb' mode), or just a generator of lines (as bytes). Of course a file object is itself a generator of lines, but so is, say, an islice of a file object, or a group from a grouper. So:
import numpy as np
from more_itertools import grouper
def getfrombigtxt(fname, *args, **kwargs):
with open(fname, 'rb') as f:
return np.vstack(np.genfromtxt(group, *args, **kwargs)
for group in grouper(f, 5000, b''))
If you don't want to install more_itertools, you can also just copy the 2-line grouper implementation from the Recipes section of the itertools docs, or even inline zipping the iterators straight into your code.
Alternatively, you can parse the CSV file with the stdlib's csv module instead of with NumPy:
import csv
import numpy as np
def getfrombigtxt(fname, delimiter=','):
with open(fname, 'r') as f: # note text mode, not binary
rows = (list(map(float, row)) for row in csv.reader(f))
return np.vstack(rows)
This is obviously going to be a lot slower… but if we're talking about turning 50ms of processing into 1000ms, and you only do it once, who cares?

Categories