PyTables ValueError on string column with newer pandas - python

Problem writing pandas dataframe (timeseries) to HDF5 using pytables/tstables:
import pandas
import tables
import tstables
# example dataframe
valfloat = [512.3, 918.8]
valstr = ['abc','cba']
tstamp = [1445464064, 1445464013]
df = pandas.DataFrame(data = zip(valfloat, valstr, tstamp), columns = ['colfloat', 'colstr', 'timestamp'])
df.set_index(pandas.to_datetime(df['timestamp'].astype(int), unit='s'), inplace=True)
df.index = df.index.tz_localize('UTC')
colsel = ['colfloat', 'colstr']
dftoadd = df[colsel].sort_index()
# try string conversion from object-type (no type mixing here ?)
##dftoadd.loc[:,'colstr'] = dftoadd['colstr'].map(str)
h5fname = 'df.h5'
# class to use as tstable description
class TsExample(tables.IsDescription):
timestamp = tables.Int64Col(pos=0)
colfloat = tables.Float64Col(pos=1)
colstr = tables.StringCol(itemsize=8, pos=2)
# create new time series
h5f = tables.open_file(h5fname, 'a')
ts = h5f.create_ts('/','example',TsExample)
# append to HDF5
ts.append(dftoadd, convert_strings=True)
# save data and close file
h5f.flush()
h5f.close()
Exception:
ValueError: rows parameter cannot be converted into a recarray object
compliant with table tstables.tstable.TsTable instance at ...
The error was: cannot view Object as non-Object type
While this particular error happens with TsTables, the code chunk responsible for it is identical to PyTables try-section here.
The error is happening after I upgraded pandas to 0.17.0; the same code was running error-free with 0.16.2.
NOTE: if a string column is excluded then everything works fine, so this problem must be related to string-column type representation in the dataframe.
The issue could be related to this question. Is there some conversion required for 'colstr' column of the dataframe that I am missing?

This is not going to work with a newer pandas as the index is timezone aware, see here
You can:
convert to a type PyTables understands, this would require localizing
use HDFStore to write the frame
Note that what you are doing is the reason HDFStore exists in the first place, to make reading/writing pyTables friendly for pandas objects. Doing this 'manually' is full of pitfalls.

Related

pytables and pandas string padding question

I've created a dataset using hdf5cpp library with a fixed size string (requirement). However when loading with pytables or pandas the strings are always represented like:
b'test\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff
The string value of 'test' with the padding after it. Does anyone know a way to suppress or not show this padding data? I really just want 'test' shown. I realise this may be correct behaviour.
My hdf5cpp setup for strings:
strType = H5Tcopy(H5T_C_S1);
status = H5Tset_size(strType, 36);
H5Tset_strpad(strType, H5T_STR_NULLTERM);
I can't help with your C Code. It is possible to work with padded strings in Pytables. I can read data written by a C application that creates a struct array of mixed types, including padded strings. (Note: there was an issue related to copying a NumPy struct array with padding. It was fixed in 3.5.0. Read this for details: PyTables GitHub Pull 720.)
Here is an example that shows proper string handling with a file created by PyTables. Maybe it will help you investigate your problem. Checking the dataset's properties would be a good start.
import tables as tb
import numpy as np
arr = np.empty((10), 'S10')
arr[0]='test'
arr[1]='one'
arr[2]='two'
arr[3]='three'
with tb.File('SO_63184571.h5','w') as h5f:
ds = h5f.create_array('/', 'testdata', obj=arr)
print (ds.atom)
for i in range(4):
print (ds[i])
print (ds[i].decode('utf-8'))
Example below added to demonstrate compound dataset with int and fixed string. This is called a Table in PyTables (Arrays always contain homogeneous values). This can be done a number of ways. I show the 2 methods I prefer:
Create a record array and reference with the description= or
obj= parameter. This is useful when already have all of your data AND it will fit in memory.
Create a record array dtype and reference with the description=
parameter. Then add the data with the .append() method. This is
useful when all of your data will NOT fit in memory, OR you need to add data to an existing table.
Code below:
recarr_dtype = np.dtype(
{ 'names': ['ints', 'strs' ],
'formats': [int, 'S10'] } )
a = np.arange(5)
b = np.array(['a', 'b', 'c', 'd', 'e'])
recarr = np.rec.fromarrays((a, b), dtype=recarr_dtype)
with tb.File('SO_63184571.h5','w') as h5f:
ds1 = h5f.create_table('/', 'compound_data1', description=recarr)
for i in range(5):
print (ds1[i]['ints'], ds1[i]['strs'].decode('utf-8'))
ds2 = h5f.create_table('/', 'compound_data2', description=recarr_dtype)
ds2.append(recarr)
for i in range(5):
print (ds2[i]['ints'], ds2[i]['strs'].decode('utf-8'))

How to save a pandas DataFrame with custom types using pyarrow and parquet

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds).
Throughout the examples we use:
import pandas as pd
import pyarrow as pa
Here's a minimal example to show the situation:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': ObjectId('5e9992543bfddb58073803e7')},
{'name': 'bob', 'oid': ObjectId('5e9992543bfddb58073803e8')},
]
)
df.to_parquet('some_path')
And we get:
ArrowInvalid: ('Could not convert 5e9992543bfddb58073803e7 with type ObjectId: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column oid with type object')
I tried to follow this reference: https://arrow.apache.org/docs/python/extending_types.html
Thus I wrote the following type extension:
class ObjectIdType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.binary(12), "my_package.objectid")
def __arrow_ext_serialize__(self):
# since we don't have a parametrized type, we don't need extra
# metadata to be deserialized
return b''
#classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata.
return ObjectId()
And was able to get a working pyarray for my oid column:
values = df['oid']
storage_array = pa.array(values.map(lambda oid: oid.binary), type=pa.binary(12))
pa.ExtensionArray.from_storage(objectid_type, storage_array)
Now where I’m stuck, and cannot find any good solution on the internet, is how to save my df to parquet, letting it interpret which column needs which Extension. I might change columns in the future, and I have several different types that need this treatment.
How can I simply create parquet file from dataframes and restore them while transparently converting the types ?
I tried to create a pyarrow.Table object, and append columns to it after preprocessing, but it doesn’t work as table.append_column takes binary columns and not pyarrow.Arrays, plus the whole isinstance thing looks like a terrible solution.
table = pa.Table.from_pandas(pd.DataFrame())
for col, values in test_df.iteritems():
if isinstance(values.iloc[0], ObjectId):
arr = pa.array(
values.map(lambda oid: oid.binary), type=pa.binary(12)
)
elif isinstance(values.iloc[0], ...):
...
else:
arr = pa.array(values)
table.append_column(arr, col) # FAILS (wrong type)
Pseudocode of the ideal solution:
parquetize(df, path, my_custom_types_conversions)
# ...
new_df = unparquetize(path, my_custom_types_conversions)
assert df.equals(new_df) # types have been correctly restored
I’m getting lost in pyarrow’s doc on if I should use ExtensionType, serialization or other things to write these functions. Any pointer would be appreciated.
Side note, I do not need parquet at all means, the main issue is to being able to save and restore dataframes with custom types quickly and space efficiently. I tried a solution based on jsonifying and gziping the dataframe, but it was too slow.
I think it is probably because the 'ObjectId' is not a defined keyword in python hence it is throwing up this exception in type conversion.
I tried the example you provided and tried by casting the oid values as string type during dataframe creation and it worked.
Check below the steps:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': "ObjectId('5e9992543bfddb58073803e7')"},
{'name': 'bob', 'oid': "ObjectId('5e9992543bfddb58073803e8')"},
]
)
df.to_parquet('parquet_file.parquet')
df1 = pd.read_parquet('parquet_file.parquet',engine='pyarrow')
df1
output:
name oid
0 alice ObjectId('5e9992543bfddb58073803e7')
1 bob ObjectId('5e9992543bfddb58073803e8')
You could write a method that reads the column names and types and outputs a new DF with the columns converted to compatible types, using a switch-case pattern to choose what type to convert column to (or whether to leave it as is).

Error when trying to save hdf5 row where one column is a string and the other is an array of floats

I have two column, one is a string, and the other is a numpy array of floats
a = 'this is string'
b = np.array([-2.355, 1.957, 1.266, -6.913])
I would like to store them in a row as separate columns in a hdf5 file. For that I am using pandas
hdf_key = 'hdf_key'
store5 = pd.HDFStore('file.h5')
z = pd.DataFrame(
{
'string': [a],
'array': [b]
})
store5.append(hdf_key, z, index=False)
store5.close()
However, I get this error
TypeError: Cannot serialize the column [array] because
its data contents are [mixed] object dtype
Is there a way to store this to h5? If so, how? If not, what's the best way to store this sort of data?
I can't help you with pandas, but can show you how do this with pytables.
Basically you create a table referencing either a numpy recarray or a dtype that defines the mixed datatypes.
Below is a super simple example to show how to create a table with 1 string and 4 floats. Then it adds rows of data to the table.
It shows 2 different methods to add data:
1. A list of tuples (1 tuple for each row) - see append_list
2. A numpy recarray (with dtype matching the table definition) -
see simple_recarr in the for loop
To get the rest of the arguments for create_table(), read the Pytables documentation. It's very helpful, and should answer additional questions. Link below:
Pytables Users's Guide
import tables as tb
import numpy as np
with tb.open_file('SO_55943319.h5', 'w') as h5f:
my_dtype = np.dtype([('A','S16'),('b',float),('c',float),('d',float),('e',float)])
dset = h5f.create_table(h5f.root, 'table_data', description=my_dtype)
# Append one row using a list:
append_list = [('test string', -2.355, 1.957, 1.266, -6.913)]
dset.append(append_list)
simple_recarr = np.recarray((1,),dtype=my_dtype)
for i in range(5):
simple_recarr['A']='string_' + str(i)
simple_recarr['b']=2.0*i
simple_recarr['c']=3.0*i
simple_recarr['d']=4.0*i
simple_recarr['e']=5.0*i
dset.append(simple_recarr)
print ('done')

Read Shapefiles into Dataframe

I have a shapefile that I would like to convert into dataframe in Python 3.7. I have tried the following codes:
import pandas as pd
import shapefile
sf_path = r'data/shapefile'
sf = shapefile.Reader(sf_path, encoding = 'Shift-JIS')
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
sf_df = pd.DataFrame(columns = fields, data = records)
But I got this error message saying
TypeError: Expected list, got _Record
So how should I convert the list to _Record or is there a way around it? I have tried GeoPandas too, but had some trouble installing it. Thanks!
def read_shapefile(sf_shape):
"""
Read a shapefile into a Pandas dataframe with a 'coords'
column holding the geometry information. This uses the pyshp
package
"""
fields = [x[0] for x in sf_shape.fields][1:]
records = [y[:] for y in sf_shape.records()]
#records = sf_shape.records()
shps = [s.points for s in sf_shape.shapes()]
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps)
return df
I had the same problem and this is because the .shp file has a kind of key field in each record and when converting to dataframe, a list is expected and only that field is found, test changing:
records = [y[:] for y in sf.records()]
I hope this works!

Store ndarray in a PyTable (and how to define the Col()-type)

TL;DR: I have a PyTable with a float32 Col and get an error when writing a numpy-float32-array into it. (How) can I store a numpy-array (float32) in the Column of a PyTables table?
I'm new to PyTables - following a recommendation of TFtables (a lib to use HDF5 in Tensorflow), I'm using it to store all my HDF5 data (currently distributed in batches in several files with each three datasets) within a table in a single HDF5 file. Datasets are
'data' : (n_elements, 1024, 1024, 4)#float32
'label' : (n_elements, 1024, 1024, 1)#uint8
'weights' : (n_elements, 1024, 1024, 1)#float32
where the n_elements are distributed over several files that I want to merge into one now (to allow unordered access).
So when I build my table, I figured each dataset represents a column. I built everything in a generic way that allows to do this for an arbitrary number of datasets:
# gets dtypes (and shapes) of the dsets (accessed by dset_keys = ['data', 'label', 'weights']
dtypes, shapes = _determine_shape(hdf5_files, dset_keys)
# to dynamically generate a table, I'm using a dict (not a class as in the PyTables tutorials)
# the dict is (conform with the doc): { 'col_name' : Col()-class-descendent }
table_description = {dset_keys[i]: tables.Col.from_dtype(dtypes[i]) for i in range(len(dset_keys))}
# create a file, a group-node and attach a table to it
h5file = tables.open_file(destination_file, mode="w", title="merged")
group = h5file.create_group("/", 'main', 'Node for data table')
table = h5file.create_table(group, 'data_table', table_description, "Collected data with %s" % (str(val_keys)))
The dtypes that I get for each dsets (read with h5py) are obviously the ones of the numpy arrays (ndarray) that reading the dset returns: float32 or uint8. So the Col()-types are Float32Col an UInt8Col. I naively assumed that I can now write a float32-array into this col, but filling in data with:
dummy_data = np.zeros([1024,1024,3], float32) # normally data read from other files
sample = table.row
sample['data'] = dummy_data
results in TypeError: invalid type (<class 'numpy.ndarray'>) for column ``data``. So now I feel stupid for assuming I'd be able to write an array in there, BUT there are no "ArrayCol()" types offered, neither are there any hints in the PyTables doc as to whether or how it is possible to write an array into a column. How do I do this?
There are "shape" arguments in the Col() class and it's descendents, so it should be possible, otherwise what are these for?!
I know it's a bit late, but I think the answer to your problem lies in the shape parameter for Float32Col.
Here's how it's used in the documentation:
from tables import *
from numpy import *
# Describe a particle record
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
# Open a file in "w"rite mode
fileh = open_file("tutorial2.h5", mode = "w")
# Get the HDF5 root group
root = fileh.root
# Create the groups:
for groupname in ("Particles", "Events"):
group = fileh.create_group(root, groupname)
# Now, create and fill the tables in Particles group
gparticles = root.Particles
# Create 3 new tables
for tablename in ("TParticle1", "TParticle2", "TParticle3"):
# Create a table
table = fileh.create_table("/Particles", tablename, Particle, "Particles: "+tablename)
# Get the record object associated with the table:
particle = table.row
# Fill the table with 257 particles
for i in xrange(257):
# First, assign the values to the Particle record
particle['name'] = 'Particle: %6d' % (i)
particle['lati'] = i
particle['longi'] = 10 - i
########### Detectable errors start here. Play with them!
particle['pressure'] = array(i*arange(2*3)).reshape((2,4)) # Incorrect
#particle['pressure'] = array(i*arange(2*3)).reshape((2,3)) # Correct
########### End of errors
particle['temperature'] = (i**2) # Broadcasting
# This injects the Record values
particle.append()
# Flush the table buffers
table.flush()
Here's the link to the part of the documentation I'm referring to
https://www.pytables.org/usersguide/tutorials.html
Edit: I just saw that the tables.Col.from_type(type, shape) allows using the precision of a type (float32 instead of float alone). The rest stays the same (takes a string and shape).
The factory function tables.Col.from_kind(kind, shape) can be used to construct a Col-Type that supports ndarrays. What "kind" is and how to use this isn't documented anywhere I found; however with trial and error I found that allowed "kind"s are strings of basic datatypes. I.e.: 'float', 'uint', ... without the precision (NOT 'float64')
Since I get numpy.dtypes from h5py reading a dataset (dset.dtype), these have to be cast to str and the precision needs to be removed.
In the end the relevant lines look like this:
# get key, dtype and shapes of elements per dataset from the datasource files
val_keys, dtypes, element_shapes = _get_dtypes(datasources, element_axis=element_axis)
# for storing arrays in columns apparently one has to use "kind"
# "kind" cannot be created with dtype but only a string representing
# the dtype w/o precision, e.g. 'float' or 'uint'
dtypes_kind = [''.join(i for i in str(dtype) if not i.isdigit()) for dtype in dtypes]
# create table description as dictionary
description = {val_keys[i]: tables.Col.from_kind(dtypes_kind[i], shape=element_shapes[i]) for i in range(len(val_keys))}
Then writing data into the table finally works as suggested:
sample = table.row
sample[key] = my_array
Since it all felt a bit "hacky" and isn't documented well, I am still wondering, whether this is not an intended use for PyTables and would leave this question open for abit to see if s.o. knows more about this...

Categories