I have a dict containing several pandas Dataframe (identified by keys) , any suggestion to effectively serialize (and cleanly load) it . Here is the structure (a pprint display output ). Each of dict['method_x_']['meas_x_'] is a pandas Dataframe. The goal is to save the dataframes for a further plotting with some specific plotting options.
{'method1':
{'meas1':
config1 config2
0 0.193647 0.204673
1 0.251833 0.284560
2 0.227573 0.220327,
'meas2':
config1 config2
0 0.172787 0.147287
1 0.061560 0.094000
2 0.045133 0.034760,
'method2':
{ 'meas1':
congif1 config2
0 0.193647 0.204673
1 0.251833 0.284560
2 0.227573 0.220327,
'meas2':
config1 config2
0 0.172787 0.147287
1 0.061560 0.094000
2 0.045133 0.034760}}
Use pickle.dump(s) and pickle.load(s). It actually works. Pandas DataFrames also have their own method df.save("filename") that you can use to serialize a single DataFrame...
In my particular use case, I tried to do a simple pickle.dump(all_df, open("all_df.p","wb"))
And while it loaded properly with> all_df = pickle.load(open("all_df.p","rb"))
When I restarted my Jupiter enviroment I would get a UnpicklingError: invalid load key, '\xef'.
One of the methods described here state that we can use HDF5 (pytables) to do the job. From their docs:
HDFStore is a dict-like object which reads and writes pandas
But it seems to be picky about the tablesversion that you use. I got mine to work after a pip install --upgrade tables and doing a runtime restart.
If you need a overall idea on how to use it:
#consider all_df as a list of dataframes
with pd.HDFStore('df_store.h5') as df_store:
for i in all_df.keys():
df_store[i] = all_df[i]
You should have a df_store.h5 file that you can convert back using the reverse process:
new_all_df = dict()
with pd.HDFStore('df_store.h5') as df_store:
for i in df_store.keys():
new_all_df[i] = df_store[i]
Related
I have a Tensor dataset that is a list of file names and a Pandas dataframe that contains metadata for each file.
filename_ds = tf.data.Dataset.list_files(path + "/*.bmp")
metadata_df = pandas.read_csv(path + "/metadata.csv")
File names contain an idx that references a line in the metadata dataframe, like "3_data.bmp" where 3 is the idx. I hoped to call filename_ds.map(combine_data).
It appears to be not as simple as parsing the file name and doing a dataframe lookup. The following fails because filename is a Tensor, and since I'm running this on a Dataset.map() call, I do not have access to tf.executing_eagerly() methods like .numpy() and cannot get a string value from the filename to do my regex and df lookup.
combine_data(filename)
idx = re.findall("(\d+)_data.bmp", filename)[0]
val = metadata_df.loc[metadata_df["idx"] == idx]["test-col"]
...
New to Tensorflow, and I suspect I'm going about this in an odd way. What would be the correct way to go about this? I could list my files and concatenate a dataset for each file, but I'm wondering if I'm just missing the "Tensorflow way" of doing it.
One way of iteration is through as_numpy_iterator()
dataset_list=list(filename_ds.as_numpy_iterator())
for each_file in dataset_list:
file_name=each_file.decode('utf-8') # this will contain the abs path /user/me/so/file_1.png
try:
idx=re.findall("(\d+).*.png", file_name)[0] # changed for my case
except :
print("Exception==>")
print(f"File:{file_name},idx:{idx}")
I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds).
Throughout the examples we use:
import pandas as pd
import pyarrow as pa
Here's a minimal example to show the situation:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': ObjectId('5e9992543bfddb58073803e7')},
{'name': 'bob', 'oid': ObjectId('5e9992543bfddb58073803e8')},
]
)
df.to_parquet('some_path')
And we get:
ArrowInvalid: ('Could not convert 5e9992543bfddb58073803e7 with type ObjectId: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column oid with type object')
I tried to follow this reference: https://arrow.apache.org/docs/python/extending_types.html
Thus I wrote the following type extension:
class ObjectIdType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.binary(12), "my_package.objectid")
def __arrow_ext_serialize__(self):
# since we don't have a parametrized type, we don't need extra
# metadata to be deserialized
return b''
#classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata.
return ObjectId()
And was able to get a working pyarray for my oid column:
values = df['oid']
storage_array = pa.array(values.map(lambda oid: oid.binary), type=pa.binary(12))
pa.ExtensionArray.from_storage(objectid_type, storage_array)
Now where I’m stuck, and cannot find any good solution on the internet, is how to save my df to parquet, letting it interpret which column needs which Extension. I might change columns in the future, and I have several different types that need this treatment.
How can I simply create parquet file from dataframes and restore them while transparently converting the types ?
I tried to create a pyarrow.Table object, and append columns to it after preprocessing, but it doesn’t work as table.append_column takes binary columns and not pyarrow.Arrays, plus the whole isinstance thing looks like a terrible solution.
table = pa.Table.from_pandas(pd.DataFrame())
for col, values in test_df.iteritems():
if isinstance(values.iloc[0], ObjectId):
arr = pa.array(
values.map(lambda oid: oid.binary), type=pa.binary(12)
)
elif isinstance(values.iloc[0], ...):
...
else:
arr = pa.array(values)
table.append_column(arr, col) # FAILS (wrong type)
Pseudocode of the ideal solution:
parquetize(df, path, my_custom_types_conversions)
# ...
new_df = unparquetize(path, my_custom_types_conversions)
assert df.equals(new_df) # types have been correctly restored
I’m getting lost in pyarrow’s doc on if I should use ExtensionType, serialization or other things to write these functions. Any pointer would be appreciated.
Side note, I do not need parquet at all means, the main issue is to being able to save and restore dataframes with custom types quickly and space efficiently. I tried a solution based on jsonifying and gziping the dataframe, but it was too slow.
I think it is probably because the 'ObjectId' is not a defined keyword in python hence it is throwing up this exception in type conversion.
I tried the example you provided and tried by casting the oid values as string type during dataframe creation and it worked.
Check below the steps:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': "ObjectId('5e9992543bfddb58073803e7')"},
{'name': 'bob', 'oid': "ObjectId('5e9992543bfddb58073803e8')"},
]
)
df.to_parquet('parquet_file.parquet')
df1 = pd.read_parquet('parquet_file.parquet',engine='pyarrow')
df1
output:
name oid
0 alice ObjectId('5e9992543bfddb58073803e7')
1 bob ObjectId('5e9992543bfddb58073803e8')
You could write a method that reads the column names and types and outputs a new DF with the columns converted to compatible types, using a switch-case pattern to choose what type to convert column to (or whether to leave it as is).
I have a large data set (1.3 billion data) that i want to visualize with Vaex. Since the data set was very big in csv (around 130gb in 520 separate file), i merged them in a hdf5 file with pandas dataframe.to_hdf function (format:table, appended for each csv file). If i use the pandas.read_hdf function to load a slice of data, there is no problem.
x y z
0 -8274.591528 36.053843 24.766887
1 -8273.229203 34.853409 21.883050
2 -8289.577896 15.326737 26.041516
3 -8279.589741 27.798428 26.222326
4 -8272.836821 37.035071 24.795912
... ... ... ...
995 -8258.567634 3.581020 23.955874
996 -8270.526953 4.373765 24.381293
997 -8287.429578 1.674278 25.838418
998 -8250.624879 4.884777 21.815401
999 -8287.115655 1.100695 25.931318
1000 rows × 3 columns
This is how it looks like, i can access to any column i want, and the shape is (1000,3) as it should be. However, when i try to load the hdf5 file using vaex.open function:
# table
0 '(0, [-8274.59152784, 36.05384262, 24.7668...
1 '(1, [-8273.22920299, 34.85340869, 21.8830...
2 '(2, [-8289.5778959 , 15.32673748, 26.0415...
3 '(3, [-8279.58974054, 27.79842822, 26.2223...
4 '(4, [-8272.83682085, 37.0350707 , 24.7959...
... ...
1,322,286,736 '(2792371, [-6781.56835851, 2229.30828904, -6...
1,322,286,737 '(2792372, [-6781.71119626, 2228.78749838, -6...
1,322,286,738 '(2792373, [-6779.3251589 , 2227.46826613, -6...
1,322,286,739 '(2792374, [-6777.26078082, 2229.49535808, -6...
1,322,286,740 '(2792375, [-6782.81758335, 2228.87820639, -6...
This is what I'm getting. The shape is (1322286741, 1) and only column is 'table'. When i try to call the vaex imported hdf as galacto[0]:
[(0, [-8274.59152784, 36.05384262, 24.76688728])]
In pandas imported data these are x,y,z columns for the first row. When i tried to inspect the data in another problem, it also gave an error saying no data has found. So i think the problem is pandas appending hdf5 files row by row and it doesn't work in other programs. Is there a way i can fix this issue?
hdf5 is as flexible as say JSON and xml, in that you can store data in any way you want. Vaex has its own way of storing the data (you can check with the h5ls utils the structure, it's very simple) that does not align with how Pandas/PyTables stores it.
Vaex stores each column as a single contiguous array, which is optimal if you don't work with all columns, and makes it easy to memory map to a (real) numpy array. PyTables stores each row (at least of the same type) next to each other. Meaning if you would calculate the mean of the x columns, you effectively go over all the data.
Since PyTables hdf5 is probably already much faster to read than CSV, I suggest you do the following (not tested, but it should get the point across):
import vaex
import pandas as pd
import glob
# make sure dir vaex exists
for filename in glob.glob("pandas/*.hdf5"): # assuming your files live there
pdf = pd.read_hdf(filename)
df = vaex.from_pandas(pdf) # now df is a vaex dataframe
df.export(filename.replace("pandas", "vaex"), progress=True)) # same in vaex' format
df = vaex.open("vaex/*.hdf5") # it will be concatenated
# don't access df.x.values since it's not a 'real' numpy array, but
# a lazily concatenated column, so it would need to memory copy.
# If you need that, you can optionally do (and for extra performance)
# df.export("big.hdf5", progress=True)
# df_single = vaex.open("big.hdf5")
# df_single.x.values # this should reference the original data on disk (no mem copy)
I would like to read a Rdata file in python and, more important be able to manage it adding date and index.
According to the "R" the file is like
[1] 51.42683 55.16056 51.55766 56.49496 60.35126 60.00867 59.86904 60.33833 60.14559 64.40926 71.08281
[12] 73.65758 69.71637 76.85003 67.86899 72.48499 78.47557 94.64443 89.55312 81.55625 90.06554 65.46467
[23] 84.79299 86.40392 90.09126 94.63728 81.17445 69.41700 71.15074 70.79933 79.15242 65.02803 58.99836
[34] 56.32638 50.73658 48.88498 54.27198 53.77287 55.77409 59.09940 55.26362 60.29990 51.63972 51.89953
I also apply the summary tool in R and I have:
summary(file)
Min. 1st Qu. Median Mean 3rd Qu. Max.
32.33 45.94 51.60 54.03 60.92 108.03
As a consequence I have no head or index column.
I would like to import it in Python. Consequently, I have applied both pyreadr and rpy2 but in both cases, despite I can read the file, I am not able to transform it in a pandas frame. Applying for example pyreader as:
import pyreadr
result = pyreadr.read_r('AdapArimaX_AheadDummy_Hour_1.RData')
print(result.keys())
df1 = result["df1"]
I get with result.keys:
odict_keys(['file'])
and an error with the second command.
I think because I have no heads in the original files. What could be the problem on the packages or in the original file?
Thanks
this is the solution that I have found:
result = pyreadr.read_r(fn)
a = list(result.keys())
df1 = result[a[0]]
Problem writing pandas dataframe (timeseries) to HDF5 using pytables/tstables:
import pandas
import tables
import tstables
# example dataframe
valfloat = [512.3, 918.8]
valstr = ['abc','cba']
tstamp = [1445464064, 1445464013]
df = pandas.DataFrame(data = zip(valfloat, valstr, tstamp), columns = ['colfloat', 'colstr', 'timestamp'])
df.set_index(pandas.to_datetime(df['timestamp'].astype(int), unit='s'), inplace=True)
df.index = df.index.tz_localize('UTC')
colsel = ['colfloat', 'colstr']
dftoadd = df[colsel].sort_index()
# try string conversion from object-type (no type mixing here ?)
##dftoadd.loc[:,'colstr'] = dftoadd['colstr'].map(str)
h5fname = 'df.h5'
# class to use as tstable description
class TsExample(tables.IsDescription):
timestamp = tables.Int64Col(pos=0)
colfloat = tables.Float64Col(pos=1)
colstr = tables.StringCol(itemsize=8, pos=2)
# create new time series
h5f = tables.open_file(h5fname, 'a')
ts = h5f.create_ts('/','example',TsExample)
# append to HDF5
ts.append(dftoadd, convert_strings=True)
# save data and close file
h5f.flush()
h5f.close()
Exception:
ValueError: rows parameter cannot be converted into a recarray object
compliant with table tstables.tstable.TsTable instance at ...
The error was: cannot view Object as non-Object type
While this particular error happens with TsTables, the code chunk responsible for it is identical to PyTables try-section here.
The error is happening after I upgraded pandas to 0.17.0; the same code was running error-free with 0.16.2.
NOTE: if a string column is excluded then everything works fine, so this problem must be related to string-column type representation in the dataframe.
The issue could be related to this question. Is there some conversion required for 'colstr' column of the dataframe that I am missing?
This is not going to work with a newer pandas as the index is timezone aware, see here
You can:
convert to a type PyTables understands, this would require localizing
use HDFStore to write the frame
Note that what you are doing is the reason HDFStore exists in the first place, to make reading/writing pyTables friendly for pandas objects. Doing this 'manually' is full of pitfalls.