Input a list of arrays of numbers in h5py - python

I was trying to input a list of numeric arrays into an HDF5 file using h5py. Consider for example:
f = h5py.File('tester.hdf5','w')
b = [[1,2][1,2,3]]
This throws an error.
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
So I am assuming HDF5 doesn't support this.
Like you can store a list of strings by using a special datatype, is there a way for list of numeric array as well.
H5py store list of list of strings
If not, what are the other suitable ways to store a list like this which I can access later from memory.
Thanks for the help in advance

You could split your list of lists in seperate datasets and store them seperately:
import h5py
my_list = [[1,2],[1,2,3]]
f = h5py.File('tester.hdf5','w')
grp=f.create_group('list_of_lists')
for i,list in enumerate(my_list):
grp.create_dataset(str(i),data=list)
After doing so, you can slice trough your datasets like before with a little variation:
In[1]: grp[str(0)][:].tolist()
Out[1]: [1, 2]
In[2]: grp[str(1)][:].tolist()
Out[2]: [1, 2, 3]

Related

How to save a list in a pandas dataframe cell to a HDF5 table format?

I have a dataframe that I want to save in the appendable format to a hdf5 file. The dataframe looks like this:
column1
0 [0, 1, 2, 3, 4]
And the code that replicates the issue is:
import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5))]})
test.to_hdf('test','testgroup',format="table")
Unfortunately, it returns this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-65-c2dbeaca15df> in <module>
1 test = pd.DataFrame({"column1":[list(range(0,5))]})
----> 2 test.to_hdf('test','testgroup',format="table")
7 frames
/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors, columns)
4979 error_column_label = columns[i] if len(columns) > i else f"No.{i}"
4980 raise TypeError(
-> 4981 f"Cannot serialize the column [{error_column_label}]\n"
4982 f"because its data contents are not [string] but "
4983 f"[{inferred_type}] object dtype"
TypeError: Cannot serialize the column [column1]
because its data contents are not [string] but [mixed] object dtype
I am aware that I can save each value in a separate column. This does not help my extended use case, as there might be variable length lists.
I know I could convert the list to a string and then recreate it based on the string, but if I start converting each column to string, I might as well use a text format, like csv, instead of a binary one like hdf5.
Is there a standard way of saving lists into hdf5 table format?
Python Lists present a challenge when writing to HDF5 because they may contain different types. For example, this is a perfectly valid list: [1, 'two', 3.0]. Also, if I understand your Pandas 'column1' dataframe, it may contain different length lists. There is no (simple) way to represent this as an HDF5 dataset.
[That's why you got the [mixed] object dtype message. The conversion of the dataframe creates an intermediate object that is written as a dataset. The dtype of the converted list data is "O" (object), and HDF5 doesn't support this type.]
However, all is not lost. If we can make some assumptions about your data, we can wrangle it into a HDF5 dataset. Assumptions: 1) all df list entities are the same type (int in this case), and 2) all df lists are the same length. (We can handle different length lists, but it is more complicated.) Also, you will need to use a different package to write the HDF5 data (either PyTables or h5py). PyTables is the underlying package for Pandas HDF5 support and h5py is widely used. The choice is yours.
Before I post the code, here is an outline of the process:
Create a NumPy record array (aka recarray) from the the dataframe
Define the desired type and shape for the HDF5 dataset (as an Atom for
Pytables, or a dtype for h5py).
Create the dataset with Ataom/dtype definition above (could do on 1 line, but
easier to read this way).
Loop over rows of the recarray (from Step 1), and write data to rows of
the dataset. This converts the List to the equivalent array.
Code to create recarray (adds 2 rows to your dataframe):
import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5)), list(range(10,15)), list(range(100,105))]})
# create recarray from the dataframe (use index='column1' to only get that column)
rec_arr = test.to_records(index=False)
PyTables specific code to export data:
import tables as tb
with tb.File('74489101_tb.h5', 'w') as h5f:
# define "atom" with type and shape of column1 data
df_atom = tb.Atom.from_type('int32', shape=(len(rec_arr[0]['column1']),) )
# create the dataset
test = h5f.create_array('/','test', shape=rec_arr.shape, atom=df_atom )
# loop over recarray and polulate dataset
for i in range(rec_arr.shape[0]):
test[i] = rec_arr[i]['column1']
print(test[:])
h5py specific code to export data:
import h5py
with h5py.File('74489101_h5py.h5', 'w') as h5f:
df_dt = (int,(len(rec_arr1[0]['column1']),))
test = h5f.create_dataset('test', shape=rec_arr1.shape, dtype=df_dt )
for i in range(rec_arr1.shape[0]):
test[i] = rec_arr1[i]['column1']
print(test[:])

How to index from a list nested in an array?

I have a variable which returns this:
array(list([0, 1, 2]), dtype=object)
How do I index from this? Everything I have tried throws an error.
For reference, some code that would produce this variable.
import xarray as xr
x = xr.DataArray(
[[0,1,2],
[3,4]]
)
x
I guess before anyone asks, I am trying to test if xarray's DataArrays is a suitable way for me to store session-based data containing multiple recordings saved as vectors/1D arrays, but each recording/array can vary in length. That is why the DataArray doesn't have even dimensions.
Thanks
I used the code given by you to create the x variable. I was able to retrieve the lists using following code:
for arr in x:
print(arr.item())
Basically, you have to call .item() on that array to retrieve the inner list.

Adding ndarray into dataframe and then back to ndarray

I have a ndarray which looks like this:
x
I wanted to add this into an existing dataframe so that I could export it as a csv, and then use that csv in a separate python script, pull out the ndarray and carry out some analysis, mainly so that I don't have one really long python script.
To add it to a dataframe I've done the following:
data["StandardisedFeatures"] = x.tolist()
This looks ok to me. However, in my next script, when I try to pull out the data and put it back as an array, it doesn't appear the same, it's wrapped in single quotes and treating it as a string:
data['StandardisedFeatures'].to_numpy()
I've tried astype(float) but it doesn't seem to work, can anyone suggest a way to fix this?
Thanks.
If your list objects in a DataFrame have become strings while processing (happens sometimes), you can use eval or ast.literal_eval functions to convert back from string to list, and use map to do it for every element.
Here is an example which will give you an idea of how to deal with this:
import pandas as pd
import numpy as np
dic = {"a": [1,2,3], "b":[4,5,6], "c": [[1,2,3], [4,5,6], [1,2,3]]}
df = pd.DataFrame(dic)
print("DataFrame:", df, sep="\n", end="\n\n")
print("Column of list to numpy:", df.c.to_numpy(), sep="\n", end="\n\n")
temp = df.c.astype(str).to_numpy()
print("Since your list objects have somehow become str objects while working with df:", temp, sep="\n", end="\n\n")
print("Magic for what you want:", np.array(list(map(eval, temp))), sep="\n", end="\n\n")
Output:
DataFrame:
a b c
0 1 4 [1, 2, 3]
1 2 5 [4, 5, 6]
2 3 6 [1, 2, 3]
Column of list to numpy:
[list([1, 2, 3]) list([4, 5, 6]) list([1, 2, 3])]
Since your list objects have somehow become str objects while working with df:
['[1, 2, 3]' '[4, 5, 6]' '[1, 2, 3]']
Magic for what you want:
[[1 2 3]
[4 5 6]
[1 2 3]]
Note: I have used eval in the example only because more people are familiar with it. You should prefer using ast.literal_eval instead whenever you need eval. This SO post nicely explains why you should do this.
Perhaps an alternative and simpler way of solving this issue is to use numpy.save and numpy.load functions. Then you can save the array as a numpy array object and load it again in the next script directly as a numpy array:
import numpy as np
x = np.array([[1, 2], [3, 4]])
# Save the array in the working directory as "x.npy" (extension is automatically inserted)
np.save("x", x)
# Load "x.npy" as a numpy array
x_loaded = np.load("x.npy")
You can save objects of any type in a DataFrame.
You retain their type, but they will be classified as "object" in the pandas.DataFrame.info().
Example: save lists
df = pd.DataFrame(dict(my_list=[[1,2,3,4], [1,2,3,4]]))
print(type(df.loc[0, 'my_list']))
# Print: list
This is useful if you use your objects directly with pandas.DataFrame.apply().

Numpy array of dtype object has vastly different values for sys.getsizeof() and nbytes

I have a sample dataset of names. It is a csv file with 2 columns, each 200 lines long. Both columns contain random names. I have the following code to load the csv file into a pandas Dataframe, convert the dataframe into a numpy array, then convert the numpy array into a standard python list. The code is as follows:
x_df = pd.read_csv("names.csv")
x_np = x_df.to_numpy()
x_list = x_np.tolist()
print("Pandas dataframe:")
print('Using sys.getsizeof(): {}'.format(sys.getsizeof(x_df)))
print('Using pandas_df.memory_usage(): {}'.format(x_df.memory_usage(index=True, deep=True).sum()))
print('\nNumpy ndarray (dtype: {}):'.format(x_np.dtype))
print('Using sys.getsizeof(): {}'.format(sys.getsizeof(x_np)))
print('Using ndarray.nbytes: {}'.format(x_np.nbytes))
total_mem = 0
for row in x_np:
for name in row:
total_mem += sys.getsizeof(name)
print('Using sys.getsizeof() on each element in np array: {}'.format(total_mem))
print('\nStandard list:')
print('Using sys.getsizeof(): {}'.format(sys.getsizeof(x_list)))
total_mem = sum([sys.getsizeof(x) for sublist in x_list for x in sublist])
print('Using sys.getsizeof() on each element in list: {}'.format(total_mem))
The output of this code is as follows:
Pandas dataframe:
Using sys.getsizeof(): 25337
Using pandas_df.memory_usage(): 25305
Numpy ndarray (dtype: object):
Using sys.getsizeof(): 112
Using ndarray.nbytes: 3200
Using sys.getsizeof() on each element in np array: 21977
Standard list:
Using sys.getsizeof(): 1672
Using sys.getsizeof() on each element in list: 21977
I think I understand, for the standard python list, why sys.getsizeof() is such a small value compared to using sys.getsizeof() on each element of that list - using it on the list overall just shows the list object, which contains references to elements of the list.
Does this same logic apply to the numpy array? Why exactly is the value of nbytes on the array so small compared to the list? Does numpy have excellent memory management, or does the numpy array consist of references, not the actual objects? If the numpy array consists of references, not the actual objects, does this apply to all dtypes? Or just the object dtype?
A dataframe containing strings will be object dtype.
22008 (8 bytes per pointer) is 3200, the array nbytes. The 112 is just the size of the array object (shape, strides etc), and not the databuffer. It apparently is a view of the array that x_df is using to store it's references.
Pandas data storage is more complicated than numpy, but apparently if the dtype across columns is uniform, it does use a 2d ndarray. I don't know how getsizeof and memory_usage (with those parameters) works, though the numbers suggest they are the same.
Your enumeration size suggests that the string elements are on the average 6-7 bytes long. That seems small for unicode, but you haven't told us about those 'random names'.
The enumerated list apparently does the same same as the numpy enumeration. I'm a little surprised that 1672 is so much small than 3200, as though the list's pointer array holds 4 byte pointers rather than 8.

How to access random indices from h5 data set?

I have some h5 data that I want to sample from by using some randomly generated indices. However, if the indices are out of increasing order, then the effort fails. Is it possible to select indices, that have been generated randomly, from h5 data sets?
Here is a MWE citing the error:
import h5py
import numpy as np
arr = np.random.random(50).reshape(10,5)
with h5py.File('example1.h5', 'w') as h5fw:
h5fw.create_dataset('data', data=arr)
random_subset = h5py.File('example1.h5', 'r')['data'][[3, 1]]
# TypeError: Indexing elements must be in increasing order
I could sort the indices, but then we lose the randomness component.
As hpaulj mentioned, random indices aren't a problem for numpy arrays in memory. So, yes it's possible to select data with randomly generated indices from h5 data sets read to numpy arrays. The key is having sufficient memory to hold the dataset in memory. The code below shows how to do this:
#random_subset = h5py.File('example1.h5', 'r')['data'][[3, 1]]
arr = h5py.File('example1.h5', 'r')['data'][:]
random_subset = arr[[3,1]]
A potential solution is to pre-sort the desired indices as follow:
idx = np.sort([3,1])
random_subset = h5py.File('example1.h5', 'r')['data'][idx]

Categories