I have a large data set (1.3 billion data) that i want to visualize with Vaex. Since the data set was very big in csv (around 130gb in 520 separate file), i merged them in a hdf5 file with pandas dataframe.to_hdf function (format:table, appended for each csv file). If i use the pandas.read_hdf function to load a slice of data, there is no problem.
x y z
0 -8274.591528 36.053843 24.766887
1 -8273.229203 34.853409 21.883050
2 -8289.577896 15.326737 26.041516
3 -8279.589741 27.798428 26.222326
4 -8272.836821 37.035071 24.795912
... ... ... ...
995 -8258.567634 3.581020 23.955874
996 -8270.526953 4.373765 24.381293
997 -8287.429578 1.674278 25.838418
998 -8250.624879 4.884777 21.815401
999 -8287.115655 1.100695 25.931318
1000 rows × 3 columns
This is how it looks like, i can access to any column i want, and the shape is (1000,3) as it should be. However, when i try to load the hdf5 file using vaex.open function:
# table
0 '(0, [-8274.59152784, 36.05384262, 24.7668...
1 '(1, [-8273.22920299, 34.85340869, 21.8830...
2 '(2, [-8289.5778959 , 15.32673748, 26.0415...
3 '(3, [-8279.58974054, 27.79842822, 26.2223...
4 '(4, [-8272.83682085, 37.0350707 , 24.7959...
... ...
1,322,286,736 '(2792371, [-6781.56835851, 2229.30828904, -6...
1,322,286,737 '(2792372, [-6781.71119626, 2228.78749838, -6...
1,322,286,738 '(2792373, [-6779.3251589 , 2227.46826613, -6...
1,322,286,739 '(2792374, [-6777.26078082, 2229.49535808, -6...
1,322,286,740 '(2792375, [-6782.81758335, 2228.87820639, -6...
This is what I'm getting. The shape is (1322286741, 1) and only column is 'table'. When i try to call the vaex imported hdf as galacto[0]:
[(0, [-8274.59152784, 36.05384262, 24.76688728])]
In pandas imported data these are x,y,z columns for the first row. When i tried to inspect the data in another problem, it also gave an error saying no data has found. So i think the problem is pandas appending hdf5 files row by row and it doesn't work in other programs. Is there a way i can fix this issue?
hdf5 is as flexible as say JSON and xml, in that you can store data in any way you want. Vaex has its own way of storing the data (you can check with the h5ls utils the structure, it's very simple) that does not align with how Pandas/PyTables stores it.
Vaex stores each column as a single contiguous array, which is optimal if you don't work with all columns, and makes it easy to memory map to a (real) numpy array. PyTables stores each row (at least of the same type) next to each other. Meaning if you would calculate the mean of the x columns, you effectively go over all the data.
Since PyTables hdf5 is probably already much faster to read than CSV, I suggest you do the following (not tested, but it should get the point across):
import vaex
import pandas as pd
import glob
# make sure dir vaex exists
for filename in glob.glob("pandas/*.hdf5"): # assuming your files live there
pdf = pd.read_hdf(filename)
df = vaex.from_pandas(pdf) # now df is a vaex dataframe
df.export(filename.replace("pandas", "vaex"), progress=True)) # same in vaex' format
df = vaex.open("vaex/*.hdf5") # it will be concatenated
# don't access df.x.values since it's not a 'real' numpy array, but
# a lazily concatenated column, so it would need to memory copy.
# If you need that, you can optionally do (and for extra performance)
# df.export("big.hdf5", progress=True)
# df_single = vaex.open("big.hdf5")
# df_single.x.values # this should reference the original data on disk (no mem copy)
Related
I try to get python array from a soapy binary file.this binary file size is 6G,has 12 columns with uncertain rows, looks like below:
ss2017-03-17, 13:18:25, 88000000.0, 90560000.0, 426666.666667, 647168, -98.6323, -98.7576, -97.3716, -98.3133, -98.8829, -98.9333
ss2017-03-17, 13:18:25, 90560000.0, 93120000.0, 426666.666667, 647168, -95.7163, -96.2564, -97.01, -98.1281, -90.701, -88.0872
ss2017-03-17, 13:18:25, 93120000.0, 95680000.0, 426666.666667, 647168, -99.0242, -91.3061, -91.9134, -85.4561, -86.0053, -97.8411
ss2017-03-17, 13:18:26, 95680000.0, 98240000.0, 426666.666667, 647168, -94.2324, -83.7932, -78.3108, -82.033, -89.1212, -97.4499
After
f = np.fromfile(open('filename','rb'))
print(f.ndim)
I got a one dimension array.
How to read this binary file and get array has 12 elements per row?
You can reshape your data like this:
np.array(data).reshape(-1)
I'd just reshape it with -1, this will infer the number of rows.
f = np.fromfile(open('filename','rb'))
f = f.reshape(-1, 12)
print(f.ndim)
Note that this will fail if the size of the array is not a multiple of 12, for whatever reason. I'd suggest checking the shape first if this matters to you.
I have two column, one is a string, and the other is a numpy array of floats
a = 'this is string'
b = np.array([-2.355, 1.957, 1.266, -6.913])
I would like to store them in a row as separate columns in a hdf5 file. For that I am using pandas
hdf_key = 'hdf_key'
store5 = pd.HDFStore('file.h5')
z = pd.DataFrame(
{
'string': [a],
'array': [b]
})
store5.append(hdf_key, z, index=False)
store5.close()
However, I get this error
TypeError: Cannot serialize the column [array] because
its data contents are [mixed] object dtype
Is there a way to store this to h5? If so, how? If not, what's the best way to store this sort of data?
I can't help you with pandas, but can show you how do this with pytables.
Basically you create a table referencing either a numpy recarray or a dtype that defines the mixed datatypes.
Below is a super simple example to show how to create a table with 1 string and 4 floats. Then it adds rows of data to the table.
It shows 2 different methods to add data:
1. A list of tuples (1 tuple for each row) - see append_list
2. A numpy recarray (with dtype matching the table definition) -
see simple_recarr in the for loop
To get the rest of the arguments for create_table(), read the Pytables documentation. It's very helpful, and should answer additional questions. Link below:
Pytables Users's Guide
import tables as tb
import numpy as np
with tb.open_file('SO_55943319.h5', 'w') as h5f:
my_dtype = np.dtype([('A','S16'),('b',float),('c',float),('d',float),('e',float)])
dset = h5f.create_table(h5f.root, 'table_data', description=my_dtype)
# Append one row using a list:
append_list = [('test string', -2.355, 1.957, 1.266, -6.913)]
dset.append(append_list)
simple_recarr = np.recarray((1,),dtype=my_dtype)
for i in range(5):
simple_recarr['A']='string_' + str(i)
simple_recarr['b']=2.0*i
simple_recarr['c']=3.0*i
simple_recarr['d']=4.0*i
simple_recarr['e']=5.0*i
dset.append(simple_recarr)
print ('done')
I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data
I have a script which produces a 15x1096 array of data using
np.savetxt("model_concentrations.csv", model_con, header="rows:','.join(sources), delimiter=",")
Each of the 15 rows corresponds to a source of emissions, while each column is 1 day over 3 years. If at all possible I would like to have a 'header' in column 1 which states the emssion source. When i use the option "header='source1,source2,...'" these labels get placed in the first row (like expected). ie.
2per 3rd_pvd 3rd_unpvd 4rai_rd 4rai_yd 5rmo 6hea
2.44E+00 2.12E+00 1.76E+00 1.33E+00 6.15E-01 3.26E-01 2.29E+00 ...
1.13E-01 4.21E-02 3.79E-02 2.05E-02 1.51E-02 2.29E-02 2.36E-01 ...
My question is, is there a way to inverse the header so the csv appears like this:
2per 7.77E+00 8.48E-01 ...
3rd_pvd 1.86E-01 3.62E-02 ...
3rd_unpvd 1.04E+00 2.65E-01 ...
4rai_rd 8.68E-02 2.88E-02 ...
4rai_yd 1.94E-01 8.58E-02 ...
5rmo 7.71E-01 1.17E-01 ...
6hea 1.07E+01 2.71E+00 ...
...
Labels for rows and columns is one of main reasons for the existence of pandas.
import pandas as pd
# Assemble your source labels in a list
sources = ['2per', '3rd_pvd', '3rd_unpvd', '4rai_rd',
'4rai_yd', '5rmo', '6hea', ...]
# Create a pandas DataFrame wrapping your numpy array
df = pd.DataFrame(model_con, index=sources)
# Saving it a .csv file writes the index too
df.to_csv('model_concentrations.csv', header=None)
I have a dict containing several pandas Dataframe (identified by keys) , any suggestion to effectively serialize (and cleanly load) it . Here is the structure (a pprint display output ). Each of dict['method_x_']['meas_x_'] is a pandas Dataframe. The goal is to save the dataframes for a further plotting with some specific plotting options.
{'method1':
{'meas1':
config1 config2
0 0.193647 0.204673
1 0.251833 0.284560
2 0.227573 0.220327,
'meas2':
config1 config2
0 0.172787 0.147287
1 0.061560 0.094000
2 0.045133 0.034760,
'method2':
{ 'meas1':
congif1 config2
0 0.193647 0.204673
1 0.251833 0.284560
2 0.227573 0.220327,
'meas2':
config1 config2
0 0.172787 0.147287
1 0.061560 0.094000
2 0.045133 0.034760}}
Use pickle.dump(s) and pickle.load(s). It actually works. Pandas DataFrames also have their own method df.save("filename") that you can use to serialize a single DataFrame...
In my particular use case, I tried to do a simple pickle.dump(all_df, open("all_df.p","wb"))
And while it loaded properly with> all_df = pickle.load(open("all_df.p","rb"))
When I restarted my Jupiter enviroment I would get a UnpicklingError: invalid load key, '\xef'.
One of the methods described here state that we can use HDF5 (pytables) to do the job. From their docs:
HDFStore is a dict-like object which reads and writes pandas
But it seems to be picky about the tablesversion that you use. I got mine to work after a pip install --upgrade tables and doing a runtime restart.
If you need a overall idea on how to use it:
#consider all_df as a list of dataframes
with pd.HDFStore('df_store.h5') as df_store:
for i in all_df.keys():
df_store[i] = all_df[i]
You should have a df_store.h5 file that you can convert back using the reverse process:
new_all_df = dict()
with pd.HDFStore('df_store.h5') as df_store:
for i in df_store.keys():
new_all_df[i] = df_store[i]