Does Parquet support storing various data frames of different widths (numbers of columns) in a single file? E.g. in HDF5 it is possible to store multiple such data frames and access them by key. So far it looks from my reading that Parquet does not support it, so alternative would be storing multiple Parquet files into the file system. I have a rather large number (say 10000) of relatively small frames ~1-5MB to process, so I'm not sure if this could become a concern?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
dfs = []
df1 = pd.DataFrame(data={"A": [1, 2, 3], "B": [4, 5, 6]},
columns=["A", "B"])
df2 = pd.DataFrame(data={"X": [1, 2], "Y": [3, 4], "Z": [5, 6]},
columns=["X", "Y", "Z"])
dfs.append(df1)
dfs.append(df2)
for i in range(2):
table1 = pa.Table.from_pandas(dfs[i])
pq.write_table(table1, "my_parq_" + str(i) + ".parquet")
No, this is not possible as Parquet files have a single schema. They normally also don't appear as single files but as multiple files in a directory with all files being the same schema. This enables tools to read these files as if they were one, either fully into local RAM, distributed over multiple nodes or evaluate an (SQL) query on them.
Parquet will also be able to store these data frames efficiently even for this small size thus it should be a suitable serialization format for your use case. In contrast to HDF5, Parquet is only a serialization for tabular data. As mentioned in your question, HDF5 also supports a file system-like key vale access. As you have a large number of files and this might be problematic for the underlying filesystem, you should look at finding a replacement for this layer. Possible approaches for this will first serialize the DataFrame to Parquet in-memory and then store it in a key-value container, this could either be a simple zip archive or a real key value store like e.g. LevelDB.
Related
I am trying to read multiple parquet files with selected columns into one Pandas dataframe. This means that the parquet files don't share all the columns. I tried to add a filter() argument into the pd.read_parquet() but it seems that it doesn't work in the multiple file reading. How can I make this work?
from pathlib import Path
import pandas as pd
data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
pd.read_parquet(parquet_file)
for parquet_file in data_dir.glob('*.parquet')
)
full_df = pd.concat(
pd.read_parquet(parquet_file, filters=[('name', 'address', 'email')])
for parquet_file in data_dir.glob('*.parquet')
)
Reading from multiple files is well supported. However, if your schemas are different then it is a bit trickier. Pyarrow currently defaults to using the schema of the first file it finds in a dataset. This is to avoid the up-front cost of inspecting the schema of every file in a large dataset.
Arrow-C++ has the capability to override this and scan every file but this is not yet exposed in pyarrow. However, if you know the unified schema ahead of time you can supply it and you will get the behavior you want. You will need to use the datasets module directly to do this as specifying a schema is not part of pyarrow.parquet.read_table (which is what is called by pandas.read_parquet).
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pandas as pd
import tempfile
tab1 = pa.Table.from_pydict({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
tab2 = pa.Table.from_pydict({'b': ['a', 'b', 'c'], 'c': [True, False, True]})
unified_schema = pa.unify_schemas([tab1.schema, tab2.schema])
with tempfile.TemporaryDirectory() as dataset_dir:
pq.write_table(tab1, f'{dataset_dir}/one.parquet')
pq.write_table(tab2, f'{dataset_dir}/two.parquet')
print('Basic read of directory will use schema from first file')
print(pd.read_parquet(dataset_dir))
print()
print('You can specify the unified schema if you know it')
dataset = ds.dataset(dataset_dir, schema=unified_schema)
print(dataset.to_table().to_pandas())
print()
print('The columns option will limit which columns are returned from read_parquet')
print(pd.read_parquet(dataset_dir, columns=['b']))
print()
print('The columns option can be used when specifying a schema as well')
dataset = ds.dataset(dataset_dir, schema=unified_schema)
print(dataset.to_table(columns=['b', 'c']).to_pandas())
If you don't know the unified schema ahead of time you can create it by inspecting all the files yourself:
# You could also use glob here or whatever tool you want to
# get the list of files in your dataset
dataset = ds.dataset(dataset_dir)
schemas = [pq.read_schema(dataset_file) for dataset_file in dataset.files]
print(pa.unify_schemas(schemas))
Since this might be expensive (especially if working with a remote filesystem) you may want to save off the unified schema in its own file (saving a parquet file or Arrow IPC file with 0 batches is usually sufficient) instead of recalculating it every time.
I have some instrumental data which saved in hdf-5 format as multiple 2-d array along with the measuring time. As attached figures below, d1 and d2 are two independent file in which the instrument recorded in different time. They have the same data variables, and the only difference is the length of phony_dim_0, which represet the total data points varying with measurement time.
These files need to be loaded to a specific software provided by the instrument company for obtaining meaningful results. I want to merge multiple files with Python xarray while keeping in their original format, and then loaed one merged file into the software.
Here is my attempt:
files = os.listdir("DATA_PATH")
d1 = xarray.open_dataset(files[0])
d2 = xarray.open_dataset(files[1])
## copy a new one to save the merged data array.
d0 = d1
vars_ = [c for c in d1]
for var in vars_:
d0[var].values = np.vstack([d1[var],d2[var]])
The error shows like this:
replacement data must match the Variable's shape. replacement data has shape (761, 200); Variable has shape (441, 200)
I thought about two solution for this problem:
expanding the dimension length to the total length of all merged files.
creating a new empty dataframe in the same format of d1 and d2.
However, I still could not figure out the function to achieve that. Any comments or suggestions would be appreciated.
Supplemental information
dataset example [d1],[d2]
I'm not familiar with xarray, so can't help with your code. However, you don't need xarray to copy HDF5 data; h5py is designed to work nicely with HDF5 data as NumPy arrays, and is all you need to get merge the data.
A note about Xarray. It uses different nomenclature than HDF5 and h5py. Xarray refers to the files as 'datasets', and calls the HDF5 datasets 'data variables'. HDF5/h5py nomenclature is more frequently used, so I am going to use it for the rest of my post.
There are some things to consider when merging datasets across 2 or more HDF5 files. They are:
Consistency of the data schema (which you have checked).
Consistency of attributes. If datasets have different attribute names or values, the merge process gets a lot more complicated! (Yours appear to be consistent.)
It's preferable to create resizabe datasets in the merged file. This simplifies the process, as you don't need to know the total size when you initially create the dataset. Better yet, you can add more data later (if/when you have more files).
I looked at your files. You have 8 HDF5 datasets in each file. One nice thing: the datasets are resizble. That simplifies the merge process. Also, although your datasets have a lot of attributes, they appear to be common in both files. That also simplifies the process.
The code below goes through the following steps to merge the data.
Open the new merge file for writing
Open the first data file (read-only)
Loop thru all data sets
a. use the group copy function to copy the dataset (data plus maxshape parameters, and attribute names and values).
Open the second data file (read-only)
Loop thru all data sets and do the following:
a. get the size of the 2 datasets (existing and to be added)
b. increase the size of HDF5 dataset with .resize() method
c. write values from dataset to end of existing dataset
At the end it loops thru all 3 files and prints shape and
maxshape for all datasets (for visual comparison).
Code below:
import h5py
files = [ '211008_778183_m.h5', '211008_778624_m.h5', 'merged_.h5' ]
# Create the merge file:
with h5py.File('merged_.h5','w') as h5fw:
# Open first HDF5 file and copy each dataset.
# Will use maxhape and attributes from existing dataset.
with h5py.File(files[0],'r') as h5fr:
for ds in h5fr.keys():
h5fw.copy(h5fr[ds], h5fw, name=ds)
# Open second HDF5 file and copy data from each dataset.
# Resizes existing dataset as needed to hold new data.
with h5py.File(files[1],'r') as h5fr:
for ds in h5fr.keys():
ds_a0 = h5fw[ds].shape[0]
add_a0 = h5fr[ds].shape[0]
h5fw[ds].resize(ds_a0+add_a0,axis=0)
h5fw[ds][ds_a0:] = h5fr[ds][:]
for fname in files:
print(f'Working on file:{fname}')
with h5py.File(fname,'r') as h5f:
for ds, h5obj in h5f.items():
print (f'for: {ds}; axshape={h5obj.shape}, maxshape={h5obj.maxshape}')
I am working jointly with vaex and dask for some analysis. In the first part of the analysis I do some processing with dask.dataframe, and my intention is to export the dataframe I computed into something vaex reads. I want to export the data into a memory-mappable format, like hdf or arrow.
dask allows exports into hdf and parquet files. Vaex allows imports as hdf and arrow. Both allow exports and imports as csv files, but I want to avoid that.
So far I got the following options (and problems):
If I export into an hdf5 file, since dask exports the file in a row format, but vaex reads it in a column format, the file cannot be imported (https://vaex.readthedocs.io/en/latest/faq.html).
I can export the data into parquet files, but I don't know how to read them from vaex. I've seen some answer in SO that transforms the files into an arrow table, but this requires the table to be loaded into memory, which I can't because the table is too large to fit into memory.
I can of course do an export into a csv and load it in chunks into vaex, then export it into a column-format hdf, but I don't think that should be the purpose of two modules for big objects.
Is there any option I am missing and that would be compatible to "bridge" the two modules without either loading the full table into memory, or having to read/write the dataset twice?
In order to open parquet with vaex you should use vaex.open and the extension of your file must be parquet.
Generate Data
fldr = "test"
os.makedirs(fldr, exist_ok=True)
n = 1_000
for i in range(10):
fn = f"{fldr}/file{i}.parquet"
df = pd.DataFrame(np.random.randn(n, 2), columns=["a", "b"])
df["key"] = np.random.randint(0, high=100, size=n)
df.to_parquet(fn, index=False)
Example: aggregation and save with dask
df = dd.read_parquet(fldr)
grp = df.groupby("key").sum()
grp.to_parquet("output")
Read with vaex
df = vaex.open("output/part.0.parquet")
This question already has an answer here:
Pandas XLSWriter - return instead of write
(1 answer)
Closed 4 years ago.
I recently had to take a dataframe and prepare it to output to an Excel file. However, I didn't want to save it to the local system, but rather pass the prepared data to a separate function that saves to the cloud based on a URI. After searching through a number of ExcelWriter examples, I couldn't find what I was looking for.
The goal is to take the dataframe, e.g.:
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6})
And temporarily store it as bytes in a variable, e.g.:
processed_data = <bytes representing the excel output>
The solution I came up with is provided in the answers and hopefully will help someone else. Would love to see others' solutions as well!
Update #2 - Example Use Case
In my case, I created an io module that allows you to use URIs to specify different cloud destinations. For example, "paths" starting with gs:// get sent to Google Storage (using gsutils-like syntax). I process the data as my first step, and then pass that processed data to a "save" function, which itself filters to determine the right path.
df.to_csv() actually works with no path and automatically returns a string (at least in recent versions), so this is my solution to allow to_excel() to do the same.
Works like the common examples, but instead of specifying the file in ExcelWriter, it uses the standard library's BytesIO to store in a variable (processed_data):
from io import BytesIO
import pandas as pd
df = pd.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6]
})
output = BytesIO()
writer = pd.ExcelWriter(output)
df.to_excel(writer) # plus any **kwargs
writer.save()
processed_data = output.getvalue()
I have a calculation that creates an excel spreadsheet using xlsxwriter to show results. It would be useful to sort the table after knowing the results.
One solution would be to create a separate Data structure in python, and sort the data structure, and use xlsx later, but it is not very elegant, requires a lot of data type handling.
I cannot find a way to sort the structures in the xlsx module.
Can anybody help with the internal data structure of that module? Can that be sorted, before writing it to disk.
Another solution would be reopening the file, sort the stuff and close it again?
import xlsxwriter
workbook=xlsxwriter("Trial.xlsx")
worksheet=workbook.add_worksheet("first")
worksheet.write_number(0,1,2)
worksheet.write_number(0,2,1)
...worksheet.sort
Can anybody help with the internal data structure of that module? Can that be sorted, before writing it to disk.
I am the author of the module and the short answer is that this can't or shouldn't be done.
It is possible to sort worksheet data in Excel at runtime but that isn't part of the file specification so it can't be done with XlsxWriter.
One solution would be to create a separate Data structure in python, and sort the data structure, and use xlsx later, but it is not very elegant, requires a lot of data type handling.
That sounds like a reasonable solution to me.
You should process your data before writing it to a Workbook as it is not easily possible to manipulate the data once in the spreadsheet.
The following example would write a column of numbers unsorted:
import xlsxwriter
with xlsxwriter.Workbook("Trial.xlsx") as workbook:
worksheet = workbook.add_worksheet("first")
data = [5, 2, 7, 3, 8, 1]
for rowy, value in enumerate(data):
worksheet.write_number(rowy, 0, value) # use column 0
But if you first sort the data as follows:
import xlsxwriter
with xlsxwriter.Workbook("Trial.xlsx") as workbook:
worksheet = workbook.add_worksheet("first")
data = sorted([5, 2, 7, 3, 8, 1])
for rowy, value in enumerate(data):
worksheet.write_number(rowy, 0, value) # use column 0
You would get something like: