How do I store custom metadata to a ParquetDataset using pyarrow?
For example, if I create a Parquet dataset using Dask
import dask
dask.datasets.timeseries().to_parquet('temp.parq')
I can then read it using pyarrow
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('temp.parq')
However, the same method I would use for writing metadata for a single parquet file (outlined in How to write Parquet metadata with pyarrow?) does not work for a ParquetDataset, since there is no replace_schema_metadata function or similar.
I think I would probably like to write a custom _custom_metadata file, as the metadata I'd like to store pertain to the whole dataset. I imagine the procedure would be something similar to:
meta = pq.read_metadata('temp.parq/_common_metadata')
custom_metadata = { b'type': b'mydataset' }
merged_metadata = { **custom_metadata, **meta.metadata }
# TODO: Construct FileMetaData object with merged_metadata
new_meta.write_metadata_file('temp.parq/_common_metadata')
One possibility (that does not directly answer the question) is to use dask.
import dask
# Sample data
df = dask.datasets.timeseries()
df.to_parquet('test.parq', custom_metadata={'mymeta': 'myvalue'})
Dask does this by writing the metadata to all the files in the directory, including _common_metadata and _metadata.
from pathlib import Path
import pyarrow.parquet as pq
files = Path('test.parq').glob('*')
all([b'mymeta' in pq.ParquetFile(file).metadata.metadata for file in files])
# True
Related
I have a multipart partitioned parquet on s3. Each partition contains multiple parquet files. The below code narrows in on a single partition which may contain somewhere around 30 parquet files. When I use scan_parquet on a s3 address that includes *.parquet wildcard, it only looks at the first file in the partition. I verified this with the count of customers. It has the count from just the first file in the partition. Is there a way that it can scan across files?
import polars as pl
s3_loc = "s3://some_bucket/some_parquet/some_partion=123/*.parquet"
df = pl.scan_parquet(s3_loc)
cus_count = df.select(pl.count('customers')).collect()
If I leave off the *.parquet from the s3 address then I get the following error.
exceptions.ArrowErrorException: ExternalFormat("File out of specification: A parquet file must containt a header and footer with at least 12 bytes")
It looks like from the user guide on multiple files that to do so requires a loop creating many lazy dfs that you then combine together.
Another approach is to use the scan_ds function which takes a pyarrow dataset object.
import polars as pl
import s3fs
import pyarrow.dataset as ds
fs = s3fs.S3FileSystem()
# you can also make a file system with anything fsspec supports
# S3FileSystem is just a wrapper for fsspec
s3_loc = "s3://some_bucket/some_parquet/some_partion=123"
myds = ds.dataset(s3_loc, filesystem=fs)
lazy_df = pl.scan_ds(myds)
cus_count = lazy_df.select(pl.count('customers')).collect()
I have a python code like this , it converts csv to parquet file.
import pandas as pd
import pyarrow.parquet as pq
df = pd.read_csv('27.csv')
print(df)
df.to_parquet('27.parquet' )
da = pd.read_parquet('27.parquet')
metadata = pq.read_metadata('27.parquet')
print(metadata)
print(da.head(10))
the result : https://imgur.com/a/Mhw3sot which parquet file version too high
want to change version to parquet-cpp version 1.5.1-SNAPSHOT ( a small version).
how to do it? where to set the file version??
27.CSV below
Temp,Flow,site_no,datetime,Conductance,Precipitation,GageHeight
11.0,16200,09380000,2018-06-27 00:00,669,0.00,9.97
10.9,16000,09380000,2018-06-27 00:15,668,0.00,9.93
10.9,15700,09380000,2018-06-27 00:30,668,0.00,9.88
10.8,15400,09380000,2018-06-27 00:45,672,0.00,9.82
10.8,15100,09380000,2018-06-27 01:00,672,0.00,9.77
10.8,14700,09380000,2018-06-27 01:15,672,0.00,9.68
10.7,14300,09380000,2018-06-27 01:30,673,0.00,9.61
10.7,13900,09380000,2018-06-27 01:45,672,0.00,9.53
10.7,13600,09380000,2018-06-27 02:00,671,0.00,9.46
10.6,13200,09380000,2018-06-27 02:15,672,0.00,9.38
11.0,16200,09380000,2018-06-27 00:00,669,0.00,9.97
10.9,16000,09380000,2018-06-27 00:15,668,0.00,9.93
10.9,15700,09380000,2018-06-27 00:30,668,0.00,9.88
10.8,15400,09380000,2018-06-27 00:45,672,0.00,9.82
10.8,15100,09380000,2018-06-27 01:00,672,0.00,9.77
10.8,14700,09380000,2018-06-27 01:15,672,0.00,9.68
10.7,14300,09380000,2018-06-27 01:30,673,0.00,9.61
10.7,13900,09380000,2018-06-27 01:45,672,0.00,9.53
10.7,14300,09380000,2018-06-27 01:30,673,0.00,9.61
As I know the parquet format in tera data has some limits.
Directly using the pyarrow.parquet should have some problem.
How to write a parquet file for tera data to read ? (file format limits have something to do) Any one did this before?
Parquet Format Limitations in Tera Data
1.The READ_NOS table operator does not support Parquet.
However, READ_NOS can be used to view the Parquet schema, using
RETURNTYPE('NOSREAD_PARQUET_SCHEMA'). This is helpful in creating the
foreign table when you do not know the schema of your Parquet data
beforehand.
2.Certain complex data types are not supported, including STRUCT, MAP, LIST,
and ENUM.
3.Because support for the STRUCT data type is not available, nested Parquet
object stores cannot be processed by Native Object Store.
files which I try.
https://ufile.io/f/wi1k9
I am aware that this can be done in R as follows
ds <- open_dataset("nyc-taxi/csv/2019", format = "csv",
partitioning = "month")
But is there a way to do in python ? Tried these but seems like thats not an option
from pyarrow import csv
table = csv.read_csv("*.csv")
from pyarrow import csv
path = os.getcwd()
table = csv.read_csv(path)
table
Is there a way to make it happen in python ?
Yes, you can do this with pyarrow as well, similarly as in R, using the pyarrow.dataset submodule (the pyarrow.csv submodule only exposes functionality for dealing with single csv files).
Example code:
import pyarrow.dataset as ds
dataset = ds.dataset("nyc-taxi/csv/2019", format="csv", partitioning=["month"])
table = dataset.to_table()
And then in the to_table() method you can specify row/column filters.
From my read of the docs, luigi is designed to work with text files or raw binaries as Targets. I am trying to build a luigi workflow for an existing processing pipeline that uses HDF5 files (for their many advantages) using h5py on a regular file system. Some tasks in this workflow do not create a whole new file, but rather add new datasets to an existing HDF file. Using h5py I would read a dataset with:
hdf = h5py.File('filepath','r')
hdf['internal/path/to/dataset'][...]
write a dataset with:
hdf['internal/path/to/dataset'] = np.array(data)
and test if a dataset in the HDF file exists with this line:
'internal/path/to/dataset' in hdf
My question is, is there a way to adapt luigi to work with these types of files?
My read of luigi docs makes me think I may be able to either subclass luigi.format.Format or perhaps subclass LocalTarget and make a custom 'open' method. But I can't find any examples on how to implement this. Many thanks to any suggestions!
d6tflow has a HDF5 pandas implementation and can easily be extended to save data other than pandas dataframes.
import d6tflow
from d6tflow.tasks.h5 import TaskH5Pandas
import pandas as pd
class Task1(TaskH5Pandas):
def run(self):
df = pd.DataFrame({'a':range(10)})
self.save(df)
class Task2(d6tflow.tasks.TaskCachePandas):
def requires(self):
return Task1()
def run(self):
df = self.input().load()
# use dataframe from HDF5
d6tflow.run([Task2])
To see https://d6tflow.readthedocs.io/en/latest/targets.html#writing-your-own-targets on how to extend.
I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)