Polars scan s3 multi-part parquet files - python

I have a multipart partitioned parquet on s3. Each partition contains multiple parquet files. The below code narrows in on a single partition which may contain somewhere around 30 parquet files. When I use scan_parquet on a s3 address that includes *.parquet wildcard, it only looks at the first file in the partition. I verified this with the count of customers. It has the count from just the first file in the partition. Is there a way that it can scan across files?
import polars as pl
s3_loc = "s3://some_bucket/some_parquet/some_partion=123/*.parquet"
df = pl.scan_parquet(s3_loc)
cus_count = df.select(pl.count('customers')).collect()
If I leave off the *.parquet from the s3 address then I get the following error.
exceptions.ArrowErrorException: ExternalFormat("File out of specification: A parquet file must containt a header and footer with at least 12 bytes")

It looks like from the user guide on multiple files that to do so requires a loop creating many lazy dfs that you then combine together.
Another approach is to use the scan_ds function which takes a pyarrow dataset object.
import polars as pl
import s3fs
import pyarrow.dataset as ds
fs = s3fs.S3FileSystem()
# you can also make a file system with anything fsspec supports
# S3FileSystem is just a wrapper for fsspec
s3_loc = "s3://some_bucket/some_parquet/some_partion=123"
myds = ds.dataset(s3_loc, filesystem=fs)
lazy_df = pl.scan_ds(myds)
cus_count = lazy_df.select(pl.count('customers')).collect()

Related

Read many parquet files from S3 to pandas dataframe

I've been researching this topic for a few days now and have yet to come up with a working solution. Apologies if this question is repetitive (although I have checked for similar questions and have not quite found the right one).
I have an s3 bucket with about 150 parquet files in it. I have been searching for a dynamic way to bring in all of these files to one dataframe (can be multiple, if more computationally efficient). If all of these parquets were appended to one dataframe, it would be a very large amount of data, so if the solution to this is simply that I require more computing power, please do let me know. I have ultimately stumbled across the awswrangler, and am using the below code, which has been running as expected:
df = wr.s3.read_parquet(path="s3://my-s3-data/folder1/subfolder1/subfolder2/", dataset=True, columns = df_cols, chunked=True)
This code has been returning a generator object, which I am not sure how to get into a dataframe. I have tried solutions from the linked pages (below) and have returned various errors such as invalid filepath and length mismatch.
https://newbedev.com/create-a-pandas-dataframe-from-generator
https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html
Create a pandas DataFrame from generator?
Another solution I tried was from https://www.py4u.net/discuss/140245 :
import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem()
bucket = "cortex-grm-pdm-auto-design-data"
path = "s3://my-bucket/folder1/subfolder1/subfolder2/"
# Python 3.6 or later
p_dataset = pq.ParquetDataset(
f"s3://my-bucket/folder1/subfolder1/subfolder2/",
filesystem=fs
)
df = p_dataset.read().to_pandas()
which resulted in an error "'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'"
lastly, I also tried the many parquet solution from https://newbedev.com/how-to-read-a-list-of-parquet-files-from-s3-as-a-pandas-dataframe-using-pyarrow :
# Read multiple parquets from a folder on S3 generated by spark
def pd_read_s3_multiple_parquets(filepath, bucket, s3=None,
s3_client=None, verbose=False, **args):
if not filepath.endswith('/'):
filepath = filepath + '/' # Add '/' to the end
if s3_client is None:
s3_client = boto3.client('s3')
if s3 is None:
s3 = boto3.resource('s3')
s3_keys = [item.key for item in s3.Bucket(bucket).objects.filter(Prefix=filepath)
if item.key.endswith('.parquet')]
if not s3_keys:
print('No parquet found in', bucket, filepath)
elif verbose:
print('Load parquets:')
for p in s3_keys:
print(p)
dfs = [pd_read_s3_parquet(key, bucket=bucket, s3_client=s3_client, **args)
for key in s3_keys]
return pd.concat(dfs, ignore_index=True)
df = pd_read_s3_multiple_parquets('path/to/folder', 'my_bucket')
This one returned no parquet found in the path (which I am certain is false, the parquets are all there when I visit the actual s3), as well as the error "no objects to concatenate"
Any guidance you can provide is greatly appreciated! Again, apologies for any repetitiveness in my question. Thank you in advance.
AWS data wrangler works seamlessly, I have used it.
Install via pip or conda.
Reading multiple parquet files is a one-liner: see example below.
Creds are automatically read from your environment variables.
# this is running on my laptop
import numpy as np
import pandas as pd
import awswrangler as wr
# assume multiple parquet files in 's3://mybucket/etc/etc/'
s3_bucket_uri = 's3://mybucket/etc/etc/'
df = wr.s3.read_parquet(path=s3_bucket_daily)
# df is a pandas DataFrame
AWS doc with examples that include your use case are here:
https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html

How to store custom Parquet Dataset metadata with pyarrow?

How do I store custom metadata to a ParquetDataset using pyarrow?
For example, if I create a Parquet dataset using Dask
import dask
dask.datasets.timeseries().to_parquet('temp.parq')
I can then read it using pyarrow
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('temp.parq')
However, the same method I would use for writing metadata for a single parquet file (outlined in How to write Parquet metadata with pyarrow?) does not work for a ParquetDataset, since there is no replace_schema_metadata function or similar.
I think I would probably like to write a custom _custom_metadata file, as the metadata I'd like to store pertain to the whole dataset. I imagine the procedure would be something similar to:
meta = pq.read_metadata('temp.parq/_common_metadata')
custom_metadata = { b'type': b'mydataset' }
merged_metadata = { **custom_metadata, **meta.metadata }
# TODO: Construct FileMetaData object with merged_metadata
new_meta.write_metadata_file('temp.parq/_common_metadata')
One possibility (that does not directly answer the question) is to use dask.
import dask
# Sample data
df = dask.datasets.timeseries()
df.to_parquet('test.parq', custom_metadata={'mymeta': 'myvalue'})
Dask does this by writing the metadata to all the files in the directory, including _common_metadata and _metadata.
from pathlib import Path
import pyarrow.parquet as pq
files = Path('test.parq').glob('*')
all([b'mymeta' in pq.ParquetFile(file).metadata.metadata for file in files])
# True

Is there a way to also set AWS metadata values when saving a data frame as a parquet file to S3 using df.to_parquet?

When I put an object in python, I can set the metadata at the time. Example:
self.s3_client.put_object(
Bucket=self._bucket,
Key=key,
Body=body,
ContentEncoding=self._compression,
ContentType="application/json",
ContentLanguage="en-US",
Metadata={'other-key':'value'}
)
It seems like both pyarrow and fastparquet don't let me pass those particular keywords despite the pandas documentation saying that extra keywords are passed.
This saves the data how I want it to, but I can't seem to attach the metadata with any syntax that I try.
df.to_parquet(s3_path, compression='gzip')
If there was an easy way to compress the parquet and convert it to a bytestream.
Would rather not write the file twice (either to local then transfer to AWS or twice on AWS)
Ok. Found it quicker than I thought.
import pandas as pd
import io
#read in data to df.
df=pd.read_csv('file.csv')
body = io.BytesIO()
df.to_parquet(
path=body,
compression="gzip",
engine="pyarrow",
)
bucket='MY_BUCKET'
key='prefix/key'
s3_client.put_object(
Bucket=bucket,
Key=key,
Body=body.getvalue(),
ContentEncoding='gzip',
ContentType="application/x-parquet",
ContentLanguage="en-US",
Metadata={'user-key':'value'},
)

pandas write dataframe to parquet format with append

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)

Ingesting 150 csv's into one data source

Hello i am completely new to handling big data and comfortable in python
I have 150 csv's each of size 70MB which i have to integrate in one source to remove basic stats like unique counts, unique names and all.
Any one could suggest how should i go about it?
I came across a package 'pyelastic search' in python how feasible it is for me to use in enthaught canopy.
Suggestion needed!
Try to use pandas package.
reading a single csv would be:
import pandas as pd
df = pd.read_csv('filelocation.csv')
in case of multiple files, just concat them. let's say ls is a list of file locations, then:
df = pd.concat([pd.read_csv(f) for f in ls])
and then to write them as a single file, do:
df.to_csv('output.csv')
of course all this is valid for in-memory operations (70x150 = ~ 10.5 GB RAM). If that's not possible - consider building an incremental process or using dask dataframes.
One option if you are in AWS
Step1 - move data to S3 (AWS native File storage)
Step2 - create table for each data structure in redshift
Step3 - run COPY command to move data from S3 to Redshift (AWS native DW)
COPY command loads data in bulk, detects file name pattern

Categories