Read multiple parquet files with selected columns into one Pandas dataframe - python

I am trying to read multiple parquet files with selected columns into one Pandas dataframe. This means that the parquet files don't share all the columns. I tried to add a filter() argument into the pd.read_parquet() but it seems that it doesn't work in the multiple file reading. How can I make this work?
from pathlib import Path
import pandas as pd
data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
pd.read_parquet(parquet_file)
for parquet_file in data_dir.glob('*.parquet')
)
full_df = pd.concat(
pd.read_parquet(parquet_file, filters=[('name', 'address', 'email')])
for parquet_file in data_dir.glob('*.parquet')
)

Reading from multiple files is well supported. However, if your schemas are different then it is a bit trickier. Pyarrow currently defaults to using the schema of the first file it finds in a dataset. This is to avoid the up-front cost of inspecting the schema of every file in a large dataset.
Arrow-C++ has the capability to override this and scan every file but this is not yet exposed in pyarrow. However, if you know the unified schema ahead of time you can supply it and you will get the behavior you want. You will need to use the datasets module directly to do this as specifying a schema is not part of pyarrow.parquet.read_table (which is what is called by pandas.read_parquet).
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pandas as pd
import tempfile
tab1 = pa.Table.from_pydict({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
tab2 = pa.Table.from_pydict({'b': ['a', 'b', 'c'], 'c': [True, False, True]})
unified_schema = pa.unify_schemas([tab1.schema, tab2.schema])
with tempfile.TemporaryDirectory() as dataset_dir:
pq.write_table(tab1, f'{dataset_dir}/one.parquet')
pq.write_table(tab2, f'{dataset_dir}/two.parquet')
print('Basic read of directory will use schema from first file')
print(pd.read_parquet(dataset_dir))
print()
print('You can specify the unified schema if you know it')
dataset = ds.dataset(dataset_dir, schema=unified_schema)
print(dataset.to_table().to_pandas())
print()
print('The columns option will limit which columns are returned from read_parquet')
print(pd.read_parquet(dataset_dir, columns=['b']))
print()
print('The columns option can be used when specifying a schema as well')
dataset = ds.dataset(dataset_dir, schema=unified_schema)
print(dataset.to_table(columns=['b', 'c']).to_pandas())
If you don't know the unified schema ahead of time you can create it by inspecting all the files yourself:
# You could also use glob here or whatever tool you want to
# get the list of files in your dataset
dataset = ds.dataset(dataset_dir)
schemas = [pq.read_schema(dataset_file) for dataset_file in dataset.files]
print(pa.unify_schemas(schemas))
Since this might be expensive (especially if working with a remote filesystem) you may want to save off the unified schema in its own file (saving a parquet file or Arrow IPC file with 0 batches is usually sufficient) instead of recalculating it every time.

Related

How to split a large .csv file using dask?

I am trying to use dask in order to split a huge tab-delimited file into smaller chunks on an AWS Batch array of 100,000 cores.
In AWS Batch each core has a unique environment variable AWS_BATCH_JOB_ARRAY_INDEX ranging from 0 to 99,999 (which is copied into the idx variable in the snippet below). Thus, I am trying to use the following code:
import os
import dask.dataframe as dd
idx = int(os.environ["AWS_BATCH_JOB_ARRAY_INDEX"])
df = dd.read_csv(f"s3://main-bucket/workdir/huge_file.tsv", sep='\t')
df = df.repartition(npartitions=100_000)
df = df.partitions[idx]
df = df.persist() # this call isn't needed before calling to df.to_csv (see comment by Sultan)
df = df.compute() # this call isn't needed before calling to df.to_csv (see comment by Sultan)
df.to_csv(f"/tmp/split_{idx}.tsv", sep="\t", index=False)
print(idx, df.shape, df.head(5))
Do I need to call presist and/or compute before calling df.to_csv?
When I have to split a big file into multiple smaller ones, I simply run the following code.
Read and repartition
import dask.dataframe as dd
df = dd.read_csv("file.csv")
df = df.repartition(npartitions=100)
Save to csv
o = df.to_csv("out_csv/part_*.csv", index=False)
Save to parquet
o = df.to_parquet("out_parquet/")
Here you can use write_metadata_file=False if you want to avoid metadata.
Few notes:
I don't think you really need persist and compute as you can directly save to disk. When you have problems like memory error is safer to save to disk rather than compute.
I found using parquet format at least 3x faster than csv when it's time to write.

Reading DataFrames saved as parquet with pyarrow, save filenames in columns

I want to read a folder full of parquet files that contain pandas DataFrames. In addition to the data that I'm reading I want to store the filenames from which the data is read in the column "file_origin". In pandas I am able to do it like this:
import pandas as pd
from pathlib import Path
data_dir = Path("path_of_folder_with_files")
df = pd.concat(
pd.read_parquet(parquet_file).assign(file_origin=parquet_file.name)
for parquet_file in data_dir.glob("*")
)
Unfortunately this is quite slow. Is there a similar way to do this with pyarrow (or any other efficient package)?
import pyarrow.parquet as pq
table = pq.read_table(data_dir, use_threads=True)
df = table.to_pandas()
You could implement it using arrow instead of pandas:
batches = []
for file_name in data_dir.glob("*"):
table = pq.read_table(file_name)
table = table.append_column("file_name", pa.array([file_name]*len(table), pa.string()))
batches.extend(table.to_batches())
return pa.Table.from_batches(batches)
I don't expect it to be significantly faster, unless you have a lot of strings and objects in your table (which are slow in pandas).

How do you batch add column headers to all CSV files in a directory and reserve those files out?

I have hundreds of CSV files without headers that get exported from some software I use. The number of columns and the exact column headers may vary between batches, but never varies within batches.
I am learning Pandas and I need some help to put together a very simple notebook that loads all the CSV files in a directory, and adds the column headers I choose to all the files in that directory and saves them as the same CSV files (same names) but now with headers included in the file.
As I said, certain batches will vary in the number of columns that need headers and what the headers will be so it would be nice to preserve the ability to change the headers at will.
I have the following code and it works great with one file. How do I loop over all files in the directory, add the same headers, and save the files.
import pandas as pd
df_csv = pd.read_csv('/Users/F/Desktop/FPython/File1.csv', names=['A', 'B', 'C'])
df_csv.to_csv('/Users/F/Desktop/FPython/File1.csv', index=False)
Try using pandas module, specifically read_csv and to_csv methods. This way you can modify the imported dataframe with needed headers as column names and then save the modified dataframe back to csv.
You can use glob module to iterate over all the .csv files in your folder:
import glob
import pandas as pd
files = glob.glob('./*.csv')
def manipulate_headers(df):
df.set_axis(['A', 'B', 'C', 'D', 'E', 'F'], axis=1, inplace=True)
return df
for file_name in files:
df = pd.read_csv(file_name)
df = manipulate_headers(df)
df.to_csv(file_name)
Where manipulate_headers() is your method on handling the headers data and changing the columns names, I just provide one possible manipulation of setting new index names.
Note:
I do recommend you to save the modified files in a new folder under new file names, so you always have a backup with the original files, in case something goes wrong.

Dask dataframes: reading multiple files & storing filename in column

I regularly use dask.dataframe to read multiple files, as so:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
However, the origin of each row, i.e. which file the data was read from, seems to be forever lost.
Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv' if file1.csv is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute is triggered as part of a workflow.
The idea is that different logic can then be applied depending on the source.
Dask functions read_csv, read_table, and read_fwf now include a parameter include_path_column:
include_path_column:bool or str, optional
Whether or not to include the path to each particular file.
If True a new column is added to the dataframe called path.
If str, sets new column name. Default is False.
Assuming you have or can make a file_list list that has the file path of each csv file, and each individual file fits in RAM (you mentioned 100 rows), then this should work:
import pandas as pd
import dask.dataframe as dd
from dask import delayed
def read_and_label_csv(filename):
# reads each csv file to a pandas.DataFrame
df_csv = pd.read_csv(filename)
df_csv['partition'] = filename.split('\\')[-1]
return df_csv
# create a list of functions ready to return a pandas.DataFrame
dfs = [delayed(read_and_label_csv)(fname) for fname in file_list]
# using delayed, assemble the pandas.DataFrames into a dask.DataFrame
ddf = dd.from_delayed(dfs)
With some customization, of course. If your csv files are bigger-than-RAM, then a concatentation of dask.DataFrames is probably the way to go.

pandas write dataframe to parquet format with append

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)

Categories