Pandas and FastParquet read a single partition - python

I have a miserably long running job to read in a dataset that has a natural, logical partition on US State. I have saved it as a partitioned parquet dataset from pandas using fastparquet (using pd.write_parquet).
I want my buddy to be able to read in just a single partition (state) from the parquet folder that's created. read_parquet doesn't have a filter ability. Any thoughts?

Try using either dask or parquet reader. Filtering via pandas has worked for me.
How to read parquet file with a condition using pyarrow in Python
RUN pip install pyarrow
RUN pip install "dask[complete]"
import pyarrow.parquet as pq
import dask.dataframe as dd
import pandas as pd
path = ""
dask_df = dd.read_parquet(path, columns=["col1", "col2"], engine="pyarrow")
dask_filter_df = dask_df[dask_df.col1 == "filter here"]
path = ""
parquet_pandas_df = pq.ParquetDataset(path).read_pandas().to_pandas()
pandas_filter_df = parquet_pandas_df[parquet_pandas_df.col1 == "filter here"]

Related

Reading DataFrames saved as parquet with pyarrow, save filenames in columns

I want to read a folder full of parquet files that contain pandas DataFrames. In addition to the data that I'm reading I want to store the filenames from which the data is read in the column "file_origin". In pandas I am able to do it like this:
import pandas as pd
from pathlib import Path
data_dir = Path("path_of_folder_with_files")
df = pd.concat(
pd.read_parquet(parquet_file).assign(file_origin=parquet_file.name)
for parquet_file in data_dir.glob("*")
)
Unfortunately this is quite slow. Is there a similar way to do this with pyarrow (or any other efficient package)?
import pyarrow.parquet as pq
table = pq.read_table(data_dir, use_threads=True)
df = table.to_pandas()
You could implement it using arrow instead of pandas:
batches = []
for file_name in data_dir.glob("*"):
table = pq.read_table(file_name)
table = table.append_column("file_name", pa.array([file_name]*len(table), pa.string()))
batches.extend(table.to_batches())
return pa.Table.from_batches(batches)
I don't expect it to be significantly faster, unless you have a lot of strings and objects in your table (which are slow in pandas).

Python - read parquet data from a variable

I am reading a parquet file and transforming it into dataframe.
from fastparquet import ParquetFile
pf = ParquetFile('file.parquet')
df = pf.to_pandas()
Is there a way to read a parquet file from a variable (that previously read and now hold parquet data)?
Thanks.
In Pandas there is method to deal with parquet. Here is reference to the docs. Something like that:
import pandas as pd
pd.read_parquet('file.parquet')
should work. Also please read this post for engine selection.
You can read a file from a variable also using pandas.read_parquet using the following code. I tested this with the pyarrow backend but this should also work for the fastparquet backend.
import pandas as pd
import io
with open("file.parquet", "rb") as f:
data = f.read()
buf = io.BytesIO(data)
df = pd.read_parquet(buf)

Pandas : Reading first n rows from parquet file?

I have a parquet file and I want to read first n rows from the file into a pandas data frame.
What I tried:
df = pd.read_parquet(path= 'filepath', nrows = 10)
It did not work and gave me error:
TypeError: read_table() got an unexpected keyword argument 'nrows'
I did try the skiprows argument as well but that also gave me same error.
Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.
Is there any way to achieve it?
The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.
To read using PyArrow as the backend, follow below:
from pyarrow.parquet import ParquetFile
import pyarrow as pa
pf = ParquetFile('file_name.pq')
first_ten_rows = next(pf.iter_batches(batch_size = 10))
df = pa.Table.from_batches([first_ten_rows]).to_pandas()
Change the line batch_size = 10 to match however many rows you want to read in.
After exploring around and getting in touch with the pandas dev team, the end point is pandas does not support argument nrows or skiprows while reading the parquet file.
The reason being that pandas use pyarrow or fastparquet parquet engines to process parquet file and pyarrow has no support for reading file partially or reading file by skipping rows (not sure about fastparquet). Below is the link of issue on pandas github for discussion.
https://github.com/pandas-dev/pandas/issues/24511
As an alternative you can use S3 Select functionality from AWS SDK for pandas as proposed by Abdel Jaidi in this answer.
pip install awswrangler
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://filepath",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
Parquet file is column oriented storage, designed for that... So it's normal to load all the file to access just one line.
Using pyarrow dataset scanner:
import pyarrow as pa
n = 10
src_path = "/parquet/path"
df = pa.dataset.dataset(src_path).scanner().head(n).to_pandas()
The most straighforward option for me seems to use dask library as
import dask.dataframe as dd
df = dd.read_parquet(path= 'filepath').head(10)

How to write a partitioned Parquet file using Pandas

I'm trying to write a Pandas dataframe to a partitioned file:
df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])
TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'
From the documentation I expected that the partition_cols would be passed as a kwargs to the pyarrow library. How can a partitioned file be written to local disk using pandas?
Pandas DataFrame.to_parquet is a thin wrapper over table = pa.Table.from_pandas(...) and pq.write_table(table, ...) (see pandas.parquet.py#L120), and pq.write_table does not support writing partitioned datasets. You should use pq.write_to_dataset instead.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
table,
root_path='output.parquet',
partition_cols=['partone', 'parttwo'],
)
For more info, see pyarrow documentation.
In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.
First make sure that you have a reasonably recent version of pandas and pyarrow:
pyenv shell 3.8.2
python -m venv venv
source venv/bin/activate
pip install pandas pyarrow
pip freeze | grep pandas # pandas==1.2.3
pip freeze | grep pyarrow # pyarrow==3.0.0
Then you can use partition_cols to produce the partitioned parquet files:
import pandas as pd
# example dataframe with 3 rows and columns year,month,day,value
df = pd.DataFrame(data={'year': [2020, 2020, 2021],
'month': [1,12,2],
'day': [1,31,28],
'value': [1000,2000,3000]})
df.to_parquet('./mydf', partition_cols=['year', 'month', 'day'])
This produces:
mydf/year=2020/month=1/day=1/6f0258e6c48a48dbb56cae0494adf659.parquet
mydf/year=2020/month=12/day=31/cf8a45116d8441668c3a397b816cd5f3.parquet
mydf/year=2021/month=2/day=28/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet
You need to update to Pandas version 0.24 or above. The functionality of partition_cols is added from that version onwards.

How do I use pandas.read_csv on Google Cloud ML?

I'm trying to deploy a training script on Google Cloud ML. Of course, I've uploaded my datasets (CSV files) in a bucket on GCS.
I used to import my data with read_csv from pandas, but it doesn't seem to work with a GCS path.
How should I proceed (I would like to keep using pandas) ?
import pandas as pd
data = pd.read_csv("gs://bucket/folder/file.csv")
output :
ERROR 2018-02-01 18:43:34 +0100 master-replica-0 IOError: File gs://bucket/folder/file.csv does not exist
You will require to use file_io from tensorflow.python.lib.io to do that as demonstrated below:
from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
from pandas import read_csv
# read csv file from google cloud storage
def read_data(gcs_path):
file_stream = file_io.FileIO(gcs_path, mode='r')
csv_data = read_csv(StringIO(file_stream.read()))
return csv_data
Now call the above function
gcs_path = 'gs://bucket/folder/file.csv' # change path according to your bucket, folder and path
df = read_data(gcs_path)
# print(df.head()) # displays top 5 rows including headers as default
Pandas does not have native GCS support. There are two alternatives:
1. copy the file to the VM using gsutil cli
2. use the TensorFlow file_io library to open the file, and pass the file object to pd.read_csv(). Please refer to the detailed answer here.
You could also use Dask to extract and then load the data into, let's say, a Jupyter Notebook running on GCP.
Make sure you have Dask is installed.
conda install dask #conda
pip install dask[complete] #pip
import dask.dataframe as dd #Import
dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data
dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data
This is all you need to load the data.
You can filter and manipulate data with Pandas syntax now.
dataframe['z'] = dataframe.x + dataframe.y
dataframe_pd = dataframe.compute()

Categories