How to write a partitioned Parquet file using Pandas

How to write a partitioned Parquet file using Pandas - python

I'm trying to write a Pandas dataframe to a partitioned file:
df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])
TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'
From the documentation I expected that the partition_cols would be passed as a kwargs to the pyarrow library. How can a partitioned file be written to local disk using pandas?

Pandas DataFrame.to_parquet is a thin wrapper over table = pa.Table.from_pandas(...) and pq.write_table(table, ...) (see pandas.parquet.py#L120), and pq.write_table does not support writing partitioned datasets. You should use pq.write_to_dataset instead.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
table,
root_path='output.parquet',
partition_cols=['partone', 'parttwo'],
)
For more info, see pyarrow documentation.
In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.

First make sure that you have a reasonably recent version of pandas and pyarrow:
pyenv shell 3.8.2
python -m venv venv
source venv/bin/activate
pip install pandas pyarrow
pip freeze | grep pandas # pandas==1.2.3
pip freeze | grep pyarrow # pyarrow==3.0.0
Then you can use partition_cols to produce the partitioned parquet files:
import pandas as pd
# example dataframe with 3 rows and columns year,month,day,value
df = pd.DataFrame(data={'year': [2020, 2020, 2021],
'month': [1,12,2],
'day': [1,31,28],
'value': [1000,2000,3000]})
df.to_parquet('./mydf', partition_cols=['year', 'month', 'day'])
This produces:
mydf/year=2020/month=1/day=1/6f0258e6c48a48dbb56cae0494adf659.parquet
mydf/year=2020/month=12/day=31/cf8a45116d8441668c3a397b816cd5f3.parquet
mydf/year=2021/month=2/day=28/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet

You need to update to Pandas version 0.24 or above. The functionality of partition_cols is added from that version onwards.

Related

Pandas and FastParquet read a single partition

I have a miserably long running job to read in a dataset that has a natural, logical partition on US State. I have saved it as a partitioned parquet dataset from pandas using fastparquet (using pd.write_parquet).
I want my buddy to be able to read in just a single partition (state) from the parquet folder that's created. read_parquet doesn't have a filter ability. Any thoughts?

Try using either dask or parquet reader. Filtering via pandas has worked for me.
How to read parquet file with a condition using pyarrow in Python
RUN pip install pyarrow
RUN pip install "dask[complete]"
import pyarrow.parquet as pq
import dask.dataframe as dd
import pandas as pd
path = ""
dask_df = dd.read_parquet(path, columns=["col1", "col2"], engine="pyarrow")
dask_filter_df = dask_df[dask_df.col1 == "filter here"]
path = ""
parquet_pandas_df = pq.ParquetDataset(path).read_pandas().to_pandas()
pandas_filter_df = parquet_pandas_df[parquet_pandas_df.col1 == "filter here"]

Error when reading HDF files in Python 3.7 with Pandas.read_hdf function

Previously, I have saved multi columns of dataset into one HDF file. The procedure can be outlined as follows:
import pandas as pd
from pandas import HDFStore, DataFrame
from pandas import read_hdf
hdf = HDFStore("FILE.h5")
feature = ['var1','var2']
## noted that the original dataframe is huge, and thus fake dataframe was generated as example.
for k in range(0,len(feature),1):
df = {'A':['1','2','3','4'],'B':[4,5,6,7]}
df = pd.DataFrame(df)
hdf.put(feature[k], df, format='table', encoding="utf-8")
Then, I can read the file 'FILE.h5' by simply using
df = pd.read_hdf("./FILE.h5,'var1',encoding = 'utf-8')
It always worked well until I have upgraded my Python environment from 2.7 to 3.7.
For now with Python 3.7 and Pandas 0.24.2, the HDF file could not be correctly read. The error shows like:
df = pd.read_hdf("./FILE.h5,'var1',encoding = 'utf-8')
>>> ...
~/anaconda3/lib/python3.7/codecs.py in getdecoder(encoding)
961
962 """
--> 963 return lookup(encoding).decode
964
965 def getincrementalencoder(encoding):
TypeError: lookup() argument must be str, not numpy.bytes_
PS
I have read the github issue which was similar to my situation. But it could not fix my problem. Then, I turned to use h5py package dealing with hdf5-format files, but it was not as convenient as the pandas.
Any advices or methods was highly appreciated!

I think you have a prior bug with pandas (since you're using version 0.13). From Github Issues 12304 and 11126 indicate that there's a bug in read_hdf when you attempt to pass encodings in versions under 0.17.
Is upgrading to a modern version of pandas an option since you are already on 3.7?

Pandas : Reading first n rows from parquet file?

I have a parquet file and I want to read first n rows from the file into a pandas data frame.
What I tried:
df = pd.read_parquet(path= 'filepath', nrows = 10)
It did not work and gave me error:
TypeError: read_table() got an unexpected keyword argument 'nrows'
I did try the skiprows argument as well but that also gave me same error.
Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.
Is there any way to achieve it?

The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.
To read using PyArrow as the backend, follow below:
from pyarrow.parquet import ParquetFile
import pyarrow as pa
pf = ParquetFile('file_name.pq')
first_ten_rows = next(pf.iter_batches(batch_size = 10))
df = pa.Table.from_batches([first_ten_rows]).to_pandas()
Change the line batch_size = 10 to match however many rows you want to read in.

After exploring around and getting in touch with the pandas dev team, the end point is pandas does not support argument nrows or skiprows while reading the parquet file.
The reason being that pandas use pyarrow or fastparquet parquet engines to process parquet file and pyarrow has no support for reading file partially or reading file by skipping rows (not sure about fastparquet). Below is the link of issue on pandas github for discussion.
https://github.com/pandas-dev/pandas/issues/24511

As an alternative you can use S3 Select functionality from AWS SDK for pandas as proposed by Abdel Jaidi in this answer.
pip install awswrangler
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://filepath",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)

Parquet file is column oriented storage, designed for that... So it's normal to load all the file to access just one line.

Using pyarrow dataset scanner:
import pyarrow as pa
n = 10
src_path = "/parquet/path"
df = pa.dataset.dataset(src_path).scanner().head(n).to_pandas()

The most straighforward option for me seems to use dask library as
import dask.dataframe as dd
df = dd.read_parquet(path= 'filepath').head(10)

How do I use pandas.read_csv on Google Cloud ML?

I'm trying to deploy a training script on Google Cloud ML. Of course, I've uploaded my datasets (CSV files) in a bucket on GCS.
I used to import my data with read_csv from pandas, but it doesn't seem to work with a GCS path.
How should I proceed (I would like to keep using pandas) ?
import pandas as pd
data = pd.read_csv("gs://bucket/folder/file.csv")
output :
ERROR 2018-02-01 18:43:34 +0100 master-replica-0 IOError: File gs://bucket/folder/file.csv does not exist

You will require to use file_io from tensorflow.python.lib.io to do that as demonstrated below:
from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
from pandas import read_csv
# read csv file from google cloud storage
def read_data(gcs_path):
file_stream = file_io.FileIO(gcs_path, mode='r')
csv_data = read_csv(StringIO(file_stream.read()))
return csv_data
Now call the above function
gcs_path = 'gs://bucket/folder/file.csv' # change path according to your bucket, folder and path
df = read_data(gcs_path)
# print(df.head()) # displays top 5 rows including headers as default

Pandas does not have native GCS support. There are two alternatives:
1. copy the file to the VM using gsutil cli
2. use the TensorFlow file_io library to open the file, and pass the file object to pd.read_csv(). Please refer to the detailed answer here.

You could also use Dask to extract and then load the data into, let's say, a Jupyter Notebook running on GCP.
Make sure you have Dask is installed.
conda install dask #conda
pip install dask[complete] #pip
import dask.dataframe as dd #Import
dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data
dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data
This is all you need to load the data.
You can filter and manipulate data with Pandas syntax now.
dataframe['z'] = dataframe.x + dataframe.y
dataframe_pd = dataframe.compute()

How to convert OpenDocument spreadsheets to a pandas DataFrame?

The Python library pandas can read Excel spreadsheets and convert them to a pandas.DataFrame with pandas.read_excel(file) command. Under the hood, it uses xlrd library which does not support ods files.
Is there an equivalent of pandas.read_excel for ods files? If not, how can I do the same for an Open Document Formatted spreadsheet (ods file)? ODF is used by LibreOffice and OpenOffice.

This is available natively in pandas 0.25. So long as you have odfpy installed (conda install odfpy OR pip install odfpy) you can do
pd.read_excel("the_document.ods", engine="odf")

You can read ODF (Open Document Format .ods) documents in Python using the following modules:
odfpy / read-ods-with-odfpy
ezodf
pyexcel / pyexcel-ods
py-odftools
simpleodspy
Using ezodf, a simple ODS-to-DataFrame converter could look like this:
import pandas as pd
import ezodf
doc = ezodf.opendoc('some_odf_spreadsheet.ods')
print("Spreadsheet contains %d sheet(s)." % len(doc.sheets))
for sheet in doc.sheets:
print("-"*40)
print(" Sheet name : '%s'" % sheet.name)
print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )
# convert the first sheet to a pandas.DataFrame
sheet = doc.sheets[0]
df_dict = {}
for i, row in enumerate(sheet.rows()):
# row is a list of cells
# assume the header is on the first row
if i == 0:
# columns as lists in a dictionary
df_dict = {cell.value:[] for cell in row}
# create index for the column headers
col_index = {j:cell.value for j, cell in enumerate(row)}
continue
for j, cell in enumerate(row):
# use header instead of column index
df_dict[col_index[j]].append(cell.value)
# and convert to a DataFrame
df = pd.DataFrame(df_dict)
P.S.
ODF spreadsheet (*.ods files) support has been requested on the pandas issue tracker: https://github.com/pydata/pandas/issues/2311, but it is still not implemented.
ezodf was used in the unfinished PR9070 to implement ODF support in pandas. That PR is now closed (read the PR for a technical discussion), but it is still available as an experimental feature in this pandas fork.
there are also some brute force methods to read directly from the XML code (here)

Here is a quick and dirty hack which uses ezodf module:
import pandas as pd
import ezodf
def read_ods(filename, sheet_no=0, header=0):
tab = ezodf.opendoc(filename=filename).sheets[sheet_no]
return pd.DataFrame({col[header].value:[x.value for x in col[header+1:]]
for col in tab.columns()})
Test:
In [92]: df = read_ods(filename='fn.ods')
In [93]: df
Out[93]:
a b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
NOTES:
all other useful parameters like header, skiprows, index_col, parse_cols are NOT implemented in this function - feel free to update this question if you want to implement them
ezodf depends on lxml make sure you have it installed

pandas now supports .ods files. you must install the odfpy module first. then it will work like a normal .xls file.
conda install -c conda-forge odfpy
then
pd.read_excel('FILE_NAME.ods', engine='odf')

Edit: Happily, this answer below is now out of date, if you can update to a recent Pandas version.
If you'd still like to work from a Pandas version of your data, and update it from ODS only when needed, read on.
It seems the answer is No!
And I would characterize the tools to read in ODS still ragged.
If you're on POSIX, maybe the strategy of exporting to xlsx on the fly before using Pandas' very nice importing tools for xlsx is an option:
unoconv -f xlsx -o tmp.xlsx myODSfile.ods
Altogether, my code looks like:
import pandas as pd
import os
if fileOlderThan('tmp.xlsx','myODSfile.ods'):
os.system('unoconv -f xlsx -o tmp.xlsx myODSfile.ods ')
xl_file = pd.ExcelFile('tmp.xlsx')
dfs = {sheet_name: xl_file.parse(sheet_name)
for sheet_name in xl_file.sheet_names}
df=dfs['Sheet1']
Here fileOlderThan() is a function (see http://github.com/cpbl/cpblUtilities) which returns true if tmp.xlsx does not exist or is older than the .ods file.

Another option: read-ods-with-odfpy. This module takes an OpenDocument Spreadsheet as input, and returns a list, out of which a DataFrame can be created.

If you only have a few .ods files to read, I would just open it in openoffice and save it as an excel file. If you have a lot of files, you could use the unoconv command in Linux to convert the .ods files to .xls programmatically (with bash)
Then it's really easy to read it in with pd.read_excel('filename.xls')

I've had good luck with pandas read_clipboard.
Selecting cells and then copy from excel or opendocument.
In python run the following.
import pandas as pd
data = pd.read_clipboard()
Pandas will do a good job based on the cells copied.

Some responses have pointed out that odfpy or other external packages are needed to get this functionality, but note that in recent versions of Pandas (current is 1.1, August-2020) there is support for ODS format in functions like pd.ExcelWriter() and pd.read_excel(). You only need to specify the propper engine "odf" to be able of working with OpenDocument file formats (.odf, .ods, .odt).

Based heavily on the answer by davidovitch (thank you), I have put together a package that reads in a .ods file and returns a DataFrame. It's not a full implementation in pandas itself, such as his PR, but it provides a simple read_ods function that does the job.
You can install it with pip install pandas_ods_reader. It's also possible to specify whether the file contains a header row or not, and to specify custom column names.

There is support for reading Excel files in Pandas (both xls and xlsx), see the read_excel command. You can use OpenOffice to save the spreadsheet as xlsx. The conversion can also be done automatically on the command line, apparently, using the convert-to command line parameter.
Reading the data from xlsx avoids some of the issues (date formats, number formats, unicode) that you may run into when you convert to CSV first.

If possible, save as CSV from the spreadsheet application and then use pandas.read_csv(). IIRC, an 'ods' spreadsheet file actually is an XML file which also contains quite some formatting information. So, if it's about tabular data, extract this raw data first to an intermediate file (CSV, in this case), which you can then parse with other programs, such as Python/pandas.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write a partitioned Parquet file using Pandas - python

You need to update to Pandas version 0.24 or above. The functionality of partition_cols is added from that version onwards.

Related

Pandas and FastParquet read a single partition

Error when reading HDF files in Python 3.7 with Pandas.read_hdf function

Pandas : Reading first n rows from parquet file?

How do I use pandas.read_csv on Google Cloud ML?

How to convert OpenDocument spreadsheets to a pandas DataFrame?

Categories

Resources