Parquet compatibility with Dask/Pandas and Pyspark

Parquet compatibility with Dask/Pandas and Pyspark - python

This is the same question as here, but the accepted answer does not work for me.
Attempt:
I try to save a dask dataframe in parquet format and read it with spark.
Issue: the timestamp column can not be interpreted by pyspark
what i have done:
I try to save a Dask dataframe in hfds as parquet using
import dask.dataframe as dd
dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', flavor='spark')
Then I read the file with pyspark:
sdf = spark.read.parquet('hdfs:///user/<myuser>/<filename>')
sdf.show()
>>> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://nameservice1/user/<user>/<filename>/part.0.parquet. Column: [utc_timestamp], Expected: bigint, Found: INT96
but if i save the dataframe with
dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', use_deprecated_int96_timestamps=True)
the utc timestamp column contains the timestamp Information in unix Format (1578642290403000)
this is my Environment:
dask==2.9.0
dask-core==2.9.0
pandas==0.23.4
pyarrow==0.15.1
pyspark==2.4.3

The INT96 type was explicitly included in order to allow compatibility with spark, which chose not to use the standard time type defined by the parquet spec. Unfortunately, it seems that they have changed again, and no longer use their own previous standard, not the parquet one.
If you could find out what type spark wants here, and post an issue to the dask repo, it would be appreciated. You would want to output data from spark containing time columns, and see what format it ends up as.
Did you also try the fastparquet backend?

Related

how to covert csv to a parquet file with a specific version. (parquet-cpp version 1.5.1-SNAPSHOT)?

I have a python code like this ， it converts csv to parquet file.
import pandas as pd
import pyarrow.parquet as pq
df = pd.read_csv('27.csv')
print(df)
df.to_parquet('27.parquet' )
da = pd.read_parquet('27.parquet')
metadata = pq.read_metadata('27.parquet')
print(metadata)
print(da.head(10))
the result : https://imgur.com/a/Mhw3sot which parquet file version too high
want to change version to parquet-cpp version 1.5.1-SNAPSHOT ( a small version).
how to do it? where to set the file version??
27.CSV below
Temp,Flow,site_no,datetime,Conductance,Precipitation,GageHeight
11.0,16200,09380000,2018-06-27 00:00,669,0.00,9.97
10.9,16000,09380000,2018-06-27 00:15,668,0.00,9.93
10.9,15700,09380000,2018-06-27 00:30,668,0.00,9.88
10.8,15400,09380000,2018-06-27 00:45,672,0.00,9.82
10.8,15100,09380000,2018-06-27 01:00,672,0.00,9.77
10.8,14700,09380000,2018-06-27 01:15,672,0.00,9.68
10.7,14300,09380000,2018-06-27 01:30,673,0.00,9.61
10.7,13900,09380000,2018-06-27 01:45,672,0.00,9.53
10.7,13600,09380000,2018-06-27 02:00,671,0.00,9.46
10.6,13200,09380000,2018-06-27 02:15,672,0.00,9.38
11.0,16200,09380000,2018-06-27 00:00,669,0.00,9.97
10.9,16000,09380000,2018-06-27 00:15,668,0.00,9.93
10.9,15700,09380000,2018-06-27 00:30,668,0.00,9.88
10.8,15400,09380000,2018-06-27 00:45,672,0.00,9.82
10.8,15100,09380000,2018-06-27 01:00,672,0.00,9.77
10.8,14700,09380000,2018-06-27 01:15,672,0.00,9.68
10.7,14300,09380000,2018-06-27 01:30,673,0.00,9.61
10.7,13900,09380000,2018-06-27 01:45,672,0.00,9.53
10.7,14300,09380000,2018-06-27 01:30,673,0.00,9.61
As I know the parquet format in tera data has some limits.
Directly using the pyarrow.parquet should have some problem.
How to write a parquet file for tera data to read ? (file format limits have something to do) Any one did this before?
Parquet Format Limitations in Tera Data
1.The READ_NOS table operator does not support Parquet.
However, READ_NOS can be used to view the Parquet schema, using
RETURNTYPE('NOSREAD_PARQUET_SCHEMA'). This is helpful in creating the
foreign table when you do not know the schema of your Parquet data
beforehand.
2.Certain complex data types are not supported, including STRUCT, MAP, LIST,
and ENUM.
3.Because support for the STRUCT data type is not available, nested Parquet
object stores cannot be processed by Native Object Store.
files which I try.
https://ufile.io/f/wi1k9

Export SAS lib to csv with correct date format (in CSV file)

I use:
Python 3.7
SAS v7.1 Eterprise
I want to export some data (from library) from SAS to CSV. After that I want to import this CSV to Pandas Dataframe and use it.
I have problem, because when I export data from SAS with this code:
proc export data=LIB.NAME
outfile='path\to\export\file.csv'
dbms=csv
replace;
run;
Every column were exported correctly instead of Column with Date. In SAS I see something like:
06NOV2018
16APR2018
and so on... In CSV it looks the same. But if i import this CSV to DataFrame, unfortunatelly, Python see the column with date as Object/string instead of date type.
So here is my question. How Can I export whole library to CSV from SAS with correct type of column (ecpessially column with Date). Maybe I should convert something before Export? Plz help me with this, In SAS I'm new, i want to just import Data from it and use it in Python.
Before you write something, keep in mind, that I had tried with pandas read_sas function, but during this command I've got such Exception with error:
df1 = pd.read_sas(path)
ValueError: Unexpected non-zero end_of_first_byte Exception ignored
in: 'pandas.io.sas._sas.Parser.process_byte_array_with_data' Traceback
(most recent call last): File "pandas\io\sas\sas.pyx", line 31, in
pandas.io.sas._sas.rle_decompress
I put fillna function and show the same error :/
df = pd.DataFrame.fillna((pd.read_sas(path)), value="")
I tried with sas7bdat module in Python, but I've got the same error.
Then I tried with sas7bdat_converter module. But CSV has the same values in Date column, so problem with dtype will arrive after convert csv to DataFrame.
Have you got any sugestions? I've spent 2 days tried to figure it out, but without any positive results :/

Regarding the read_sas error, a Git issue has been reported but closed for lack of reproducible example. However, I can easily import SAS data files with Pandas using .sas7bdat files generated from SAS 9.4 base (possibly the v7.1 Enterprise is the issue).
However, consider using parse_dates argument of read_csv as it can convert your date DDMMMYY format to datetime during import. No change needed with your SAS exported dataset.
sas_df = pd.read_csv(r"path\to\export\file.csv", parse_dates = ['DATE_COLUMN'])

how to write a Pandas dataframe in HDFS [duplicate]

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB.
If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0

Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.
pyarrow.Table.fromPandas is the function your looking for:
Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
Convert pandas.DataFrame to an Arrow Table
import pyarrow as pa
pdf = ... # type: pandas.core.frame.DataFrame
adf = pa.Table.from_pandas(pdf) # type: pyarrow.lib.Table
The result can be written directly to Parquet / HDFS without passing data via Spark:
import pyarrow.parquet as pq
fs = pa.hdfs.connect()
with fs.open(path, "wb") as fw
pq.write_table(adf, fw)
See also
#WesMcKinney answer to read a parquet files from HDFS using PyArrow.
Reading and Writing the Apache Parquet Format in the pyarrow documentation.
Native Hadoop file system (HDFS) connectivity in Python
Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.
Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.
Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.
In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.

From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?

One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command.
example
df = pd.read_csv("data/as/foo.csv")
df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.
Now to save your df in hdfs:
sdf.write.csv('mycsv.csv')

An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.

How to save a huge pandas dataframe to hdfs?

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB.
If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0

Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.
pyarrow.Table.fromPandas is the function your looking for:
Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
Convert pandas.DataFrame to an Arrow Table
import pyarrow as pa
pdf = ... # type: pandas.core.frame.DataFrame
adf = pa.Table.from_pandas(pdf) # type: pyarrow.lib.Table
The result can be written directly to Parquet / HDFS without passing data via Spark:
import pyarrow.parquet as pq
fs = pa.hdfs.connect()
with fs.open(path, "wb") as fw
pq.write_table(adf, fw)
See also
#WesMcKinney answer to read a parquet files from HDFS using PyArrow.
Reading and Writing the Apache Parquet Format in the pyarrow documentation.
Native Hadoop file system (HDFS) connectivity in Python
Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.
Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.
Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.
In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.

From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?

One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command.
example
df = pd.read_csv("data/as/foo.csv")
df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.
Now to save your df in hdfs:
sdf.write.csv('mycsv.csv')

An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.

Error with Pandas command on Spark?

I would like to preface by saying I am very new to Spark. I have a working program on Pandas that I need to run on Spark. I am using Databricks to do this. After initializing 'sqlContext' and 'sc', I load in a CSV file and create a Spark dataframe. After doing this, I then convert this dataframe into a Pandas dataframe, where I have already wrote code to do what I need to do.
Objective: I need to load in a CSV file and identify the data types and return the data types of each and every column. The tricky part is that dates come in a variety of formats, for which I have written (with help from this community) regular expressions to match. I do this for every data type. At the end, I convert the columns to the correct type and print each column type.
After successfully loading my Pandas dataframe in, I am getting this error: "TypeError: to_numeric() got an unexpected keyword argument 'downcast' "
The code that I am running that triggered this:
# Changing the column data types
if len(int_count) == len(str_count):
df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='integer')
if len(float_count) == len(str_count):
df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='float')
if len(boolean_count) == len(str_count):
df[lst[col]] = df[lst[col]].astype('bool')
if len(date_count) == len(str_count):
df[lst[col]] = pd.to_datetime(df[lst[col]], errors='coerce')
'lst' is the column header and 'col' is a variable I used to iterate through the column headers. This code worked perfectly when running on PyCharm. Not sure why I am getting this error on Spark.
Any help would be great!

From your comments:
I have tried to load the initial data directly into pandas df but it has consistently thrown me an error, saying the file doesn't exist, which is why I have had to convert it after loading it into Spark.
So, my answer has nothing to do with Spark, only with uploading data to Databricks Cloud (Community Edition), which seems to be your real issue here.
After initializing a cluster and uploading a file user_info.csv, we get this screenshot:
including the actual path for our uploaded file.
Now, in a Databricks notebook, if you try to use this exact path with pandas, you'll get a File does not exist error:
import pandas as pd
pandas_df = pd.read_csv("/FileStore/tables/1zpotrjo1499779563504/user_info.csv")
[...]
IOError: File /FileStore/tables/1zpotrjo1499779563504/user_info.csv does not exist
because, as the instructions clearly mention, in that case (i.e. files you want loaded directly in pandas or R instead of Spark) you need to prepend the file path with /dbfs:
pandas_df = pd.read_csv("/dbfs/FileStore/tables/1zpotrjo1499779563504/user_info.csv") # works OK
pandas_df.head() # works OK

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parquet compatibility with Dask/Pandas and Pyspark - python

Related

how to covert csv to a parquet file with a specific version. (parquet-cpp version 1.5.1-SNAPSHOT)?

Export SAS lib to csv with correct date format (in CSV file)

how to write a Pandas dataframe in HDFS [duplicate]

How to save a huge pandas dataframe to hdfs?

Error with Pandas command on Spark?

Categories

Resources