how to write a Pandas dataframe in HDFS [duplicate] - python

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB.
If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0

Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.
pyarrow.Table.fromPandas is the function your looking for:
Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
Convert pandas.DataFrame to an Arrow Table
import pyarrow as pa
pdf = ... # type: pandas.core.frame.DataFrame
adf = pa.Table.from_pandas(pdf) # type: pyarrow.lib.Table
The result can be written directly to Parquet / HDFS without passing data via Spark:
import pyarrow.parquet as pq
fs = pa.hdfs.connect()
with fs.open(path, "wb") as fw
pq.write_table(adf, fw)
See also
#WesMcKinney answer to read a parquet files from HDFS using PyArrow.
Reading and Writing the Apache Parquet Format in the pyarrow documentation.
Native Hadoop file system (HDFS) connectivity in Python
Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.
Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.
Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.
In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.

From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?

One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command.
example
df = pd.read_csv("data/as/foo.csv")
df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.
Now to save your df in hdfs:
sdf.write.csv('mycsv.csv')

An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.

Related

How to read txt files and convert into data frame using Apache Beam Python?

My txt file contains stock market data without any delimiter. So I have to transform that into structured columns and convert it into a data frame using the Apache beam python pipeline. Any help would be appreciated.
I have a made brief compilation on how to use beam.dataframe ingesting data with apache beam. Below you can see some of the functions that beams dataframe supports , as you can see those functions are similar to the one used on pandas. Its because apache beam have a implemented a function _ReadFromPandas which make use of pandas library to get data from your sources into a beam dataframe.
apache_beam.dataframe.io
read_csv : Emulates read_csv from Pandas, but as a Beam PTransform. Use this to get a deferred Beam dataframe representing the contents of the file. csv -> stands for comma-separated values.
df = p | beam.dataframe.io.read_csv(…)
read_fwf: Emulates pandas read_fwf. *fwf stands for Fixed Width Text File. You can find more detailed information on this article.
data = "mydata.txt" # content --> "MyFixedText12345"
read_fwf(data, colspecs=[(0, 10), (10, None)], header=None)
# expected dataframe = [[MyFixedText, 12345]]
read_json: Emulates pandas read_json. Parse json data structure file to create the dataframe.
#json structure
{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}
read_html: Emulates pandas read_html. Will parse HTML tables into list of dataframes.
As mention in the comments you will have to define the structure of your data, so it will be able to fit into one of options apache_beam have for dataframes. As beam makes use of pandas, you can find more detailed information about pandas io tools, keep in mind that according to official beam documentation not all pandas dataframe options are supported by beams.
For more info about beam dataframes, check this link

Parquet compatibility with Dask/Pandas and Pyspark

This is the same question as here, but the accepted answer does not work for me.
Attempt:
I try to save a dask dataframe in parquet format and read it with spark.
Issue: the timestamp column can not be interpreted by pyspark
what i have done:
I try to save a Dask dataframe in hfds as parquet using
import dask.dataframe as dd
dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', flavor='spark')
Then I read the file with pyspark:
sdf = spark.read.parquet('hdfs:///user/<myuser>/<filename>')
sdf.show()
>>> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://nameservice1/user/<user>/<filename>/part.0.parquet. Column: [utc_timestamp], Expected: bigint, Found: INT96
but if i save the dataframe with
dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', use_deprecated_int96_timestamps=True)
the utc timestamp column contains the timestamp Information in unix Format (1578642290403000)
this is my Environment:
dask==2.9.0
dask-core==2.9.0
pandas==0.23.4
pyarrow==0.15.1
pyspark==2.4.3
The INT96 type was explicitly included in order to allow compatibility with spark, which chose not to use the standard time type defined by the parquet spec. Unfortunately, it seems that they have changed again, and no longer use their own previous standard, not the parquet one.
If you could find out what type spark wants here, and post an issue to the dask repo, it would be appreciated. You would want to output data from spark containing time columns, and see what format it ends up as.
Did you also try the fastparquet backend?

Read a large csv as a Pandas DataFrame faster

I have a csv that I am reading into a Pandas DataFrame but it takes about 35 minutes to read. The csv is approximately 120 GB. I found a module called cudf that allows a GPU DataFrame however it is only for Linux. Is there something similar for Windows?
chunk_list = []
combined_array = pd.DataFrame()
for chunk in tqdm(pd.read_csv('\\large_array.csv', header = None,
low_memory = False, error_bad_lines = False, chunksize = 10000)):
print(' --- Complete')
chunk_list.append(chunk)
array = pd.concat(chunk_list)
print(array)
You can also look at dask-dataframe if you really want to read it into a pandas api like dataframe.
For reading csvs , this will parallelize your io task across multiple cores plus nodes. This will probably alleviate memory pressures by scaling across nodes as with 120 GB csv you will probably be memory bound too.
Another good alternative might be using arrow.
Do you have GPU ? if yes, please look at BlazingSQL, the GPU SQL engine in a Python package.
In this article, describe Querying a Terabyte with BlazingSQL. And BlazingSQL support read from CSV.
After you get GPU dataframe convert to Pandas dataframe with
# from cuDF DataFrame to pandas DataFrame
df = gdf.to_pandas()

Writing a big Spark Dataframe into a csv file

I'm using Spark 2.3 and I need to save a Spark Dataframe into a csv file and I'm looking for a better way to do it.. looking over related/similar questions, I found this one, but I need a more specific:
If the DataFrame is too big, how can I avoid using Pandas? Because I used toCSV() function (code below) and it produced:
Out Of Memory error (could not allocate memory).
Is directly writing to a csv using file I/O a better way? Can it preserve the separators?
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?
Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?
def toCSV(spark_df, n=None, save_csv=None, csv_sep=',', csv_quote='"'):
"""get spark_df from hadoop and save to a csv file
Parameters
----------
spark_df: incoming dataframe
n: number of rows to get
save_csv=None: filename for exported csv
Returns
-------
"""
# use the more robust method
# set temp names
tmpfilename = save_csv or (wfu.random_filename() + '.csv')
tmpfoldername = wfu.random_filename()
print n
# write sparkdf to hadoop, get n rows if specified
if n:
spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
else:
spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
# get merge file from hadoop
HDFSUtil.getmerge(tmpfoldername, tmpfilename)
HDFSUtil.rmdir(tmpfoldername)
# read into pandas df, remove tmp csv file
pd_df = pd.read_csv(tmpfilename, names=spark_df.columns, sep=csv_sep, quotechar=csv_quote)
os.remove(tmpfilename)
# re-write the csv file with header!
if save_csv is not None:
pd_df.to_csv(save_csv, sep=csv_sep, quotechar=csv_quote)
If the DataFrame is too big, how can I avoid using Pandas?
You can just save the file to HDFS or S3 or whichever distributed storage you have.
Is directly writing to a csv using file I/O a better way? Can it
preserve the separators?
If you mean by that to save file to local storage - it will still cause OOM exception, since you will need to move all data in memory on local machine to do it.
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv')
will cause the header to be written in each file and when the files
are merged, it will have headers in the middle. Am I wrong?
In this case you will have only 1 file (since you do coalesce(1)). So you don't need to care about headers. Instead - you should care about memory on the executors - you might get OOM on the executor since all the data will be moved to that executor.
Using spark write and then hadoop getmerge is better than using
coalesce from the point of performance?
Definitely better (but don't use coalesce()). Spark will efficiently write data to storage, then HDFS will duplicate data and after that getmerge will be able to efficiently read data from the nodes and merge it.
We used databricks library . It works fine
df.save("com.databricks.spark.csv", SaveMode.Overwrite, Map("delimiter" -> delim, "nullValue" -> "-", "path" -> tempFPath))
Library :
<!-- spark df to csv -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.3.0</version>
</dependency>

How to save a huge pandas dataframe to hdfs?

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB.
If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0
Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.
pyarrow.Table.fromPandas is the function your looking for:
Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
Convert pandas.DataFrame to an Arrow Table
import pyarrow as pa
pdf = ... # type: pandas.core.frame.DataFrame
adf = pa.Table.from_pandas(pdf) # type: pyarrow.lib.Table
The result can be written directly to Parquet / HDFS without passing data via Spark:
import pyarrow.parquet as pq
fs = pa.hdfs.connect()
with fs.open(path, "wb") as fw
pq.write_table(adf, fw)
See also
#WesMcKinney answer to read a parquet files from HDFS using PyArrow.
Reading and Writing the Apache Parquet Format in the pyarrow documentation.
Native Hadoop file system (HDFS) connectivity in Python
Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.
Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.
Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.
In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.
From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?
One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command.
example
df = pd.read_csv("data/as/foo.csv")
df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.
Now to save your df in hdfs:
sdf.write.csv('mycsv.csv')
An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.

Categories