topandas() using pyarrow returns empty dataframe - python

I have a spark dataframe with 5 million rows and 250 columns. When I do topandas() conversion of this dataframe with "spark.sql.execution.arrow.enabled" as "true" it returns a empty dataframe with just the columns.
With pyarrow disabled I get below error
Py4JJavaError: An error occurred while calling o124.collectToPython. : java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there any way to perform this operation with increasing some sort of memory allocation?
I couldn't find any online resoruces for this except https://issues.apache.org/jira/browse/SPARK-28881 which is not that helpful

Ok this problem is related to the memory because while converting a spark dataframe to pandas dataframe spark (Py4j) has to pass throw the collect which consumes a lot of memory, so what I advise you to do is while creating the spark session just reconfigure the memory, here an example :
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '16g')
sc = SparkContext("local", "stack_over_flow")
Proceed with sc (spark context ) or with spark session as you like
If this will not work it may be a version conflict so please check those options :
-Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python
-Downgrade to pyarrow < 0.15.0 for now.
If you can share your script it will be more clear

Related

How to use ODBC connection for pyspark.pandas

In my following python code I successfully can connect to MS Azure SQL Db using ODBC connection, and can load data into an Azure SQL table using pandas' dataframe method to_sql(...). But when I use pyspark.pandas instead, the to_sql(...) method fails stating no such method supported. I know pandas API on Spark has reached about 97% coverage. But I was wondering if there is alternate method of achieving the same while still using ODBC.
Question: In the following code sample, how can we use ODBC connection for pyspark.pandas for connecting to Azure SQL db and load a dataframe into a SQL table?
import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime
data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
.......
data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10),
'Name' : sq.VARCHAR(50),
'OrderDate' : sq.DATETIME()})
Ref: Pandas API on Spark and this
UPDATE: The data file is about 6.5GB with 150 columns and 15 million records. Therefore, the pandas cannot handle it, and as expected, it gives OOM (out of memory) error.
I noticed you were appending the data to the table, so this work around came to mind.
Break the pyspark.pandas into chunks, and then export each chunk to pandas, and from there append the chunk.
n = len(data_df)//20 # Break it into 20 chunks
list_dfs = np.array_split(data_df, n) # [df[i:i+n] for i in range(0,df.shape[0],n)]
for df in list_dfs:
df = df.to_pandas()
df.to_sql()
As per the official pyspark.pandas documentation by Apache Spark, there is no such method available for this module which can load the pandas DataFrame to SQL Table.
Please see all provided methods here.
As an alternative approach, there are some similar asks mentioned in these SO threads. This might be helpful.
How to write to a Spark SQL table from a Panda data frame using PySpark?
How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook

how to write a Pandas dataframe in HDFS [duplicate]

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB.
If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0
Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.
pyarrow.Table.fromPandas is the function your looking for:
Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
Convert pandas.DataFrame to an Arrow Table
import pyarrow as pa
pdf = ... # type: pandas.core.frame.DataFrame
adf = pa.Table.from_pandas(pdf) # type: pyarrow.lib.Table
The result can be written directly to Parquet / HDFS without passing data via Spark:
import pyarrow.parquet as pq
fs = pa.hdfs.connect()
with fs.open(path, "wb") as fw
pq.write_table(adf, fw)
See also
#WesMcKinney answer to read a parquet files from HDFS using PyArrow.
Reading and Writing the Apache Parquet Format in the pyarrow documentation.
Native Hadoop file system (HDFS) connectivity in Python
Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.
Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.
Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.
In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.
From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?
One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command.
example
df = pd.read_csv("data/as/foo.csv")
df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.
Now to save your df in hdfs:
sdf.write.csv('mycsv.csv')
An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.

How to save a huge pandas dataframe to hdfs?

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB.
If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0
Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.
pyarrow.Table.fromPandas is the function your looking for:
Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
Convert pandas.DataFrame to an Arrow Table
import pyarrow as pa
pdf = ... # type: pandas.core.frame.DataFrame
adf = pa.Table.from_pandas(pdf) # type: pyarrow.lib.Table
The result can be written directly to Parquet / HDFS without passing data via Spark:
import pyarrow.parquet as pq
fs = pa.hdfs.connect()
with fs.open(path, "wb") as fw
pq.write_table(adf, fw)
See also
#WesMcKinney answer to read a parquet files from HDFS using PyArrow.
Reading and Writing the Apache Parquet Format in the pyarrow documentation.
Native Hadoop file system (HDFS) connectivity in Python
Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.
Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.
Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.
In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.
From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?
One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command.
example
df = pd.read_csv("data/as/foo.csv")
df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.
Now to save your df in hdfs:
sdf.write.csv('mycsv.csv')
An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.

Spark memory leak when overwriting dataframe variable

I'm running into this memory leak in the spark driver that I can't seem to figure out why. My guess is it has something to do with trying to overwrite a DataFrame variable but I can't find any documentation or other issues like this.
This is on Spark 2.1.0 (PySpark).
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Spark Leak") \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext.getOrCreate(sc)
items = 5000000
data = [str(x) for x in range(1,items)]
df = sqlContext.createDataFrame(data, StringType())
print(df.count())
for x in range(0,items):
sub_df = sqlContext.createDataFrame([str(x)], StringType())
df = df.subtract(sub_df)
print(df.count())
This will continue to run until the driver runs out of memory then dies.
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.net.SocketInputStream.read(SocketInputStream.java:224)
at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:917)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1089)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1081)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1081)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1184)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1717)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/05/25 16:55:40 ERROR DAGScheduler: Failed to update accumulators for task 13
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:915)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1089)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1081)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1081)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1184)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1717)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
...
If anything, I'd think memory should shrink since items are being removed from the DataFrame but that is not the case.
Am I not understanding how spark assigns DataFrames to Python variables or something?
I've also tried to assign the df.subtract to a new temporary variable, then unpersisting df then assigning the temporary var to df and unpersisting the temp var but that also has the same issues.
The fundamental problem here seems to be understanding what exactly DataFrame (this applies to Spark RDDs as well). A local DataFrame object effectively describes a computation which is to be performed, when some action is executed on a given object.
As a result it is a recursive structure, which captures all its dependencies. effectively execution plan with each iteration. While Spark provides tools, like checkpointing, which can be used to address this problem and cut the lineage, the code in question doesn't make much sense in the first place.
Distributed data structures available in Spark are designed for high latency, IO intensive jobs. Parallelizing individual objects, executing millions of Spark jobs on millions of distributed objects just cannot work well.
Furthermore Spark is not designed for efficient, single-item operations. Each subtract is O(N) making a whole process O(N2), and effectively useless on any large dataset.
While trivially itself, a correct way to do it, would be something like this:
items = 5000000
df1 = spark.range(items).selectExpr("cast(id as string)")
df2 = spark.range(items).selectExpr("cast(id as string)")
df1.subtract(df2)

Spark 1.4 increase maxResultSize memory

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas() function I receive the following error:
serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.
You can set spark.driver.maxResultSize parameter in the SparkConf object:
from pyspark import SparkConf, SparkContext
# In Jupyter you have to stop the current context first
sc.stop()
# Create new config
conf = (SparkConf()
.set("spark.driver.maxResultSize", "2g"))
# Create new context
sc = SparkContext(conf=conf)
You should probably create a new SQLContext as well:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
From the command line, such as with pyspark, --conf spark.driver.maxResultSize=3g can also be used to increase the max result size.
Tuning spark.driver.maxResultSize is a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As #Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df, then you can call df.rdd and do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:
Do not turn on spark.sql.parquet.binaryAsString. String objects take more space
Use spark.rdd.compress to compress RDDs when you collect them
Try to collect it using pagination. (code in Scala, from another answer Scala: How to get a range of rows in a dataframe)
long count = df.count()
int limit = 50;
while(count > 0){
df1 = df.limit(limit);
df1.show(); //will print 50, next 50, etc rows
df = df.except(df1);
count = count - limit;
}
Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue.
You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize. there are two ways of defining this variable
1 - create Spark Config by setting this variable as
conf.set("spark.driver.maxResultSize", "3g")
2 - or set this variable
in spark-defaults.conf file present in conf folder of spark. like
spark.driver.maxResultSize 3g and restart the spark.
while starting the job or terminal, you can use
--conf spark.driver.maxResultSize="0"
to remove the bottleneck
There is also a Spark bug
https://issues.apache.org/jira/browse/SPARK-12837
that gives the same error
serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize
even though you may not be pulling data to the driver explicitly.
SPARK-12837 addresses a Spark bug that accumulators/broadcast variables prior to Spark 2 were pulled to driver unnecessary causing this problem.
You can set spark.driver.maxResultSize to 2GB when you start the pyspark shell:
pyspark --conf "spark.driver.maxResultSize=2g"
This is for allowing 2Gb for spark.driver.maxResultSize

Categories