How to load a huge CSV to a Pyspark DataFrame? - python

I'm trying to load a huge genomic dataset (2504 lines and 14848614 columns) to a PySpark DataFrame, but no success. I'm getting java.lang.OutOfMemoryError: Java heap space. I thought the main idea of using spark was exactly the independency of memory... (I'm newbie on it. Please, bear with me :)
This is my code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.memory", "6G").getOrCreate()
file_location = "1kGp3_chr3_6_10.raw"
file_type = "csv"
infer_schema = "true"
first_row_is_header = "true"
delimiter = "\t"
max_cols = 15000000 # 14848614 variants loaded
data = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.option("maxColumns", max_cols) \
.load(file_location)
I know we can set the StorageLevel by, for example df.persist(StorageLevel.DISK_ONLY), but this is possible only after you successfully load the file to a DataFrame, isn't it? (not sure if I missing something)
Here's the error:
...
Py4JJavaError: An error occurred while calling o33.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
...
Thanks!
EDIT/UPDATE:
I forgot to mention the size of the CSV: 70G.
Here's another attempt which resulted in a different error:
I tried with a smaller dataset (2504 lines and 3992219 columns. File size: 19G), and increased memory to "spark.driver.memory", "12G".
After about 35 min running the load method, I got:
Py4JJavaError: An error occurred while calling o33.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 54 tasks (1033.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)

Your error is telling you the problem - you don't have enough memory.
The value in using pyspark is not the independency of memory but it's speed because (it uses ram), the ability to have certain data or operations persist, and the ability to leverage multiple machines.
So, solutions -
1) If possible devote more ram.
2) Depending on the size of your CSV file, you may or may not be able to fit it into memory for a laptop or desktop. If that case, you may need to put this into something like a cloud instance for reasons of speed or cost. Even there you may not find a machine large enough to fit the whole thing in memory for a single machine (though to be frank that would be pretty large considering Amazon's current max for a single memory-optimized (u-24tb1.metal) instance is 24,576 GiB.
And there you see the true power of pyspark: the ability to load truly giant datasets into ram and run it across multiple machines.

Related

Why I am getting Bufferholder error in Databricks even my cluster size is of 8GB

Hey I am getting this error when I run my pyspark python code in databricks.
Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 8 because the size after growing exceeds size limitation 2147483632
Can anyone please tell me why I am getting this error and how I can resolve it. This error is getting when I am dividing the huge json file in two parts.
Here is the python code.
from pyspark.sql.functions import explode, col
import itertools
# Read the JSON file from Databricks storage
df_json = spark.read.option("multiline","true").json("/mnt/BigData_JSONFiles/2022-10_040_05C0_in-network-rates_2_of_2.json")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
df_network=df_json.select(df_json.columns[1:])
df_version=df_json.select(df_json.columns[:1])
df_network.write.format("json").save("/mnt/BigData_JSONFiles/2022-10_040_05C0_in-network-rates_2_of_2_detail.json")
df_version.write.format("json").save("/mnt/BigData_JSONFiles/2022-10_040_05C0_in-network-rates_2_of_2_header.json")
display(df_network)
display(df_version)
Thank you :)
It is because one of your column size exceeds the 2GB limit - details in this article below.
https://learn.microsoft.com/en-us/azure/databricks/kb/sql/cannot-grow-bufferholder-exceeds-size

Spark Repartition Issue

Good day everyone,
I'm working with a project where I'm running an ETL process over millions of data records with the aid of Spark (2.4.4) and PySpark.
We're fetching from an S3 bucket in AWS huge compressed CSV files, converting them into Spark Dataframes, using the repartition() method and converting each piece into a parquet data to lighten and speed up the process:
for file in files:
if not self.__exists_parquet_in_s3(self.config['aws.output.folder'] + '/' + file, '.parquet'):
# Run the parquet converter
print('**** Processing %s ****' % file)
# TODO: number of repartition variable
df = SparkUtils.get_df_from_s3(self.spark_session, file, self.config['aws.bucket']).repartition(94)
s3folderpath = 's3a://' + self.config['aws.bucket'] + \
'/' + self.config['aws.output.folder'] + \
'/%s' % file + '/'
print('Writing down process')
df.write.format('parquet').mode('append').save(
'%s' % s3folderpath)
print('**** Saving %s completed ****' % file)
df.unpersist()
else:
print('Parquet files already exist!')
So as a first step this piece of code is searching inside the s3 bucket if these parquet file exists, if not it will enter the for cycle and run all the transformations.
Now, let's get to the point. I have this pipeline which is working fine with every csv file, except for one which is identical to the others except for bein much heavier also after the repartition and conversion in parquet (29 MB x 94 parts vs 900 kB x 32 parts).
This is causing a bottleneck after some time during the process (which is divided into identical cycles, where the number of cycles is equal to the number of repartitions made) raising a java heap memory space issue after several Warnings:
WARN TaskSetManager: Stage X contains a task of very large size (x KB). The maximum recommended size is 100 KB. (Also see pics below)
Part 1]:
Part 2
The most logical solution would be that of further increasing the repartition parameter to lower the weight of each parquet file BUT it does not allow me to create more than 94 partitions, after some time during the for cycle (above mentioned) it raises this error:
ERROR FileFormatWriter: Aborting job 8fc9c89f-dccd-400c-af6f-dfb312df0c72.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: HGC6JTRN5VT5ERRR, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: 7VBu4mUEmiAWkjLoRPruTiCY3IhpK40t+lg77HDNC0lTtc8h2Zi1K1XGSnJhjwnbLagN/kS+TpQ=
Or also:
Second issue type, notice the warning
What I noticed is that I can under partition the files related to the original value: I can use a 16 as parameter instead of the 94 and it will run fine, but if i increase it over 94, the original value, it won't work.
Remember this pipeline is perfectly working until the end with other (lighter) CSV files, the only variable here seems to be the input file (size in particular) which seems to make it stop after some time. If you need any other detail please let me know, I'll be extremely glad if you help me with this. Thank you everyone in advance.
Not sure what's your logic in your SparkUtils, based on the code and log you provided, it looks like it doesn't relate to your resource or partitioning, it may cause by the connection between your spark application and S3:
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403
403: your login don't have access to the bucket/file you are trying to read/write. Although it's from the Hadoop documents about the authentication, you can see several case will cause this error: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#Authentication_Failure. As you mentioned that you see this error during the loop but not at the beginning of your job, please check your running time of your spark job, also the IAM and session authentication as it maybe cause by session expiration (default 1 hour, based on how your DevOps team set), details your can check: https://docs.aws.amazon.com/singlesignon/latest/userguide/howtosessionduration.html.

topandas() using pyarrow returns empty dataframe

I have a spark dataframe with 5 million rows and 250 columns. When I do topandas() conversion of this dataframe with "spark.sql.execution.arrow.enabled" as "true" it returns a empty dataframe with just the columns.
With pyarrow disabled I get below error
Py4JJavaError: An error occurred while calling o124.collectToPython. : java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there any way to perform this operation with increasing some sort of memory allocation?
I couldn't find any online resoruces for this except https://issues.apache.org/jira/browse/SPARK-28881 which is not that helpful
Ok this problem is related to the memory because while converting a spark dataframe to pandas dataframe spark (Py4j) has to pass throw the collect which consumes a lot of memory, so what I advise you to do is while creating the spark session just reconfigure the memory, here an example :
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '16g')
sc = SparkContext("local", "stack_over_flow")
Proceed with sc (spark context ) or with spark session as you like
If this will not work it may be a version conflict so please check those options :
-Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python
-Downgrade to pyarrow < 0.15.0 for now.
If you can share your script it will be more clear

pyspark toPandas Error?

I have a messy and very big data set consisting of Chinese characters, numbers, strings, date.etc. After I did some cleaning using pyspark and want to turn it into a pandas, it raises this error:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
17/06/06 18:48:54 WARN TaskSetManager: Lost task 8.0 in stage 13.0 (TID 393, localhost): TaskKilled (killed intentionally)
And above the error, it outputs some of my original data.It's very long. So I just post part of it.
I have checked my cleaned data. All column type are int, double. Why does it still output my old data?
Try launching jupyter notebook increasing 'iopub_data_rate_limit' as:
jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000
Source: https://github.com/jupyter/notebook/issues/2287
The best way is to put this in your jupyterhub_config.py file:
c.Spawner.args = ['--NotebookApp.iopub_data_rate_limit=1000000000']

Spark 1.4 increase maxResultSize memory

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas() function I receive the following error:
serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.
You can set spark.driver.maxResultSize parameter in the SparkConf object:
from pyspark import SparkConf, SparkContext
# In Jupyter you have to stop the current context first
sc.stop()
# Create new config
conf = (SparkConf()
.set("spark.driver.maxResultSize", "2g"))
# Create new context
sc = SparkContext(conf=conf)
You should probably create a new SQLContext as well:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
From the command line, such as with pyspark, --conf spark.driver.maxResultSize=3g can also be used to increase the max result size.
Tuning spark.driver.maxResultSize is a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As #Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df, then you can call df.rdd and do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:
Do not turn on spark.sql.parquet.binaryAsString. String objects take more space
Use spark.rdd.compress to compress RDDs when you collect them
Try to collect it using pagination. (code in Scala, from another answer Scala: How to get a range of rows in a dataframe)
long count = df.count()
int limit = 50;
while(count > 0){
df1 = df.limit(limit);
df1.show(); //will print 50, next 50, etc rows
df = df.except(df1);
count = count - limit;
}
Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue.
You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize. there are two ways of defining this variable
1 - create Spark Config by setting this variable as
conf.set("spark.driver.maxResultSize", "3g")
2 - or set this variable
in spark-defaults.conf file present in conf folder of spark. like
spark.driver.maxResultSize 3g and restart the spark.
while starting the job or terminal, you can use
--conf spark.driver.maxResultSize="0"
to remove the bottleneck
There is also a Spark bug
https://issues.apache.org/jira/browse/SPARK-12837
that gives the same error
serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize
even though you may not be pulling data to the driver explicitly.
SPARK-12837 addresses a Spark bug that accumulators/broadcast variables prior to Spark 2 were pulled to driver unnecessary causing this problem.
You can set spark.driver.maxResultSize to 2GB when you start the pyspark shell:
pyspark --conf "spark.driver.maxResultSize=2g"
This is for allowing 2Gb for spark.driver.maxResultSize

Categories