Spark Dataframe: Spark dataframe saveAsTable LeaseExpiredException - python

In my spark application I am saving a dataframe as a parquet file as follows,
comp_df.write.mode("overwrite").saveAsTable("cdr_step1", format="parquet", path="/data/intermediate_data/cdr_step1/")
If my dataframe size is small this works fine. But as the dataset size increases I am getting the following error.
I checked this issue in the internet and in most places, people resolve this problem changing their code design. In my case I have only a single line of write operation and I don't understand what I need to change.
17/02/02 13:22:56 ERROR datasources.DefaultWriterContainer: Job job_201702021228_0000 aborted.
17/02/02 13:22:56 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
17/02/02 13:22:56 WARN scheduler.TaskSetManager: Lost task 1979.0 in stage 3.0 (TID 1984, slv3.cdh-prod.xxxx.com, executor 86): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /data/intermediate_data/cdr_step1/_temporary/0/_temporary/attempt_201702021322_0003_m_001979_0/part-r-01979-9fe33b7c-0b14-4e63-8e96-6e83aabbe807.gz.parquet (inode 2144221): File does not exist. Holder DFSClient_NONMAPREDUCE_-1523564925_148 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3635)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3438)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3294)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:679)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:214)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:489)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

Related

Spark Repartition Issue

Good day everyone,
I'm working with a project where I'm running an ETL process over millions of data records with the aid of Spark (2.4.4) and PySpark.
We're fetching from an S3 bucket in AWS huge compressed CSV files, converting them into Spark Dataframes, using the repartition() method and converting each piece into a parquet data to lighten and speed up the process:
for file in files:
if not self.__exists_parquet_in_s3(self.config['aws.output.folder'] + '/' + file, '.parquet'):
# Run the parquet converter
print('**** Processing %s ****' % file)
# TODO: number of repartition variable
df = SparkUtils.get_df_from_s3(self.spark_session, file, self.config['aws.bucket']).repartition(94)
s3folderpath = 's3a://' + self.config['aws.bucket'] + \
'/' + self.config['aws.output.folder'] + \
'/%s' % file + '/'
print('Writing down process')
df.write.format('parquet').mode('append').save(
'%s' % s3folderpath)
print('**** Saving %s completed ****' % file)
df.unpersist()
else:
print('Parquet files already exist!')
So as a first step this piece of code is searching inside the s3 bucket if these parquet file exists, if not it will enter the for cycle and run all the transformations.
Now, let's get to the point. I have this pipeline which is working fine with every csv file, except for one which is identical to the others except for bein much heavier also after the repartition and conversion in parquet (29 MB x 94 parts vs 900 kB x 32 parts).
This is causing a bottleneck after some time during the process (which is divided into identical cycles, where the number of cycles is equal to the number of repartitions made) raising a java heap memory space issue after several Warnings:
WARN TaskSetManager: Stage X contains a task of very large size (x KB). The maximum recommended size is 100 KB. (Also see pics below)
Part 1]:
Part 2
The most logical solution would be that of further increasing the repartition parameter to lower the weight of each parquet file BUT it does not allow me to create more than 94 partitions, after some time during the for cycle (above mentioned) it raises this error:
ERROR FileFormatWriter: Aborting job 8fc9c89f-dccd-400c-af6f-dfb312df0c72.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: HGC6JTRN5VT5ERRR, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: 7VBu4mUEmiAWkjLoRPruTiCY3IhpK40t+lg77HDNC0lTtc8h2Zi1K1XGSnJhjwnbLagN/kS+TpQ=
Or also:
Second issue type, notice the warning
What I noticed is that I can under partition the files related to the original value: I can use a 16 as parameter instead of the 94 and it will run fine, but if i increase it over 94, the original value, it won't work.
Remember this pipeline is perfectly working until the end with other (lighter) CSV files, the only variable here seems to be the input file (size in particular) which seems to make it stop after some time. If you need any other detail please let me know, I'll be extremely glad if you help me with this. Thank you everyone in advance.
Not sure what's your logic in your SparkUtils, based on the code and log you provided, it looks like it doesn't relate to your resource or partitioning, it may cause by the connection between your spark application and S3:
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403
403: your login don't have access to the bucket/file you are trying to read/write. Although it's from the Hadoop documents about the authentication, you can see several case will cause this error: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#Authentication_Failure. As you mentioned that you see this error during the loop but not at the beginning of your job, please check your running time of your spark job, also the IAM and session authentication as it maybe cause by session expiration (default 1 hour, based on how your DevOps team set), details your can check: https://docs.aws.amazon.com/singlesignon/latest/userguide/howtosessionduration.html.

How to load a huge CSV to a Pyspark DataFrame?

I'm trying to load a huge genomic dataset (2504 lines and 14848614 columns) to a PySpark DataFrame, but no success. I'm getting java.lang.OutOfMemoryError: Java heap space. I thought the main idea of using spark was exactly the independency of memory... (I'm newbie on it. Please, bear with me :)
This is my code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.memory", "6G").getOrCreate()
file_location = "1kGp3_chr3_6_10.raw"
file_type = "csv"
infer_schema = "true"
first_row_is_header = "true"
delimiter = "\t"
max_cols = 15000000 # 14848614 variants loaded
data = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.option("maxColumns", max_cols) \
.load(file_location)
I know we can set the StorageLevel by, for example df.persist(StorageLevel.DISK_ONLY), but this is possible only after you successfully load the file to a DataFrame, isn't it? (not sure if I missing something)
Here's the error:
...
Py4JJavaError: An error occurred while calling o33.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
...
Thanks!
EDIT/UPDATE:
I forgot to mention the size of the CSV: 70G.
Here's another attempt which resulted in a different error:
I tried with a smaller dataset (2504 lines and 3992219 columns. File size: 19G), and increased memory to "spark.driver.memory", "12G".
After about 35 min running the load method, I got:
Py4JJavaError: An error occurred while calling o33.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 54 tasks (1033.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
Your error is telling you the problem - you don't have enough memory.
The value in using pyspark is not the independency of memory but it's speed because (it uses ram), the ability to have certain data or operations persist, and the ability to leverage multiple machines.
So, solutions -
1) If possible devote more ram.
2) Depending on the size of your CSV file, you may or may not be able to fit it into memory for a laptop or desktop. If that case, you may need to put this into something like a cloud instance for reasons of speed or cost. Even there you may not find a machine large enough to fit the whole thing in memory for a single machine (though to be frank that would be pretty large considering Amazon's current max for a single memory-optimized (u-24tb1.metal) instance is 24,576 GiB.
And there you see the true power of pyspark: the ability to load truly giant datasets into ram and run it across multiple machines.

pyspark toPandas Error?

I have a messy and very big data set consisting of Chinese characters, numbers, strings, date.etc. After I did some cleaning using pyspark and want to turn it into a pandas, it raises this error:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
17/06/06 18:48:54 WARN TaskSetManager: Lost task 8.0 in stage 13.0 (TID 393, localhost): TaskKilled (killed intentionally)
And above the error, it outputs some of my original data.It's very long. So I just post part of it.
I have checked my cleaned data. All column type are int, double. Why does it still output my old data?
Try launching jupyter notebook increasing 'iopub_data_rate_limit' as:
jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000
Source: https://github.com/jupyter/notebook/issues/2287
The best way is to put this in your jupyterhub_config.py file:
c.Spawner.args = ['--NotebookApp.iopub_data_rate_limit=1000000000']

java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0

I'm invoking Pyspark with Spark 2.0 in local mode with the following command:
pyspark --executor-memory 4g --driver-memory 4g
The input dataframe is being read from a tsv file and has 580 K x 28 columns. I'm doing a few operation on the dataframe and then i am trying to export it to a tsv file and i am getting this error.
df.coalesce(1).write.save("sample.tsv",format = "csv",header = 'true', delimiter = '\t')
Any pointers how to get rid of this error. I can easily display the df or count the rows.
The output dataframe is 3100 rows with 23 columns
Error:
Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 1073, localhost): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:129)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.fetchNextRow(WindowExec.scala:300)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.<init>(WindowExec.scala:309)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:289)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:288)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more
Driver stacktrace:
I believe that the cause of this problem is coalesce(), which despite the fact that it avoids a full shuffle (like repartition would do), it has to shrink the data in the requested number of partitions.
Here, you are requesting all the data to fit into one partition, thus one task (and only one task) has to work with all the data, which may cause its container to suffer from memory limitations.
So, either ask for more partitions than 1, or avoid coalesce() in this case.
Otherwise, you could try the solutions provided in the links below, for increasing your memory configurations:
Spark java.lang.OutOfMemoryError: Java heap space
Spark runs out of memory when grouping by key
In my case replacing the coalesce(1) with repartition(1) Worked.
The problem for me was indeed coalesce().
What I did was exporting the file not using coalesce() but parquet instead using df.write.parquet("testP"). Then read back the file and export that with coalesce(1).
Hopefully it works for you as well.
As was stated in other answers, use repartition(1) instead of coalesce(1). The reason is that repartition(1) will ensure that upstream processing is done in parallel (multiple tasks/partitions), rather than on only one executor.
To quote the Dataset.coalesce() Spark docs:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition(1) instead. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
In my case the driver was smaller than the workers. Issue was resolved by making the driver larger.

Python -> Py4j -> Spark -> Cassandra

I would like to test a simply Spark row count job on a test Cassandra table with only four rows just to verify that everything works.
I can quickly get this working from Java:
JavaSparkContext sc = new JavaSparkContext(conf);
SparkContextJavaFunctions sparkContextJavaFunctions = CassandraJavaUtil.javaFunctions(sc);
CassandraJavaRDD<CassandraRow> table = sparkContextJavaFunctions.cassandraTable("demo", "playlists");
long count = table.count();
Now, I'd like to get the same thing working in Python. The Spark distribution comes with a set of unbundled PySpark source code to use Spark from Python. It uses a library called py4j to launch a Java server and marshal java commands through a TCP gateway. I'm using that gateway directly to get this working.
I specify the following extra jars to the Java SparkSubmit host via the --driver-class-path option:
spark-cassandra-connector-java_2.11-1.2.0-rc1.jar
spark-cassandra-connector_2.11-1.2.0-rc1.jar
cassandra-thrift-2.1.3.jar
cassandra-clientutil-2.1.3.jar
cassandra-driver-core-2.1.5.jar
libthrift-0.9.2.jar
joda-convert-1.2.jar
joda-time-2.3.jar
Here is the core Python code to do the row count test:
from pyspark.java_gateway import launch_gateway
jvm_gateway = launch_gateway()
sc = jvm_gateway.jvm.org.apache.spark.api.java.JavaSparkContext(conf)
spark_cass_functions = jvm_gateway.jvm.com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions(sc)
table = spark_cass_functions.cassandraTable("demo", "playlists");
On this last line, I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o5.cassandraTable.
: com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.conf.factory is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.auth.conf.factory is not a valid Spark Cassandra Connector variable.
No likely matches found.
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:38)
at com.datastax.spark.connector.rdd.CassandraRDD.<init>(CassandraRDD.scala:18)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.<init>(CassandraTableScanRDD.scala:59)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$.apply(CassandraTableScanRDD.scala:182)
at com.datastax.spark.connector.japi.SparkContextJavaFunctions.cassandraTable(SparkContextJavaFunctions.java:88)
at com.datastax.spark.connector.japi.SparkContextJavaFunctions.cassandraTable(SparkContextJavaFunctions.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Clearly, there is some configuration or setup issue. I'm not sure how to reasonably debug or investigate or what I could try. Can anyone with more Cassandra/Python/Spark expertise provide some advice? Thank you!
EDIT: A coworker setup a spark-defaults.conf file that was the root of this. I don't fully understand why this caused problems from Python and not from Java, but it doesn't matter. I don't want that conf file and removing it resolved by issue.
That is a known bug in the Spark Cassandra Connector in 1.2.0-rc1 and 1.2.0-rc2, it will be fixed in rc3.
Relevant Tickets
https://datastax-oss.atlassian.net/browse/SPARKC-102
https://datastax-oss.atlassian.net/browse/SPARKC-105
https://datastax-oss.atlassian.net/browse/SPARKC-108
You could always try out pyspark_cassandra. It's built against 1.2.0-rc3 and probably is a lot easier when working with Cassandra in pyspark.

Categories