Spark Repartition Issue - python

Good day everyone,
I'm working with a project where I'm running an ETL process over millions of data records with the aid of Spark (2.4.4) and PySpark.
We're fetching from an S3 bucket in AWS huge compressed CSV files, converting them into Spark Dataframes, using the repartition() method and converting each piece into a parquet data to lighten and speed up the process:
for file in files:
if not self.__exists_parquet_in_s3(self.config['aws.output.folder'] + '/' + file, '.parquet'):
# Run the parquet converter
print('**** Processing %s ****' % file)
# TODO: number of repartition variable
df = SparkUtils.get_df_from_s3(self.spark_session, file, self.config['aws.bucket']).repartition(94)
s3folderpath = 's3a://' + self.config['aws.bucket'] + \
'/' + self.config['aws.output.folder'] + \
'/%s' % file + '/'
print('Writing down process')
df.write.format('parquet').mode('append').save(
'%s' % s3folderpath)
print('**** Saving %s completed ****' % file)
df.unpersist()
else:
print('Parquet files already exist!')
So as a first step this piece of code is searching inside the s3 bucket if these parquet file exists, if not it will enter the for cycle and run all the transformations.
Now, let's get to the point. I have this pipeline which is working fine with every csv file, except for one which is identical to the others except for bein much heavier also after the repartition and conversion in parquet (29 MB x 94 parts vs 900 kB x 32 parts).
This is causing a bottleneck after some time during the process (which is divided into identical cycles, where the number of cycles is equal to the number of repartitions made) raising a java heap memory space issue after several Warnings:
WARN TaskSetManager: Stage X contains a task of very large size (x KB). The maximum recommended size is 100 KB. (Also see pics below)
Part 1]:
Part 2
The most logical solution would be that of further increasing the repartition parameter to lower the weight of each parquet file BUT it does not allow me to create more than 94 partitions, after some time during the for cycle (above mentioned) it raises this error:
ERROR FileFormatWriter: Aborting job 8fc9c89f-dccd-400c-af6f-dfb312df0c72.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: HGC6JTRN5VT5ERRR, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: 7VBu4mUEmiAWkjLoRPruTiCY3IhpK40t+lg77HDNC0lTtc8h2Zi1K1XGSnJhjwnbLagN/kS+TpQ=
Or also:
Second issue type, notice the warning
What I noticed is that I can under partition the files related to the original value: I can use a 16 as parameter instead of the 94 and it will run fine, but if i increase it over 94, the original value, it won't work.
Remember this pipeline is perfectly working until the end with other (lighter) CSV files, the only variable here seems to be the input file (size in particular) which seems to make it stop after some time. If you need any other detail please let me know, I'll be extremely glad if you help me with this. Thank you everyone in advance.

Not sure what's your logic in your SparkUtils, based on the code and log you provided, it looks like it doesn't relate to your resource or partitioning, it may cause by the connection between your spark application and S3:
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403
403: your login don't have access to the bucket/file you are trying to read/write. Although it's from the Hadoop documents about the authentication, you can see several case will cause this error: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#Authentication_Failure. As you mentioned that you see this error during the loop but not at the beginning of your job, please check your running time of your spark job, also the IAM and session authentication as it maybe cause by session expiration (default 1 hour, based on how your DevOps team set), details your can check: https://docs.aws.amazon.com/singlesignon/latest/userguide/howtosessionduration.html.

Related

How to load a huge CSV to a Pyspark DataFrame?

I'm trying to load a huge genomic dataset (2504 lines and 14848614 columns) to a PySpark DataFrame, but no success. I'm getting java.lang.OutOfMemoryError: Java heap space. I thought the main idea of using spark was exactly the independency of memory... (I'm newbie on it. Please, bear with me :)
This is my code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.memory", "6G").getOrCreate()
file_location = "1kGp3_chr3_6_10.raw"
file_type = "csv"
infer_schema = "true"
first_row_is_header = "true"
delimiter = "\t"
max_cols = 15000000 # 14848614 variants loaded
data = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.option("maxColumns", max_cols) \
.load(file_location)
I know we can set the StorageLevel by, for example df.persist(StorageLevel.DISK_ONLY), but this is possible only after you successfully load the file to a DataFrame, isn't it? (not sure if I missing something)
Here's the error:
...
Py4JJavaError: An error occurred while calling o33.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
...
Thanks!
EDIT/UPDATE:
I forgot to mention the size of the CSV: 70G.
Here's another attempt which resulted in a different error:
I tried with a smaller dataset (2504 lines and 3992219 columns. File size: 19G), and increased memory to "spark.driver.memory", "12G".
After about 35 min running the load method, I got:
Py4JJavaError: An error occurred while calling o33.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 54 tasks (1033.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
Your error is telling you the problem - you don't have enough memory.
The value in using pyspark is not the independency of memory but it's speed because (it uses ram), the ability to have certain data or operations persist, and the ability to leverage multiple machines.
So, solutions -
1) If possible devote more ram.
2) Depending on the size of your CSV file, you may or may not be able to fit it into memory for a laptop or desktop. If that case, you may need to put this into something like a cloud instance for reasons of speed or cost. Even there you may not find a machine large enough to fit the whole thing in memory for a single machine (though to be frank that would be pretty large considering Amazon's current max for a single memory-optimized (u-24tb1.metal) instance is 24,576 GiB.
And there you see the true power of pyspark: the ability to load truly giant datasets into ram and run it across multiple machines.

Trying to split a large TSV file on S3 w/ Lambda

The Goal I have a data generating process which creates one large TSV file on S3 (somewhere between 30-40 GB in size). Because of some data processing I want to do on it, it's easier to have it in many smaller files (~1 GB in size or smaller). Unfortunately I don't have a lot of ability to change the original data generating process to partition out the files at creation, so I'm trying to create a simple lambda to do it for me, my attempt is below
import json
import boto3
import codecs
def lambda_handler(event, context):
s3 = boto3.client('s3')
read_bucket_name = 'some-bucket'
write_bucket_name = 'some-other-bucket'
original_key = 'some-s3-file-key'
obj = s3.get_object(Bucket=read_bucket_name, Key=original_key)
lines = []
line_count = 0
file_count = 0
MAX_LINE_COUNT = 500000
def create_split_name(file_count):
return f'{original_key}-{file_count}'
def create_body(lines):
return ''.join(lines)
for ln in codecs.getreader('utf-8')(obj['Body']):
if line_count > MAX_LINE_COUNT:
key = create_split_name(file_count)
s3.put_object(
Bucket=write_bucket_name,
Key=key,
Body=create_body(lines)
)
lines = []
line_count = 0
file_count += 1
lines.append(ln)
line_count += 1
if len(lines) > 0:
file_count += 1
key = create_split_name(file_count)
s3.put_object(
Bucket=write_bucket_name,
Key=key,
Body=create_body(lines)
)
return {
'statusCode': 200,
'body': { 'file_count': file_count }
}
This functionally works which is great but the problem is on files that are sufficiently large enough this can't finish in the 15 min run window of an AWS lambda. So my questions are these
Can this code be optimized in any appreciable way to reduce run time (I'm not an expert on profiling lambda code)?
Will porting this to a compiled language provide any real benefit to run time?
Are there other utilities within AWS that can solve this problem? (A quick note here, I know that I could spin up an EC2 server to do this for me but ideally I'm trying to find a serverless solution)
UPDATE Another option I have tried is to not split up the file but tell different lambda jobs to simply read different parts of the same file using Range.
I can try to read a file by doing
obj = s3.get_object(Bucket='cradle-smorgasbord-drop', Key=key, Range=bytes_range)
lines = [line for line in codecs.getreader('utf-8')(obj['Body'])]
However on an approximately 30 GB file I had bytes_range=0-49999999 which is only the first 50 MB and the download is taking way longer than I would think that it should for that amount of data (I actually haven't even seen it finish yet)
To avoid hitting the limit of 15 minutes for the execution of AWS Lambda functions you have to ensure that you only read as much data from S3 as you can process in 15 minutes or less.
How much data from S3 you can process in 15 minutes or less depends on your function logic and the CPU and network performance of the AWS Lambda function. The available CPU performance of AWS Lambda functions scales with memory provided to the AWS Lambda function. From the AWS Lambda documentation:
Lambda allocates CPU power linearly in proportion to the amount of memory configured. At 1,792 MB, a function has the equivalent of one full vCPU (one vCPU-second of credits per second).
So as a first step you could try to increase the provided memory to see if that improves the amount of data your function can process in 15 minutes.
Increasing the CPU performance for AWS Lambda functions might already solve your problem for now, but it doesn't scale well in case you have to process larger files in future.
Fortunately there is a solution for that: When reading objects from S3 you don't have to read the whole object at once, but you can use range requests to only read a part of the object. To do that all you have to do is to specify the range you want to read when calling get_object(). From the boto3 documentation for get_object():
Range (string) -- Downloads the specified range bytes of an object. For more information about the HTTP Range header, go to http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.
In your case instead of triggering your AWS Lambda function once per object in S3 to process, you'd trigger it multiple times for the same objects, but to process different chunks of that object. Depending on how you invoke your function you might need another AWS Lambda function to examine the size of the objects in S3 to process (using head_object()) and trigger your actual Lambda function once for each chunk of data.
While you need that additional chunking logic, you wouldn't need to split the read data in your original AWS Lambda function anymore, as you could simply ensure that each chunk has a size of 1GB and only the data belonging to the chunk is read thanks to the range request. As you'd invoke a separate AWS Lambda function for each chunk you'd also parallelize your currently sequential logic, resulting in faster execution.
Finally you could drastically decrease the amount of memory consumed by your AWS Lambda function, by not reading the whole data into memory, but using streaming instead.

Python "random" MemoryError

I would like to understand what is happening with a MemoryError that seem to occur more or less randomly.
I'm running a Python 3 program under Docker and on an Azure VM (2CPU & 7GB RAM).
To make it simple, the program deals with binary files that are read by a specific library (there's no problem there), then I merge them by peer of files and finally insert data in a database.
The dataset that I get after the merge (and before db insert) is a Pandas dataframe that contains around ~ 2.8M rows and 36 columns.
For the insertion into database, I'm using a REST API that obliges me to insert the file by chunk.
Before that, I'm transforming the datafram into a StringIO buffer using this function:
# static method from Utils class
#staticmethod
def df_to_buffer(my_df):
count_row, count_col = my_df.shape
buffer = io.StringIO() #creating an empty buffer
my_df.to_csv(buffer, index=False) #filling that buffer
LOGGER.info('Current data contains %d rows and %d columns, for a total
buffer size of %d bytes.', count_row, count_col, buffer.tell())
buffer.seek(0) #set to the start of the stream
return buffer
So in my "main" program the behaviour is :
# transform the dataframe to a StringIO buffer
file_data = Utils.df_to_buffer(file_df)
buffer_chunk_size = 32000000 #32MB
while True:
data = file_data.read(buffer_chunk_size)
if data:
...
# do the insert stuff
...
else:
# whole file has been loaded
break
# loop is over, close the buffer before processing a new file
file_data.close()
The problem :
Sometimes I am able to insert 2 or 3 files in a row. Sometimles a MemoryError occurs at a random moment (but always when it's about to insert a new file).
The error occurs at the first iteration of a file insert (never in the middle of the file). It specifically crashes on the line that does the read by chunk file_data.read(buffer_chunk_size)
I'm monitoring the memory during the process (using htop) : it never goes higher than 5,5 GB of memory and espacially when the crash occurs, it runs around ~3.5 GB of used memory at that moment...
Any information or advice would be appreciated,
thanks. :)
EDIT
I was able to debug and kind of identify the problem but did not solve it yet.
It occurs when I read the StringIO buffer by chunk. The data chunk increases a lot the RAM consumption, as it is a big str that contains the 320000000 characters of file.
I tried to reduce it from 32000000 to 16000000. I was able to insert some files, but after some time the MemoryError occurs again... I'm trying to reduce it to 8000000 right now.

Spark coalesce vs collect, which one is faster?

I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")
Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).
To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files

Is this a good way to export all entities of a type to a csv file?

I have millions of entities of a particular type that i would like to export to a csv file. The following code writes entities in batches of 1000 to a blob while keeping the blob open and deferring the next batch to the task queue. When there are no more entities to be fetched the blob is finalized. This seems to work for most of my local testing but I wanted to know:
If i am missing out on any gotchas or corner cases before running it on my production data and incurring $s for datastore reads.
If the deadline is exceeded or the memory runs out while the batch is being written to the blob, this code is defaulting to the start of the current batch for running the task again which may cause a lot of duplication. Any suggestions to fix that?
def entities_to_csv(entity_type,blob_file_name='',cursor='',batch_size=1000):
more = True
next_curs = None
q = entity_type.query()
results,next_curs,more = q.fetch_page(batch_size,start_cursor=Cursor.from_websafe_string(cursor))
if results:
try:
if not blob_file_name:
blob_file_name = files.blobstore.create(mime_type='text/csv',_blob_uploaded_filename='%s.csv' % entity_type.__name__)
rows = [e.to_dict() for e in results]
with files.open(blob_file_name, 'a') as f:
writer = csv.DictWriter(f,restval='',extrasaction='ignore',fieldnames=results[0].keys())
writer.writerows(rows)
if more:
deferred.defer(entity_type,blob_file_name,next_curs.to_websafe_string())
else:
files.finalize(blob_file_name)
except DeadlineExceededError:
deferred.defer(entity_type,blob_file_name,cursor)
Later in the code, something like:
deferred.defer(entities_to_csv,Song)
The problem with your current solution is that your memory will increase with every write to preform to the blobstore. the blobstore is immutable and write all the data at once from the memory.
You need to run the job on a backend that can hold all the records in memory, you need to define a backend in your application and call defer with _target='<backend name>'.
Check out this Google I/O video, pretty much describes what you want to do using MapReduce, starting at around the 23:15 mark in the video. The code you want is at 27:19
https://developers.google.com/events/io/sessions/gooio2012/307/

Categories