Trying to split a large TSV file on S3 w/ Lambda - python

The Goal I have a data generating process which creates one large TSV file on S3 (somewhere between 30-40 GB in size). Because of some data processing I want to do on it, it's easier to have it in many smaller files (~1 GB in size or smaller). Unfortunately I don't have a lot of ability to change the original data generating process to partition out the files at creation, so I'm trying to create a simple lambda to do it for me, my attempt is below
import json
import boto3
import codecs
def lambda_handler(event, context):
s3 = boto3.client('s3')
read_bucket_name = 'some-bucket'
write_bucket_name = 'some-other-bucket'
original_key = 'some-s3-file-key'
obj = s3.get_object(Bucket=read_bucket_name, Key=original_key)
lines = []
line_count = 0
file_count = 0
MAX_LINE_COUNT = 500000
def create_split_name(file_count):
return f'{original_key}-{file_count}'
def create_body(lines):
return ''.join(lines)
for ln in codecs.getreader('utf-8')(obj['Body']):
if line_count > MAX_LINE_COUNT:
key = create_split_name(file_count)
s3.put_object(
Bucket=write_bucket_name,
Key=key,
Body=create_body(lines)
)
lines = []
line_count = 0
file_count += 1
lines.append(ln)
line_count += 1
if len(lines) > 0:
file_count += 1
key = create_split_name(file_count)
s3.put_object(
Bucket=write_bucket_name,
Key=key,
Body=create_body(lines)
)
return {
'statusCode': 200,
'body': { 'file_count': file_count }
}
This functionally works which is great but the problem is on files that are sufficiently large enough this can't finish in the 15 min run window of an AWS lambda. So my questions are these
Can this code be optimized in any appreciable way to reduce run time (I'm not an expert on profiling lambda code)?
Will porting this to a compiled language provide any real benefit to run time?
Are there other utilities within AWS that can solve this problem? (A quick note here, I know that I could spin up an EC2 server to do this for me but ideally I'm trying to find a serverless solution)
UPDATE Another option I have tried is to not split up the file but tell different lambda jobs to simply read different parts of the same file using Range.
I can try to read a file by doing
obj = s3.get_object(Bucket='cradle-smorgasbord-drop', Key=key, Range=bytes_range)
lines = [line for line in codecs.getreader('utf-8')(obj['Body'])]
However on an approximately 30 GB file I had bytes_range=0-49999999 which is only the first 50 MB and the download is taking way longer than I would think that it should for that amount of data (I actually haven't even seen it finish yet)

To avoid hitting the limit of 15 minutes for the execution of AWS Lambda functions you have to ensure that you only read as much data from S3 as you can process in 15 minutes or less.
How much data from S3 you can process in 15 minutes or less depends on your function logic and the CPU and network performance of the AWS Lambda function. The available CPU performance of AWS Lambda functions scales with memory provided to the AWS Lambda function. From the AWS Lambda documentation:
Lambda allocates CPU power linearly in proportion to the amount of memory configured. At 1,792 MB, a function has the equivalent of one full vCPU (one vCPU-second of credits per second).
So as a first step you could try to increase the provided memory to see if that improves the amount of data your function can process in 15 minutes.
Increasing the CPU performance for AWS Lambda functions might already solve your problem for now, but it doesn't scale well in case you have to process larger files in future.
Fortunately there is a solution for that: When reading objects from S3 you don't have to read the whole object at once, but you can use range requests to only read a part of the object. To do that all you have to do is to specify the range you want to read when calling get_object(). From the boto3 documentation for get_object():
Range (string) -- Downloads the specified range bytes of an object. For more information about the HTTP Range header, go to http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.
In your case instead of triggering your AWS Lambda function once per object in S3 to process, you'd trigger it multiple times for the same objects, but to process different chunks of that object. Depending on how you invoke your function you might need another AWS Lambda function to examine the size of the objects in S3 to process (using head_object()) and trigger your actual Lambda function once for each chunk of data.
While you need that additional chunking logic, you wouldn't need to split the read data in your original AWS Lambda function anymore, as you could simply ensure that each chunk has a size of 1GB and only the data belonging to the chunk is read thanks to the range request. As you'd invoke a separate AWS Lambda function for each chunk you'd also parallelize your currently sequential logic, resulting in faster execution.
Finally you could drastically decrease the amount of memory consumed by your AWS Lambda function, by not reading the whole data into memory, but using streaming instead.

Related

Spark Repartition Issue

Good day everyone,
I'm working with a project where I'm running an ETL process over millions of data records with the aid of Spark (2.4.4) and PySpark.
We're fetching from an S3 bucket in AWS huge compressed CSV files, converting them into Spark Dataframes, using the repartition() method and converting each piece into a parquet data to lighten and speed up the process:
for file in files:
if not self.__exists_parquet_in_s3(self.config['aws.output.folder'] + '/' + file, '.parquet'):
# Run the parquet converter
print('**** Processing %s ****' % file)
# TODO: number of repartition variable
df = SparkUtils.get_df_from_s3(self.spark_session, file, self.config['aws.bucket']).repartition(94)
s3folderpath = 's3a://' + self.config['aws.bucket'] + \
'/' + self.config['aws.output.folder'] + \
'/%s' % file + '/'
print('Writing down process')
df.write.format('parquet').mode('append').save(
'%s' % s3folderpath)
print('**** Saving %s completed ****' % file)
df.unpersist()
else:
print('Parquet files already exist!')
So as a first step this piece of code is searching inside the s3 bucket if these parquet file exists, if not it will enter the for cycle and run all the transformations.
Now, let's get to the point. I have this pipeline which is working fine with every csv file, except for one which is identical to the others except for bein much heavier also after the repartition and conversion in parquet (29 MB x 94 parts vs 900 kB x 32 parts).
This is causing a bottleneck after some time during the process (which is divided into identical cycles, where the number of cycles is equal to the number of repartitions made) raising a java heap memory space issue after several Warnings:
WARN TaskSetManager: Stage X contains a task of very large size (x KB). The maximum recommended size is 100 KB. (Also see pics below)
Part 1]:
Part 2
The most logical solution would be that of further increasing the repartition parameter to lower the weight of each parquet file BUT it does not allow me to create more than 94 partitions, after some time during the for cycle (above mentioned) it raises this error:
ERROR FileFormatWriter: Aborting job 8fc9c89f-dccd-400c-af6f-dfb312df0c72.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: HGC6JTRN5VT5ERRR, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: 7VBu4mUEmiAWkjLoRPruTiCY3IhpK40t+lg77HDNC0lTtc8h2Zi1K1XGSnJhjwnbLagN/kS+TpQ=
Or also:
Second issue type, notice the warning
What I noticed is that I can under partition the files related to the original value: I can use a 16 as parameter instead of the 94 and it will run fine, but if i increase it over 94, the original value, it won't work.
Remember this pipeline is perfectly working until the end with other (lighter) CSV files, the only variable here seems to be the input file (size in particular) which seems to make it stop after some time. If you need any other detail please let me know, I'll be extremely glad if you help me with this. Thank you everyone in advance.
Not sure what's your logic in your SparkUtils, based on the code and log you provided, it looks like it doesn't relate to your resource or partitioning, it may cause by the connection between your spark application and S3:
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403
403: your login don't have access to the bucket/file you are trying to read/write. Although it's from the Hadoop documents about the authentication, you can see several case will cause this error: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#Authentication_Failure. As you mentioned that you see this error during the loop but not at the beginning of your job, please check your running time of your spark job, also the IAM and session authentication as it maybe cause by session expiration (default 1 hour, based on how your DevOps team set), details your can check: https://docs.aws.amazon.com/singlesignon/latest/userguide/howtosessionduration.html.

Saving to the same parquet file in parallel using dask leading to ArrowInvalid

I am doing some simulation where I compute some stuff for several time step. For each I want to save a parquet file where each line correspond to a simulation this looks like so :
def simulation():
nsim = 3
timesteps = [1,2]
data = {} #initialization not shown here
for i in nsim:
compute_stuff()
for j in timesteps:
data[str(j)]= compute_some_other_stuff()
return data
Once I have my dict data containing the result of my simulation (under numpy arrays), I transform it into dask.DataFrame objects to then save them to file using the .to_parquet() method as follows:
def save(data):
for i in data.keys():
data[i] = pd.DataFrame(data[i], bins=...)
df = from_pandas(data[i], npartitions=2)
f.to_parquet(datafolder + i + "/", engine="pyarrow", append=True, ignore_divisions = True)
When use this code only once it works perfectly and the struggle arrises when I try to implement it in parallel. Using dask I do:
client = Client(n_workers=10, processes=True)
def f():
data = simulation()
save(data)
to_compute = [delayed(f)(n) for n in range(20)]
compute(to_compute)
The behaviour of this last portion of code is quite random. At some point this happens:
distributed.worker - WARNING - Compute Failed
Function: f
args: (4)
kwargs: {}
Exception: "ArrowInvalid('Parquet file size is 0 bytes')"
....
distributed.worker - WARNING - Compute Failed
Function: f
args: (12)
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
I think these errors are due to the fact that 2 processes try to write at the same time to the same parquet file, and it not well handled (as it can be on txt file). I already tried to switch to pySpark / Koalas without much success. I there a better way to save the result along a simulation (in case of a crash / wall time on a cluster)?
You are making a classic dask mistake of invoking the dask API from within functions that are themselves delayed. The error indicates that things are happening in parallel (which is what dask does!) which are not expected to change during processing. Specifically, a file is clearly being edited by one task while another one is reading it (not sure which).
What you probably want to do, is use concat on the dataframe pieces and then a single call to to_parquet.
Note that it seems all of your data is actually held in the client, and you are using from_parquet. This seems like a bad idea, since you are missing out on one of dask's biggest features, to only load data when needed. You should, instead, load your data inside delayed functions or dask dataframe API calls.

Spark coalesce vs collect, which one is faster?

I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")
Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).
To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files

Why does this usage of py-leveldb's WriteBatch cause a memory leak?

So I'm writing a Python script for indexing the Bitcoin blockchain by addresses, using a leveldb database (py-leveldb), and it keeps eating more and more memory until it crashes. I've replicated the behaviour in the code example below. When I run the code it continues to use more and more memory until it's exhausted the available RAM on my system and the process is either killed or throws "std::bad_alloc".
Am I doing something wrong? I keep writing to the batch object, and commit it every once in a while, but the memory usage keeps increasing even though I commit the data in the WriteBatch object. I even delete the WriteBatch object after comitting it, so as far as I can see it can't be this that is causing the memory leak.
Is my code using WriteBatch in a wrong way or is there a memory leak in py-leveldb?
The code requires py-leveldb to run, get it from here: https://pypi.python.org/pypi/leveldb
WARNING: RUNNING THIS CODE WILL EXHAUST YOUR MEMORY IF IT RUNS LONG ENOUGH. DO NOT RUN ON ON A CRITICAL SYSTEM. Also, it will write the data to a folder in the same folder as the script runs in, on my system this folder contains about 1.5GB worth of database files before memory is exhaused (it ends up consuming over 3GB of RAM).
Here's the code:
import leveldb, random, string
RANDOM_DB_NAME = "db-DetmREnTrKjd"
KEYLEN = 10
VALLEN = 30
num_keys = 1000
iterations = 100000000
commit_every = 1000000
leveldb.DestroyDB(RANDOM_DB_NAME)
db = leveldb.LevelDB(RANDOM_DB_NAME)
batch = leveldb.WriteBatch()
#generate a random list of keys to be used
key_list = [''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(KEYLEN)) for i in range(0,num_keys)]
for k in xrange(iterations):
#select a random key from the key list
key_index = random.randrange(0,1000)
key = key_list[key_index]
try:
prev_val = db.Get(key)
except KeyError:
prev_val = ""
random_val = ''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(VALLEN))
#write the current random value plus any value that might already be there
batch.Put(key, prev_val + random_val)
if k % commit_every == 0:
print "Comitting batch %d/%d..." % (k/commit_every, iterations/commit_every)
db.Write(batch, sync=True)
del batch
batch = leveldb.WriteBatch()
db.Write(batch, sync=True)
You should really try Plyvel instead. See https://plyvel.readthedocs.org/. It has way cleaner code, more features, more speed, and a lot more tests. I've used it for bulk writing to quite large databases (20+ GB) without any issues.
(Full disclosure: I'm the author.)
I use http://code.google.com/p/leveldb-py/
I don't have enough information to participate in a python leveldb driver bake-off, but I love the simplicity of leveldb-py. It is a single python file using ctypes. I've used it to store documents in about 3 million keys storing about 10GB and never noticed memory problems.
To your actual problem:
You may try working with the batch size.
Your code, using leveldb-py and doing a put for every key, worked fine on my system using less than 20MB of memory.
I take from here (http://ayende.com/blog/161412/reviewing-leveldb-part-iii-writebatch-isnt-what-you-think-it-is) that there are quite a few memory copies going on under the hood in leveldb.

Is this a good way to export all entities of a type to a csv file?

I have millions of entities of a particular type that i would like to export to a csv file. The following code writes entities in batches of 1000 to a blob while keeping the blob open and deferring the next batch to the task queue. When there are no more entities to be fetched the blob is finalized. This seems to work for most of my local testing but I wanted to know:
If i am missing out on any gotchas or corner cases before running it on my production data and incurring $s for datastore reads.
If the deadline is exceeded or the memory runs out while the batch is being written to the blob, this code is defaulting to the start of the current batch for running the task again which may cause a lot of duplication. Any suggestions to fix that?
def entities_to_csv(entity_type,blob_file_name='',cursor='',batch_size=1000):
more = True
next_curs = None
q = entity_type.query()
results,next_curs,more = q.fetch_page(batch_size,start_cursor=Cursor.from_websafe_string(cursor))
if results:
try:
if not blob_file_name:
blob_file_name = files.blobstore.create(mime_type='text/csv',_blob_uploaded_filename='%s.csv' % entity_type.__name__)
rows = [e.to_dict() for e in results]
with files.open(blob_file_name, 'a') as f:
writer = csv.DictWriter(f,restval='',extrasaction='ignore',fieldnames=results[0].keys())
writer.writerows(rows)
if more:
deferred.defer(entity_type,blob_file_name,next_curs.to_websafe_string())
else:
files.finalize(blob_file_name)
except DeadlineExceededError:
deferred.defer(entity_type,blob_file_name,cursor)
Later in the code, something like:
deferred.defer(entities_to_csv,Song)
The problem with your current solution is that your memory will increase with every write to preform to the blobstore. the blobstore is immutable and write all the data at once from the memory.
You need to run the job on a backend that can hold all the records in memory, you need to define a backend in your application and call defer with _target='<backend name>'.
Check out this Google I/O video, pretty much describes what you want to do using MapReduce, starting at around the 23:15 mark in the video. The code you want is at 27:19
https://developers.google.com/events/io/sessions/gooio2012/307/

Categories