Stream huge zip files on S3 using Lambda and boto3

Stream huge zip files on S3 using Lambda and boto3 - python

I have a bunch of CSV files compressed as one zip on S3. I only need to process one CSV file inside the zip using AWS lambda function
import boto3
from zipfile import ZipFile
BUCKET = 'my-bucket'
s3_rsc = boto3.resource('s3')
def zip_stream(zip_f='app.zip', bkt=BUCKET, rsc=s3_rsc):
obj = rsc.Object(
bucket_name=bkt,
key=zip_f
)
return ZipFile(BytesIO(obj.get()['Body'].read()))
zip_obj = zip_stream()
csv_dat = zip_obj.read('one.csv')
The above snippet works well with test zip files, however, it fails with memory error if the zip file size exceeds 0.5G.
Error Message
{ "errorMessage": "", "errorType": "MemoryError", "stackTrace":
[
" File "/var/task/lambda_function.py", line 12, in handler\n all_files = files_in_zip()\n",
" File "/var/task/lambda_function.py", line 36, in files_in_zip\n zippo = zip_stream()\n",
" File "/var/task/lambda_function.py", line 32, in zip_stream\n return ZipFile(BytesIO(obj.get()['Body'].read()))\n",
" File "/var/runtime/botocore/response.py", line 77, in read\n chunk = self._raw_stream.read(amt)\n",
" File "/var/runtime/urllib3/response.py", line 515, in read\n data = self._fp.read() if not fp_closed else b""\n",
" File "/var/lang/lib/python3.8/http/client.py", line 468, in read\n s = self._safe_read(self.length)\n",
" File "/var/lang/lib/python3.8/http/client.py", line 609, in _safe_read\n data = self.fp.read(amt)\n" ] }
Is there an option to stream/lazyload the zipfile to mitigate memory issues?
Note - I also referred an old post(How can I use boto to stream a file out of Amazon S3 to Rackspace Cloudfiles?) which spoke about streaming a file but not zip

Depending on your exact needs, you can use smart-open to handle the reading of the zip File. If you can fit the CSV data in RAM in your Lambda, it's fairly straightforward to call directly:
from smart_open import smart_open
from io import TextIOWrapper, BytesIO
def lambda_handler(event, context):
# Simple test, just calculate the sum of the first column of a CSV file in a Zip file
total_sum, row_count = 0, 0
# Use smart open to handle the byte range requests for us
with smart_open("s3://example-bucket/many_csvs.zip", "rb") as f:
# Wrap that in a zip file handler
zip = zipfile.ZipFile(f)
# Open a specific CSV file in the zip file
zf = zip.open("data_101.csv")
# Read all of the data into memory, and prepare a text IO wrapper to read it row by row
text = TextIOWrapper(BytesIO(zf.read()))
# And finally, use python's csv library to parse the csv format
cr = csv.reader(text)
# Skip the header row
next(cr)
# Just loop through each row and add the first column
for row in cr:
total_sum += int(row[0])
row_count += 1
# And output the results
print(f"Sum {row_count} rows for col 0: {total_sum}")
I tested this with a 1gb zip file containing hundreds of CSV files. The CSV file I picked was around 12mb uncompressed, or 100,000 rows, so it felt nicely into RAM in the Lambda environment, even when limited to 128mb of RAM.
If your CSV file can't be loaded at once like this, you'll need to take care to load it in sections, buffering the reads so you don't waste time reading it line-by-line and forcing smart-open to load small chunks at a time.

Related

Pyarrow/Parquet - Cast all null columns to string during batch processing

There is a problem with my code that I can not solve for a while now.
I'm trying to convert a tar.gz compressed csv file to parquet. The file itself, when uncompressed, is about 700MB large. The processing is done in a memory-restricted system, so I have to process the file in batches.
I figured out how to read the tar.gz as a stream, extract the file I need and use pyarrow's open_csv() to read batches. From here, I want to save the data to a parquet file by writing in batches.
This is where the problem appears. The file itself has lots of columns that don't have any values. But once in a while, there is a single value that appears in line 500.000 or something, so pyarrow does not recognize the dtype properly. Most of the columns are therefore of dtype null. My idea is to modify the schema and cast these columns to string, so any values are valid. Modifying the schema works fine, but when I run the code, I get this error.
Traceback (most recent call last):
File "b:\snippets\tar_converter.py", line 38, in <module>
batch = reader.read_next_batch()
File "pyarrow\ipc.pxi", line 682, in pyarrow.lib.RecordBatchReader.read_next_batch
File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: In CSV column #49: CSV conversion error to null: invalid value '0.0000'
Line 38 is this one:
batch = reader.read_next_batch()
Does anyone have any idea how to enforce the schema to the batches so
Here is my code.
import io
import os
import tarfile
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv
import logging
srcs = list()
path = "C:\\data"
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith("tar.gz"):
srcs.append(os.path.join(root, name))
for source_file_name in srcs:
file_name: str = source_file_name.replace(".tar.gz", "")
target_file_name: str = source_file_name.replace(".tar.gz", ".parquet")
clean_file_name: str = os.path.basename(source_file_name.replace(".tar.gz", ""))
# download CSV file, preserving folder structure
logging.info(f"Processing '{source_file_name}'.")
with io.open(source_file_name, "rb") as file_obj_in:
# unpack all files to temp_path
file_obj_in.seek(0)
with tarfile.open(fileobj=file_obj_in, mode="r") as tf:
file_obj = tf.extractfile(f"{clean_file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=25*1024*1024))
schema = reader.schema
null_cols = list()
for index, entry in enumerate(schema.types):
if entry.equals(pa.null()):
schema = schema.set(index, schema.field(index).with_type(pa.string()))
null_cols.append(index)
with pq.ParquetWriter(target_file_name, schema) as writer:
while True:
try:
batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]
writer.write_batch(batch)
except StopIteration:
break
Also, I could leave out this part:
batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]
But then the error is like this (shortened), showing that the schema change works at least.
Traceback (most recent call last):
File "b:\snippets\tar_converter.py", line 39, in <module>
writer.write_batch(batch)
File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 981, in write_batch
self.write_table(table, row_group_size)
File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 1004, in write_table
raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: null
VAT_RECEIVABLE_ID: null
MONTHLY_AMOUNT_EFFECTIVE_DATE: null vs.
file:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: string
VAT_RECEIVABLE_ID: string
MONTHLY_AMOUNT_EFFECTIVE_DATE: string
Thank you!

So I think I figured it out. Wanted to post it for those who have similar issues.
Also, thanks to all who had a look and helped!
I did a workaround to solve this, by reading the file two times.
In the first run I only read the first batch into stream to get the schema. Then, converted null columns to string and closed the stream (this is important if you use same variable name). After this you read the file again, but now passing the modified schema as a ReadOption to the reader. Thanks to #0x26res whose comment gave me the idea.
# get initial schema by reading one batch
initial_reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=16*1024*1024))
schema = initial_reader.schema
for index, entry in enumerate(schema.types):
if entry.equals(pa.null()):
schema = schema.set(index, schema.field(index).with_type(pa.string()))
# now use the modified schema for reader
# must close old reader first, otherwise wrong data is loaded
file_obj.close()
file_obj = tf.extractfile(f"{file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj,
read_options=csv.ReadOptions(block_size=16*1024*1024),
convert_options=csv.ConvertOptions(column_types=schema))

GZip and output file

I'm having difficulty with the following code (which is simplified from a larger application I'm working on in Python).
from io import StringIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(str.encode(jsonString))
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "a", encoding="utf-8") as f:
f.write(out.getvalue())
When this runs I get the following error:
File "d:\Development\AWS\TwitterCompetitionsStreaming.py", line 61, in on_status
with gzip.GzipFile(fileobj=out, mode="w") as f:
File "C:\Python38\lib\gzip.py", line 204, in __init__
self._write_gzip_header(compresslevel)
File "C:\Python38\lib\gzip.py", line 232, in _write_gzip_header
self.fileobj.write(b'\037\213') # magic header
TypeError: string argument expected, got 'bytes'
PS ignore the rubbish indenting here...I know it doesn't look right.
What I'm wanting to do is to create a json file and gzip it in place in memory before saving the gzipped file to the filesystem (windows). I know I've gone about this the wrong way and could do with a pointer. Many thanks in advance.

You have to use bytes everywhere when working with gzip instead of strings and text. First, use BytesIO instead of StringIO. Second, mode should be 'wb' for bytes instead of 'w' (last is for text) (samely 'ab' instead of 'a' when appending), here 'b' character means "bytes". Full corrected code below:
Try it online!
from io import BytesIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = BytesIO()
with gzip.GzipFile(fileobj = out, mode = 'wb') as f:
f.write(str.encode(jsonString))
currenttimestamp = '2021-01-29'
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "wb") as f:
f.write(out.getvalue())

Memory error while downloading large Gzip files and decompressing them

I am trying to download a dataset from https://datasets.imdbws.com/title.principals.tsv.gz, decompress the contents in my code itself(Python)and write the resulting file(s) onto disk.
To do so I am using the following code snippet.
results = requests.get(config[sourceFiles]['url'])
with open(config[sourceFiles]['downloadLocation']+config[sourceFiles]['downloadFileName'], 'wb') as f_out:
print(config[sourceFiles]['downloadFileName'] + " starting download")
f_out.write(gzip.decompress(results.content))
print(config[sourceFiles]['downloadFileName']+" downloaded successfully")
This code works fine for most zip files however for larger files it gives the following error message.
File "C:\Users\****\AppData\Local\Programs\Python\Python37-32\lib\gzip.py", line 532, in decompress
return f.read()
File "C:\Users\****\AppData\Local\Programs\Python\Python37-32\lib\gzip.py", line 276, in read
return self._buffer.read(size)
File "C:\Users\****\AppData\Local\Programs\Python\Python37-32\lib\gzip.py", line 471, in read
uncompress = self._decompressor.decompress(buf, size)
MemoryError
Is there a way to accomplish this without having to download the zip file directly onto disk and decompressing it for actual data.

You can use a streaming request coupled with zlib:
import zlib
import requests
url = 'https://datasets.imdbws.com/title.principals.tsv.gz'
result = requests.get(url, stream=True)
f_out = open("result.txt", "wb")
chunk_size = 1024 * 1024
d = zlib.decompressobj(zlib.MAX_WBITS|32)
for chunk in result.iter_content(chunk_size):
buffer = d.decompress(chunk)
f_out.write(buffer)
buffer = d.flush()
f_out.write(buffer)
f_out.close()
This snippet reads the data chunk by chunk and feeds it to zlib which can handle data streams.
Depending on your connection speed and CPU/disk performance you can test various chunk sizes.

python - Unzip csv file from S3 (in memory) and then iterate it

I've got a zip file in S3 bucket called 'My_Bucket'. the file key is 'MY_FILE.ZIP'.
The extracted file name is 'MY_FILE_FULL_NAME.CSV'.
I would like to get the file from the S3 bucket, extract it and iterate it.
As the job will be done by Lambda function - I would like to extract the file in memory (stream).
I started to write the below:
import zipfile
import boto3
import io
s3 = boto3.resource("s3")
bucket = s3.Bucket('My_Bucket')
obj = bucket.Object('YYY.zip')
with io.BytesIO(obj.get()["Body"].read()) as tf:
tf.seek(0)
#How should I continue ???
I need the part from unzipping through opening the file, reading it line by line.
Thanks in advance.

MemoryError when Using the read() Method in Reading a Large Size of JSON file from Amazon S3

I'm trying to import a large size of JSON FILE from Amazon S3 into AWS RDS-PostgreSQL using Python. But, these errors occured,
Traceback (most recent call last):
File "my_code.py", line 67, in
file_content = obj['Body'].read().decode('utf-8').splitlines(True)
File "/home/user/asd-to-qwe/fgh-to-hjk/env/local/lib/python3.6/site-packages/botocore/response.py", line 76, in read
chunk = self._raw_stream.read(amt)
File "/home/user/asd-to-qwe/fgh-to-hjk/env/local/lib/python3.6/site-packages/botocore/vendored/requests/packages/urllib3/response.py", line 239, in read
data = self._fp.read()
File "/usr/lib64/python3.6/http/client.py", line 462, in read
s = self._safe_read(self.length)
File "/usr/lib64/python3.6/http/client.py", line 617, in _safe_read
return b"".join(s)
MemoryError
// my_code.py
import sys
import boto3
import psycopg2
import zipfile
import io
import json
s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
connection = psycopg2.connect(host=<host>, dbname=<dbname>, user=<user>, password=<password>)
cursor = connection.cursor()
bucket = sys.argv[1]
key = sys.argv[2]
obj = s3.get_object(Bucket=bucket, Key=key)
def insert_query(data):
query = """
INSERT INTO data_table
SELECT
(src.test->>'url')::varchar, (src.test->>'id')::bigint,
(src.test->>'external_id')::bigint, (src.test->>'via')::jsonb
FROM (SELECT CAST(%s AS JSONB) AS test) src
"""
cursor.execute(query, (json.dumps(data),))
if key.endswith('.zip'):
zip_files = obj['Body'].read()
with io.BytesIO(zip_files) as zf:
zf.seek(0)
with zipfile.ZipFile(zf, mode='r') as z:
for filename in z.namelist():
with z.open(filename) as f:
for line in f:
insert_query(json.loads(line.decode('utf-8')))
if key.endswith('.json'):
file_content = obj['Body'].read().decode('utf-8').splitlines(True)
for line in file_content:
insert_query(json.loads(line))
connection.commit()
connection.close()
Are there any solutions to these problems? Any help would do, thank you so much!

A significant savings can be had by avoiding slurping your whole input file into memory as a list of lines.
Specifically, these lines are terrible on memory usage, in that they involve a peak memory usage of a bytes object the size of your whole file, plus a list of lines with the complete contents of the file as well:
file_content = obj['Body'].read().decode('utf-8').splitlines(True)
for line in file_content:
For a 1 GB ASCII text file with 5 million lines, on 64 bit Python 3.3+, that's a peak memory requirement of roughly 2.3 GB for just the bytes object, the list, and the individual strs in the list. A program that needs 2.3x as much RAM as the size of the files it processes won't scale to large files.
To fix, change that original code to:
file_content = io.TextIOWrapper(obj['Body'], encoding='utf-8')
for line in file_content:
Given that obj['Body'] appears to be usable for lazy streaming this should remove both copies of the complete file data from memory. Using TextIOWrapper means obj['Body'] is lazily read and decoded in chunks (of a few KB at a time), and the lines are iterated lazily as well; this reduces memory demands to a small, largely fixed amount (the peak memory cost would depend on the length of the longest line), regardless of file size.
Update:
It looks like StreamingBody doesn't implement the io.BufferedIOBase ABC. It does have its own documented API though, that can be used for a similar purpose. If you can't make the TextIOWrapper do the work for you (it's much more efficient and simple if it can be made to work), an alternative would be to do:
file_content = (line.decode('utf-8') for line in obj['Body'].iter_lines())
for line in file_content:
Unlike using TextIOWrapper, it doesn't benefit from bulk decoding of blocks (each line is decoded individually), but otherwise it should still achieve the same benefits in terms of reduced memory usage.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stream huge zip files on S3 using Lambda and boto3 - python

Related

Pyarrow/Parquet - Cast all null columns to string during batch processing

GZip and output file

Memory error while downloading large Gzip files and decompressing them

python - Unzip csv file from S3 (in memory) and then iterate it

MemoryError when Using the read() Method in Reading a Large Size of JSON file from Amazon S3

Categories

Resources