Pyarrow/Parquet - Cast all null columns to string during batch processing - python

There is a problem with my code that I can not solve for a while now.
I'm trying to convert a tar.gz compressed csv file to parquet. The file itself, when uncompressed, is about 700MB large. The processing is done in a memory-restricted system, so I have to process the file in batches.
I figured out how to read the tar.gz as a stream, extract the file I need and use pyarrow's open_csv() to read batches. From here, I want to save the data to a parquet file by writing in batches.
This is where the problem appears. The file itself has lots of columns that don't have any values. But once in a while, there is a single value that appears in line 500.000 or something, so pyarrow does not recognize the dtype properly. Most of the columns are therefore of dtype null. My idea is to modify the schema and cast these columns to string, so any values are valid. Modifying the schema works fine, but when I run the code, I get this error.
Traceback (most recent call last):
File "b:\snippets\tar_converter.py", line 38, in <module>
batch = reader.read_next_batch()
File "pyarrow\ipc.pxi", line 682, in pyarrow.lib.RecordBatchReader.read_next_batch
File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: In CSV column #49: CSV conversion error to null: invalid value '0.0000'
Line 38 is this one:
batch = reader.read_next_batch()
Does anyone have any idea how to enforce the schema to the batches so
Here is my code.
import io
import os
import tarfile
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv
import logging
srcs = list()
path = "C:\\data"
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith("tar.gz"):
srcs.append(os.path.join(root, name))
for source_file_name in srcs:
file_name: str = source_file_name.replace(".tar.gz", "")
target_file_name: str = source_file_name.replace(".tar.gz", ".parquet")
clean_file_name: str = os.path.basename(source_file_name.replace(".tar.gz", ""))
# download CSV file, preserving folder structure
logging.info(f"Processing '{source_file_name}'.")
with io.open(source_file_name, "rb") as file_obj_in:
# unpack all files to temp_path
file_obj_in.seek(0)
with tarfile.open(fileobj=file_obj_in, mode="r") as tf:
file_obj = tf.extractfile(f"{clean_file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=25*1024*1024))
schema = reader.schema
null_cols = list()
for index, entry in enumerate(schema.types):
if entry.equals(pa.null()):
schema = schema.set(index, schema.field(index).with_type(pa.string()))
null_cols.append(index)
with pq.ParquetWriter(target_file_name, schema) as writer:
while True:
try:
batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]
writer.write_batch(batch)
except StopIteration:
break
Also, I could leave out this part:
batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]
But then the error is like this (shortened), showing that the schema change works at least.
Traceback (most recent call last):
File "b:\snippets\tar_converter.py", line 39, in <module>
writer.write_batch(batch)
File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 981, in write_batch
self.write_table(table, row_group_size)
File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 1004, in write_table
raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: null
VAT_RECEIVABLE_ID: null
MONTHLY_AMOUNT_EFFECTIVE_DATE: null vs.
file:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: string
VAT_RECEIVABLE_ID: string
MONTHLY_AMOUNT_EFFECTIVE_DATE: string
Thank you!

So I think I figured it out. Wanted to post it for those who have similar issues.
Also, thanks to all who had a look and helped!
I did a workaround to solve this, by reading the file two times.
In the first run I only read the first batch into stream to get the schema. Then, converted null columns to string and closed the stream (this is important if you use same variable name). After this you read the file again, but now passing the modified schema as a ReadOption to the reader. Thanks to #0x26res whose comment gave me the idea.
# get initial schema by reading one batch
initial_reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=16*1024*1024))
schema = initial_reader.schema
for index, entry in enumerate(schema.types):
if entry.equals(pa.null()):
schema = schema.set(index, schema.field(index).with_type(pa.string()))
# now use the modified schema for reader
# must close old reader first, otherwise wrong data is loaded
file_obj.close()
file_obj = tf.extractfile(f"{file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj,
read_options=csv.ReadOptions(block_size=16*1024*1024),
convert_options=csv.ConvertOptions(column_types=schema))

Related

Stream huge zip files on S3 using Lambda and boto3

I have a bunch of CSV files compressed as one zip on S3. I only need to process one CSV file inside the zip using AWS lambda function
import boto3
from zipfile import ZipFile
BUCKET = 'my-bucket'
s3_rsc = boto3.resource('s3')
def zip_stream(zip_f='app.zip', bkt=BUCKET, rsc=s3_rsc):
obj = rsc.Object(
bucket_name=bkt,
key=zip_f
)
return ZipFile(BytesIO(obj.get()['Body'].read()))
zip_obj = zip_stream()
csv_dat = zip_obj.read('one.csv')
The above snippet works well with test zip files, however, it fails with memory error if the zip file size exceeds 0.5G.
Error Message
{ "errorMessage": "", "errorType": "MemoryError", "stackTrace":
[
" File "/var/task/lambda_function.py", line 12, in handler\n all_files = files_in_zip()\n",
" File "/var/task/lambda_function.py", line 36, in files_in_zip\n zippo = zip_stream()\n",
" File "/var/task/lambda_function.py", line 32, in zip_stream\n return ZipFile(BytesIO(obj.get()['Body'].read()))\n",
" File "/var/runtime/botocore/response.py", line 77, in read\n chunk = self._raw_stream.read(amt)\n",
" File "/var/runtime/urllib3/response.py", line 515, in read\n data = self._fp.read() if not fp_closed else b""\n",
" File "/var/lang/lib/python3.8/http/client.py", line 468, in read\n s = self._safe_read(self.length)\n",
" File "/var/lang/lib/python3.8/http/client.py", line 609, in _safe_read\n data = self.fp.read(amt)\n" ] }
Is there an option to stream/lazyload the zipfile to mitigate memory issues?
Note - I also referred an old post(How can I use boto to stream a file out of Amazon S3 to Rackspace Cloudfiles?) which spoke about streaming a file but not zip
Depending on your exact needs, you can use smart-open to handle the reading of the zip File. If you can fit the CSV data in RAM in your Lambda, it's fairly straightforward to call directly:
from smart_open import smart_open
from io import TextIOWrapper, BytesIO
def lambda_handler(event, context):
# Simple test, just calculate the sum of the first column of a CSV file in a Zip file
total_sum, row_count = 0, 0
# Use smart open to handle the byte range requests for us
with smart_open("s3://example-bucket/many_csvs.zip", "rb") as f:
# Wrap that in a zip file handler
zip = zipfile.ZipFile(f)
# Open a specific CSV file in the zip file
zf = zip.open("data_101.csv")
# Read all of the data into memory, and prepare a text IO wrapper to read it row by row
text = TextIOWrapper(BytesIO(zf.read()))
# And finally, use python's csv library to parse the csv format
cr = csv.reader(text)
# Skip the header row
next(cr)
# Just loop through each row and add the first column
for row in cr:
total_sum += int(row[0])
row_count += 1
# And output the results
print(f"Sum {row_count} rows for col 0: {total_sum}")
I tested this with a 1gb zip file containing hundreds of CSV files. The CSV file I picked was around 12mb uncompressed, or 100,000 rows, so it felt nicely into RAM in the Lambda environment, even when limited to 128mb of RAM.
If your CSV file can't be loaded at once like this, you'll need to take care to load it in sections, buffering the reads so you don't waste time reading it line-by-line and forcing smart-open to load small chunks at a time.

How to convert csv to json with python on amazon lambda?

I have a lambda function which attempts to take a csv file which was uploaded on a bucket, convert it to json and save it on another bucket. Here is my code:
import json
import os
import boto3
import csv
def lambda_handler(event, context):
for record in event['Records']:
bucket = record['s3']['bucket']['name']
file_key = record['s3']['object']['key']
s3 = boto3.client('s3')
csvfile = s3.get_object(Bucket=bucket, Key=file_key)
csvcontent = csvfile['Body'].read().split(b'\n')
data = []
csv_file = csv.DictReader(csvcontent)
print(csv_file)
data = list(csv_file)
os.chdir('/tmp')
JSON_PATH = file_key[6:] + ".json"
print(data)
with open(JSON_PATH, 'w') as output:
json.dump(data, output)
bucket_name = 'xxx'
s3.upload_file(JSON_PATH, bucket_name, JSON_PATH)
The problem is that although when I test this locally on my machine the file can be converted to json, when I run the lambda function I get the following error:
[ERROR] Error: iterator should return strings, not bytes (did you open the file in text mode?)
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 19, in lambda_handler
data = list(csv_file)
File "/var/lang/lib/python3.7/csv.py", line 111, in __next__
self.fieldnames
File "/var/lang/lib/python3.7/csv.py", line 98, in fieldnames
self._fieldnames = next(self.reader)
Can someone help me understand why this happens? I have been trying a solution since a while and I don’t understand what the problem is. I appreciate any help you can provide
The result of read() in s3.get_object() is bytes, not strings. The csv. DictReader() expects strings instead of bytes, and that's why it is failing.
You can decode the result of read() into strings using the decode() function with the correct encoding. The following would be a fix:
change this
csvcontent = csvfile['Body'].read().split(b'\n')
to this
csvcontent = csvfile['Body'].read().decode('utf-8')
A good way to debug these problems is to use the type() function to check what type your variable is. In your case, you can easily find out the problem by trying print(type(csvcontent)) - it would show that csvcontent indeed is a byte type.
Just a small tweak to make it work right:
csvcontent = csvfile['Body'].read().decode().split('\n')

MemoryError when Using the read() Method in Reading a Large Size of JSON file from Amazon S3

I'm trying to import a large size of JSON FILE from Amazon S3 into AWS RDS-PostgreSQL using Python. But, these errors occured,
Traceback (most recent call last):
File "my_code.py", line 67, in
file_content = obj['Body'].read().decode('utf-8').splitlines(True)
File "/home/user/asd-to-qwe/fgh-to-hjk/env/local/lib/python3.6/site-packages/botocore/response.py", line 76, in read
chunk = self._raw_stream.read(amt)
File "/home/user/asd-to-qwe/fgh-to-hjk/env/local/lib/python3.6/site-packages/botocore/vendored/requests/packages/urllib3/response.py", line 239, in read
data = self._fp.read()
File "/usr/lib64/python3.6/http/client.py", line 462, in read
s = self._safe_read(self.length)
File "/usr/lib64/python3.6/http/client.py", line 617, in _safe_read
return b"".join(s)
MemoryError
// my_code.py
import sys
import boto3
import psycopg2
import zipfile
import io
import json
s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
connection = psycopg2.connect(host=<host>, dbname=<dbname>, user=<user>, password=<password>)
cursor = connection.cursor()
bucket = sys.argv[1]
key = sys.argv[2]
obj = s3.get_object(Bucket=bucket, Key=key)
def insert_query(data):
query = """
INSERT INTO data_table
SELECT
(src.test->>'url')::varchar, (src.test->>'id')::bigint,
(src.test->>'external_id')::bigint, (src.test->>'via')::jsonb
FROM (SELECT CAST(%s AS JSONB) AS test) src
"""
cursor.execute(query, (json.dumps(data),))
if key.endswith('.zip'):
zip_files = obj['Body'].read()
with io.BytesIO(zip_files) as zf:
zf.seek(0)
with zipfile.ZipFile(zf, mode='r') as z:
for filename in z.namelist():
with z.open(filename) as f:
for line in f:
insert_query(json.loads(line.decode('utf-8')))
if key.endswith('.json'):
file_content = obj['Body'].read().decode('utf-8').splitlines(True)
for line in file_content:
insert_query(json.loads(line))
connection.commit()
connection.close()
Are there any solutions to these problems? Any help would do, thank you so much!
A significant savings can be had by avoiding slurping your whole input file into memory as a list of lines.
Specifically, these lines are terrible on memory usage, in that they involve a peak memory usage of a bytes object the size of your whole file, plus a list of lines with the complete contents of the file as well:
file_content = obj['Body'].read().decode('utf-8').splitlines(True)
for line in file_content:
For a 1 GB ASCII text file with 5 million lines, on 64 bit Python 3.3+, that's a peak memory requirement of roughly 2.3 GB for just the bytes object, the list, and the individual strs in the list. A program that needs 2.3x as much RAM as the size of the files it processes won't scale to large files.
To fix, change that original code to:
file_content = io.TextIOWrapper(obj['Body'], encoding='utf-8')
for line in file_content:
Given that obj['Body'] appears to be usable for lazy streaming this should remove both copies of the complete file data from memory. Using TextIOWrapper means obj['Body'] is lazily read and decoded in chunks (of a few KB at a time), and the lines are iterated lazily as well; this reduces memory demands to a small, largely fixed amount (the peak memory cost would depend on the length of the longest line), regardless of file size.
Update:
It looks like StreamingBody doesn't implement the io.BufferedIOBase ABC. It does have its own documented API though, that can be used for a similar purpose. If you can't make the TextIOWrapper do the work for you (it's much more efficient and simple if it can be made to work), an alternative would be to do:
file_content = (line.decode('utf-8') for line in obj['Body'].iter_lines())
for line in file_content:
Unlike using TextIOWrapper, it doesn't benefit from bulk decoding of blocks (each line is decoded individually), but otherwise it should still achieve the same benefits in terms of reduced memory usage.

Many-record upload to postgres

I have a series of .csv files with some data, and I want a Python script to open them all, do some preprocessing, and upload the processed data to my postgres database.
I have it mostly complete, but my upload step isn't working. I'm sure it's something simple that I'm missing, but I just can't find it. I'd appreciate any help you can provide.
Here's the code:
import psycopg2
import sys
from os import listdir
from os.path import isfile, join
import csv
import re
import io
try:
con = db_connect("dbname = '[redacted]' user = '[redacted]' password = '[redacted]' host = '[redacted]'")
except:
print("Can't connect to database.")
sys.exit(1)
cur = con.cursor()
upload_file = io.StringIO()
file_list = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in file_list:
id_match = re.search(r'.*-(\d+)\.csv', file)
if id_match:
id = id_match.group(1)
file_name = format(id_match.group())
with open(mypath+file_name) as fh:
id_reader = csv.reader(fh)
next(id_reader, None) # Skip the header row
for row in id_reader:
[stuff goes here to get desired values from file]
if upload_file.getvalue() != '': upload_file.write('\n')
upload_file.write('{0}\t{1}\t{2}'.format(id, [val1], [val2]))
print(upload_file.getvalue()) # prints output that looks like I expect it to
# with thousands of rows that seem to have the right values in the right fields
cur.copy_from(upload_file, '[my_table]', sep='\t', columns=('id', 'col_1', 'col_2'))
con.commit()
if con:
con.close()
This runs without error, but a select query in psql still shows no records in the table. What am I missing?
Edit: I ended up giving up and writing it to a temporary file, and then uploading the file. This worked without any trouble...I'd obviously rather not have the temporary file though, so I'm happy to have suggestions if someone sees the problem.
When you write to an io.StringIO (or any other file) object, the file pointer remains at the position of the last character written. So, when you do
f = io.StringIO()
f.write('1\t2\t3\n')
s = f.readline()
the file pointer stays at the end of the file and s contains an empty string.
To read (not getvalue) the contents, you must reposition the file pointer to the beginning, e.g. use seek(0)
upload_file.seek(0)
cur.copy_from(upload_file, '[my_table]', columns = ('id', 'col_1', 'col_2'))
This allows copy_from to read from the beginning and import all the lines in your upload_file.
Don't forget, that you read and keep all the files in your memory, which might work for a single small import, but may become a problem when doing large imports or multiple imports in parallel.

Changing complex JSON files to .csv

I've downloaded a JSON file with lots of data about football players and I want to get at the data in a .csv. I'm a newb at most of this!
You can find the raw file here: https://raw.githubusercontent.com/llimllib/fantasypl_stats/8ba3e796fc3e73c43921da44d4344c08ce1d7031/data/players.1440000356.json
In the past I've used this code to export some of the data into a .csv using some python code (I think!) in command prompt:
import csv
import json
json_data = open("file.json")
data = json.load(json_data)
f = csv.writer(open("fix_hists.csv","wb+"))
arr = []
for i in data:
fh = data[i]["fixture_history"]
array = fh["all"]
for j in array:
try:
j.insert(0,str(data[i]["first_name"]))
except:
j.insert(0,'error')
try:
j.insert(1,data[i]["web_name"])
except:
j.insert(1,'error')
try:
f.writerow(j)
except:
f.writerow(['error','error'])
json_data.close()
Sadly, when I do this now in command prompt, i get the following error:
Traceback (most recent call last): <br/>
File"fix_hist.py", line 12 (module) <br/>
fh = data[i]["fixture_history"] <br/>
TypeError: list indices must be integers, not str
Can this be fixed or is there another way I can grab some of the data and convert it to .csv? Specifically the 'Fixture History'? and then 'First'Name', 'type_name' etc.
I'd advise using pandas.
Pandas has a function for parsing JSON files. pd.read_json()
docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
This will read the JSON file directly into a dataframe
there's an improper tab on the for loop on line 11. if you update your code to look like the following, it should run without errors
import csv
import json
json_data = open("players.json")
data = json.load(json_data)
f = csv.writer(open("fix_hists.csv","wb+"))
arr = []
for i in data:
fh = data[i]["fixture_history"]
array = fh["all"]
for j in array:
try:
j.insert(0,str(data[i]["first_name"]))
except:
j.insert(0,'error')
try:
j.insert(1,data[i]["web_name"])
except:
j.insert(1,'error')
try:
f.writerow(j)
except:
f.writerow(['error','error'])
json_data.close()
make sure that you've named the JSON file players.json to match the name on line 4. also, make sure that the JSON file and this python file are in the same directory. you can run the python file in a development environment like PyCharm, or you can cd to the directory in a terminal/command prompt window and run it with python fileName.py. it will create a csv file in that directory called fix_hists.csv

Categories