How to convert csv to json with python on amazon lambda?

How to convert csv to json with python on amazon lambda? - python

I have a lambda function which attempts to take a csv file which was uploaded on a bucket, convert it to json and save it on another bucket. Here is my code:
import json
import os
import boto3
import csv
def lambda_handler(event, context):
for record in event['Records']:
bucket = record['s3']['bucket']['name']
file_key = record['s3']['object']['key']
s3 = boto3.client('s3')
csvfile = s3.get_object(Bucket=bucket, Key=file_key)
csvcontent = csvfile['Body'].read().split(b'\n')
data = []
csv_file = csv.DictReader(csvcontent)
print(csv_file)
data = list(csv_file)
os.chdir('/tmp')
JSON_PATH = file_key[6:] + ".json"
print(data)
with open(JSON_PATH, 'w') as output:
json.dump(data, output)
bucket_name = 'xxx'
s3.upload_file(JSON_PATH, bucket_name, JSON_PATH)
The problem is that although when I test this locally on my machine the file can be converted to json, when I run the lambda function I get the following error:
[ERROR] Error: iterator should return strings, not bytes (did you open the file in text mode?)
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 19, in lambda_handler
data = list(csv_file)
File "/var/lang/lib/python3.7/csv.py", line 111, in __next__
self.fieldnames
File "/var/lang/lib/python3.7/csv.py", line 98, in fieldnames
self._fieldnames = next(self.reader)
Can someone help me understand why this happens? I have been trying a solution since a while and I don’t understand what the problem is. I appreciate any help you can provide

The result of read() in s3.get_object() is bytes, not strings. The csv. DictReader() expects strings instead of bytes, and that's why it is failing.
You can decode the result of read() into strings using the decode() function with the correct encoding. The following would be a fix:
change this
csvcontent = csvfile['Body'].read().split(b'\n')
to this
csvcontent = csvfile['Body'].read().decode('utf-8')
A good way to debug these problems is to use the type() function to check what type your variable is. In your case, you can easily find out the problem by trying print(type(csvcontent)) - it would show that csvcontent indeed is a byte type.

Just a small tweak to make it work right:
csvcontent = csvfile['Body'].read().decode().split('\n')

Related

I need help creating a simple python script that stores an attribute value from a custom json file

JSON file looks like this:
{"Clear":"Pass","Email":"noname#email.com","ID":1234}
There are hundreds of json files with different email values, which is why I need a script to run against all files.
I need to extract out the value associated with the Email attribute, which is nooname#email.com.
I tried using import json but I'm getting a decoder error:
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Script looks like this:
import json
json_data = json.loads("file.json")
print (json_data["Email"]
Thanks!

According to the docs, json.loads() takes a str, bytes or bytearray as argument. So if you want to load a json file this way, you should pass the content of the file instead of its path.
import json
file = open("file.json", "r") # Opens file.json in read mode
file_data = file.read()
json_data = json.loads(file_data)
file.close() # Remember to close the file after using it
You can also use json.load() which takes a FILE as argument
import json
file = open("file.json", "r")
json_data = json.load(file)
file.close()

your script needs to open the file to get a file handle, than we can read the json.
this sample contains code that can read the json file. to simulate this, it uses a string that is identical with the data coming from the file.
import json
#this is to read from the real json file
#file_name = 'email.json'
#with open(file_name, 'r') as f_obj:
#json_data = json.load(f_obj)
# this is a string that equals the result from reading json file
json_data = '{"Clear":"Pass","Email":"noname#email.com","ID":1234}'
json_data = json.loads(json_data)
print (json_data["Email"])
result: noname#email.com

import json
with open("file.json", 'r') as f:
file_content = f.read()
#convert json to python dict
tmp = json.loads(file_content)
email = tmp["Email"]
As already pointed out in previous comments, json.loads() take contents of a file rather than a file.

Pyarrow/Parquet - Cast all null columns to string during batch processing

There is a problem with my code that I can not solve for a while now.
I'm trying to convert a tar.gz compressed csv file to parquet. The file itself, when uncompressed, is about 700MB large. The processing is done in a memory-restricted system, so I have to process the file in batches.
I figured out how to read the tar.gz as a stream, extract the file I need and use pyarrow's open_csv() to read batches. From here, I want to save the data to a parquet file by writing in batches.
This is where the problem appears. The file itself has lots of columns that don't have any values. But once in a while, there is a single value that appears in line 500.000 or something, so pyarrow does not recognize the dtype properly. Most of the columns are therefore of dtype null. My idea is to modify the schema and cast these columns to string, so any values are valid. Modifying the schema works fine, but when I run the code, I get this error.
Traceback (most recent call last):
File "b:\snippets\tar_converter.py", line 38, in <module>
batch = reader.read_next_batch()
File "pyarrow\ipc.pxi", line 682, in pyarrow.lib.RecordBatchReader.read_next_batch
File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: In CSV column #49: CSV conversion error to null: invalid value '0.0000'
Line 38 is this one:
batch = reader.read_next_batch()
Does anyone have any idea how to enforce the schema to the batches so
Here is my code.
import io
import os
import tarfile
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv
import logging
srcs = list()
path = "C:\\data"
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith("tar.gz"):
srcs.append(os.path.join(root, name))
for source_file_name in srcs:
file_name: str = source_file_name.replace(".tar.gz", "")
target_file_name: str = source_file_name.replace(".tar.gz", ".parquet")
clean_file_name: str = os.path.basename(source_file_name.replace(".tar.gz", ""))
# download CSV file, preserving folder structure
logging.info(f"Processing '{source_file_name}'.")
with io.open(source_file_name, "rb") as file_obj_in:
# unpack all files to temp_path
file_obj_in.seek(0)
with tarfile.open(fileobj=file_obj_in, mode="r") as tf:
file_obj = tf.extractfile(f"{clean_file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=25*1024*1024))
schema = reader.schema
null_cols = list()
for index, entry in enumerate(schema.types):
if entry.equals(pa.null()):
schema = schema.set(index, schema.field(index).with_type(pa.string()))
null_cols.append(index)
with pq.ParquetWriter(target_file_name, schema) as writer:
while True:
try:
batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]
writer.write_batch(batch)
except StopIteration:
break
Also, I could leave out this part:
batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]
But then the error is like this (shortened), showing that the schema change works at least.
Traceback (most recent call last):
File "b:\snippets\tar_converter.py", line 39, in <module>
writer.write_batch(batch)
File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 981, in write_batch
self.write_table(table, row_group_size)
File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 1004, in write_table
raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: null
VAT_RECEIVABLE_ID: null
MONTHLY_AMOUNT_EFFECTIVE_DATE: null vs.
file:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: string
VAT_RECEIVABLE_ID: string
MONTHLY_AMOUNT_EFFECTIVE_DATE: string
Thank you!

So I think I figured it out. Wanted to post it for those who have similar issues.
Also, thanks to all who had a look and helped!
I did a workaround to solve this, by reading the file two times.
In the first run I only read the first batch into stream to get the schema. Then, converted null columns to string and closed the stream (this is important if you use same variable name). After this you read the file again, but now passing the modified schema as a ReadOption to the reader. Thanks to #0x26res whose comment gave me the idea.
# get initial schema by reading one batch
initial_reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=16*1024*1024))
schema = initial_reader.schema
for index, entry in enumerate(schema.types):
if entry.equals(pa.null()):
schema = schema.set(index, schema.field(index).with_type(pa.string()))
# now use the modified schema for reader
# must close old reader first, otherwise wrong data is loaded
file_obj.close()
file_obj = tf.extractfile(f"{file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj,
read_options=csv.ReadOptions(block_size=16*1024*1024),
convert_options=csv.ConvertOptions(column_types=schema))

GZip and output file

I'm having difficulty with the following code (which is simplified from a larger application I'm working on in Python).
from io import StringIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(str.encode(jsonString))
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "a", encoding="utf-8") as f:
f.write(out.getvalue())
When this runs I get the following error:
File "d:\Development\AWS\TwitterCompetitionsStreaming.py", line 61, in on_status
with gzip.GzipFile(fileobj=out, mode="w") as f:
File "C:\Python38\lib\gzip.py", line 204, in __init__
self._write_gzip_header(compresslevel)
File "C:\Python38\lib\gzip.py", line 232, in _write_gzip_header
self.fileobj.write(b'\037\213') # magic header
TypeError: string argument expected, got 'bytes'
PS ignore the rubbish indenting here...I know it doesn't look right.
What I'm wanting to do is to create a json file and gzip it in place in memory before saving the gzipped file to the filesystem (windows). I know I've gone about this the wrong way and could do with a pointer. Many thanks in advance.

You have to use bytes everywhere when working with gzip instead of strings and text. First, use BytesIO instead of StringIO. Second, mode should be 'wb' for bytes instead of 'w' (last is for text) (samely 'ab' instead of 'a' when appending), here 'b' character means "bytes". Full corrected code below:
Try it online!
from io import BytesIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = BytesIO()
with gzip.GzipFile(fileobj = out, mode = 'wb') as f:
f.write(str.encode(jsonString))
currenttimestamp = '2021-01-29'
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "wb") as f:
f.write(out.getvalue())

Writing a file to S3 using Lambda in Python with AWS

In AWS, I'm trying to save a file to S3 in Python using a Lambda function. While this works on my local computer, I am unable to get it to work in Lambda. I've been working on this problem for most of the day and would appreciate help. Thank you.
def pdfToTable(PDFfilename, apiKey, fileExt, bucket, key):
# parsing a PDF using an API
fileData = (PDFfilename, open(PDFfilename, "rb"))
files = {"f": fileData}
postUrl = "https://pdftables.com/api?key={0}&format={1}".format(apiKey, fileExt)
response = requests.post(postUrl, files=files)
response.raise_for_status()
# this code is probably the problem!
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'rb') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_fileobj(data, key)
# FYI, on my own computer, this saves the file
with open('output.csv', "wb") as f:
f.write(response.content)
In S3, there is a bucket transportation.manifests.parsed containing the folder csv where the file should be saved.
The type of response.content is bytes.
From AWS, the error from the current set-up above is [Errno 2] No such file or directory: '/tmp/output2.csv': FileNotFoundError. In fact, my goal is to save the file to the csv folder under a unique name, so tmp/output2.csv might not be the best approach. Any guidance?
In addition, I've tried to use wb and w instead of rb also to no avail. The error with wb is Input <_io.BufferedWriter name='/tmp/output2.csv'> of type: <class '_io.BufferedWriter'> is not supported. The documentation suggests that using 'rb' is the recommended usage, but I do not understand why that would be the case.
Also, I've tried s3_client.put_object(Key=key, Body=response.content, Bucket=bucket) but receive An error occurred (404) when calling the HeadObject operation: Not Found.

Assuming Python 3.6. The way I usually do this is to wrap the bytes content in a BytesIO wrapper to create a file like object. And, per the boto3 docs you can use the-transfer-manager for a managed transfer:
from io import BytesIO
import boto3
s3 = boto3.client('s3')
fileobj = BytesIO(response.content)
s3.upload_fileobj(fileobj, 'mybucket', 'mykey')
If that doesn't work I'd double check all IAM permissions are correct.

You have a writable stream that you're asking boto3 to use as a readable stream which won't work.
Write the file, and then simply use bucket.upload_file() afterwards, like so:
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'w') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_file('/tmp/output2.csv', key)

How can I use boto to stream a file out of Amazon S3 to Rackspace Cloudfiles?

I'm copying a file from S3 to Cloudfiles, and I would like to avoid writing the file to disk. The Python-Cloudfiles library has an object.stream() call that looks to be what I need, but I can't find an equivalent call in boto. I'm hoping that I would be able to do something like:
shutil.copyfileobj(s3Object.stream(),rsObject.stream())
Is this possible with boto (or I suppose any other s3 library)?

Other answers in this thread are related to boto, but S3.Object is not iterable anymore in boto3. So, the following DOES NOT WORK, it produces an TypeError: 's3.Object' object is not iterable error message:
s3 = boto3.session.Session(profile_name=my_profile).resource('s3')
s3_obj = s3.Object(bucket_name=my_bucket, key=my_key)
with io.FileIO('sample.txt', 'w') as file:
for i in s3_obj:
file.write(i)
In boto3, the contents of the object is available at S3.Object.get()['Body'] which is an iterable since version 1.9.68 but previously wasn't. Thus the following will work for the latest versions of boto3 but not earlier ones:
body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
for i in body:
file.write(i)
So, an alternative for older boto3 versions is to use the read method, but this loads the WHOLE S3 object in memory which when dealing with large files is not always a possibility:
body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
for i in body.read():
file.write(i)
But the read method allows to pass in the amt parameter specifying the number of bytes we want to read from the underlying stream. This method can be repeatedly called until the whole stream has been read:
body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
while file.write(body.read(amt=512)):
pass
Digging into botocore.response.StreamingBody code one realizes that the underlying stream is also available, so we could iterate as follows:
body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
for b in body._raw_stream:
file.write(b)
While googling I've also seen some links that could be use, but I haven't tried:
WrappedStreamingBody
Another related thread
An issue in boto3 github to request StreamingBody is a proper stream - which has been closed!!!

The Key object in boto, which represents on object in S3, can be used like an iterator so you should be able to do something like this:
>>> import boto
>>> c = boto.connect_s3()
>>> bucket = c.lookup('garnaat_pub')
>>> key = bucket.lookup('Scan1.jpg')
>>> for bytes in key:
... write bytes to output stream
Or, as in the case of your example, you could do:
>>> shutil.copyfileobj(key, rsObject.stream())

I figure at least some of the people seeing this question will be like me, and will want a way to stream a file from boto line by line (or comma by comma, or any other delimiter). Here's a simple way to do that:
def getS3ResultsAsIterator(self, aws_access_info, key, prefix):
s3_conn = S3Connection(**aws_access)
bucket_obj = s3_conn.get_bucket(key)
# go through the list of files in the key
for f in bucket_obj.list(prefix=prefix):
unfinished_line = ''
for byte in f:
byte = unfinished_line + byte
#split on whatever, or use a regex with re.split()
lines = byte.split('\n')
unfinished_line = lines.pop()
for line in lines:
yield line
#garnaat's answer above is still great and 100% true. Hopefully mine still helps someone out.

Botocore's StreamingBody has an iter_lines() method:
https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody.iter_lines
So:
import boto3
s3r = boto3.resource('s3')
iterator = s3r.Object(bucket, key).get()['Body'].iter_lines()
for line in iterator:
print(line)

This is my solution of wrapping streaming body:
import io
class S3ObjectInterator(io.RawIOBase):
def __init__(self, bucket, key):
"""Initialize with S3 bucket and key names"""
self.s3c = boto3.client('s3')
self.obj_stream = self.s3c.get_object(Bucket=bucket, Key=key)['Body']
def read(self, n=-1):
"""Read from the stream"""
return self.obj_stream.read() if n == -1 else self.obj_stream.read(n)
Example usage:
obj_stream = S3ObjectInterator(bucket, key)
for line in obj_stream:
print line

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert csv to json with python on amazon lambda? - python

Just a small tweak to make it work right: csvcontent = csvfile['Body'].read().decode().split('\n')

Related

I need help creating a simple python script that stores an attribute value from a custom json file

Pyarrow/Parquet - Cast all null columns to string during batch processing

GZip and output file

Writing a file to S3 using Lambda in Python with AWS

How can I use boto to stream a file out of Amazon S3 to Rackspace Cloudfiles?

Categories

Resources