Reading JSON file from S3 with AWS Lambda - python

First of all, I'm new to AWS so I apologize if the question is very simple or not explained properly.
I'm trying to read a JSON file stored in a S3 bucket with an AWS lambda function.
My main problem is that I am completely unable to extract the information from it.
This is my code:
**
import json
import boto3
def lambda_handler(event, context):
BUCKET = 'BUCKET'
KEY = 'KEY.json'
client = boto3.client('s3')
result = client.get_object(Bucket=BUCKET, Key=KEY)
# Read the object
text = result['Body'].read()#.decode('utf-8')
#convert to string
text_str = str(text)
text_str = text_str.replace('\r\n', '')
print(text_str)
**
If i use decode('utf-8'), I get: "errorMessage": "'utf-8' codec can't decode byte 0xba in position 976: invalid start byte".
If i don't, i get the JSON file but like this:
'{\r\n "id": 0,\r\n "uid": "uid",\r\n "name": "User",\r\n "last": "Candidate"}'
I am stuck here, because the .replace doesn't work and I don't know how to use what i get and access to it as in a standard JSON.
Thank you in advance.
Update: looks like the main problem is that I have ASCii characters like 'á' in the JSON file. Now, I get something like this (I just show a part of the json):
'{"id": 0,"uid": "uid","name": "User","last": "Candidate"}'
I've tried ast.literal_eval to get rid of the '' and access the dictionary, and also json.dumps to try to avoid the problem with the ASCii characters, but nothing has worked.

Related

S3 InvalidDigest when calling the PutObject operation [duplicate]

I have tried to upload an XML File to S3 using boto3. As recommended by Amazon, I would like to send a Base64 Encoded MD5-128 Bit Digest(Content-MD5) of the data.
https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.put
My Code:
with open(file, 'rb') as tempfile:
body = tempfile.read()
tempfile.close()
hash_object = hashlib.md5(body)
base64_md5 = base64.encodebytes(hash_object.digest())
response = s3.Object(self.bucket, self.key + file).put(
Body=body.decode(self.encoding),
ACL='private',
Metadata=metadata,
ContentType=self.content_type,
ContentEncoding=self.encoding,
ContentMD5=str(base64_md5)
)
When i try this the str(base64_md5) create a string like 'b'ZpL06Osuws3qFQJ8ktdBOw==\n''
In this case, I get this Error Message:
An error occurred (InvalidDigest) when calling the PutObject operation: The Content-MD5 you specified was invalid.
For Test purposes I copied only the Value without the 'b' in front: 'ZpL06Osuws3qFQJ8ktdBOw==\n'
Then i get this Error Message:
botocore.exceptions.HTTPClientError: An HTTP Client raised and unhandled exception: Invalid header value b'hvUe19qHj7rMbwOWVPEv6Q==\n'
Can anyone help me how to save Upload a File to S3?
Thanks,
Oliver
Starting with #Isaac Fife's example, stripping it down to identify what's required vs not, and to include imports and such to make it a full reproducible example:
(the only change you need to make is to use your own bucket name)
import base64
import hashlib
import boto3
contents = "hello world!"
md = hashlib.md5(contents.encode('utf-8')).digest()
contents_md5 = base64.b64encode(md).decode('utf-8')
boto3.client('s3').put_object(
Bucket="mybucket",
Key="test",
Body=contents,
ContentMD5=contents_md5
)
Learnings: first, the MD5 you are trying to generate will NOT look like what an 'upload' returns. We actually need a base64 version, it returns a md.hexdigest() version. hex is base16, which is not base64.
(Python 3.7)
Took me hours to figure this out because the only error you get is "The Content-MD5 you specified was invalid." Super useful for debugging... Anyway, here is the code I used to actually get the file to upload correctly before refactoring.
json_results = json_converter.convert_to_json(result)
json_results_utf8 = json_results.encode('utf-8')
content_md5 = md5.get_content_md5(json_results_utf8)
content_md5_string = content_md5.decode('utf-8')
metadata = {
"md5chksum": content_md5_string
}
s3 = boto3.resource('s3', config=Config(signature_version='s3v4'))
obj = s3.Object(bucket, 'filename.json')
obj.put(
Body=json_results_utf8,
ContentMD5=content_md5_string,
ServerSideEncryption='aws:kms',
Metadata=metadata,
SSEKMSKeyId=key_id)
and the hashing
def get_content_md5(data):
digest = hashlib.md5(data).digest()
return base64.b64encode(digest)
The hard part for me was figuring out what encoding you need at each step in the process and not being very familiar with how strings are stored in python at the time.
get_content_md5 takes a utf-8 bytes-like object only, and returns the same. But to pass the md5 hash to aws, it needs to be a string. You have to decode it before you give it to ContentMD5.
Pro-tip - Body on the other hand, needs to be given bytes or a seekable object. Make sure if you pass a seekable object that you seek(0) to the beginning of the file before you pass it to AWS or the MD5 will not match. For that reason, using bytes is less error prone, imo.

Storing gzipped data in redis

I have a huge python dictionary that i want to save to redis cache and then have an API handler return this dictionary straight from cache
Im using gzip to compress the stringified dict first before storing in cache
transformed_object = {...big dictionary}
byte_object = BytesIO()
data = json.dumps(transformed_object)
with gzip.GzipFile(fileobj=byte_object, mode="w") as f:
f.write(data.encode())
final_data = byte_object.getvalue()
I write this to Redis cache
context.redis.set(COMPLETE_GZIPPED_CACHE, final_data)
I have an API handler where I want to return the gzipped data
cache_list = redis.get(COMPLETE_GZIPPED_CACHE)
self.finish(
{
"status": True,
"cache_list": cache_list,
"updated_at": datetime.datetime.now(),
}
)
The problem is I'm getting the below error
TypeError: Object of type 'bytes' is not JSON serializable
Do i need to decode the bytes first back to string before returning to the frontend? ideally i would like the frontend to handle the decoding
Is there a better way to do this?
Figured it out from other posts - wrote a function like this and opted to use zlib
def convert_to_gzip_format(dict):
stringified_object = json.dumps(dict).encode("utf-8")
compressed_file = zlib.compress(stringified_object)
base64_string = base64.b64encode(compressed_file).decode("ascii")
return base64_string
This saves it as an ascii string to redis. I then use pako.js in the frontend to decode the above into a readable string.

Read JSON data from UTF-8 encoded byte string

I have a script that sends a JSON UTF-8 encoded Byte string to a socket. (A github project: https://github.com/alios/raildriver). Now I'm writing the python script that needs to read the incoming data. Right now I can receive the data and print it to the terminal. With the following script: https://www.binarytides.com/code-telnet-client-sockets-python/
Output:
data = '{"Current": 117.42609405517578, "Accelerometer": -5.394751071929932, "SpeedometerKPH": 67.12493133544922, "Ammeter": 117.3575210571289, "Amp": 117.35590362548828, "Acceleration": -0.03285316377878189, "TractiveEffort": -5.394751071929932, "Effort": 48.72163772583008, "RawTargetDistance": 3993.927734375, "TargetDistanceBar": 0.9777777791023254, "TargetDistanceDigits100": -1.0, "TargetDistanceDigits1000": -1.0}'
The problem is that I can't find how to read the JSON array. For example read "Ammeter" and return its value 117.357521057289 to a new variable.
All the data is being received in the variable data
The code I have right now:
decodedjson = data.decode('utf-8')
dumpedjson = json.dumps(decodedjson)
loadedjson = json.loads(dumpedjson)
Can you please help me?
You are encoding to JSON then decoding again. SImply not encode, remove the second line:
decodedjson = data.decode('utf-8')
loadedjson = json.loads(decodedjson)
If you are using Python 3.6 or newer, you don't actually have to decode from UTF-8, as the json.loads() function knows how to deal with UTF-encoded JSON data directly. The same applies to Python 2:
loadedjson = json.loads(data)
Demo using Python 3.7:
>>> data = b'{"Current": 117.42609405517578, "Accelerometer": -5.394751071929932, "SpeedometerKPH": 67.12493133544922, "Ammeter": 117.3575210571289, "Amp": 117.35590362548828, "Acceleration": -0.03285316377878189, "TractiveEffort": -5.394751071929932, "Effort": 48.72163772583008, "RawTargetDistance": 3993.927734375, "TargetDistanceBar": 0.9777777791023254, "TargetDistanceDigits100": -1.0, "TargetDistanceDigits1000": -1.0}'
>>> loadedjson = json.loads(data)
>>> loadedjson['Ammeter']
117.3575210571289

Key given by lambda S3 event cannot be used when containing non-ASCII characters

I have a Python lambda script that shrinks images as they are uploaded to S3. When the uploaded filename contains non-ASCII characters (Hebrew in my case), I cannot get the object (Forbidden as if the file doesn't exist).
Here's (some of) my code:
s3_client = boto3.client('s3')
def handler(event, context):
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
s3_client.download_file(bucket, key, "/tmp/somefile")
This raises An error occurred (403) when calling the HeadObject operation: Forbidden: ClientError. I also see in the log that the key contains characters like %D7%92.
Following the web I also tried to unquote the key according to some sources (http://blog.rackspace.com/the-devnull-s3-bucket-hacking-with-aws-lambda-and-python/) like so, with no luck:
key = urllib.unquote_plus(record['s3']['object']['key'])
Same error, although this time the log states that I'm trying to retrieve a key with characters like this: פ×קס×.
Note that this script is verified to work on English keys, and the tests were done on keys with no spaces.
#This worked for me
import urllib.parse
encodedStr = 'My+name+is+Tarak'
urllib.parse.unquote_plus(encodedStr)
"My name is Tarak"
I had a similar problem. I solved it adding an encode before doing the unquote:
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode("utf8"))

Trying to extract json data, geting expected string or buffer

So I'm experimenting with json abit and this is the code I've got this far,
import json
from utorrent.client import UTorrentClient
uTorrent = UTorrentClient("xxxx", "xxxx", "xxxx")
data = uTorrent.list()
torrents = json.loads(data)["torrents"]
for torrent in torrents:
print item[0] # hash
print item[2] # name
print item[21] # status
print item[26] # folder
The typical json output can be viewed here. But im getting an "expected string or buffer" error. Anyone with any pointers?
The point with above code is to print out each hash/name.. for each torrent found in the list provided by uTorrent
Did you try using load instead of loads? I was having the same problem and I realized there's a difference.

Categories