Boto3 read a file content from S3 key line by line

Boto3 read a file content from S3 key line by line - python

With boto3, you can read a file content from a location in S3, given a bucket name and the key, as per (this assumes a preliminary import boto3)
s3 = boto3.resource('s3')
content = s3.Object(BUCKET_NAME, S3_KEY).get()['Body'].read()
This returns a string type. The specific file I need to fetch happens to be a collection of dictionary-like objects, one per line. So it is not a JSON format. Instead of reading it as a string, I'd like to stream it as a file object and read it line by line; cannot find a way to do this other than downloading the file locally first as
s3 = boto3.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)
filename = 'my-file'
bucket.download_file(S3_KEY, filename)
f = open('my-file')
What I'm asking is if it's possible to have this type of control on the file without having to download it locally first?

I found .splitlines() worked for me...
txt_file = s3.Object(bucket, file).get()['Body'].read().decode('utf-8').splitlines()
Without the .splitlines() the whole blob of text was return and trying to iterate each line resulted in each char being iterated. With .splitlines() iteration by line was achievable.
In my example here I iterate through each line and compile it into a dict.
txt_file = s3.Object(bucket, file).get()['Body'].read().decode(
'utf-8').splitlines()
for line in txt_file:
arr = line.split()
print(arr)

You also can take advantage of StreamingBody's iter_lines method:
for line in s3.Object(bucket, file).get()['Body'].iter_lines():
decoded_line = line.decode('utf-b') # if decoding is needed
That would consume less memory than reading the whole line at once and then split it

The following comment from kooshiwoosh to a similar question provides a nice answer:
from io import TextIOWrapper
from gzip import GzipFile
...
# get StreamingBody from botocore.response
response = s3.get_object(Bucket=bucket, Key=key)
# if gzipped
gzipped = GzipFile(None, 'rb', fileobj=response['Body'])
data = TextIOWrapper(gzipped)
for line in data:
# process line

This will do the work:
bytes_to_read = 512
content = s3.Object(BUCKET_NAME, S3_KEY).get()['Body'].read(bytes_to_read)

This works for me:
json_object = s3.get_object(Bucket = bucket, Key = json_file_name)
json_file_reader = json_object['Body'].read()
content = json.loads(json_file_reader)

As of now you have a possibility to use the download_fileobj function. Here an example for a CSV file:
import boto3
import csv
bucket = 'my_bucket'
file_key = 'my_key/file.csv'
output_file_path = 'output.csv'
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
#Dump binary in append mode
with open(output_file_path, 'ab') as file_object:
bucket.download_fileobj(
Key = file_key,
Fileobj = file_object,
)
#Read your file as usual
with open(output_file_path, 'r') as csvfile:
lines = csv.reader(csvfile)
for line in lines:
doWhatEver(line[0])

Related

Unable to read from s3 bucket and write to same bucket a different file using boto3

I am using the below Python3 shell code to read from S3 bucket, extract data and write to a new file in the same bucket. But the write operation is not working and Medicaid_Provider_ID_.txt is populated with zero rows. Any clue ??
import logging
import boto3
s3 = boto3.client("s3")
data = s3.get_object(Bucket='mmis.request.file', Key='MEIPASS_FISCAL_TRANS_ONE_RECORD.TXT')
file_lines = data['Body'].iter_lines()
next(file_lines)
new = []
id = 1
for line in file_lines:
line_split = line.decode().split(',')
MEDICAID_PROVIDER_ID = line_split[0]
REASON_CODE = line_split[1]
with open("Medicaid_Provider_ID_.txt","w") as f:
f.writelines(MEDICAID_PROVIDER_ID)
f.close()
id += 1
new = s3.put_object(Bucket='mmis.request.file', Key='Medicaid_Provider_ID_.txt')

This line of code is recreating your file every single time the code runs:
with open("Medicaid_Provider_ID_.txt","w") as f:
You should open/create the file once, then iterate over all the rows in the file, then close the file when you are done. Like so:
import logging
import boto3
s3 = boto3.client("s3")
data = s3.get_object(Bucket='mmis.request.file', Key='MEIPASS_FISCAL_TRANS_ONE_RECORD.TXT')
file_lines = data['Body'].iter_lines()
next(file_lines)
new = []
id = 1
# Open the file
with open("Medicaid_Provider_ID_.txt","w") as f:
# Write each line of the file
for line in file_lines:
line_split = line.decode().split(',')
MEDICAID_PROVIDER_ID = line_split[0]
REASON_CODE = line_split[1]
f.writelines(MEDICAID_PROVIDER_ID)
# Close the file
f.close()
id += 1
new = s3.put_object(Bucket='mmis.request.file', Key='Medicaid_Provider_ID_.txt')

Writing string output to csv file. Getting error when working with temporary file

I'm iterating through an excel file that I'm pulling from S3. I want to append this data into one file. The data isn't enough to exceed lambda memory limits so I'm saving it into a variable and then converting the string into csv file that I'm looking to upload to S3. When I run a variation of this code locally it works perfectly, not sure what's going wrong when I'm converting it to AWS.
import csv
import boto3
import urllib3
import tempfile
s3 = boto3.client('s3')
bucket = os.environ['S3_BUCKET']
http = urllib3.PoolManager()
def lambda_handler(event, context):
file = readS3('example.xlsx') # load file with Boto3
latest_scan = openpyxl.load_workbook(io.BytesIO(file), data_only=True)
sh = latest_scan.active
a = []
for row in sh['A']:
r5 = http.request(
'GET',
'https://example.com/api/' + str(row.value),
headers={
'Accept': 'text/csv'
}
)
a.append(r5.data.decode('utf-8'))
s = ''.join(a)
temp = tempfile.TemporaryFile(mode='w+', suffix='.csv')
with open(temp, 'w', encoding="utf-8") as f:
for line in s:
f.write(line)
temp.seek(0)
s3.put_object(temp, Bucket = bucket, Key = 'test.csv')
temp.close()
I'm getting:
"errorMessage": "expected str, bytes or os.PathLike object, not _io.TextIOWrapper",
"errorType": "TypeError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line in lambda_handler\n with open(temp,
'w', encoding=\"utf-8\") as f:\n"
]

tempfile.TemporaryFile() opens the file, it doesn't return a filename. So just assign that to f.
with tempfile.TemporaryFile(mode='w+', suffix='.csv', encoding="utf-8") as f:

Loop over list of file paths, read in JSON files, insert into MongoDB using Pymongo

I have a file folder of 1000+ json metadata files. I have created a list of the file paths and I'm trying to:
for each file path, read json file
pull in only the key value pairs I'm interested in
store it in a variable or save it in a way that I can insert into
mongodb using pymongo
I have been successful listing the file paths to a variable and loading ONE json doc (from one file path). The problem is I need to do over a thousand and I get an error when trying to incorporate list of file paths and loop.
Here's what I've tried so far:
import pymongo
import json
filename = r"C:\Users\Documents\FileFolder\randomFile.docx.json"
with open(filename, "r", encoding = "utf8") as f:
json_doc = json.load(f)
new_jsonDoc = dict()
for key in {'Application-Name', 'Author', 'resourceName', 'Line-Count', 'Page-Count', 'Paragraph-Count', 'Word-Count'}:
new_jsonDoc[key] = json_doc[0][key]
Sample output:
{'Application-Name': 'Microsoft Office Word',
'Author': 'Sample, John Q.',
'Character Count': '166964',
'Line-Count': '1391',
'Page-Count': '103',
'Paragraph-Count': '391',
'Word-Count': '29291',
'resourceName': 'randomFile.docx'}
Now when I add the loop:
for file in list_JsonFiles: # this is list of file paths created by os.walk
# if I do a print(type(file)) here, file type is a string
with open(file, "r") as f:
# print(type(file)) = string, print(type(f)) = TextIOWrapper
json_doc = json.loads(f)
### TypeError: the JSON object must be str, bytes or bytearray, not TextIOWrapper ###
How can I get my loop working? Is my approach wrong?

Figured the TypeError out:
for file in list_JsonFiles:
with open(file, "r", encoding = "utf8") as f:
json_doc = json.load(f)
new_jsonDoc = dict()
for key in {'Application-Name', 'Author', 'resourceName', 'Line-Count', 'Page-Count', 'Paragraph-Count', 'Word-Count'}:
if key in json_doc[0]:
new_jsonDoc[key] = json_doc[0][key]

AWS Lambda remove unwanted rows of text

I have a text file which is dropped in to an s3 bucket (bucket_name_1), I would like to use AWS Lambda to remove the unwanted headers and footers in the file and write it to another s3 bucket (bucket_name_2).
Sample of the file:
UNWANTED HEADER
UNWANTED HEADER
Date|FirstName|Surname|Age|
1/21/2020|JOHN|SMITH|45|
1/21/2020|EMMA|BROWN|29|
1/21/2020|FRANK|WILSON|37|
...
UNWANTED FOOTER
So far I have a lambda which will read the file in
import boto3
s3 = boto3.resource('s3')
client = boto3.client('s3')
def lambda_handler(event, context):
bucket_name_1 = event['Records'][0]['s3']['bucket']['name']
bucket_name_2 = 'output-bucket'
key = event['Records'][0]['s3']['object']['key']
obj = s3.Object(bucket_name_1, key)
body = obj.get()['Body'].read()
print(body)

I would recommend:
Download the file to /tmp/ using download_file()
Manipulate the file, or copy the desired lines to an 'output file'
Upload the resulting file to S3 using upload_file()
It would be something like this:
import boto3
def lambda_handler(event, context):
s3_client = boto3.client('s3')
bucket_in = event['Records'][0]['s3']['bucket']['name']
bucket_out = 'output-bucket'
key = event['Records'][0]['s3']['object']['key']
filename_in = '/tmp/in.txt'
filename_out = '/tmp/out.txt'
# Download file
s3_client.download_file(bucket_in, key, filename_in)
# Remove headers and footers
with open(filename_in, 'r') as file_in:
with open(filename_out, 'w') as file_out:
for line in file_in:
# Put logic here for including/excluding lines from source file
file_out.write(line)
# Upload output file
s3_client.upload_file(filename_out, bucket_out, key)

Boto3, read gzip from s3 and print content

I'm trying to read a gzip file from S3 - the "native" format f the file is a csv. Ultimately, after uncompressing the file, I'd like to be able to "see" the content so I can read the number of lines in the csv and keep count of it.
My "basic" attempts are here - still just trying to print the contents of the file. This attempt just tells me that there is no such file or directory...
I know I'm also probably erroneously thinking the unzipped csv file will be in json format - but that's the next "issue" once I get to read the unzipped contents...
[Errno 2] No such file or directory: 'SMSUsageReports/eu-west-1/2018/01/02/001.csv.gz'
import gzip
import boto3
import json
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
bucket = s3.Bucket('snssmsreports')
for obj in bucket.objects.filter(Prefix='SMSUsageReports/eu-west-1/2018/01/02'):
json_object = s3_client.get_object(Bucket=bucket.name, Key=obj.key)
file_name = obj.key
obj = bucket.Object(file_name)
file_body = obj.get()["Body"].read()
# gzip stuff here
f=gzip.open(file_name,'rb')
file_content=f.read()
#print file_content
#jsonFileReader = json_object['Body'].read()
jsonDict = json.loads(file_content)
#table = dynamodb.Table('SNS')
#table.put_item(Item=jsonDict)
print('{0}:{1}'.format(bucket.name, obj.key))
print(jsonDict)
OK, So I updated my code as follow:
import zipfile
import gzip
import boto3
import io
import json
import pandas as pd
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
bucket = s3.Bucket('snssmsreports')
for obj in bucket.objects.filter(Prefix='SMSUsageReports/eu-west-1/2018/01/02'):
json_object = s3_client.get_object(Bucket=bucket.name, Key=obj.key)
file_name = obj.key
obj = bucket.Object(file_name)
s3_client.download_file(bucket.name, file_name, '../../tmp/file.gz')
gzip_name = '../../tmp/file.gz'
# gzip stuff here
with gzip.open(gzip_name,'rb') as f:
file_content=f.read()
str_file = str(file_content)
csvfile = open('../../tmp/testfile.csv','w')
csvfile.write(str_file)
csvfile.close()
#table = dynamodb.Table('SNS')
#table.put_item(Item=jsonDict)
#pandas csv reader
df1 = pd.read_csv('../../tmp/testfile.csv')
print(df1)
#print('{0}:{1}'.format(bucket.name, obj.key))
#print(file_content)
#table = dynamodb.Table('SNS')
#table.put_item(Item=jsonDict)
This does not throw any errors anymore, but the output only has one row and 135 columns, so panda is not liking the actual content of the csv, or my conversion to str() is not the right way to do it?

OK, issue was the opening of the file for write - to write bytes I had to open file as wb...
csvfile = open('../../tmp/testfile.csv','wb')
csvfile.write(file_content)
csvfile.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Boto3 read a file content from S3 key line by line - python

You also can take advantage of StreamingBody's iter_lines method: for line in s3.Object(bucket, file).get()['Body'].iter_lines(): decoded_line = line.decode('utf-b') # if decoding is needed That would consume less memory than reading the whole line at once and then split it

This will do the work: bytes_to_read = 512 content = s3.Object(BUCKET_NAME, S3_KEY).get()['Body'].read(bytes_to_read)

This works for me: json_object = s3.get_object(Bucket = bucket, Key = json_file_name) json_file_reader = json_object['Body'].read() content = json.loads(json_file_reader)

Related

Unable to read from s3 bucket and write to same bucket a different file using boto3

Writing string output to csv file. Getting error when working with temporary file

Loop over list of file paths, read in JSON files, insert into MongoDB using Pymongo

AWS Lambda remove unwanted rows of text

Boto3, read gzip from s3 and print content

Categories

Resources