Saving an object to CSV directly to S3

Saving an object to CSV directly to S3 - python

I have a file-like object received through HTTP response, I want to directly save it as CSV in S3. I tried using the S3 bucket and its path but I am facing an error, as no such file was found
Can someone help me with the code here- r5 is the URL response, rr= r5.raw which has the response data object. This object has to be saved as CSV into S3 directly.

Docs with example of sending existing file: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html
If you have a file content in a string, use BytesIO to make it a file-like object:
s3 = boto3.resource("s3")
bucket = s3.Bucket(bucket_name)
stream = io.BytesIO(my_str_data.encode()) # boto3 only allows to bytes data, so we need to encode
filename = "filename_here"
bucket.upload_fileobj(stream, filename)
Since you said you already have a file-like object, make sure it's bytes file-like object and just put in in place of stream

Related

Gzip a file in Python before uploading to Cloud Storage

I have the following Python function to write the given content to a bucket in Cloud Storage:
import gzip
from google.cloud import storage
def upload_to_cloud_storage(json):
"""Write to Cloud Storage."""
# The contents to upload as a JSON string.
contents = json
storage_client = storage.Client()
# Path and name of the file to upload (file doesn't yet exist).
destination = "path/to/name.json.gz"
# Gzip the contents before uploading
with gzip.open(destination, "wb") as f:
f.write(contents.encode("utf-8"))
# Bucket
my_bucket = storage_client.bucket('my_bucket')
# Blob (content)
blob = my_bucket.blob(destination)
blob.content_encoding = 'gzip'
# Write to storage
blob.upload_from_string(contents, content_type='application/json')
However, I receive an error when running the function:
FileNotFoundError: [Errno 2] No such file or directory: 'path/to/name.json.gz'
Highlighting this line as the cause:
with gzip.open(destination, "wb") as f:
I can confirm that the bucket and path both exist although the file itself is new and to be written.
I can also confirm that removing the Gzipping part sees the file successfully written to Cloud Storage.
How can I gzip a new file and upload to Cloud Storage?
Other answers I've used for reference:
https://stackoverflow.com/a/54769937
https://stackoverflow.com/a/67995040

Although #David's answer wasn't complete at the time of solving my problem, it got me on the right track. Here's what I ended up using along with explanations I found out along the way.
import gzip
from google.cloud import storage
from google.cloud.storage import fileio
def upload_to_cloud_storage(json_string):
"""Gzip and write to Cloud Storage."""
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
# Filename (include path)
blob = bucket.blob('path/to/file.json')
# Set blog meta data for decompressive transcoding
blob.content_encoding = 'gzip'
blob.content_type = 'application/json'
writer = fileio.BlobWriter(blob)
# Must write as bytes
gz = gzip.GzipFile(fileobj=writer, mode="wb")
# When writing as bytes we must encode our JSON string.
gz.write(json_string.encode('utf-8'))
# Close connections
gz.close()
writer.close()
We use the GzipFile() class instead of convenience method (compress) to enable us to pass in the mode. When trying to write using w or wt you will receive the error:
TypeError: memoryview: a bytes-like object is required, not 'str'
So we must write in binary mode (wb). This will also enable the .write() method. When doing so however we need to encode our JSON string. This can be done using str.encode() and setting it as utf-8. Failing to do this will also result in the same error.
Finally, I wanted to be able to enable decompressive transcoding where the requester (browser in my case) will receive the uncompressed version of the file when requested. To enable this google.cloud.storage.blob allows you to set some meta data including content_type and content_encoding so we can can follow best practices.
This sees the JSON object in memory written to your chosen destination in Cloud Storage in a compressed format and decompressed on the fly (without needing to download a gzip archive).
Thanks also to #JohnHanley for the troubleshooting advice.

The best solution is not to write the gzip to a file at all, and directly compress and stream to GCS.
from google.cloud import storage
from google.cloud.storage import fileio
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = fileio.BlobWriter(blob)
gz = gzip.GzipFile(fileobj=writer, mode="w") # use "wb" if bytes
gz.write(contents)
gz.close()
writer.close()

Writing a file to S3 using Lambda in Python with AWS

In AWS, I'm trying to save a file to S3 in Python using a Lambda function. While this works on my local computer, I am unable to get it to work in Lambda. I've been working on this problem for most of the day and would appreciate help. Thank you.
def pdfToTable(PDFfilename, apiKey, fileExt, bucket, key):
# parsing a PDF using an API
fileData = (PDFfilename, open(PDFfilename, "rb"))
files = {"f": fileData}
postUrl = "https://pdftables.com/api?key={0}&format={1}".format(apiKey, fileExt)
response = requests.post(postUrl, files=files)
response.raise_for_status()
# this code is probably the problem!
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'rb') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_fileobj(data, key)
# FYI, on my own computer, this saves the file
with open('output.csv', "wb") as f:
f.write(response.content)
In S3, there is a bucket transportation.manifests.parsed containing the folder csv where the file should be saved.
The type of response.content is bytes.
From AWS, the error from the current set-up above is [Errno 2] No such file or directory: '/tmp/output2.csv': FileNotFoundError. In fact, my goal is to save the file to the csv folder under a unique name, so tmp/output2.csv might not be the best approach. Any guidance?
In addition, I've tried to use wb and w instead of rb also to no avail. The error with wb is Input <_io.BufferedWriter name='/tmp/output2.csv'> of type: <class '_io.BufferedWriter'> is not supported. The documentation suggests that using 'rb' is the recommended usage, but I do not understand why that would be the case.
Also, I've tried s3_client.put_object(Key=key, Body=response.content, Bucket=bucket) but receive An error occurred (404) when calling the HeadObject operation: Not Found.

Assuming Python 3.6. The way I usually do this is to wrap the bytes content in a BytesIO wrapper to create a file like object. And, per the boto3 docs you can use the-transfer-manager for a managed transfer:
from io import BytesIO
import boto3
s3 = boto3.client('s3')
fileobj = BytesIO(response.content)
s3.upload_fileobj(fileobj, 'mybucket', 'mykey')
If that doesn't work I'd double check all IAM permissions are correct.

You have a writable stream that you're asking boto3 to use as a readable stream which won't work.
Write the file, and then simply use bucket.upload_file() afterwards, like so:
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'w') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_file('/tmp/output2.csv', key)

GCS - Read a text file from Google Cloud Storage directly into python

I feel kind of stupid right now. I have been reading numerous documentations and stackoverflow questions but I can't get it right.
I have a file on Google Cloud Storage. It is in a bucket 'test_bucket'. Inside this bucket there is a folder, 'temp_files_folder', which contains two files, one .txt file named 'test.txt' and one .csv file named 'test.csv'. The two files are simply because I try using both but the result is the same either way.
The content in the files is
hej
san
and I am hoping to read it into python the same way I would do on a local with
textfile = open("/file_path/test.txt", 'r')
times = textfile.read().splitlines()
textfile.close()
print(times)
which gives
['hej', 'san']
I have tried using
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('test_bucket')
blob = bucket.get_blob('temp_files_folder/test.txt')
print(blob.download_as_string)
but it gives the output
<bound method Blob.download_as_string of <Blob: test_bucket, temp_files_folder/test.txt>>
How can I get the actual string(s) in the file?

download_as_string is a method, you need to call it.
print(blob.download_as_string())
More likely, you want to assign it to a variable so that you download it once and can then print it and do whatever else you want with it:
downloaded_blob = blob.download_as_string()
print(downloaded_blob)
do_something_else(downloaded_blob)

The method 'download_as_string()' will read in the content as byte.
Find below an example to process a .csv file.
import csv
from io import StringIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket(YOUR_BUCKET_NAME)
blob = bucket.blob(YOUR_FILE_NAME)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")

According to the documentation (https://googleapis.dev/python/storage/latest/blobs.html), As of the time of writing (2021/08), the download_as_string method is a depreciated alias for the download_as_byte method which - as suggested by the name - returns a byte object.
You can instead use the download_as_text method to return a str object.
For instances, to download the file MYFILE from bucket MYBUCKET and store it as an utf-8 encoded string:
from google.cloud.storage import Client
client = Client()
bucket = client.get_bucket(MYBUCKET)
blob = bucket.get_blob(MYFILE)
downloaded_file = blob.download_as_text(encoding="utf-8")
You can then also use this in order to read different file formats. For json, replace the last line to
import json
downloaded_json_file = json.loads(blob.download_as_text(encoding="utf-8"))
For yaml files, replace the last line to :
import yaml
downloaded_yaml_file = yaml.safe_load(blob.download_as_text(encoding="utf-8"))

DON'T USE: blob.download_as_string()
USE: blob.download_as_text()
blob.download_as_text() does indeed return a string.
blob.download_as_string() is deprecated and returns a bytes object instead of a string object.

Works out when reading a docx / text file
from google.cloud import storage
# create storage client
storage_client = storage.Client.from_service_account_json('**PATH OF JSON FILE**')
bucket = storage_client.get_bucket('**BUCKET NAME**')
# get bucket data as blob
blob = bucket.blob('**SPECIFYING THE DOXC FILENAME**')
downloaded_blob = blob.download_as_string()
downloaded_blob = downloaded_blob.decode("utf-8")
print(downloaded_blob)

How do I read a csv stored in S3 with csv.DictReader?

I have code that fetches an AWS S3 object. How do I read this StreamingBody with Python's csv.DictReader?
import boto3, csv
session = boto3.session.Session(aws_access_key_id=<>, aws_secret_access_key=<>, region_name=<>)
s3_resource = session.resource('s3')
s3_object = s3_resource.Object(<bucket>, <key>)
streaming_body = s3_object.get()['Body']
#csv.DictReader(???)

The code would be something like this:
import boto3
import csv
# get a handle on s3
s3 = boto3.resource(u's3')
# get a handle on the bucket that holds your file
bucket = s3.Bucket(u'bucket-name')
# get a handle on the object you want (i.e. your file)
obj = bucket.Object(key=u'test.csv')
# get the object
response = obj.get()
# read the contents of the file and split it into a list of lines
# for python 2:
lines = response[u'Body'].read().split()
# for python 3 you need to decode the incoming bytes:
lines = response['Body'].read().decode('utf-8').split()
# now iterate over those lines
for row in csv.DictReader(lines):
# here you get a sequence of dicts
# do whatever you want with each line here
print(row)
You can compact this a bit in actual code, but I tried to keep it step-by-step to show the object hierarchy with boto3.
Edit Per your comment about avoiding reading the entire file into memory: I haven't run into that requirement so cant speak authoritatively, but I would try wrapping the stream so I could get a text file-like iterator. For example you could use the codecs library to replace the csv parsing section above with something like:
for row in csv.DictReader(codecs.getreader('utf-8')(response[u'Body'])):
print(row)

Flask - Handling Form File & Upload to AWS S3 without Saving to File

I am using a Flask app to receive a mutipart/form-data request with an uploaded file (a video, in this example).
I don't want to save the file in the local directory because this app will be running on a server, and saving it will slow things down.
I am trying to use the file object created by the Flask request.files[''] method, but it doesn't seem to be working.
Here is that portion of the code:
#bp.route('/video_upload', methods=['POST'])
def VideoUploadHandler():
form = request.form
video_file = request.files['video_data']
if video_file:
s3 = boto3.client('s3')
s3.upload_file(video_file.read(), S3_BUCKET, 'video.mp4')
return json.dumps('DynamoDB failure')
This returns an error:
TypeError: must be encoded string without NULL bytes, not str
on the line:
s3.upload_file(video_file.read(), S3_BUCKET, 'video.mp4')
I did get this to work by first saving the file and then accessing that saved file, so it's not an issue with catching the request file. This works:
video_file.save(form['video_id']+".mp4")
s3.upload_file(form['video_id']+".mp4", S3_BUCKET, form['video_id']+".mp4")
What would be the best method to handle this file data in memory and pass it to the s3.upload_file() method? I am using the boto3 methods here, and I am only finding examples with the filename used in the first parameter, so I'm not sure how to process this correctly using the file in memory. Thanks!

First you need to be able to access the raw data sent to Flask. This is not as easy as it seems, since you're reading a form. To be able to read the raw stream you can use flask.request.stream, which behaves similarly to StringIO. The trick here is, you cannot call request.form or request.file because accessing those attributes will load the whole stream into memory or into a file.
You'll need some extra work to extract the right part of the stream (which unfortunately I cannot help you with because it depends on how your form is made, but I'll let you experiment with this).
Finally you can use the set_contents_from_file function from boto, since upload_file does not seem to deal with file-like objects (StringIO and such).
Example code:
from boto.s3.key import Key
#bp.route('/video_upload', methods=['POST'])
def VideoUploadHandler():
# form = request.form <- Don't do that
# video_file = request.files['video_data'] <- Don't do that either
video_file_and_metadata = request.stream # This is a file-like object which does not only contain your video file
# This is what you need to implement
video_title, video_stream = extract_title_stream(video_file_and_metadata)
# Then, upload to the bucket
s3 = boto3.client('s3')
bucket = s3.create_bucket(bucket_name, location=boto.s3.connection.Location.DEFAULT)
k = Key(bucket)
k.key = video_title
k.set_contents_from_filename(video_stream)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Saving an object to CSV directly to S3 - python

Related

Gzip a file in Python before uploading to Cloud Storage

Writing a file to S3 using Lambda in Python with AWS

GCS - Read a text file from Google Cloud Storage directly into python

How do I read a csv stored in S3 with csv.DictReader?

Flask - Handling Form File & Upload to AWS S3 without Saving to File

Categories

Resources