How to access dataset from S3 bucket in Python?

How to access dataset from S3 bucket in Python? - python

I have a Jupyter Notebook and I want to access a dataset that is in S3 bucket( it is publicaly accesible)
response = s3.list_objects_v2(Bucket='sagemaker-eu-central-1-261218592922' )
for content in response['Contents']:
obj_dict = s3.get_object(Bucket='sagemaker-eu-central-1-261218592922', Key=content['Key'])
print(obj_dict)
I am using a boto3 client (s3). Ok so I go through the contents of the bucket with code above, but how does one access the contents of the file?

Without knowing what the contents of the file are, or what you want to do after, I can only offer a generic solution -
response = s3.list_objects_v2(Bucket='sagemaker-eu-central-1-261218592922' )
for content in response['Contents']:
obj_dict = s3.get_object(Bucket='sagemaker-eu-central-1-261218592922', Key=content['Key'])
contents = obj_dict['Body'].read().decode('utf-8')
More details on this can be found at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.get_object

Related

How to list all the contents in a given key on Amazon S3 using Apache Libcloud?

The code to list contents in S3 using boto3 is known:
self.s3_client = boto3.client(
u's3',
aws_access_key_id=config.AWS_ACCESS_KEY_ID,
aws_secret_access_key=config.AWS_SECRET_ACCESS_KEY,
region_name=config.region_name,
config=Config(signature_version='s3v4')
)
versions = self.s3_client.list_objects(Bucket=self.bucket_name, Prefix=self.package_s3_version_key)
However, I need to list contents on S3 using libcloud. I could not find it in the documentation.

If you are just looking for all the contents for a specific bucket:
from libcloud.storage.types import Provider
from libcloud.storage.providers import get_driver
client = driver(StoreProvider.S3)
s3 = client(aws_id, aws_secret)
container = s3.get_container(container_name='name')
objects = s3.list_container_objects(container)
s3.download_object(objects[0], '/path/to/download')
The resulting objects will contain a list of all the keys in that bucket with filename, byte size, and metadata. To download call the download_object method on s3 with the full libcloud Object and your file path.
If you'd rather get all objects of all buckets, change get_container to list_containers with no parameters.
Information for all driver methods: https://libcloud.readthedocs.io/en/latest/storage/api.html
Short examples specific to s3: https://libcloud.readthedocs.io/en/latest/storage/drivers/s3.html

GCS - Read a text file from Google Cloud Storage directly into python

I feel kind of stupid right now. I have been reading numerous documentations and stackoverflow questions but I can't get it right.
I have a file on Google Cloud Storage. It is in a bucket 'test_bucket'. Inside this bucket there is a folder, 'temp_files_folder', which contains two files, one .txt file named 'test.txt' and one .csv file named 'test.csv'. The two files are simply because I try using both but the result is the same either way.
The content in the files is
hej
san
and I am hoping to read it into python the same way I would do on a local with
textfile = open("/file_path/test.txt", 'r')
times = textfile.read().splitlines()
textfile.close()
print(times)
which gives
['hej', 'san']
I have tried using
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('test_bucket')
blob = bucket.get_blob('temp_files_folder/test.txt')
print(blob.download_as_string)
but it gives the output
<bound method Blob.download_as_string of <Blob: test_bucket, temp_files_folder/test.txt>>
How can I get the actual string(s) in the file?

download_as_string is a method, you need to call it.
print(blob.download_as_string())
More likely, you want to assign it to a variable so that you download it once and can then print it and do whatever else you want with it:
downloaded_blob = blob.download_as_string()
print(downloaded_blob)
do_something_else(downloaded_blob)

The method 'download_as_string()' will read in the content as byte.
Find below an example to process a .csv file.
import csv
from io import StringIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket(YOUR_BUCKET_NAME)
blob = bucket.blob(YOUR_FILE_NAME)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")

According to the documentation (https://googleapis.dev/python/storage/latest/blobs.html), As of the time of writing (2021/08), the download_as_string method is a depreciated alias for the download_as_byte method which - as suggested by the name - returns a byte object.
You can instead use the download_as_text method to return a str object.
For instances, to download the file MYFILE from bucket MYBUCKET and store it as an utf-8 encoded string:
from google.cloud.storage import Client
client = Client()
bucket = client.get_bucket(MYBUCKET)
blob = bucket.get_blob(MYFILE)
downloaded_file = blob.download_as_text(encoding="utf-8")
You can then also use this in order to read different file formats. For json, replace the last line to
import json
downloaded_json_file = json.loads(blob.download_as_text(encoding="utf-8"))
For yaml files, replace the last line to :
import yaml
downloaded_yaml_file = yaml.safe_load(blob.download_as_text(encoding="utf-8"))

DON'T USE: blob.download_as_string()
USE: blob.download_as_text()
blob.download_as_text() does indeed return a string.
blob.download_as_string() is deprecated and returns a bytes object instead of a string object.

Works out when reading a docx / text file
from google.cloud import storage
# create storage client
storage_client = storage.Client.from_service_account_json('**PATH OF JSON FILE**')
bucket = storage_client.get_bucket('**BUCKET NAME**')
# get bucket data as blob
blob = bucket.blob('**SPECIFYING THE DOXC FILENAME**')
downloaded_blob = blob.download_as_string()
downloaded_blob = downloaded_blob.decode("utf-8")
print(downloaded_blob)

Uploading files to S3 bucket - Python Django

I want to put a string (which is an xml response) into a file and then upload it to a amazon s3 bucket. Following is my function in Python3
def excess_data(user, resume):
res_data = BytesIO(bytes(resume, "UTF-8"))
file_name = 'name_of_file/{}_resume.xml'.format(user.id)
s3 = S3Storage(bucket="Name of bucket", endpoint='s3-us-west-2.amazonaws.com')
response = s3.save(file_name, res_data)
url = s3.url(file_name)
print(url)
print(response)
return get_bytes_from_url(url)
However, when I run this script I keep getting the error - AttributeError: Unable to determine the file's size. Any idea how to resolve this?
Thanks in advance!

I am using s3boto (requires python-boto) in my Django projects to connect with S3.
Using boto is easy to do what you want to do:
>>> from boto.s3.key import Key
>>> k = Key(bucket)
>>> k.key = 'foobar'
>>> k.set_contents_from_string('This is a test of S3')
See this documentation the section Storing Data.
I hope that it helps with your problem.

Flask - Handling Form File & Upload to AWS S3 without Saving to File

I am using a Flask app to receive a mutipart/form-data request with an uploaded file (a video, in this example).
I don't want to save the file in the local directory because this app will be running on a server, and saving it will slow things down.
I am trying to use the file object created by the Flask request.files[''] method, but it doesn't seem to be working.
Here is that portion of the code:
#bp.route('/video_upload', methods=['POST'])
def VideoUploadHandler():
form = request.form
video_file = request.files['video_data']
if video_file:
s3 = boto3.client('s3')
s3.upload_file(video_file.read(), S3_BUCKET, 'video.mp4')
return json.dumps('DynamoDB failure')
This returns an error:
TypeError: must be encoded string without NULL bytes, not str
on the line:
s3.upload_file(video_file.read(), S3_BUCKET, 'video.mp4')
I did get this to work by first saving the file and then accessing that saved file, so it's not an issue with catching the request file. This works:
video_file.save(form['video_id']+".mp4")
s3.upload_file(form['video_id']+".mp4", S3_BUCKET, form['video_id']+".mp4")
What would be the best method to handle this file data in memory and pass it to the s3.upload_file() method? I am using the boto3 methods here, and I am only finding examples with the filename used in the first parameter, so I'm not sure how to process this correctly using the file in memory. Thanks!

First you need to be able to access the raw data sent to Flask. This is not as easy as it seems, since you're reading a form. To be able to read the raw stream you can use flask.request.stream, which behaves similarly to StringIO. The trick here is, you cannot call request.form or request.file because accessing those attributes will load the whole stream into memory or into a file.
You'll need some extra work to extract the right part of the stream (which unfortunately I cannot help you with because it depends on how your form is made, but I'll let you experiment with this).
Finally you can use the set_contents_from_file function from boto, since upload_file does not seem to deal with file-like objects (StringIO and such).
Example code:
from boto.s3.key import Key
#bp.route('/video_upload', methods=['POST'])
def VideoUploadHandler():
# form = request.form <- Don't do that
# video_file = request.files['video_data'] <- Don't do that either
video_file_and_metadata = request.stream # This is a file-like object which does not only contain your video file
# This is what you need to implement
video_title, video_stream = extract_title_stream(video_file_and_metadata)
# Then, upload to the bucket
s3 = boto3.client('s3')
bucket = s3.create_bucket(bucket_name, location=boto.s3.connection.Location.DEFAULT)
k = Key(bucket)
k.key = video_title
k.set_contents_from_filename(video_stream)

Python Django : Creating file object in memory without actually creating a file

I have an endpoint where I want to collect the response data and dump it into a file on S3 like this - https://stackoverflow.com/a/18731115/4824482
This is how I was trying to do it -
file_obj = open('/some/path/log.csv', 'w+')
file_obj.write(request.POST['data'])
and then passing file_obj to the S3 related code as in the above link.
The problem is that I don't have permissions to create a file on the server. Is there any way I can create a file object just in memory and then pass it to the S3 code?

Probably that's duplicate question of How to upload a file to S3 without creating a temporary local file. You would find best suggestion by checking out answers to that question.
Shortly the answer is code below:
from boto.s3.key import Key
k = Key(bucket)
k.key = 'yourkey'
k.set_contents_from_string(request.POST['data'])

Try tempfile https://docs.python.org/2/library/tempfile.html
f = tempfile.TemporaryFile()
f.write(request.POST['data'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to access dataset from S3 bucket in Python? - python

Related

How to list all the contents in a given key on Amazon S3 using Apache Libcloud?

GCS - Read a text file from Google Cloud Storage directly into python

Uploading files to S3 bucket - Python Django

Flask - Handling Form File & Upload to AWS S3 without Saving to File

Python Django : Creating file object in memory without actually creating a file

Categories

Resources