Uploading files to S3 bucket - Python Django - python

I want to put a string (which is an xml response) into a file and then upload it to a amazon s3 bucket. Following is my function in Python3
def excess_data(user, resume):
res_data = BytesIO(bytes(resume, "UTF-8"))
file_name = 'name_of_file/{}_resume.xml'.format(user.id)
s3 = S3Storage(bucket="Name of bucket", endpoint='s3-us-west-2.amazonaws.com')
response = s3.save(file_name, res_data)
url = s3.url(file_name)
print(url)
print(response)
return get_bytes_from_url(url)
However, when I run this script I keep getting the error - AttributeError: Unable to determine the file's size. Any idea how to resolve this?
Thanks in advance!

I am using s3boto (requires python-boto) in my Django projects to connect with S3.
Using boto is easy to do what you want to do:
>>> from boto.s3.key import Key
>>> k = Key(bucket)
>>> k.key = 'foobar'
>>> k.set_contents_from_string('This is a test of S3')
See this documentation the section Storing Data.
I hope that it helps with your problem.

Related

How to access dataset from S3 bucket in Python?

I have a Jupyter Notebook and I want to access a dataset that is in S3 bucket( it is publicaly accesible)
response = s3.list_objects_v2(Bucket='sagemaker-eu-central-1-261218592922' )
for content in response['Contents']:
obj_dict = s3.get_object(Bucket='sagemaker-eu-central-1-261218592922', Key=content['Key'])
print(obj_dict)
I am using a boto3 client (s3). Ok so I go through the contents of the bucket with code above, but how does one access the contents of the file?
Without knowing what the contents of the file are, or what you want to do after, I can only offer a generic solution -
response = s3.list_objects_v2(Bucket='sagemaker-eu-central-1-261218592922' )
for content in response['Contents']:
obj_dict = s3.get_object(Bucket='sagemaker-eu-central-1-261218592922', Key=content['Key'])
contents = obj_dict['Body'].read().decode('utf-8')
More details on this can be found at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.get_object

How to read XML file into Sagemaker Notebook Instance? [duplicate]

I want to write a Python script that will read and write files from s3 using their url's, eg:'s3:/mybucket/file'. It would need to run locally and in the cloud without any code changes. Is there a way to do this?
Edit: There are some good suggestions here but what I really want is something that allows me to do this:
myfile = open("s3://mybucket/file", "r")
and then use that file object like any other file object. That would be really cool. I might just write something like this for myself if it doesn't exist. I could build that abstraction layer on simples3 or boto.
For opening, it should be as simple as:
import urllib
opener = urllib.URLopener()
myurl = "https://s3.amazonaws.com/skyl/fake.xyz"
myfile = opener.open(myurl)
This will work with s3 if the file is public.
To write a file using boto, it goes a little something like this:
from boto.s3.connection import S3Connection
conn = S3Connection(AWS_KEY, AWS_SECRET)
bucket = conn.get_bucket(BUCKET)
destination = bucket.new_key()
destination.name = filename
destination.set_contents_from_file(myfile)
destination.make_public()
lemme know if this works for you :)
Here's how they do it in awscli :
def find_bucket_key(s3_path):
"""
This is a helper function that given an s3 path such that the path is of
the form: bucket/key
It will return the bucket and the key represented by the s3 path
"""
s3_components = s3_path.split('/')
bucket = s3_components[0]
s3_key = ""
if len(s3_components) > 1:
s3_key = '/'.join(s3_components[1:])
return bucket, s3_key
def split_s3_bucket_key(s3_path):
"""Split s3 path into bucket and key prefix.
This will also handle the s3:// prefix.
:return: Tuple of ('bucketname', 'keyname')
"""
if s3_path.startswith('s3://'):
s3_path = s3_path[5:]
return find_bucket_key(s3_path)
Which you could just use with code like this
from awscli.customizations.s3.utils import split_s3_bucket_key
import boto3
client = boto3.client('s3')
bucket_name, key_name = split_s3_bucket_key(
's3://example-bucket-name/path/to/example.txt')
response = client.get_object(Bucket=bucket_name, Key=key_name)
This doesn't address the goal of interacting with an s3 key as a file like object but it's a step in that direction.
I haven't seen something that would work directly with S3 urls, but you could use an S3 access library (simples3 looks decent) and some simple string manipulation:
>>> url = "s3:/bucket/path/"
>>> _, path = url.split(":", 1)
>>> path = path.lstrip("/")
>>> bucket, path = path.split("/", 1)
>>> print bucket
'bucket'
>>> print path
'path/'
Try s3fs
First example on the docs:
>>> import s3fs
>>> fs = s3fs.S3FileSystem(anon=True)
>>> fs.ls('my-bucket')
['my-file.txt']
>>> with fs.open('my-bucket/my-file.txt', 'rb') as f:
... print(f.read())
b'Hello, world'
You can use Boto Python API for accessing S3 by python. Its a good library. After you do the installation of Boto, following sample programe will work for you
>>> k = Key(b)
>>> k.key = 'yourfile'
>>> k.set_contents_from_filename('yourfile.txt')
You can find more information here http://boto.cloudhackers.com/s3_tut.html#storing-data
http://s3tools.org/s3cmd works pretty well and support the s3:// form of the URL structure you want. It does the business on Linux and Windows. If you need a native API to call from within a python program then http://code.google.com/p/boto/ is a better choice.

Load JSON from s3 inside aws glue pyspark job

I am trying to retrieve a JSON file from an s3 bucket inside a glue pyspark script.
I am running this function in the job inside aws glue:
def run(spark):
s3_bucket_path = 's3://bucket/data/file.gz'
df = spark.read.json(s3_bucket_path)
df.show()
After this I am getting:
AnalysisException: u'Path does not exist: s3://bucket/data/file.gz;'
I searched for this issue and did not find anything that would be similar enough to infer where is the issue. I think there might be permission issues accessing the bucket, but then the error message should be different.
Here You can Try This :
s3 = boto3.client("s3", region_name="us-west-2", aws_access_key_id="
", aws_secret_access_key="")
jsonFile = s3.get_object(Bucket=bucket, Key=key)
jsonObject = json.load(jsonFile["Body"])
where Key = full path to your file in bucket
and use this jsonObject in spark.read.json(jsonObject)

Writing a file to S3 using Lambda in Python with AWS

In AWS, I'm trying to save a file to S3 in Python using a Lambda function. While this works on my local computer, I am unable to get it to work in Lambda. I've been working on this problem for most of the day and would appreciate help. Thank you.
def pdfToTable(PDFfilename, apiKey, fileExt, bucket, key):
# parsing a PDF using an API
fileData = (PDFfilename, open(PDFfilename, "rb"))
files = {"f": fileData}
postUrl = "https://pdftables.com/api?key={0}&format={1}".format(apiKey, fileExt)
response = requests.post(postUrl, files=files)
response.raise_for_status()
# this code is probably the problem!
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'rb') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_fileobj(data, key)
# FYI, on my own computer, this saves the file
with open('output.csv', "wb") as f:
f.write(response.content)
In S3, there is a bucket transportation.manifests.parsed containing the folder csv where the file should be saved.
The type of response.content is bytes.
From AWS, the error from the current set-up above is [Errno 2] No such file or directory: '/tmp/output2.csv': FileNotFoundError. In fact, my goal is to save the file to the csv folder under a unique name, so tmp/output2.csv might not be the best approach. Any guidance?
In addition, I've tried to use wb and w instead of rb also to no avail. The error with wb is Input <_io.BufferedWriter name='/tmp/output2.csv'> of type: <class '_io.BufferedWriter'> is not supported. The documentation suggests that using 'rb' is the recommended usage, but I do not understand why that would be the case.
Also, I've tried s3_client.put_object(Key=key, Body=response.content, Bucket=bucket) but receive An error occurred (404) when calling the HeadObject operation: Not Found.
Assuming Python 3.6. The way I usually do this is to wrap the bytes content in a BytesIO wrapper to create a file like object. And, per the boto3 docs you can use the-transfer-manager for a managed transfer:
from io import BytesIO
import boto3
s3 = boto3.client('s3')
fileobj = BytesIO(response.content)
s3.upload_fileobj(fileobj, 'mybucket', 'mykey')
If that doesn't work I'd double check all IAM permissions are correct.
You have a writable stream that you're asking boto3 to use as a readable stream which won't work.
Write the file, and then simply use bucket.upload_file() afterwards, like so:
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'w') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_file('/tmp/output2.csv', key)

Python Django : Creating file object in memory without actually creating a file

I have an endpoint where I want to collect the response data and dump it into a file on S3 like this - https://stackoverflow.com/a/18731115/4824482
This is how I was trying to do it -
file_obj = open('/some/path/log.csv', 'w+')
file_obj.write(request.POST['data'])
and then passing file_obj to the S3 related code as in the above link.
The problem is that I don't have permissions to create a file on the server. Is there any way I can create a file object just in memory and then pass it to the S3 code?
Probably that's duplicate question of How to upload a file to S3 without creating a temporary local file. You would find best suggestion by checking out answers to that question.
Shortly the answer is code below:
from boto.s3.key import Key
k = Key(bucket)
k.key = 'yourkey'
k.set_contents_from_string(request.POST['data'])
Try tempfile https://docs.python.org/2/library/tempfile.html
f = tempfile.TemporaryFile()
f.write(request.POST['data'])

Categories