Amazon AWS boto S3 Changing file contents - python

Trying to upload a mp4 file to an Amazon S3 server, it uploads. When downloading it's exactly the same size but bytes 42, 43, 44 changes.
Looking up that's mdat. http://www.file-recovery.com/mp4-signature-format.htm
I tried changing mime types to various ones and also even changed extension to exe. No luck..
When trying to playback the video it does not work.
I'm using the boto python framework to do this.
Any ideas?
# Uploads a file to a bucket
def upload_file_to_bucket(self, file_path_on_disk, file_name_in_bucket, bucket_name):
self.remove_file_from_bucket(file_name_in_bucket, bucket_name)
if os.path.exists(file_path_on_disk):
print "Uploading " + file_path_on_disk + " to " + file_name_in_bucket + " in bucket " + bucket_name
bucket = self.get_connection().get_bucket(bucket_name)
k = bucket.new_key(file_name_in_bucket)
k.set_contents_from_filename(file_path_on_disk, headers = {'Content-Type' : 'application/octet-stream', 'Body' : 'data'}) # {'Content-Disposition': 'attachment', 'Content-Type' : 'video/mp4'}) headers = {'Content-Type' : 'application/octet-stream'}
k.set_acl('public-read')
print "Uploaded file to amazon server " + bucket_name
else:
print "File does not exist so cannot upload " + file_path_on_disk
Edit:
Looking further. seems there is a lot more corruption than that. Dodgy Load balancer?
Is it possible there a way to ensure the file is uploaded correctly? At the moment, it's always wrong.
EDIT:
This was due to the file not being fully written before being uploaded if anyone encounters this.

The right content type is video/mp4 you can set it like this from both
data = open('file.mp4', 'rb')
bucket.put_object(Key='file.mp4', Body=data, ContentType='video/mp4')
make sure the type is set correctly after upload in the s3, navigate to your file and check the properties / Metadata. Note that video/mp4 is not in the drop down list of available content type from AWS but you can force it by writing the content you want

Related

Uploading image string to Google Drive using pydrive

I need to upload an image string (as the one you get from requests.get(url).content) to google drive using the PyDrive package. I checked a similar question but the answer accepted there was to save it in a temporary file on a local drive and then upload that.
However, I cannot do that because of local storage and permission restrictions.
The accepted answer was previously to use SetContentString(image_string.decode('utf-8')) since
SetContentString requires a parameter of type str not bytes.
However the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte came up, as in the comments on that answer.
Is there any way to do this without using a temporary file, using PIL/BytesIO/anything that can convert it to be uploaded correctly as a string or somehow using PIL manipulated as an image and uploaded using SetContentFile()?
A basic example of what I'm trying to do is:
img_content = requests.get('https://i.imgur.com/A5gIh7W.jpeg')
file = drive.CreateFile({...})
file.setContentString(img_content.decode('utf-8'))
file.Upload()
When I saw the document (Upload and update file content) of pydrive, it says as follows.
Managing file content is as easy as managing file metadata. You can set file content with either SetContentFile(filename) or SetContentString(content) and call Upload() just as you did to upload or update file metadata.
And, I searched about the method for directly uploading the binary data to Google Drive. But, I couldn't find it. From this situation, I thought that there might not be such method. So, in this answer, I would like to propose to upload the binary data using requests module. In this case, the access token is retrieved from the authorization script of pydrive. The sample script is as follows.
Sample script:
from pydrive.auth import GoogleAuth
import io
import json
import requests
url = 'https://i.imgur.com/A5gIh7W.jpeg' # Please set the direct link of the image file.
filename = 'sample file' # Please set the filename on Google Drive.
folder_id = 'root' # Please set the folder ID. The file is put to this folder.
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
metadata = {
"name": filename,
"parents": [folder_id]
}
files = {
'data': ('metadata', json.dumps(metadata), 'application/json'),
'file': io.BytesIO(requests.get(url).content)
}
r = requests.post(
"https://www.googleapis.com/upload/drive/v3/files?uploadType=multipart",
headers={"Authorization": "Bearer " + gauth.credentials.access_token},
files=files
)
print(r.text)
Note:
In this script, it supposes that your URL is the direct link of the image file. Please be careful this.
In this case, uploadType=multipart is used. The official document says as follows. Ref
Use this upload type to quickly transfer a small file (5 MB or less) and metadata that describes the file, in a single request. To perform a multipart upload, refer to Perform a multipart upload.
When you want to upload the data of the large size, please use the resumable upload. Ref
References:
Upload and update file content of pydrive
Upload file data of Drive API

GCP Python Cloud Function : Reading a Plain text file from Cloud Storage

Cloud Function will triggered once a file gets uploaded in the storage,
My File Name : PubSubMessage.
Inside Text : Hi, this this the first message
from google.cloud import storage
storage_client = storage.Client()
def hello_gcs(event, context):
file = event
bucket = storage_client.get_bucket(file['bucket'])
blob = bucket.blob(file['name'])
contents = blob.download_as_string()
print('contents: {}'.format(contents))
decodedstring = contents.decode(encoding="utf-8", errors="ignore")
print('decodedstring: \n{}'.format(decodedstring))
print('decodedstring: \n{}'.format(decodedstring))
------WebKitFormBoundaryAWAKqDaYZB3fJBhx
Content-Disposition: form-data; name="file"; filename="PubSubMessage.txt"
Content-Type: text/plain
Hi, this this the first line.
Hi ,this is the second line.
hi this is the space after.
------WebKitFormBoundaryAWAKqDaYZB3fJBhx--
My Requirements.txt file
google-cloud-storage
requests==2.20.0
requests-toolbelt==0.9.1
How do i get the actual string inside the file "Hi, I am the first message....." ?
What is the best possible way to get the text from a file?
TIA
The string you read from Google Storage is a string representation of a multipart form. It contains not only the uploaded file contents but also some metadata. The same kind of request may be used to represent more than one file and/or form fields along with a file.
To access the file contents you want, you can use a library which supports that, such as requests-toolbelt. Check out this SO answer for an example. You'll need the Content-Type header, which includes the boundary, or to manually parse the boundary just from the content, if you absolutely must.
EDIT: from your answer, it seems that the Content-Type header was available in the Storage Metadata in Google Storage, which is a common scenario. For future readers of this answer, the specifics of where to read this header from will depend on your particular case.
Since this library is present in PyPI (the Python Package Index), you can use it even in Cloud Functions by specifying it as a dependency in the requirements.txt file.
Below Code will print the actual text present inside a file.
from requests_toolbelt.multipart import decoder
from google.cloud import storage
storage_client = storage.Client()
def hello_gcs(event, context):
file = event
bucket = storage_client.bucket(file['bucket'])
#print('Bucket Name : {}'.format(file['bucket']))
#print('Object Name : {}'.format(file['name']))
#print('Bucket Object : {}'.format(bucket))
blob = bucket.get_blob(file['name'])
#print('Blob Object : {}'.format(blob))
contentType = blob.content_type
print('Blob ContentType: {}'.format(contentType))
#To download the file as byte object
content = blob.download_as_string()
print('content: {}'.format(content))
for part in decoder.MultipartDecoder(content, contentType).parts:
print(part.text)

Writing a file to S3 using Lambda in Python with AWS

In AWS, I'm trying to save a file to S3 in Python using a Lambda function. While this works on my local computer, I am unable to get it to work in Lambda. I've been working on this problem for most of the day and would appreciate help. Thank you.
def pdfToTable(PDFfilename, apiKey, fileExt, bucket, key):
# parsing a PDF using an API
fileData = (PDFfilename, open(PDFfilename, "rb"))
files = {"f": fileData}
postUrl = "https://pdftables.com/api?key={0}&format={1}".format(apiKey, fileExt)
response = requests.post(postUrl, files=files)
response.raise_for_status()
# this code is probably the problem!
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'rb') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_fileobj(data, key)
# FYI, on my own computer, this saves the file
with open('output.csv', "wb") as f:
f.write(response.content)
In S3, there is a bucket transportation.manifests.parsed containing the folder csv where the file should be saved.
The type of response.content is bytes.
From AWS, the error from the current set-up above is [Errno 2] No such file or directory: '/tmp/output2.csv': FileNotFoundError. In fact, my goal is to save the file to the csv folder under a unique name, so tmp/output2.csv might not be the best approach. Any guidance?
In addition, I've tried to use wb and w instead of rb also to no avail. The error with wb is Input <_io.BufferedWriter name='/tmp/output2.csv'> of type: <class '_io.BufferedWriter'> is not supported. The documentation suggests that using 'rb' is the recommended usage, but I do not understand why that would be the case.
Also, I've tried s3_client.put_object(Key=key, Body=response.content, Bucket=bucket) but receive An error occurred (404) when calling the HeadObject operation: Not Found.
Assuming Python 3.6. The way I usually do this is to wrap the bytes content in a BytesIO wrapper to create a file like object. And, per the boto3 docs you can use the-transfer-manager for a managed transfer:
from io import BytesIO
import boto3
s3 = boto3.client('s3')
fileobj = BytesIO(response.content)
s3.upload_fileobj(fileobj, 'mybucket', 'mykey')
If that doesn't work I'd double check all IAM permissions are correct.
You have a writable stream that you're asking boto3 to use as a readable stream which won't work.
Write the file, and then simply use bucket.upload_file() afterwards, like so:
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'w') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_file('/tmp/output2.csv', key)

Google Picker and download the selected file

I have managed to integrate the google Picker. Now I have the FileId and the access_token. I need to download the selected file in my back end which is in Python.
I have followed the documentation on google developers but If i use the python library then i have to authenticate the user again which is not suitable in my case.
Please enlighten me if I can download the file with any format.
Thanks in advance.
This is a very simple sample script for downloading files using access token and file id without Google library. And this sample supposes files (images and videos) except for Google Docs, as you said. The flow is as follows.
Flow :
Retrieve filename and mimeType.
Create filename from retrieved filename and mimeType. If the filename doesn't have the extension on Google Drive, this script adds the extension to the filename using mimeType and save it. If the filename has the extension on Google Drive, this script uses the original filename and save it.
Download a file and save it as the created filename. If you want to save the file to the specific directory, please set it by yourself.
Sample script :
import mimetypes
import os.path
import requests
accessToken = "### access token ###"
fileId = "### file id ###"
fileInf = requests.get(
"https://www.googleapis.com/drive/v3/files/" + fileId,
headers={"Authorization": "Bearer " + accessToken},
)
filename = fileInf.json()["name"]
temp, ext = os.path.splitext(filename)
filename = filename if ext != "" else filename + mimetypes.guess_extension(fileInf.json()["mimeType"])
r = requests.get(
"https://www.googleapis.com/drive/v3/files/" + fileId + "?alt=media",
headers={"Authorization": "Bearer " + accessToken},
)
with open(filename, "wb") as f:
f.write(r.content)
If this was not helpful for you, I'm sorry.

How to download files with Box API & Python

I have currently the upload portion of my code working, how would I go about converting this into a program that will download the respective files from the box folder?
This is the upload program:
import requests
import json
#the user acces token
access_token = 'UfUNeHhv4gIxFCn5WEXHgBJwfG8gHT2o'
#the name of the file as you want it to appear in box
dst_filename = 'box_file'
#the actual file path
src_directory = 'C:\Python\cache\\'
#the name of the file to be transferred
src_filename = 'Wildlife.wmv'
#the id of the folder you want to upload to
parent_id = '0'
counter = 1
for counter in range(1, 6):
src_file = (src_directory + src_filename + '-' + str(counter))
print(src_file)
box_filename = (dst_filename + '-' + str(counter))
headers = { 'Authorization': 'Bearer {0}'.format(access_token)}
url = 'https://upload.box.com/api/2.0/files/content'
#open(src_file,'rb') - opens the source file with the buffered reader
files = { 'filename': (box_filename, open(src_file,'rb')) }
data = { "parent_id": parent_id }
response = requests.post(url, data=data, files=files, headers=headers)
#file_info = response.json()
#print(file_info)
print(response)
print(url, data, files, headers)
counter = counter + 1
This is the sample curl request that the Box API documentation gives for downloading files.
curl -L https://api.box.com/2.0/files/FILE_ID/content \
-H "Authorization: Bearer ACCESS_TOKEN" \
-o FILE_PATH/file_name.txt
Part two of this question: Is there a way to alter this program (and the download program) to process all of the files within a folder no matter what the name of the file is?
I am new to programming, so please forgive my lack of skills/knowledge in this area.
Assume you are getting your authorization correct you can download file by adding few lines to code to your Existing code.
This will copy data from box file to local file here name is FileFromBox.xlx
with open('FileFromBox.xls', 'wb') as open_file:
client.file('FileId_of_box_file').download_to(open_file)
open_file.close()
I know this was asked long back, but still I believe many people are searching for the way to do it.
Please check Box SDK for more details.
And I'm using OAuth2.0 - Custom App. You can create the credentials from the developer console.
Here's the code.
from boxsdk import OAuth2, Client
#from boxsdk import Folder
auth = OAuth2(
client_id='fbxxxxxxxxxxxxxxxxxxxxxxxxxxxxx9',
client_secret='bPxxxxxxxxxxxxxxxxxxxxxxxxx4Or',
access_token='QExxxxxxxxxxxxxxxxxxxxxxxxxxwt',
)
client = Client(auth)
root_folder = client.root_folder().get()
items = root_folder.get_items()
for item in items:
print('{0} {1} is named "{2}"'.format(item.type.capitalize(), item.id, item.name))
with open(item.name, 'wb') as open_file:
client.file(item.id).download_to(open_file)
open_file.close()
Hope this will help you. Thanks to the Python boxsdk 2.0.0 Doc.
I would suggest you looking at Box SDK
As you can see in their docs, after authenticating with your client you only need to run the following line:
client.file(file_id='SOME_FILE_ID').content()
There is more information in Box SDK Docs. If this does not satisfy your necessities because you want to create your own Box SDK, then please wait for another person to give an specific response to your problem. Thanks.
You can download a file & folders into a zip like below:
name = 'test'
file = mock_client.file('466239504569')
folder = mock_client.folder('466239504580')
items = [file, folder]
output_file = open('test.zip', 'wb')
status = client.download_zip(name, items, output_file)
print('The status of the zip download is {0}'.format(status['state']))

Categories