Reading part of a file in S3 using Boto

Reading part of a file in S3 using Boto - python

I am trying to read 700MB file stored in S3. How ever I only require bytes from locations 73 to 1024.
I tried to find a usable solution but failed to. Would be a great help if someone could help me out.

S3 supports GET requests using the 'Range' HTTP header which is what you're after.
To specify a Range request in boto, just add a header dictionary specifying the 'Range' key for the bytes you are interested in. Adapted from Mitchell Garnaat's response:
import boto
s3 = boto.connect_s3()
bucket = s3.lookup('mybucket')
key = bucket.lookup('mykey')
your_bytes = key.get_contents_as_string(headers={'Range' : 'bytes=73-1024'})

import boto3
obj = boto3.resource('s3').Object('mybucket', 'mykey')
stream = obj.get(Range='bytes=32-64')['Body']
print(stream.read())
boto3 version from https://github.com/boto/boto3/issues/1236

Please have a look on the python script here
import boto3
region = 'us-east-1' # define your region here
bucketname = 'test' # define bucket
key = 'objkey' # s3 file
Bytes_range = 'bytes=73-1024'
client = boto3.client('s3',region_name = region)
resp = client.get_object(Bucket=bucketname,Key=key,Range=Bytes_range)
data = resp['Body'].read()

Related

Boto3 S3 Paginator not returning filtered results

I'm trying to list the items in my S3 bucket from the last few months. I'm able to get results from the normal paginator (page_iterator), but the filtered_iterator isn't yielding anything when I iterate over it.
I've been referencing the documentation here. I've double checked my filter string both using JmesPath site and the AWS CLI, and it works in both places. I'm at a loss at this point on what I need to do.
Current Code:
client = boto3.client('s3', region_name='us-west-2')
paginator = client.get_paginator('list_objects_v2')
operation_parameters = {'Bucket': self.bucket_name,
'Prefix': file_path_prefix}
page_iterator = paginator.paginate(**operation_parameters)
filtered_iterator = page_iterator.search("Contents[?LastModified>='2022-10-31'][]")
for key_data in filtered_iterator:
print('page2', key_data)

Generate presigned url for uploading file to google storage using python

I want to upload a image from front end to google storage using javascript ajax functionality. I need a presigned url that the server would generate which would provide authentication to my frontend to upload a blob.
How can I generate a presigned url when using my local machine.
Previously for aws s3 I would do :
pp = s3.generate_presigned_post(
Bucket=settings.S3_BUCKET_NAME,
Key='folder1/' + file_name,
ExpiresIn=20 # seconds
)
When generating a signed url for a user to just view a file stored on google storage I do :
bucket = settings.CLIENT.bucket(settings.BUCKET_NAME)
blob_name = 'folder/img1.jpg'
blob = bucket.blob(blob_name)
url = blob.generate_signed_url(
version='v4',
expiration=datetime.timedelta(minutes=1),
method='GET')

Spent 100$ on google support and 2 weeks of my time to finally find a solution.
client = storage.Client() # works on app engine standard without any credentials requirements
But if you want to use generate_signed_url() function then you need service account Json key.
Every app engine standard has a default service account. ( You can find it in IAM/service account). Create a key for that default sv account and download the key ('sv_key.json') in json format. Store that key in your Django project right next to app.yaml file. Then do the following :
from google.cloud import storage
CLIENT = storage.Client.from_service_account_json('sv_key.json')
bucket = CLIENT.bucket('bucket_name_1')
blob = bucket.blob('img1.jpg') # name of file to be saved/uploaded to storage
pp = blob.generate_signed_url(
version='v4',
expiration=datetime.timedelta(minutes=1),
method='POST')
This will work on your local machine and GAE standard. WHen you deploy your app to GAE, sv_key.json also gets deployed with Django project and hence it works.
Hope it helps you.

Editing my answer as I didn't understand the problem you were facing.
Taking a look at the comments thread in the question, as #Nick Shebanov stated, there's one possibility to accomplish what are you trying to when using GAE with flex environment.
I have been trying to do the same with GAE Standard environment with no luck so far. At this point, I would recommend opening a feature request at the public issue tracker so this gets somehow implemented.

Create a service account private key and store it in SecretManager (SM).
In settings.py retrieve that key from SecretManager and store it in a constant - SV_ACCOUNT_KEY
Override Client() class func from_service_account_json() to take json key content instead of a path to json file. This way we dont have to have a json file in our file system (locally, cloudbuild or in GAE). we can just get private key contents from SM anytime anywhere.
settings.py
secret = SecretManager()
SV_ACCOUNT_KEY = secret.access_secret_data('SV_ACCOUNT_KEY')
signed_url_mixin.py
import datetime
import json
from django.conf import settings
from google.cloud.storage.client import Client
from google.oauth2 import service_account
class CustomClient(Client):
#classmethod
def from_service_account_json(cls, json_credentials_path, *args, **kwargs):
"""
Copying everything from base func (from_service_account_json).
Instead of passing json file for private key, we pass the private key
json contents directly (since we cannot save a file on GAE).
Since its not properly written, we cannot just overwrite a class or a
func, we have to rewrite this entire func.
"""
if "credentials" in kwargs:
raise TypeError("credentials must not be in keyword arguments")
credentials_info = json.loads(json_credentials_path)
credentials = service_account.Credentials.from_service_account_info(
credentials_info
)
if cls._SET_PROJECT:
if "project" not in kwargs:
kwargs["project"] = credentials_info.get("project_id")
kwargs["credentials"] = credentials
return cls(*args, **kwargs)
class _SignedUrlMixin:
bucket_name = settings.BUCKET_NAME
CLIENT = CustomClient.from_service_account_json(settings.SV_ACCOUNT_KEY)
exp_min = 4 # expire minutes
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.bucket = self.CLIENT.bucket(self.bucket_name)
def _signed_url(self, file_name, method):
blob = self.bucket.blob(file_name)
signed_url = blob.generate_signed_url(
version='v4',
expiration=datetime.timedelta(minutes=self.exp_min),
method=method
)
return signed_url
class GetSignedUrlMixin(_SignedUrlMixin):
"""
A GET url to view file on CS
"""
def get_signed_url(self, file_name):
"""
:param file_name: name of file to be retrieved from CS.
xyz/f1.pdf
:return: GET signed url
"""
method = 'GET'
return self._signed_url(file_name, method)
class PutSignedUrlMixin(_SignedUrlMixin):
"""
A PUT url to make a put req to upload a file to CS
"""
def put_signed_url(self, file_name):
"""
:file_name: xyz/f1.pdf
"""
method = 'PUT'
return self._signed_url(file_name, method)

How to read data from Azure's CosmosDB in python

I have a trial account with Azure and have uploaded some JSON files into CosmosDB. I am creating a python program to review the data but I am having trouble doing so. This is the code I have so far:
import pydocumentdb.documents as documents
import pydocumentdb.document_client as document_client
import pydocumentdb.errors as errors
url = 'https://ronyazrak.documents.azure.com:443/'
key = '' # primary key
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(url, {'masterKey': key})
collection_link = '/dbs/test1/colls/test1'
collection = client.ReadCollection(collection_link)
result_iterable = client.QueryDocuments(collection)
query = { 'query': 'SELECT * FROM server s' }
I read somewhere that the key would be my primary key that I can find in my Azure account Keys. I have filled the key string with my primary key shown in the image but key here is empty just for privacy purposes.
I also read somewhere that the collection_link should be '/dbs/test1/colls/test1' if my data is in collection 'test1' Collections.
My code gets an error at the function client.ReadCollection().
That's the error I have "pydocumentdb.errors.HTTPFailure: Status code: 401
{"code":"Unauthorized","message":"The input authorization token can't serve the request. Please check that the expected payload is built as per the protocol, and check the key being used. Server used the following payload to sign: 'get\ncolls\ndbs/test1/colls/test1\nmon, 29 may 2017 19:47:28 gmt\n\n'\r\nActivityId: 03e13e74-8db4-4661-837a-f8d81a2804cc"}"
Once this error is fixed, what is there left to do? I want to get the JSON files as a big dictionary so that I can review the data.
Am I in the right path? Am I approaching this the wrong way? How can I read documents that are in my database? Thanks.

According to your error information, it seems to be caused by the authentication failed with your key as the offical explaination said below from here.
So please check your key, but I think the key point is using pydocumentdb incorrectly. These id of Database, Collection & Document are different from their links. These APIs ReadCollection, QueryDocuments need to be pass related link. You need to retrieve all resource in Azure CosmosDB via resource link, not resource id.
According to your description, I think you want to list all documents under the collection id path /dbs/test1/colls/test1. As reference, here is my sample code as below.
from pydocumentdb import document_client
uri = 'https://ronyazrak.documents.azure.com:443/'
key = '<your-primary-key>'
client = document_client.DocumentClient(uri, {'masterKey': key})
db_id = 'test1'
db_query = "select * from r where r.id = '{0}'".format(db_id)
db = list(client.QueryDatabases(db_query))[0]
db_link = db['_self']
coll_id = 'test1'
coll_query = "select * from r where r.id = '{0}'".format(coll_id)
coll = list(client.QueryCollections(db_link, coll_query))[0]
coll_link = coll['_self']
docs = client.ReadDocuments(coll_link)
print list(docs)
Please see the details of DocumentDB Python SDK from here.

For those using azure-cosmos, the current library (2019) I opened a doc bug and provided a sample in GitHub
Sample
from azure.cosmos import cosmos_client
import json
CONFIG = {
"ENDPOINT": "ENDPOINT_FROM_YOUR_COSMOS_ACCOUNT",
"PRIMARYKEY": "KEY_FROM_YOUR_COSMOS_ACCOUNT",
"DATABASE": "DATABASE_ID", # Prolly looks more like a name to you
"CONTAINER": "YOUR_CONTAINER_ID" # Prolly looks more like a name to you
}
CONTAINER_LINK = f"dbs/{CONFIG['DATABASE']}/colls/{CONFIG['CONTAINER']}"
FEEDOPTIONS = {}
FEEDOPTIONS["enableCrossPartitionQuery"] = True
# There is also a partitionKey Feed Option, but I was unable to figure out how to us it.
QUERY = {
"query": f"SELECT * from c"
}
# Initialize the Cosmos client
client = cosmos_client.CosmosClient(
url_connection=CONFIG["ENDPOINT"], auth={"masterKey": CONFIG["PRIMARYKEY"]}
)
# Query for some data
results = client.QueryItems(CONTAINER_LINK, QUERY, FEEDOPTIONS)
# Look at your data
print(list(results))
# You can also use the list as JSON
json.dumps(list(results), indent=4)

python aws put_bucket_policy() MalformedPolicy

How does one use put_bucket_policy()? It throws a MalformedPolicy error even when I try to pass in an existing valid policy:
import boto3
client = boto3.client('s3')
dict_policy = client.get_bucket_policy(Bucket = 'my_bucket')
str_policy = str(dict_policy)
response = client.put_bucket_policy(Bucket = 'my_bucket', Policy = str_policy)
* error message: *
botocore.exceptions.ClientError: An error occurred (MalformedPolicy) when calling the PutBucketPolicy operation: This policy contains invalid Json

That's because applying str to a dict doesn't turn it into a valid json, use json.dumps instead:
import boto3
import json
client = boto3.client('s3')
dict_policy = client.get_bucket_policy(Bucket = 'my_bucket')
str_policy = json.dumps(dict_policy)
response = client.put_bucket_policy(Bucket = 'my_bucket', Policy = str_policy)

Current boto3 API doesn't have a function to APPEND the bucket policy, whether add another items/elements/attributes. You need load and manipulate the JSON yourself. E.g. write script load the policy into a dict, append the "Statement" element list, then use the policy.put to replace the whole policy. Without the original statement id, user policy will be appended. HOWEVER, there is no way to tell whether later user policy will override rules of the earlier one.
For example
import boto3
s3_conn = boto3.resource('s3')
bucket_policy = s3_conn.BucketPolicy('bucket_name')
policy = s3_conn.get_bucket_policy(Bucket='bucket_name')
user_policy = { "Effect": "Allow",... }
new_policy = policy['Statement'].append(user_policy)
bucket_policy.put(Policy=new_policy)
The user don't need know the old policy in the process.

Creating Signed URLs for Amazon CloudFront

Short version: How do I make signed URLs "on-demand" to mimic Nginx's X-Accel-Redirect behavior (i.e. protecting downloads) with Amazon CloudFront/S3 using Python.
I've got a Django server up and running with an Nginx front-end. I've been getting hammered with requests to it and recently had to install it as a Tornado WSGI application to prevent it from crashing in FastCGI mode.
Now I'm having an issue with my server getting bogged down (i.e. most of its bandwidth is being used up) due to too many requests for media being made to it, I've been looking into CDNs and I believe Amazon CloudFront/S3 would be the proper solution for me.
I've been using Nginx's X-Accel-Redirect header to protect the files from unauthorized downloading, but I don't have that ability with CloudFront/S3--however they do offer signed URLs. I'm no Python expert by far and definitely don't know how to create a Signed URL properly, so I was hoping someone would have a link for how to make these URLs "on-demand" or would be willing to explain how to here, it would be greatly appreciated.
Also, is this the proper solution, even? I'm not too familiar with CDNs, is there a CDN that would be better suited for this?

Amazon CloudFront Signed URLs work differently than Amazon S3 signed URLs. CloudFront uses RSA signatures based on a separate CloudFront keypair which you have to set up in your Amazon Account Credentials page. Here's some code to actually generate a time-limited URL in Python using the M2Crypto library:
Create a keypair for CloudFront
I think the only way to do this is through Amazon's web site. Go into your AWS "Account" page and click on the "Security Credentials" link. Click on the "Key Pairs" tab then click "Create a New Key Pair". This will generate a new key pair for you and automatically download a private key file (pk-xxxxxxxxx.pem). Keep the key file safe and private. Also note down the "Key Pair ID" from amazon as we will need it in the next step.
Generate some URLs in Python
As of boto version 2.0 there does not seem to be any support for generating signed CloudFront URLs. Python does not include RSA encryption routines in the standard library so we will have to use an additional library. I've used M2Crypto in this example.
For a non-streaming distribution, you must use the full cloudfront URL as the resource, however for streaming we only use the object name of the video file. See the code below for a full example of generating a URL which only lasts for 5 minutes.
This code is based loosely on the PHP example code provided by Amazon in the CloudFront documentation.
from M2Crypto import EVP
import base64
import time
def aws_url_base64_encode(msg):
msg_base64 = base64.b64encode(msg)
msg_base64 = msg_base64.replace('+', '-')
msg_base64 = msg_base64.replace('=', '_')
msg_base64 = msg_base64.replace('/', '~')
return msg_base64
def sign_string(message, priv_key_string):
key = EVP.load_key_string(priv_key_string)
key.reset_context(md='sha1')
key.sign_init()
key.sign_update(message)
signature = key.sign_final()
return signature
def create_url(url, encoded_signature, key_pair_id, expires):
signed_url = "%(url)s?Expires=%(expires)s&Signature=%(encoded_signature)s&Key-Pair-Id=%(key_pair_id)s" % {
'url':url,
'expires':expires,
'encoded_signature':encoded_signature,
'key_pair_id':key_pair_id,
}
return signed_url
def get_canned_policy_url(url, priv_key_string, key_pair_id, expires):
#we manually construct this policy string to ensure formatting matches signature
canned_policy = '{"Statement":[{"Resource":"%(url)s","Condition":{"DateLessThan":{"AWS:EpochTime":%(expires)s}}}]}' % {'url':url, 'expires':expires}
#sign the non-encoded policy
signature = sign_string(canned_policy, priv_key_string)
#now base64 encode the signature (URL safe as well)
encoded_signature = aws_url_base64_encode(signature)
#combine these into a full url
signed_url = create_url(url, encoded_signature, key_pair_id, expires);
return signed_url
def encode_query_param(resource):
enc = resource
enc = enc.replace('?', '%3F')
enc = enc.replace('=', '%3D')
enc = enc.replace('&', '%26')
return enc
#Set parameters for URL
key_pair_id = "APKAIAZVIO4BQ" #from the AWS accounts CloudFront tab
priv_key_file = "cloudfront-pk.pem" #your private keypair file
# Use the FULL URL for non-streaming:
resource = "http://34254534.cloudfront.net/video.mp4"
#resource = 'video.mp4' #your resource (just object name for streaming videos)
expires = int(time.time()) + 300 #5 min
#Create the signed URL
priv_key_string = open(priv_key_file).read()
signed_url = get_canned_policy_url(resource, priv_key_string, key_pair_id, expires)
print(signed_url)
#Flash player doesn't like query params so encode them if you're using a streaming distribution
#enc_url = encode_query_param(signed_url)
#print(enc_url)
Make sure that you set up your distribution with a TrustedSigners parameter set to the account holding your keypair (or "Self" if it's your own account)
See Getting started with secure AWS CloudFront streaming with Python for a fully worked example on setting this up for streaming with Python

This feature is now already supported in Botocore, which is the underlying library of Boto3, the latest official AWS SDK for Python. (The following sample requires the installation of the rsa package, but you can use other RSA package too, just define your own "normalized RSA signer".)
The usage looks like this:
from botocore.signers import CloudFrontSigner
# First you create a cloudfront signer based on a normalized RSA signer::
import rsa
def rsa_signer(message):
private_key = open('private_key.pem', 'r').read()
return rsa.sign(
message,
rsa.PrivateKey.load_pkcs1(private_key.encode('utf8')),
'SHA-1') # CloudFront requires SHA-1 hash
cf_signer = CloudFrontSigner(key_id, rsa_signer)
# To sign with a canned policy::
signed_url = cf_signer.generate_presigned_url(
url, date_less_than=datetime(2015, 12, 1))
# To sign with a custom policy::
signed_url = cf_signer.generate_presigned_url(url, policy=my_policy)
Disclaimer: I am the author of that PR.

As many have commented already, the initially accepted answer doesn't apply to Amazon CloudFront in fact, insofar Serving Private Content through CloudFront requires the use of dedicated CloudFront Signed URLs - accordingly secretmike's answer has been correct, but it is meanwhile outdated after he himself took the time and Added support for generating signed URLs for CloudFront (thanks much for this!).
boto now supports a dedicated create_signed_url method and the former binary dependency M2Crypto has recently been replaced with a pure-Python RSA implementation as well, see Don't use M2Crypto for cloudfront URL signing.
As increasingly common, one can find one or more good usage examples within the related unit tests (see test_signed_urls.py), for example test_canned_policy(self) - see setUp(self) for the referenced variables self.pk_idand self.pk_str (obviously you'll need your own keys):
def test_canned_policy(self):
"""
Generate signed url from the Example Canned Policy in Amazon's
documentation.
"""
url = "http://d604721fxaaqy9.cloudfront.net/horizon.jpg?large=yes&license=yes"
expire_time = 1258237200
expected_url = "http://example.com/" # replaced for brevity
signed_url = self.dist.create_signed_url(
url, self.pk_id, expire_time, private_key_string=self.pk_str)
# self.assertEqual(expected_url, signed_url)

This is what I use for create a policy so that I can give access to multiple files with the same "signature":
import json
import rsa
import time
from base64 import b64encode
url = "http://your_domain/*"
expires = int(time.time() + 3600)
pem = """-----BEGIN RSA PRIVATE KEY-----
...
-----END RSA PRIVATE KEY-----"""
key_pair_id = 'ABX....'
policy = {}
policy['Statement'] = [{}]
policy['Statement'][0]['Resource'] = url
policy['Statement'][0]['Condition'] = {}
policy['Statement'][0]['Condition']['DateLessThan'] = {}
policy['Statement'][0]['Condition']['DateLessThan']['AWS:EpochTime'] = expires
policy = json.dumps(policy)
private_key = rsa.PrivateKey.load_pkcs1(pem)
signature = b64encode(rsa.sign(str(policy), private_key, 'SHA-1'))
print '?Policy=%s&Signature=%s&Key-Pair-Id=%s' % (b64encode(policy),
signature,
key_pair_id)
I can use it for all files under http://your_domain/* for example:
http://your_domain/image2.png?Policy...
http://your_domain/image2.png?Policy...
http://your_domain/file1.json?Policy...

secretmike's answer works, but it is better to use rsa instead of M2Crypto.
I used boto which uses rsa.
import boto
from boto.cloudfront import CloudFrontConnection
from boto.cloudfront.distribution import Distribution
expire_time = int(time.time() +3000)
conn = CloudFrontConnection('ACCESS_KEY_ID', 'SECRET_ACCESS_KEY')
##enter the id or domain name to select a distribution
distribution = Distribution(connection=conn, config=None, domain_name='', id='', last_modified_time=None, status='')
signed_url = distribution.create_signed_url(url='YOUR_URL', keypair_id='YOUR_KEYPAIR_ID_example-APKAIAZVIO4BQ',expire_time=expire_time,private_key_file="YOUR_PRIVATE_KEY_FILE_LOCATION")
Use the boto documentation

I find simple solutions do not need change s3.generate_url ways,
just select your Cloudfront config: Yes, Update bucket policy.
After that change from :
https://xxxx.s3.amazonaws.com/hello.png&Signature=sss&Expires=1585008320&AWSAccessKeyId=kkk
to
https://yyy.cloudfront.net/hello.png&Signature=sss&Expires=1585008320&AWSAccessKeyId=kkk
with yyy.cloudfront.net is your CloudFront domain
refer to: https://aws.amazon.com/blogs/developer/accessing-private-content-in-amazon-cloudfront/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading part of a file in S3 using Boto - python

I am trying to read 700MB file stored in S3. How ever I only require bytes from locations 73 to 1024. I tried to find a usable solution but failed to. Would be a great help if someone could help me out.

import boto3 obj = boto3.resource('s3').Object('mybucket', 'mykey') stream = obj.get(Range='bytes=32-64')['Body'] print(stream.read()) boto3 version from https://github.com/boto/boto3/issues/1236

Related

Boto3 S3 Paginator not returning filtered results

Generate presigned url for uploading file to google storage using python

How to read data from Azure's CosmosDB in python

python aws put_bucket_policy() MalformedPolicy

Creating Signed URLs for Amazon CloudFront

Categories

Resources