Move/copy data from one folder to another on AWS S3 - python

I am looking for all the methods for moving/copying the data from one folder to another on AWS S3 bucket.
Method 1: Via AWS CLI (Most easy)
Download and install awscli on ur instance, I am using here windows(64-bit link) and run "asw configure" to fill up your configuration and just run this single command on cmd
aws s3 cp s3://from-source/ s3://to-destination/ --recursive
Here cp for copy and recursive to copy all files
Method 2: Via AWS CLI using python
import os
import awscli
if os.environ.get('LC_CTYPE', '') == 'UTF-8':
os.environ['LC_CTYPE'] = 'en_US.UTF-8'
from awscli.clidriver import create_clidriver
driver = create_clidriver()
driver.main('s3 mv s3://staging/AwsTesting/research/ s3://staging/AwsTesting/research_archive/ --recursive'.split())
Even this worked for me perfectly
Method 3: Via Boto using python
import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'staging',
'Key': '/AwsTesting/research/'
}
s3.meta.client.copy(copy_source, 'staging', '/AwsTesting/research_archive/')
With my understanding I have assumed the 'key' for bucket is just the folder prefix so I have mentioned the folder path here
Error:
Invalid bucket name "s3://staging": Bucket name must match the regex "^[a-zA-Z0-9.-_]{1,255}$"
Even I changed it to simple bucket name as "staging" but no success.
How can I understand bucket connectivity via boto and the concept of this key?

import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'staging',
'Key': 'AwsTesting/research/filename.csv'
}
s3.meta.client.copy(copy_source, 'staging', 'AwsTesting/')

An alternative to using cp with the CLI is sync - https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
aws s3 sync s3://mybucket s3://mybucket2
It will essentially do the same thing.

Use the following snippet which is working.
def s3candidateavtarcopy(old,new):
try:
response = s3.list_objects_v2(Bucket = s3_candidate_bucket,Prefix=old)
keycount = response['KeyCount']
if(keycount > 0):
for key in response['Contents']:
file = key['Key']
try:
output = file.split(old)
newfile = new + output[1]
input_source = {'Bucket': s3_candidate_bucket,'Key' : file }
s3_resource.Object(s3_candidate_bucket,newfile).copy_from(CopySource=input_source)
except ClientError as e:
print(e.response['Error']['Message'])
else:
print('Success')
else:
print('No matching records')
except ClientError as e:
print(e.response['Error']['Message'])
else:
print('Operatio completed')

Related

Downloading only AWS S3 object file names and image URL in CSV Format

I have hosted files in AWS s3 bucket, I need only all S3 bucket object URL's in CSV file
Please suggest
You can get all S3 Object URLS by using the AWS SDK for S3. First, what you need to do is read all items in a bucket. You can use Python code similar to this Java code (you can port the logic):
ListObjectsRequest listObjects = ListObjectsRequest
.builder()
.bucket(bucketName)
.build();
ListObjectsResponse res = s3.listObjects(listObjects);
List<S3Object> objects = res.contents();
for (ListIterator iterVals = objects.listIterator(); iterVals.hasNext(); ) {
S3Object myValue = (S3Object) iterVals.next();
System.out.print("\n The name of the key is " + myValue.key());
}
Then iterate through the list and get the key as shown above. For each object, you can get the URL using Python code similar to this:
GetUrlRequest request = GetUrlRequest.builder()
.bucket(bucketName)
.key(keyName)
.build();
URL url = s3.utilities().getUrl(request);
System.out.println("The URL for "+keyName +" is "+url.toString());
Put each URL value into a collection and then write the collection out to a CSV. That is how you achieve your use case.

Uploading a Video to Azure Media Services with Python SDKs

I am currently looking for a way to upload a video to Azure Media Services (AMS v3) via Python SDKs. I have followed its instruction, and am able to connect to AMS successfully.
Example
credentials = AdalAuthentication(
context.acquire_token_with_client_credentials,
RESOURCE,
CLIENT,
KEY)
client = AzureMediaServices(credentials, SUBSCRIPTION_ID) # Successful
I also successfully get all the videos' details uploaded via its portal
for data in client.assets.list(RESOUCE_GROUP_NAME, ACCOUNT_NAME).get(0):
print(f'Asset_name: {data.name}, file_name: {data.description}')
# Asset_name: 4f904060-d15c-4880-8c5a-xxxxxxxx, file_name: 夢想全紀錄.mp4
# Asset_name: 8f2e5e36-d043-4182-9634-xxxxxxxx, file_name: an552Qb_460svvp9.webm
# Asset_name: aef495c1-a3dd-49bb-8e3e-xxxxxxxx, file_name: world_war_2.webm
# Asset_name: b53d8152-6ecd-41a2-a59e-xxxxxxxx, file_name: an552Qb_460svvp9.webm - Media Encoder Standard encoded
However, when I tried to use the following method; it failed. Since I have no idea what to parse as parameters - Link to Python SDKs
create_or_update(resource_group_name, account_name, asset_name,
parameters, custom_headers=None, raw=False, **operation_config)
Therefore, I would like to ask questions as follows (everything is done via Python SDKs):
What kind of parameters does it expect?
Can a video be uploaded directly to AMS or it should be uploaded to Blob Storage first?
Should an Asset contain only one video or multiple files are fine?
The documentation for the REST version of that method is at https://learn.microsoft.com/en-us/rest/api/media/assets/createorupdate. This is effectively the same as the Python parameters.
Videos are stored in Azure Storage for Media Services. This is true for input assets, the assets that are encoded, and any streamed content. It all is in Storage but accessed by Media Services. You do need to create an asset in Media Services which creates the Storage container. Once the Storage container exists you upload via the Storage APIs to that Media Services created container.
Technically multiple files are fine, but there are a number of issues with doing that that you may not expect. I'd recommend using 1 input video = 1 Media Services asset. On the encoding output side there will be more than one file in the asset. Encoding output contains one or more videos, manifests, and metadata files.
I have found my method to work around using Python SDKs and REST; however, I am not quite sure it's proper.
Log-In to Azure Media Services and Blob Storage via Python packages
import adal
from msrestazure.azure_active_directory import AdalAuthentication
from msrestazure.azure_cloud import AZURE_PUBLIC_CLOUD
from azure.mgmt.media import AzureMediaServices
from azure.mgmt.media.models import MediaService
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
Create Assets for an original file and an encoded one by parsing these parameters. Example of the original file Asset creation.
asset_name = 'asset-myvideo'
asset_properties = {
'properties': {
'description': 'Original File Description',
'storageAccountName': "storage-account-name"
}
}
client.assets.create_or_update(RESOUCE_GROUP_NAME, ACCOUNT_NAME, asset_name, asset_properties)
Upload a video to the Blob Storage derived from the created original asset
current_container = [data.container for data in client.assets.list(RESOUCE_GROUP_NAME, ACCOUNT_NAME).get(0) if data.name == asset_name][0] # Get Blob Storage location
file_name = "myvideo.mp4"
blob_client = blob_service_client.get_blob_client(container=current_container, blob=file_name)
with open('original_video.mp4', 'rb') as data:
blob_client.upload_blob(data)
print(f'Video uploaded to {current_container}')
And after that, I do Transform, Job, and Streaming Locator to get the video Streaming Link successfully.
I was able to get this to work with the newer python SDK. The python documentation is mostly missing, so I constructed this mainly from the python SDK source code and the C# examples.
azure-storage-blob==12.3.1
azure-mgmt-media==2.1.0
azure-mgmt-resource==9.0.0
adal~=1.2.2
msrestazure~=0.6.3
0) Import a lot of stuff
from azure.mgmt.media.models import Asset, Transform, Job,
BuiltInStandardEncoderPreset, TransformOutput, \
JobInputAsset, JobOutputAsset, AssetContainerSas, AssetContainerPermission
import adal
from msrestazure.azure_active_directory import AdalAuthentication
from msrestazure.azure_cloud import AZURE_PUBLIC_CLOUD
from azure.mgmt.media import AzureMediaServices
from azure.storage.blob import BlobServiceClient, ContainerClient
import datetime as dt
import time
LOGIN_ENDPOINT = AZURE_PUBLIC_CLOUD.endpoints.active_directory
RESOURCE = AZURE_PUBLIC_CLOUD.endpoints.active_directory_resource_id
# AzureSettings is a custom NamedTuple
1) Log in to AMS:
def get_ams_client(settings: AzureSettings) -> AzureMediaServices:
context = adal.AuthenticationContext(LOGIN_ENDPOINT + '/' +
settings.AZURE_MEDIA_TENANT_ID)
credentials = AdalAuthentication(
context.acquire_token_with_client_credentials,
RESOURCE,
settings.AZURE_MEDIA_CLIENT_ID,
settings.AZURE_MEDIA_SECRET
)
return AzureMediaServices(credentials, settings.AZURE_SUBSCRIPTION_ID)
2) Create an input and output asset
input_asset = create_or_update_asset(
input_asset_name, "My Input Asset", client, azure_settings)
input_asset = create_or_update_asset(
output_asset_name, "My Output Asset", client, azure_settings)
3) Get the Container Name. (most documentation refers to BlockBlobService, which is seems to have been removed from the SDK)
def get_container_name(client: AzureMediaServices, asset_name: str, settings: AzureSettings):
expiry_time = dt.datetime.now(dt.timezone.utc) + dt.timedelta(hours=4)
container_list: AssetContainerSas = client.assets.list_container_sas(
resource_group_name=settings.AZURE_MEDIA_RESOURCE_GROUP_NAME,
account_name=settings.AZURE_MEDIA_ACCOUNT_NAME,
asset_name=asset_name,
permissions = AssetContainerPermission.read_write,
expiry_time=expiry_time
)
sas_uri: str = container_list.asset_container_sas_urls[0]
container_client: ContainerClient = ContainerClient.from_container_url(sas_uri)
return container_client.container_name
4) Upload a file the the input asset container:
def upload_file_to_asset_container(
container: str, local_file, uploaded_file_name, settings: AzureSettings):
blob_service_client = BlobServiceClient.from_connection_string(settings.AZURE_MEDIA_STORAGE_CONNECTION_STRING))
blob_client = blob_service_client.get_blob_client(container=container, blob=uploaded_file_name)
with open(local_file, 'rb') as data:
blob_client.upload_blob(data)
5) Create a transform (in my case, using the adaptive streaming preset):
def get_or_create_transform(
client: AzureMediaServices,
transform_name: str,
settings: AzureSettings):
transform_output = TransformOutput(preset=BuiltInStandardEncoderPreset(preset_name="AdaptiveStreaming"))
transform: Transform = client.transforms.create_or_update(
resource_group_name=settings.AZURE_MEDIA_RESOURCE_GROUP_NAME,
account_name=settings.AZURE_MEDIA_ACCOUNT_NAME,
transform_name=transform_name,
outputs=[transform_output]
)
return transform
5) Submit the Job
def submit_job(
client: AzureMediaServices,
settings: AzureSettings,
input_asset: Asset,
output_asset: Asset,
transform_name: str,
correlation_data: dict) -> Job:
job_input = JobInputAsset(asset_name=input_asset.name)
job_outputs = [JobOutputAsset(asset_name=output_asset.name)]
job: Job = client.jobs.create(
resource_group_name=settings.AZURE_MEDIA_RESOURCE_GROUP_NAME,
account_name=settings.AZURE_MEDIA_ACCOUNT_NAME,
job_name=f"test_job_{UNIQUENESS}",
transform_name=transform_name,
parameters=Job(input=job_input,
outputs=job_outputs,
correlation_data=correlation_data)
)
return job
6) Then I get the URLs after the Event Grid has told me the job is done:
# side-effect warning: this starts the streaming endpoint $$$
def get_urls(client: AzureMediaServices, output_asset_name: str
locator_name: str):
try:
locator: StreamingLocator = client.streaming_locators.create(
resource_group_name=settings.AZURE_MEDIA_RESOURCE_GROUP_NAME,
account_name=settings.AZURE_MEDIA_ACCOUNT_NAME,
streaming_locator_name=locator_name,
parameters=StreamingLocator(
asset_name=output_asset_name,
streaming_policy_name="Predefined_ClearStreamingOnly"
)
)
except Exception as ex:
print("ignoring existing")
streaming_endpoint: StreamingEndpoint = client.streaming_endpoints.get(
resource_group_name=settings.AZURE_MEDIA_RESOURCE_GROUP_NAME,
account_name=settings.AZURE_MEDIA_ACCOUNT_NAME,
streaming_endpoint_name="default")
if streaming_endpoint:
if streaming_endpoint.resource_state != "Running":
client.streaming_endpoints.start(
resource_group_name=settings.AZURE_MEDIA_RESOURCE_GROUP_NAME,
account_name=settings.AZURE_MEDIA_ACCOUNT_NAME,
streaming_endpoint_name="default"
)
paths = client.streaming_locators.list_paths(
resource_group_name=settings.AZURE_MEDIA_RESOURCE_GROUP_NAME,
account_name=settings.AZURE_MEDIA_ACCOUNT_NAME,
streaming_locator_name=locator_name
)
return [f"https://{streaming_endpoint.host_name}{path.paths[0]}" for path in paths.streaming_paths]

mock boto3 response for downloading file from S3

I've got code that downloads a file from an S3 bucket using boto3.
# foo.py
def dl(src_f, dest_f):
s3 = boto3.resource('s3')
s3.Bucket('mybucket').download_file(src_f, dest_f)
I'd now like to write a unit test for dl() using pytest and by mocking the interaction with AWS using the stubber available in botocore.
#pytest.fixture
def s3_client():
yield boto3.client("s3")
from foo import dl
def test_dl(s3_client):
with Stubber(s3_client) as stubber:
params = {"Bucket": ANY, "Key": ANY}
response = {"Body": "lorem"}
stubber.add_response(SOME_OBJ, response, params)
dl('bucket_file.txt', 'tmp/bucket_file.txt')
assert os.path.isfile('tmp/bucket_file.txt')
I'm not sure about the right approach for this. How do I add bucket_file.txt to the stubbed reponse? What object do I need to add_response() to (shown as SOME_OBJ)?
Have you considered using moto3?
Your code could look the same way as it is right now:
# foo.py
def dl(src_f, dest_f):
s3 = boto3.resource('s3')
s3.Bucket('mybucket').download_file(src_f, dest_f)
and the test:
import boto3
import os
from moto import mock_s3
#mock_s3
def test_dl():
s3 = boto3.client('s3', region_name='us-east-1')
# We need to create the bucket since this is all in Moto's 'virtual' AWS account
s3.create_bucket(Bucket='mybucket')
s3.put_object(Bucket='mybucket', Key= 'bucket_file.txt', Body='')
dl('bucket_file.txt', 'bucket_file.txt')
assert os.path.isfile('bucket_file.txt')
The intention of the code becomes a bit more obvious since you simply work with s3 as usual, except for there is no real s3 behind the method calls.

how do I test methods using boto3 with moto

I am writing test cases for a quick class to find / fetch keys from s3, using boto3. I have used moto in the past to test boto (not 3) code but am trying to move to boto3 with this project, and running into an issue:
class TestS3Actor(unittest.TestCase):
#mock_s3
def setUp(self):
self.bucket_name = 'test_bucket_01'
self.key_name = 'stats_com/fake_fake/test.json'
self.key_contents = 'This is test data.'
s3 = boto3.session.Session().resource('s3')
s3.create_bucket(Bucket=self.bucket_name)
s3.Object(self.bucket_name, self.key_name).put(Body=self.key_contents)
error:
...
File "/Library/Python/2.7/site-packages/botocore/vendored/requests/packages/urllib3/connectionpool.py", line 344, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
File "/Library/Python/2.7/site-packages/botocore/vendored/requests/packages/urllib3/connectionpool.py", line 314, in _raise_timeout
if 'timed out' in str(err) or 'did not complete (read)' in str(err): # Python 2.6
TypeError: __str__ returned non-string (type WantWriteError)
botocore.hooks: DEBUG: Event needs-retry.s3.CreateBucket: calling handler <botocore.retryhandler.RetryHandler object at 0x10ce75310>
It looks like moto is not mocking out the boto3 call correctly - how do I make that work?
What worked for me is setting up the environment with boto before running my mocked tests with boto3.
Here's a working snippet:
import unittest
import boto
from boto.s3.key import Key
from moto import mock_s3
import boto3
class TestS3Actor(unittest.TestCase):
mock_s3 = mock_s3()
def setUp(self):
self.mock_s3.start()
self.location = "eu-west-1"
self.bucket_name = 'test_bucket_01'
self.key_name = 'stats_com/fake_fake/test.json'
self.key_contents = 'This is test data.'
s3 = boto.connect_s3()
bucket = s3.create_bucket(self.bucket_name, location=self.location)
k = Key(bucket)
k.key = self.key_name
k.set_contents_from_string(self.key_contents)
def tearDown(self):
self.mock_s3.stop()
def test_s3_boto3(self):
s3 = boto3.resource('s3', region_name=self.location)
bucket = s3.Bucket(self.bucket_name)
assert bucket.name == self.bucket_name
# retrieve already setup keys
keys = list(bucket.objects.filter(Prefix=self.key_name))
assert len(keys) == 1
assert keys[0].key == self.key_name
# update key
s3.Object(self.bucket_name, self.key_name).put(Body='new')
key = s3.Object(self.bucket_name, self.key_name).get()
assert 'new' == key['Body'].read()
When run with py.test test.py you get the following output:
collected 1 items
test.py .
========================================================================================= 1 passed in 2.22 seconds =========================================================================================
According to this information, it looks like streaming upload to s3 using Boto3 S3 Put is not yet supported.
In my case, I used following to successfully upload an object to a bucket:
s3.Object(self.s3_bucket_name, self.s3_key).put(Body=open("file_to_upload", 'rb'))
where "file_to_upload" is your local file to be uploaded to s3 bucket. For your test case, you can just create a temporary file to check this functionality:
test_file = open("test_file.json", "w")
test_file.write("some test contents")
test_file.close()
s3.Object(self.s3_bucket_name, self.s3_key).put(Body=open("test_file", 'rb'))

How to create a signed cloudfront URL with Python?

I would like to know how to create a signed URL for cloudfront. The current working solution is unsecured, and I would like to switch the system to secure URL's.
I have tried using Boto 2.5.2 and Django 1.4
Is there a working example on how to use the boto.cloudfront.distribution.create_signed_url method? or any other solution that works?
I have tried the following code using the BOTO 2.5.2 API
def get_signed_url():
import boto, time, pprint
from boto import cloudfront
from boto.cloudfront import distribution
AWS_ACCESS_KEY_ID = 'YOUR_AWS_ACCESS_KEY_ID'
AWS_SECRET_ACCESS_KEY = 'YOUR_AWS_SECRET_ACCESS_KEY'
KEYPAIR_ID = 'YOUR_KEYPAIR_ID'
KEYPAIR_FILE = 'YOUR_FULL_PATH_TO_FILE.pem'
CF_DISTRIBUTION_ID = 'E1V7I3IOVHUU02'
my_connection = boto.cloudfront.CloudFrontConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
distros = my_connection.get_all_streaming_distributions()
oai = my_connection.create_origin_access_identity('my_oai', 'An OAI for testing')
distribution_config = my_connection.get_streaming_distribution_config(CF_DISTRIBUTION_ID)
distribution_info = my_connection.get_streaming_distribution_info(CF_DISTRIBUTION_ID)
my_distro = boto.cloudfront.distribution.Distribution(connection=my_connection, config=distribution_config, domain_name=distribution_info.domain_name, id=CF_DISTRIBUTION_ID, last_modified_time=None, status='Active')
s3 = boto.connect_s3()
BUCKET_NAME = "YOUR_S3_BUCKET_NAME"
bucket = s3.get_bucket(BUCKET_NAME)
object_name = "FULL_URL_TO_MP4_ECLUDING_S3_URL_DOMAIN_NAME EG( my/path/video.mp4)"
key = bucket.get_key(object_name)
key.add_user_grant("READ", oai.s3_user_id)
SECS = 8000
OBJECT_URL = 'FULL_S3_URL_TO_FILE.mp4'
my_signed_url = my_distro.create_signed_url(OBJECT_URL, KEYPAIR_ID, expire_time=time.time() + SECS, valid_after_time=None, ip_address=None, policy_url=None, private_key_file=KEYPAIR_FILE, private_key_string=KEYPAIR_ID)
Everything seems fine until the method create_signed_url. It returns an error.
Exception Value: Only specify the private_key_file or the private_key_string not both
Omit the private_key_string:
my_signed_url = my_distro.create_signed_url(OBJECT_URL, KEYPAIR_ID,
expire_time=time.time() + SECS, private_key_file=KEYPAIR_FILE)
That parameter is used to pass the actual contents of the private key file, as a string. The comments in the source explain that only one of private_key_file or private_key_string should be passed.
You can also omit all the kwargs which are set to None, since None is the default.

Categories