copy files between s3 buckets using python - python

I have a excel sheet with metadata with 3 fields (path,folder_structure,filename)
Path: it is the path of source file in s3 source bucket
folder_structure: new folder structure that need to be created in Target bucket
filename: this is the filename that need to be renamed after copying to target bucket
I have below code working in windows source folder and creating target folder and copying data to target folder. need to modify this to source from s3 bucket and load it another s3 bucket.
code:
import pandas as pd
import os,shutil
from pathlib import Path
data = pd.read_excel('c:\data\sample_requirement.xlsx',engine='openpyxl')
root_dir = 'source'
for rec in range(len(data)):
#Replacing the '|' symbol with backward slash
dire = data['folder_structure'][rec].replace('|','\\')
#appending root directory with folder structure
directory = root_dir+'\\'+dire
#print(directory)
#Checking if path exists, if exit-> skip else-> create new
if not os.path.exists(directory):
#print('Not exist')
#creating new directory
os.makedirs(directory)
#Path in the excel
path = data['path'][rec]
#Filenames to change
filename = data['filename'][rec]
#print(filename)
if not os.path.isfile(directory + filename) :
#Copying the files to created path
shutil.copy(path,directory)
#Renaming the files
try:
os.rename(directory + os.path.basename(path),directory + filename)
except FileExistsError as e:
print('File Name already Exists')

How about this just add this to your code replace your target and destination bucket:-
import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'yoursourcebucket',
'Key': 'yourkey'
}
s3.meta.client.copy(copy_source, 'nameofdestinationbucket', 'destinationkey')
its a good practice to follow docs for knowing the details of the code, also note there maybe many other ways too to perform the same operation for example using awscli https://stackoverflow.com/a/32526487/13126651

Copy one file from a bucket to another :
s3 = boto3.client("s3")
s3.copy({"Bucket": SOURCE_BUCKET, "Key": SOURCE_KEY}, DESTINATION_BUCKET, DESTINATION_KEY)
Copy up to 1000 files a bucket to another :
s3 = boto3.client("s3")
response = s3.list_objects_v2(
Bucket=self.bucket,
Prefix=self.path,
) # Warning not handling pagination -> will truncate after 1000 keys
for file in response['Contents']:
s3.copy(
{"Bucket": SOURCE_BUCKET, "Key": file['Key']},
DESTINATION_BUCKET,
"/".join(
DESTINATION_PREFIX,
file['KEY']
),
)
Copy more than 1000 files :
properly handle pagination when call list_objects_v2
loop until response['IsTruncated']
use response['NextContinuationToken'] as the new 'ContinuationToken' arg

Related

Download Entire Content of a subfolder in a S3 bucket

I have a bucket in s3 called "sample-data". Inside the Bucket I have folders labelled "A" to "Z".
Inside each alphabetical folder there are more files and folders. What is the fastest way to download the alphabetical folder and all it's content?
For example --> sample-data/a/foo.txt,more_files/foo1.txt
In the above example the bucket sample-data contains an folder called a which contains foo.txt and a folder called more_files which contains foo1.txt
I know how to download a single file. For instance if i wanted foo.txt I would do the following.
s3 = boto3.client('s3')
s3.download_file("sample-data", "a/foo.txt", "foo.txt")
However i am wondering if i can download the folder called a and all it's contents entirely? Any help would be appreciated.
I think your best bet would be the awscli
aws s3 cp --recursive s3://mybucket/your_folder_named_a path/to/your/destination
From the docs:
--recursive (boolean) Command is performed on all files or objects under the specified directory or prefix.
EDIT:
To do this with boto3 try this:
import os
import errno
import boto3
client = boto3.client('s3')
def assert_dir_exists(path):
try:
os.makedirs(path)
except OSError as e:
if e.errno != errno.EEXIST:
raise
def download_dir(bucket, path, target):
# Handle missing / at end of prefix
if not path.endswith('/'):
path += '/'
paginator = client.get_paginator('list_objects_v2')
for result in paginator.paginate(Bucket=bucket, Prefix=path):
# Download each file individually
for key in result['Contents']:
# Calculate relative path
rel_path = key['Key'][len(path):]
# Skip paths ending in /
if not key['Key'].endswith('/'):
local_file_path = os.path.join(target, rel_path)
# Make sure directories exist
local_file_dir = os.path.dirname(local_file_path)
assert_dir_exists(local_file_dir)
client.download_file(bucket, key['Key'], local_file_path)
download_dir('your_bucket', 'your_folder', 'destination')
You list all the objects in the folder you want to download. Then iterate file by file and download it.
import boto3
s3 = boto3.client("s3")
response = s3.list_objects_v2(
Bucket=BUCKET,
Prefix ='DIR1/DIR2',
)
The response is of type dict. The key that contains the list of the file names is "Contents"
Here are more information:
list all files in a bucket
boto3 documentation
I am not sure if this is the fastest solution, but it can help you.

copy files from one AWS s3 bucket/folder to another AWS/S3 folder and also keep the deepest sub-folder name by pythons on databricks

My question is relevant to the previous one copy files from one AWS/S3 bucket to another bucket on databricks.
I created a new thread because this question is different from the previous one.
This post AWS S3 copy files and folders between two buckets does not help me.
I need to copy some files from one AWS/S3 bucket/folder to another AWS/S3 bucket folder by python on databricks.
My source S3 bucket/folder is like :
source_s3_bucket
folder_name1
folder_name2
folder_name3
folder_name4
deepest_folder_name
file1
file2
....
file11500
The destination s3 bucket/folder:
destination_s3_bucket
dest_folder_name1
dest_folder_name2
dest_folder_name3
deepest_folder_name (this folder name must be exactly same as the source one "deepest_folder_name")
file1
file2
....
file11500
Also, the "dest_folder_nameX" are all different from the sources ones and also the depth of the source and destination folders are also different. But, the deepest folder name in source bucket must be kept in destination bucket.
All files must be exactly copied and keep the same names.
I have tried to do the python3 coding:
import boto3
s3 = boto3.client('s3')
s3_resource = boto3.resource('s3')
for key in s3.list_objects(Bucket=source_bucket, Prefix=source_prefix)['Contents']:
files = key['Key']
copy_source = {'Bucket': source_bucket,'Key': files}
s3_resource.meta.client.copy(CopySource=copy_source, Bucket=dest_bucket, Key=dest_prefix)
But, no files are copied to the destination folder and also how I can keep the "deepest_folder_name" ?
UPDATE
"The deepest folder" means that I have to keep that layer's sub-folders' names and copy them and the files located in them to the destination.
for example,
in source bucket:
folder_name_abc
folder_name_dfr
folder_name_typ # this folder names must be kept
file1
file2
In destination bucket:
folder_name_typ # this folder names must be exactly same as the source
file1
file2
thanks
The tricky part is manipulating the 'path' portion of the object Keys.
You could use something like this:
import boto3
s3_client = boto3.client('s3')
SOURCE_BUCKET = 'bucket1'
SOURCE_PREFIX = 'folder_name_abc/folder_name_dfr/' # Where is Folder located? (Leave blank if root level, include slash at end if Prefix specified)
FOLDER_TO_COPY = 'folder_name_typ'
DESTINATION_BUCKET = 'bucket2'
DESTINATION_PREFIX = '' # (Leave blank if root level, include slash at end if Prefix specified)
# List objects in source directory
bucket_listing = s3_client.list_objects_v2(Bucket=SOURCE_BUCKET,Prefix=f'{SOURCE_PREFIX}{FOLDER_TO_COPY}/')
for object in bucket_listing['Contents']:
print(f'Copying from {object['Key']} to {DESTINATION_PREFIX + object['Key'][len(SOURCE_PREFIX):]}')
s3_client.copy_object(
CopySource = {'Bucket': SOURCE_BUCKET, 'Key': object['Key']},
Bucket = DESTINATION_BUCKET,
Key = DESTINATION_PREFIX + object['Key'][len(SOURCE_PREFIX):] # Remove source prefix, add destination prefix
)

Python - download entire directory from Google Cloud Storage

At the following page
https://googlecloudplatform.github.io/google-cloud-python/latest/storage/blobs.html
there are all the API calls which can be used for Python & Google Cloud storage. Even in the "official" samples on github
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/storage/cloud-client/snippets.py
don't have a related example.
Finally, downloading a directory with the same method used for download files gives the error
Error: [Errno 21] Is a directory:
You just have to first list all the files in a directory and then download them one by one:
bucket_name = 'your-bucket-name'
prefix = 'your-bucket-directory/'
dl_dir = 'your-local-directory/'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name=bucket_name)
blobs = bucket.list_blobs(prefix=prefix) # Get list of files
for blob in blobs:
filename = blob.name.replace('/', '_')
blob.download_to_filename(dl_dir + filename) # Download
blob.name includes the entire directory structure + filename, so if you want the same file name as in the bucket, you might want to extract it first (instead of replacing / with _)
If you want to keep the same directory structure without renaming and also create nested folders. I have for python 3.5+ a solution based on #ksbg answer :
from pathlib import Path
bucket_name = 'your-bucket-name'
prefix = 'your-bucket-directory/'
dl_dir = 'your-local-directory/'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name=bucket_name)
blobs = bucket.list_blobs(prefix=prefix) # Get list of files
for blob in blobs:
if blob.name.endswith("/"):
continue
file_split = blob.name.split("/")
directory = "/".join(file_split[0:-1])
Path(directory).mkdir(parents=True, exist_ok=True)
blob.download_to_filename(blob.name)
Lets say, we want to download FINALFOLDER from the storage path: gs://TEST_BUCKET_NAME/FOLDER1/FOLDER2/FINALFOLDER
After downloading, the final path will look like: D:\\my_blob_data\FINALFOLDER
from os import makedirs
from os.path import join, isdir, isfile, basename
from google.cloud import storage
# if your environment was authenticated, the default config will be picked up
storage_client = storage.Client() # comment this line if you want to use service account
# uncomment the line below if you have a service account json
# storage_client = storage.Client.from_service_account_json('creds/sa.json')
bucket_name = 'TEST_BUCKET_NAME'
prefix = 'FOLDER2'
dst_path = 'D:\\my_blob_data'
if isdir(dstPath) == False:
makedirs(dstPath)
bucket = storage_client.bucket(bucket_name=bucket_name)
blobs = bucket.list_blobs(prefix=prefix) # Get list of files
for blob in blobs:
blob_name = blob.name
dst_file_name = blob_name.replace('FOLDER1/FOLDER2', dst_path) #.replace('FOLDER1/FOLDER2', 'D:\\my_blob_data')
# extract the final directory and create it in the destination path if it does not exist
dst_dir = dst_file_name.replace('/' + basename(dst_file_name), '')
if isdir(dst_dir) == False:
makedirs(dst_dir)
# download the blob object
blob.download_to_filename(dst_file_name)
Using tensoflow gfile package, here is a recursive function.
root_dir is the GCS parent folder.
local_base_dir is the parent folder created at local
def copy_recursively(root_dir, local_base_dir):
if tf.io.gfile.exists(local_base_dir):
tf.io.gfile.rmtree(local_base_dir)
tf.io.gfile.mkdir(local_base_dir)
file_list = tf.io.gfile.glob(root_dir+'/**')
for item in file_list:
if not tf.io.gfile.isdir(item):
fname = item.rsplit('/',1)[-1]
if not fname.startswith('.'):
tf.io.gfile.copy(item,
os.path.join(local_base_dir,fname),
overwrite=False)
else:
child_dir= item.rsplit('/',1)[-1]
full_dir_path = os.path.join(local_base_dir,child_dir)
print(f"Setting up child directory: {full_dir_path}")
copy_recursively(item,full_dir_path)
root_dir = 'gs://.../.../..'
local_base_dir = root_dir.rsplit('/',1)[-1]
copy_recursively(root_dir, local_base_dir)
Refer This Link- https://medium.com/#sandeepsinh/multiple-file-download-form-google-cloud-storage-using-python-and-gcs-api-1dbcab23c44
1 - Add Your Credential Json
2 - List Bucket Items
3 - Download
import logging
import os
from google.cloud import storage
global table_id
global bucket_name
logging.basicConfig(format=’%(levelname)s:%(message)s’, level=logging.DEBUG)
bucket_name = ‘mybucket’
table_id = ‘shakespeare’
storage_client = storage.Client.from_service_account_json(‘/google-cloud/keyfile/service_account.json’)
# The “folder” where the files you want to download are
folder=’/google-cloud/download/{}’.format(table_id)
delimiter=’/’
bucket=storage_client.get_bucket(bucket_name)
blobs=bucket.list_blobs(prefix=table_id, delimiter=delimiter) #List all objects that satisfy the filter.
# Download the file to a destination
def download_to_local():
logging.info(‘File download Started…. Wait for the job to complete.’)
# Create this folder locally if not exists
if not os.path.exists(folder):
os.makedirs(folder)
# Iterating through for loop one by one using API call
for blob in blobs:
logging.info(‘Blobs: {}’.format(blob.name))
destination_uri = ‘{}/{}’.format(folder, blob.name)
blob.download_to_filename(destination_uri)
logging.info(‘Exported {} to {}’.format(
blob.name, destination_uri))
if __name__ == ‘__main__’:
download_to_local()

Remove absolute path while uploading file on S3

I am trying to upload file on S3 in my bucket, using following code which is working absolutely fine.
#!/usr/bin/python
import os
import boto
import boto.s3.connection
from boto.s3.key import Key
from boto.s3.connection import S3Connection
from datetime import datetime
try:
conn = boto.s3.connect_to_region('us-east-1',
aws_access_key_id = 'AKXXXXXXXXXXXXA',
aws_secret_access_key = 'cXXXXXXXXXXXXXXXXXXXXXXXXX2',
calling_format = boto.s3.connection.OrdinaryCallingFormat(),
)
print conn
filename = '/home/rahul/GitDjangopostgres/AWSNewView/company/ExtraPaymentDocuments/Argus_ATM_e_surveillance/Gujarat/Ahmedabad/P3DCAM04/cecb834f-ae85-49e3-b8a1-a6nC^U)3GcZ)M62d643aa7-d047-498c-bf59-8__Invoice (7).pdf'
bucket = conn.get_bucket('bucketName', validate=False)
key_name = filename
print "file to upload",key_name
secure_https_url = 'https://{host}/{bucket}{key}'.format(
host=conn.server_name(),
bucket='bucketName',
key=key_name)
print "secure_https_url",secure_https_url
k = bucket.new_key(key_name)
mp = k.set_contents_from_filename(key_name)
print "File uploaded successfully"
except Exception,e:
print str(e)
print "error"
Now the problem is as my file name is '/home/rahul/GitDjangopostgres/AWSNewView/company/ExtraPaymentDocuments/Argus_ATM_e_surveillance/Gujarat/Ahmedabad/P3DCAM04/cecb834f-ae85-49e3-b8a1-a6nC^U)3GcZ)M62d643aa7-d047-498c-bf59-8__Invoice (7).pdf', It's creating hierarchical bucket and storing my file. so I am getting file path as https://s3.amazonaws.com/bucketName/home/rahul/GitDjangopostgres/AWSNewView/company/ExtraPaymentDocuments/Argus_ATM_e_surveillance/Gujarat/Ahmedabad/P3DCAM04/cecb834f-ae85-49e3-b8a1-a6nC^U)3GcZ)M62d643aa7-d047-498c-bf59-8__Invoice (7).pdf.I want to change this hierarchy to https://s3.amazonaws.com/bucketName/ExtraPaymentDocuments/Argus_ATM_e_surveillance/Gujarat/Ahmedabad/P3DCAM04/cecb834f-ae85-49e3-b8a1-a6nC^U)3GcZ)M62d643aa7-d047-498c-bf59-8__Invoice (7).pdf. Is there any option to do it with boto OR should I go with python, because to upload file on s3 it requires absolute path of the file, As I am using Django and this is my celery task.
The function set_contents_from_filename(key_name) understand that whatever is key name it will put it in s3 as it is. If your key name contains any / then it will create hierarchy structure. For your situation i suggest that you create two paths. One is your base local path which contains your files. Base local path will contain all your files that you want to upload to s3 (use os.path.join for creating the path) Then next is aws path which is the hierarchy you want to create in your s3 bucket. As an example you can declare the aws path as:
/ExtraPaymentDocuments/Argus_ATM_e_surveillance/Gujarat/Ahmedabad/P3DCAM04
Then you append the local filename to the parameter key_name which you will pass to function new_key.
Boto will work fine here.
example:
create a s3 path that you want in your s3 storage and also add a base local path which contain all the files that you want to upload. filename has to be appended to create a file. new_key function will create a key i.e. path that you can use to store your files. set_contents_from_filename function will take a local file path and store this file to s3 with the key (path) provided in the function above.
k = bucket.new_key(s3_path + filename)
mp = k.set_contents_from_filename(base_local_path _ filename)

Download S3 Files with Boto

I am trying to set up an app where users can download their files stored in an S3 Bucket. I am able to set up my bucket, and get the correct file, but it won't download, giving me the this error: No such file or directory: 'media/user_1/imageName.jpg' Any idea why? This seems like a relatively easy problem, but I can't quite seem to get it. I can delete an image properly, so it is able to identify the correct image.
Here's my views.py
def download(request, project_id=None):
conn = S3Connection('AWS_BUCKET_KEY', 'AWS_SECRET_KEY')
b = Bucket(conn, 'BUCKET_NAME')
k = Key(b)
instance = get_object_or_404(Project, id=project_id)
k.key = 'media/'+str(instance.image)
k.get_contents_to_filename(str(k.key))
return redirect("/dashboard/")
The problem is that you are downloading to a local directory that doesn't exist (media/user1). You need to either:
Create the directory on the local machine first
Just use the filename rather than a full path
Use the full path, but replace slashes (/) with another character -- this will ensure uniqueness of filename without having to create directories
The last option could be achieved via:
k.get_contents_to_filename(str(k.key).replace('/', '_'))
See also: Boto3 to download all files from a S3 Bucket
Downloading files using boto3 is very simple, configure your AWS credentials at system level before using this code.
client = boto3.client('s3')
// if your bucket name is mybucket and the file path is test/abc.txt
// then the Bucket='mybucket' Prefix='test'
resp = client.list_objects_v2(Bucket="<your bucket name>", Prefix="<prefix of the s3 folder>")
for obj in resp['Contents']:
key = obj['Key']
//to read s3 file contents as String
response = client.get_object(Bucket="<your bucket name>",
Key=key)
print(response['Body'].read().decode('utf-8'))
//to download the file to local
client.download_file('<your bucket name>', key, key.replace('test',''))
replace is to locate the file in your local with s3 file name, if you don't replace it will try to save as 'test/abc.txt'.
import os
import boto3
import json
s3 = boto3.resource('s3', aws_access_key_id="AKIAxxxxxxxxxxxxJWB",
aws_secret_access_key="LV0+vsaxxxxxxxxxxxxxxxxxxxxxry0/LjxZkN")
my_bucket = s3.Bucket('s3testing')
# download file into current directory
for s3_object in my_bucket.objects.all():
# Need to split s3_object.key into path and file name, else it will give error file not found.
path, filename = os.path.split(s3_object.key)
my_bucket.download_file(s3_object.key, filename)

Categories