Airflow : Download latest file from S3 with Wildcard - python

Requirement: To download the latest file i.e., current file from s3
Sample file in s3
bucketname/2020/09/reporting_2020_09_20200902000335.zip
bucketname/2020/09/reporting_2020_09_20200901000027.zip
When I pass the s3_src_key as /2020/09/reporting_2020_09_20200902 doesn't work for below one
Code:
with tempfile.NamedTemporaryFile('r') as f_source, tempfile.NamedTemporaryFile('w') as f_target:
s3_client.download_file(self.s3_src_bucket, self.s3_src_key, f_source.name)
Below one works fine
import os
bucket = 'bucketname'
key = '/2020/09/reporting_2020_09_20200902'
s3_resource = boto3.resource('s3')
my_bucket = s3_resource.Bucket(bucket)
objects = my_bucket.objects.filter(Prefix=key)
for obj in objects:
path, filename = os.path.split(obj.key)
my_bucket.download_file(obj.key, filename)
I need help how to use wildcard in Airflow

You can list objects that match a given pattern, but then you'll need to write code that decides which one of them is the latest.
Here's the Python SDK function you'll need

Related

Copy files between two GCS bucket which is partitioned by date

I have a requirement to copy the file between two bucket detailed below -
Bucket A /folder A is source inbound box for daily files which are created as f1_abc_20210304_000 > I want to scan the latest file in folder A (10 files every day) and copy the latest file and next > Copy it in to Bucket B/Folder B / FILE name (ie from 10 files) / 2021/03/04 and drop the files in 04 folder.
Any suggestion how I should proceed with the design?
Thanks
RG
Did you want to do this copy task using Airflow?
If yes, Airflow provide GCSToGCSOperator
One approach is by using client libraries, for the example below I'm using the python client library for google cloud storage.
move.py
from google.cloud import storage
from google.oauth2 import service_account
import os
# as mention on https://cloud.google.com/docs/authentication/production
key_path = "credentials.json"
credentials = service_account.Credentials.from_service_account_file(key_path)
storage_client = storage.Client(credentials=credentials)
bucket_name = "source-bucket-id"
destination_bucket_name = "destination-bucket-id"
source_bucket = storage_client.bucket(bucket_name)
# prefix 'original_data' is the folder where i store the data
array_blobs = source_bucket.list_blobs(prefix='original_data')
filtered_dict = []
for blob in array_blobs:
if str(blob.name).endswith('.csv'):
#add additional logic to handle the files you want to ingest
filtered_dict.append({'name':blob.name,'time':blob.time_created})
orderedlist = sorted(filtered_dict, key=lambda d: d['time'], reverse=True)
latestblob = orderedlist[0]['name']
# prefix 'destination_data' is the folder where i want to move the data
destination_blob_name = "destination_data/{}".format(os.path.basename(latestblob))
source_blob = source_bucket.blob(latestblob)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(source_blob, destination_bucket, destination_blob_name)
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
For a bit of context on the code, what I did was to use the google cloud storage python client, log in, get the list of files from my source folder original_data inside bucket source-bucket-id and add the relevant files ( you can modify the pick up logic by adding your own criteria which fits your situation ). After that I pick up the latest files based on time creation and use that name to move it into my destination-bucket-id. As a note, destination_bucket_name variable includes the folder where I want to allocate the file and also the end filename.
UPDATE: I miss the airflow tag. So on that case you should use the operator that comes with google provider which is GCSToGCSOperator. The parameters to pass can be obtained using a python operator and pass it to your operator. It will work like this:
#task(task_id="get_gcs_params")
def get_gcs_params(**kwargs):
date = kwargs["next_ds"]
# logic should be as displayed on move.py
# ...
return {"source_objects":source,"destination_object":destination}
gcs_params = get_gcs_params()
copy_file = GCSToGCSOperator(
task_id='copy_single_file',
source_bucket='data',
source_objects= gcs_params.output['source_objects'],
destination_bucket='data_backup',
destination_object= gcs_params.output['destination_object'],
gcp_conn_id=google_cloud_conn_id
)
For additional guidance you can check the cloud storage examples list. I use Copy an object between buckets for guidance.

Mock download file from s3 with actual file

I would like to write a test to mock the download of a function from s3 and replace it locally with an actual file that exists of my machine. I took inspiration from this post. The idea is the following:
from moto import mock_s3
import boto3
def dl(src_f, dest_f):
s3 = boto3.resource('s3')
s3.Bucket('fake_bucket').download_file(src_f, dest_f)
#mock_s3
def _create_and_mock_bucket():
# Create fake bucket and mock it
bucket = "fake_bucket"
# We need to create the bucket since this is all in Moto's 'virtual' AWS account
file_path = "some_real_file.txt"
s3 = boto3.client("s3", region_name="us-east-1")
s3.create_bucket(Bucket=bucket)
s3.put_object(Bucket=bucket, Key=file_path, Body="")
dl(file_path, 'some_other_real_file.txt')
_create_and_mock_bucket()
Now some_other_real_file.txt exists, but it is not a copy of some_real_file.txt. Any idea on how to do that?
If 'some_real_file.txt' already exists on your system, you should use upload_file instead:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file
For your example:
file_path = "some_real_file.txt"
s3 = boto3.client("s3", region_name="us-east-1")
s3.create_bucket(Bucket=bucket)
s3_resource = boto3.resource('s3')
s3_resource.meta.client.upload_file(file_path, bucket, file_path)
Your code currently creates an empty file in S3 (since Body=""), and that is exactly what is being downloaded to 'some_other_real_file.txt'.
Notice that, if you change the Body-parameter to have some text in it, that exact content will be downloaded to 'some_other_real_file.txt'.

How to read XML file into Sagemaker Notebook Instance? [duplicate]

I want to write a Python script that will read and write files from s3 using their url's, eg:'s3:/mybucket/file'. It would need to run locally and in the cloud without any code changes. Is there a way to do this?
Edit: There are some good suggestions here but what I really want is something that allows me to do this:
myfile = open("s3://mybucket/file", "r")
and then use that file object like any other file object. That would be really cool. I might just write something like this for myself if it doesn't exist. I could build that abstraction layer on simples3 or boto.
For opening, it should be as simple as:
import urllib
opener = urllib.URLopener()
myurl = "https://s3.amazonaws.com/skyl/fake.xyz"
myfile = opener.open(myurl)
This will work with s3 if the file is public.
To write a file using boto, it goes a little something like this:
from boto.s3.connection import S3Connection
conn = S3Connection(AWS_KEY, AWS_SECRET)
bucket = conn.get_bucket(BUCKET)
destination = bucket.new_key()
destination.name = filename
destination.set_contents_from_file(myfile)
destination.make_public()
lemme know if this works for you :)
Here's how they do it in awscli :
def find_bucket_key(s3_path):
"""
This is a helper function that given an s3 path such that the path is of
the form: bucket/key
It will return the bucket and the key represented by the s3 path
"""
s3_components = s3_path.split('/')
bucket = s3_components[0]
s3_key = ""
if len(s3_components) > 1:
s3_key = '/'.join(s3_components[1:])
return bucket, s3_key
def split_s3_bucket_key(s3_path):
"""Split s3 path into bucket and key prefix.
This will also handle the s3:// prefix.
:return: Tuple of ('bucketname', 'keyname')
"""
if s3_path.startswith('s3://'):
s3_path = s3_path[5:]
return find_bucket_key(s3_path)
Which you could just use with code like this
from awscli.customizations.s3.utils import split_s3_bucket_key
import boto3
client = boto3.client('s3')
bucket_name, key_name = split_s3_bucket_key(
's3://example-bucket-name/path/to/example.txt')
response = client.get_object(Bucket=bucket_name, Key=key_name)
This doesn't address the goal of interacting with an s3 key as a file like object but it's a step in that direction.
I haven't seen something that would work directly with S3 urls, but you could use an S3 access library (simples3 looks decent) and some simple string manipulation:
>>> url = "s3:/bucket/path/"
>>> _, path = url.split(":", 1)
>>> path = path.lstrip("/")
>>> bucket, path = path.split("/", 1)
>>> print bucket
'bucket'
>>> print path
'path/'
Try s3fs
First example on the docs:
>>> import s3fs
>>> fs = s3fs.S3FileSystem(anon=True)
>>> fs.ls('my-bucket')
['my-file.txt']
>>> with fs.open('my-bucket/my-file.txt', 'rb') as f:
... print(f.read())
b'Hello, world'
You can use Boto Python API for accessing S3 by python. Its a good library. After you do the installation of Boto, following sample programe will work for you
>>> k = Key(b)
>>> k.key = 'yourfile'
>>> k.set_contents_from_filename('yourfile.txt')
You can find more information here http://boto.cloudhackers.com/s3_tut.html#storing-data
http://s3tools.org/s3cmd works pretty well and support the s3:// form of the URL structure you want. It does the business on Linux and Windows. If you need a native API to call from within a python program then http://code.google.com/p/boto/ is a better choice.

ValueError when downloading a file from an S3 bucket using boto3?

I am sucessfully downloading an image file to my local computer from my S3 bucket using the following:
import os
import boto3
import botocore
files = ['images/dog_picture.png']
bucket = 'animals'
s3 = boto3.resource('s3')
for file in files:
s3.Bucket(bucket).download_file(file, os.path.basename(file))
However, when I try to specify the directory to which the image should be saved on my local machine as is done in the docs:
s3.Bucket(bucket).download_file(file, os.path.basename(file), '/home/user/storage/new_image.png')
I get:
ValueError: Invalid extra_args key '/home/user/storage/new_image.png', must be one of: VersionId, SSECustomerAlgorithm, SSECustomerKey, SSECustomerKeyMD5, RequestPayer
I must be doing something wrong but I'm following the example in the docs. Can someone help me specify a local directory?
Looking into the docs, you're providing an extra parameter
import boto3
s3 = boto3.resource('s3')
s3.Bucket('mybucket').download_file('hello.txt', '/tmp/hello.txt')
From the docs, hello.txt is the name of the object on the bucket and /tmp/hello.txt is the path on your device, so the correct way would be
s3.Bucket(bucket).download_file(file, '/home/user/storage/new_image.png')

How to list all the contents in a given key on Amazon S3 using Apache Libcloud?

The code to list contents in S3 using boto3 is known:
self.s3_client = boto3.client(
u's3',
aws_access_key_id=config.AWS_ACCESS_KEY_ID,
aws_secret_access_key=config.AWS_SECRET_ACCESS_KEY,
region_name=config.region_name,
config=Config(signature_version='s3v4')
)
versions = self.s3_client.list_objects(Bucket=self.bucket_name, Prefix=self.package_s3_version_key)
However, I need to list contents on S3 using libcloud. I could not find it in the documentation.
If you are just looking for all the contents for a specific bucket:
from libcloud.storage.types import Provider
from libcloud.storage.providers import get_driver
client = driver(StoreProvider.S3)
s3 = client(aws_id, aws_secret)
container = s3.get_container(container_name='name')
objects = s3.list_container_objects(container)
s3.download_object(objects[0], '/path/to/download')
The resulting objects will contain a list of all the keys in that bucket with filename, byte size, and metadata. To download call the download_object method on s3 with the full libcloud Object and your file path.
If you'd rather get all objects of all buckets, change get_container to list_containers with no parameters.
Information for all driver methods: https://libcloud.readthedocs.io/en/latest/storage/api.html
Short examples specific to s3: https://libcloud.readthedocs.io/en/latest/storage/drivers/s3.html

Categories