Mock download file from s3 with actual file - python

I would like to write a test to mock the download of a function from s3 and replace it locally with an actual file that exists of my machine. I took inspiration from this post. The idea is the following:
from moto import mock_s3
import boto3
def dl(src_f, dest_f):
s3 = boto3.resource('s3')
s3.Bucket('fake_bucket').download_file(src_f, dest_f)
#mock_s3
def _create_and_mock_bucket():
# Create fake bucket and mock it
bucket = "fake_bucket"
# We need to create the bucket since this is all in Moto's 'virtual' AWS account
file_path = "some_real_file.txt"
s3 = boto3.client("s3", region_name="us-east-1")
s3.create_bucket(Bucket=bucket)
s3.put_object(Bucket=bucket, Key=file_path, Body="")
dl(file_path, 'some_other_real_file.txt')
_create_and_mock_bucket()
Now some_other_real_file.txt exists, but it is not a copy of some_real_file.txt. Any idea on how to do that?

If 'some_real_file.txt' already exists on your system, you should use upload_file instead:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file
For your example:
file_path = "some_real_file.txt"
s3 = boto3.client("s3", region_name="us-east-1")
s3.create_bucket(Bucket=bucket)
s3_resource = boto3.resource('s3')
s3_resource.meta.client.upload_file(file_path, bucket, file_path)
Your code currently creates an empty file in S3 (since Body=""), and that is exactly what is being downloaded to 'some_other_real_file.txt'.
Notice that, if you change the Body-parameter to have some text in it, that exact content will be downloaded to 'some_other_real_file.txt'.

Related

AWS Lambda: How to read CSV files in S3 bucket then upload it to another S3 bucket?

I'm doing a project, where I read files from the S3 bucket and to get rid of all NA values then upload them to the different S3 bucket. I've been watching a Lambda tutorial and example codes, but I have a hard time understanding how it really works.
My goal is to read any file in the S3 bucket and using the Lambda function, I drop all the NA values, then upload them to a different S3 bucket. But I don't really understand what is going on. I read the documentation, but it wasn't very helpful for me to understand.
How can I make the below code to read CSV files from the S3 bucket, then drop all NA values, then upload them to the new S3 bucket?
import json
import os
import boto3
import csv
def lambda_handler(event, context):
for record in event['Records']:
bucket = record['s3']['bucket']['name']
file_key = record['s3']['object']['key']
s3 = boto3.client('s3')
csv_file = s3.get_object(Bucket=bucket, Key=file_key)
csv_content = csv_file['Body'].read().split(b'\n')
csv_data = csv.DictReader(csv_content)
Any links to the documentation, or video and advice will be appreciated.
Uploading files
def upload_file(file_name, bucket, object_name=None):
"""Upload a file to an S3 bucket
:param file_name: File to upload
:param bucket: Bucket to upload to
:param object_name: S3 object name. If not specified then file_name is used
:return: True if file was uploaded, else False
"""
# If S3 object_name was not specified, use file_name
if object_name is None:
object_name = file_name
# Upload the file
s3_client = boto3.client('s3')
try:
response = s3_client.upload_file(file_name, bucket, object_name)
except ClientError as e:
logging.error(e)
return False
return True
s3 download_file
import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
Now you simply put these calls in any way you want to and process your csv files and then how you process and upload to s3 in efficiency that would be a completely different topic.
There are plenty of answere her ein this post How to upload a file to directory in S3 bucket using boto
You can check this one as well if curious, gives some idea how to process larger files.
Step 4: Create the Lambda function that splits input data

Airflow : Download latest file from S3 with Wildcard

Requirement: To download the latest file i.e., current file from s3
Sample file in s3
bucketname/2020/09/reporting_2020_09_20200902000335.zip
bucketname/2020/09/reporting_2020_09_20200901000027.zip
When I pass the s3_src_key as /2020/09/reporting_2020_09_20200902 doesn't work for below one
Code:
with tempfile.NamedTemporaryFile('r') as f_source, tempfile.NamedTemporaryFile('w') as f_target:
s3_client.download_file(self.s3_src_bucket, self.s3_src_key, f_source.name)
Below one works fine
import os
bucket = 'bucketname'
key = '/2020/09/reporting_2020_09_20200902'
s3_resource = boto3.resource('s3')
my_bucket = s3_resource.Bucket(bucket)
objects = my_bucket.objects.filter(Prefix=key)
for obj in objects:
path, filename = os.path.split(obj.key)
my_bucket.download_file(obj.key, filename)
I need help how to use wildcard in Airflow
You can list objects that match a given pattern, but then you'll need to write code that decides which one of them is the latest.
Here's the Python SDK function you'll need

How to upload a data frame on AWS as CSV directly in python?

i tried boto3 but no luck
import boto3
from botocore.exceptions import NoCredentialsError
ACCESS_KEY = 'access_key'
SECRET_KEY = 'secret_key'
def upload_to_aws(local_file, bucket, s3_file):
s3 = boto3.client('s3', aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
try:
s3.upload_file(local_file, bucket, s3_file)
print("Upload Successful")
return True
except FileNotFoundError:
print("The file was not found")
return False
except NoCredentialsError:
print("Credentials not available")
return False
uploaded = upload_to_aws('local_file', 'information-arch', 's3_file_name')
print("Done! with the uploud")
Hussein,
According to the boto3Documentation, you should upload your upload like this:
upload_file(Filename, Bucket, Key, ExtraArgs=None, Callback=None, Config=None)
Example:
import boto3
s3 = boto3.resource('s3')
s3.meta.client.upload_file('/tmp/hello.txt', 'mybucket', 'hello.txt')
Parameters
Filename (str) -- The path to the file to upload.
Bucket (str) -- The name of the bucket to upload to.
Key (str) -- The name of the key to upload to.
So on your upload_to_aws function called to pass the parameter like this way.
Thanks
You can copy your dataframe directly to s3 like this:
Let's say you have a dataframe called df. You can use the to_csv option specifying your s3 path.
It will directly save the csv file on S3.
This works with pandas versions >= 0.24
df.to_csv(s3_path, index=False)
From pandas docs:
pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.

AWS Lambda - copy object to another S3 location

I'd like to write a lambda python code to move files to the same S3 bucket.
[same S3 bucket]
/location-as-is/file.jpg
[same S3 bucket]
/location-to-be/file.jpg
How can I do that?
Thank you.
In order to get this to work you will need a few things. First is the lambda code itself. You should be able to use the python sdk boto3 to make the call to copy. Here is an example how to copy your file:
import json
import boto3
s3 = boto3.resource('s3')
def lambda_handler(event, context):
my_bucket = "example-bucket"
current_object_key = "fileA/keyA.jpg"
new_object_key = "fileB/keyB.jpg"
copy_source = {
'Bucket': my_bucket,
'Key': current_object_key
}
s3.meta.client.copy(copy_source, my_bucket, new_object_key)
You will also need to make sure you lambda execution role has proper s3 read and write permissions and that your s3 bucket policy is configured to allow your lambda role to access it.
You can use boto for this purpose, as below:
import boto
c = boto.connect_s3()
src_buc = c.get_bucket('Source_Bucket')
sink_buc = c.get_bucket('Sink_Bucket')
and then you can iterate over all your keys to copy the content:
for k in src_buc.list():
# copy to sink
sink_buc.copy_key(k.key.name, src_buc.name, k.key.name)

AWS uploading file into wrong bucket

I am using AWS Sagemaker and trying to upload a data folder into S3 from Sagemaker. I am trying to do is to upload my data into the s3_train_data directory (the directory exists in S3). However, it wouldn't upload it in that bucket, but in a default Bucket that has been created, and in turn creates a new folder directory with the S3_train_data variables.
code to input in directory
import os
import sagemaker
from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = <bucket name>
prefix = <folders1/folders2>
key = <input>
s3_train_data = 's3://{}/{}/{}/'.format(bucket, prefix, key)
#path 'data' is the folder in the Jupyter Instance, contains all the training data
inputs = sagemaker_session.upload_data(path= 'data', key_prefix= s3_train_data)
Is the problem in the code or more in how I created the notebook?
You could look at the Sample notebooks, how to upload the data S3 bucket
There have many ways. I am just giving you hints to answer.
And you forgot create a boto3 session to access the S3 bucket
It is one of the ways to do it.
import os
import urllib.request
import boto3
def download(url):
filename = url.split("/")[-1]
if not os.path.exists(filename):
urllib.request.urlretrieve(url, filename)
def upload_to_s3(channel, file):
s3 = boto3.resource('s3')
data = open(file, "rb")
key = channel + '/' + file
s3.Bucket(bucket).put_object(Key=key, Body=data)
# caltech-256
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
upload_to_s3('train', 'caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
upload_to_s3('validation', 'caltech-256-60-val.rec')
link : https://buildcustom.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining.ipynb
Another way to do it.
bucket = '<your_s3_bucket_name_here>'# enter your s3 bucket where you will copy data and model artifacts
prefix = 'sagemaker/breast_cancer_prediction' # place to upload training files within the bucket
# do some processing then prepare to push the data.
f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype('float32'), train_y.astype('float32'))
f.seek(0)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', train_file)).upload_fileobj(f)
Link : https://buildcustom.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/introduction_to_applying_machine_learning/breast_cancer_prediction/Breast%20Cancer%20Prediction.ipynb
Youtube link : https://www.youtube.com/watch?v=-YiHPIGyFGo - how to pull the data in S3 bucket.

Categories