Python boto force connect to s3 - python

Can anyone help me with connecting to amazon s3. I have next problem:
I want to check whether provided credential are good to connect on step when just creating connection. For example:
import boto
boto.connect_s3(access_key,wrong_secret_key)
Raise error on this step when providing bad key.
I know we can catch Error on step when connecting to specific bucket, but I want to catch on previous step.
Thanks.

The connect_s3 function does not actually make any requests to S3. It simply creates and configures the S3Connection object and returns it. If you want to validate the credentials you will have to perform some S3 operation with the connection. For example, you could try to do a get_bucket:
import boto
s3 = boto.connect_s3(...)
s3.get_bucket('mybucket')
This will actually make a round trip to the S3 service and, if there is a problem with your credentials, you will see an error.

Related

Automate File loading from s3 to snowflake

In s3 bucket daily new JSON files are dumping , i have to create solution which pick the latest file when it arrives PARSE the JSON and load it to Snowflake Datawarehouse. may someone please share your thoughts how can we achieve
There are a number of ways to do this depending on your needs. I would suggest creating an event to trigger a lambda function.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
Another option may be to create a SQS message when the file lands on s3 and have an ec2 instance poll the queue and process as necessary.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/sqs-example-long-polling.html
edit: Here is a more detailed explanation on how to create events from s3 and trigger lambda functions. Documentation is provided by Snowflake
https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-rest-lambda.html
Look into Snowpipe, it lets you do that within the system, making it (possibly) much easier.
There are some aspects to be considered such as is it a batch or streaming data , do you want retry loading the file in case there is wrong data or format or do you want to make it a generic process to be able to handle different file formats/ file types(csv/json) and stages.
In our case we have built a generic s3 to Snowflake load using Python and Luigi and also implemented the same using SSIS but for csv/txt file only.
In my case, I have a python script which get information about the bucket with boto.
Once I detect a change, I call the REST Endpoint Insertfiles on SnowPipe.
Phasing:
detect S3 change
get S3 object path
parse Content and transform to CSV in S3 (same bucket or other snowpipe can connect)
Call SnowPipe REST API
What you need:
Create a user with a public key
Create your stage on SnowFlake with AWS credential in order to access S3
Create your pipe on Snowflake with your user role
Sign a JWT
I also tried with a Talend job with TOS BigData.
Hope it helps.

Reset hadoop aws keys to upload to another s3 bucket under different username

Sorry for horrible question title but here is my scenario
I have a pyspark databricks notebook in which I am loading other notebooks.
One of this notebooks is setting some redshift configuration for reading data from redshift(Some temp S3 buckets). I cannot change any of this configuration.
Under this configuration, both of this returns True. This is useful in step number 5
sc._jsc.hadoopConfiguration().get("fs.s3n.awsAccessKeyId") == None
sc._jsc.hadoopConfiguration().get("fs.s3n.awsSecretAccessKey") == None
I've a apache spark model which I need to store to my S3 bucket which is different bucket than configured for redshift
I am pickling other objects and storing into AWS using boto3 and It is working properly but I don't think we can pickle apache models like other objects. So I've to use model's save method with S3 url and for that I am setting aws credentials like this and this works (if no one in same cluster is not messing with AWS configurations).
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId",
AWS_ACCESS_KEY_ID)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_ACCESS_KEY)
After I save this model, I also need to read other data from redshift and here it is failing with following error. What I think is that redshift's configuration of S3 is changed with above code.
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 1844.0 failed 4 times, most recent failure: Lost task
0.3 in stage 1844.0 (TID 63816, 10.0.63.188, executor 3): com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service:
Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID:
3219CD268DEE5F53; S3 Extended Request ID:
rZ5/zi2B+AsGuKT0iW1ATUyh9xw7YAt9RULoE33WxTaHWUWqHzi1+0sRMumxnnNgTvNED30Nj4o=), S3 Extended Request ID:
rZ5/zi2B+AsGuKT0iW1ATUyh9xw7YAt9RULoE33WxTaHWUWqHzi1+0sRMumxnnNgTvNED30Nj4o=
Now my question is why I am not able to read data again. How can I reset redshift's S3 configuration the way it was before setting explicitly after saving model into S3.
What I also don't understand is, initially aws values were None and when I try to reset with None on my own it returns an error saying
The value of property fs.s3n.awsAccessKeyId must not be null
Right now I am thinking workaround in which I will save model locally on databricks and then will make zip of it and upload it to S3 but still this is just a patch. I would like to do it in proper manner.
Sorry for using quote box for code because it was not working for multiline code for some reason
Thank you in Advance!!!
re-import the notebook that sets up the redshift connectivity. Or find where it is set and copy that code.
If you don't have privileges to modify the notebooks you are importing then I'd guess you don't have privileges to set roles on the cluster. If you use roles then you don't need aws keys.

parallell copy of buckets/keys from boto3 or boto api between 2 different accounts/connections

I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?
Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.
You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

How to use ddbmock with dynamodb-mapper?

Can someone please explain how to set up dynamodb_mapper (together with boto?) to use ddbmock with sqlite backend as Amazon DynamoDB-replacement for functional testing purposes?
Right now, I have tried out "plain" boto and managed to get it working with ddbmock (with sqlite) by starting the ddbmock server locally and connect using boto like this:
db = connect_boto_network(host='127.0.0.1', port=6543)
..and then I use the db object for all operations against the database. However, dynamodb_mapper uses this way to get a db connection:
conn = ConnectionBorg()
As I understand, it uses boto's default way to connect with (the real) DynamoDB. So basically I'm wondering if there is a (preferred?) way to get ConnectionBorg() to connect with my local ddbmock server, as I've done with boto above? Thanks for any suggestions.
Library Mode
In library mode rather than server mode:
import boto
from ddbmock import config
from ddbmock import connect_boto_patch
# switch to sqlite backend
config.STORAGE_ENGINE_NAME = 'sqlite'
# define the database path. defaults to 'dynamo.db'
config.STORAGE_SQLITE_FILE = '/tmp/my_database.sqlite'
# Wire-up boto and ddbmock together
db = connect_boto_patch()
Any access to dynamodb service via boto will use ddbmock under the hood.
Server Mode
If you still want to us ddbmock in server mode, I would try to change ConnectionBorg._shared_state['_region'] in the really beginning of test setup code:
ConnectionBorg._shared_state['_region'] = RegionInfo(name='ddbmock', endpoint="localhost:6543")
As far as I understand, any access to dynamodb via any ConnectionBorg instance after those lines will use ddbmock entry point.
This said, I've never tested it. I'll make sure authors of ddbmock gives an update on this.

boto issue with IAM role

I'm trying to use AWS' recently announced "IAM roles for EC2" feature, which lets security credentials automatically get delivered to EC2 instances. (see http://aws.amazon.com/about-aws/whats-new/2012/06/11/Announcing-IAM-Roles-for-EC2-instances/).
I've set up an instance with an IAM role as described. I can also get (seemingly) proper access key / credentials with curl.
However, boto fails to do a simple call like "get_all_buckets", even though I've turned on ALL S3 permissions for the role.
The error I get is "The AWS Access Key Id you provided does not exist in our records"
However, the access key listed in the error matches the one I get from curl.
Here is the failing script, run on an EC2 instance with an IAM role attached that gives all S3 permissions:
import urllib2
import ast
from boto.s3.connection import S3Connection
resp=urllib2.urlopen('http://169.254.169.254/latest/meta-data/iam/security-credentials/DatabaseApp').read()
resp=ast.literal_eval(resp)
print "access:" + resp['AccessKeyId']
print "secret:" + resp['SecretAccessKey']
conn = S3Connection(resp['AccessKeyId'], resp['SecretAccessKey'])
rs= conn.get_all_buckets()
If you are using boto 2.5.1 or later it's actually much easier than this. Boto will automatically find the credentials in the instance metadata for you and use them as long as no other credentials are found in environment variables or in a boto config file. So, you should be able to simply do this on the EC2 instance:
>>> import boto
>>> c = boto.connect_s3()
>>> rs = c.get_all_buckets()
The reason that your manual approach is failing is that the credentials associated with the IAM Role is a temporary session credential and consists of an access_key, a secret_key and a security_token and you need to supply all three of those values to the S3Connection constructor.
I don't know if this answer will help anyone but I was getting the same error, I had to solve my problem a little differently.
First, my amazon instance did not have any IAM roles. I thought I could just use the access key and the secret key but I kept getting this error with only those two keys. I read I needed a security token as well, but I didn't have one because I didn't have any IAM roles. This is what I did to correct the issue:
Create an IAM role with AmazonS3FullAccess permissions.
Start a new instance and attach my newly created role.
Even after doing this it still didn't work. I had to also connect to the proper region with the code below:
import boto.s3.connection
conn = boto.s3.connect_to_region('your-region')
conn.get_all_buckets()

Categories