I've got a python script that Works On My Machine (OSX, python 2.7.13, boto3 1.4.4) but won't work for my colleague (Windows7, otherwise same).
The authentication seems to work, and we can both s3's list_objects_v2 and get_object. However when he tries to upload with put_object, it times out. Here is a full log; the upload starts at line 45.
I've tried using his credentials and it works. He's tried uploading a tiny file and it'll work when it's in the bytes range, but even kb is too big for it. We've even tried it on another windows machine on another internet connection with no luck.
My upload code is pretty simple:
with open("tmp_build.zip", "r") as zip_to_upload:
upload_response = s3.put_object(Bucket=target_bucket, Body=zip_to_upload, Key=build_type+".zip")
The Key resolves to test.zip in our runs, and the file is about 15mb.
Why is it failing on windows? What more debug info can I give you?
Using inspiration from this https://github.com/boto/boto3/issues/870 issue, I added .read() to my Body parameter, and lo it works.
Might be network issues. are you on the same network?
are you able to upload it using the AWS-CLI
try the following
aws s3 cp my-file.txt s3://my-s3-bucket/data/ --debug
also I would consider adding X retries to the upload might give you more information on the error at hand. most of the times these are sporadic network related issues
Related
One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.
Is there a way to make a python aws-lambda on a local machine which can read and write data from an S3 bucket. I can get this to run on a lambda in AWS's web-page with the following code with no problems.
import json
def lambda_handler(event, context):
# TODO implement
now = datetime.datetime.now()
cur_day = "{}{}{}".format(now.strftime('%d'),now.strftime('%m'),now.year)
print(cur_day)
my_contents = get_data_from_s3_file('myBucket', 'myFile')
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
def get_data_from_s3_file(bucket, my_file):
"""Read the contents of the file as a string and split by lines"""
my_data = s3.get_object(Bucket=bucket, Key=my_file)
my_text = my_data['Body'].read().decode('utf-8').split('\n')
return my_text
The issue is that this is a terrible environment to write and debug code so I would like to do it on my own local machine. I have set up AWS-CLI and installed an app that lets you run lambda code in a local environment called 'python-lambda-local', as shown here.
pip install python-lambda-local
python-lambda-local -l lib/ -f lambda_handler -t 5 pythonLambdaLocalTest.py event.json
The file 'pythonLambdaLocalTest.py' contains the same code that I ran on AWS from the console - the advantage here is that I can use an IDE such as visual studio code to write it. If I run it without calling 'get_data_from_s3_file' then the code seems to run fine on the local machine and 'cur_day' is printed to the console. However, if I run the full script and try to connect to the bucket then I get the following error:
raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the
endpoint URL: "https://myBucket/myfile"
Does anyone have a method to connect to the s3 from the local machine? I'm sure there must be a way to use aws-cli? or the serverless-application-model (sam)? However I can't find any guides which are complete enough to follow.
I've also tried downloading the .yaml file from the console and putting it in the local directory and running:
sam local start-api -t pythonLambdaLocalTest.yaml
and I get:
2019-01-21 16:56:30 Found credentials in shared credentials file: ~/.aws/credentials
Error: Template does not have any APIs connected to Lambda functions
This suggests that potentially an api could connect my local machine to the aws s3 bucket but I have very little experience in setting this kind of thing up and am struggling with the jargon. Any help on getting this approach to run would be great? I've recently started using docker so some approach using this would also be great?
I've also tried this approach here and can see my lambda functions listed in visual studio code but I can't seem to see or edit any of the code and there is no obvious link to do so - most of the support seems to be around node.js and my lambda's are python.
I also realise that cloud9 is an option but appears to require a running EC2 instance which I would rather not pay for.
I have tried a lot of approaches but there doesn't seem to be any complete guides. Any help highly appreciated!
While running the following code:
import boto3
BUCKET = 'bwd-plfb'
s3 = boto3.client('s3',use_ssl = False)
resp = s3.list_objects_v2(Bucket = BUCKET )
s3.download_file(BUCKET,'20171018/OK/OK_All.zip','test.zip')
I'm getting the following error:
botocore.exceptions.ClientError: An error occurred
(SignatureDoesNotMatch) when calling the GetObject operation: The request
signature we calculated does not match the signature you provided. Check
your key and signing method.
What I've tried so far:
Double checking Access key ID and Secret access key configured in aws cli (Running aws configure in command prompt) - They're correct.
Trying to list bucket objects using boto3 - It worked successfully. The problem seems to be occuring when trying to download files.
Using a chrome plugin to browse bucket contents and download files: chrome plugin It works successfully.
The interesting thing is downloading works for some files but not all. I downloaded a file which previously worked before 20 times in a row to see if the error was intermittent. It worked all 20 times. I did the same thing for a file which had not previously worked and it did not download any of the 20 times.
I saw some other posts on stackoverflow saying the api key & access key maybe incorrect. However, I don't believe that to be the case if I was able to list objects and download files (one's which did & did not work through boto3) using the Chrome S3 plugin.
Does anyone have any suggestions on what might be the issue here?
Thank You
this error occurs when you use wrong/invalid secret key for s3
I encountered the error when my path was not correct. I had double slash // in my path. It removing one of the slashes fixed the error.
I have encountered this myself. I download on a regular basis about 10 files daily from S3. I noticed that if the file is too large (~8MB), I get the SignatureDoesNotMatch error only for that file, but not the other files which are small in size.
I then tried to use the shell "aws s3 cp" CLI command and got the same result. My co-worker suggested using "aws s3api get-object", which now works 100% of the time. However, I can't find the equivalent python script, so I'm stuck running the shell script. (s3.download_file or s3.download_fileobj don't work either.)
I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?
Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.
You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")
I need to copy all keys from '/old/dir/' to '/new/dir/' in an amazon S3 bucket.
I came up with this script (quick hack):
import boto
s3 = boto.connect_s3()
thebucket = s3.get_bucket("bucketname")
keys = thebucket.list('/old/dir')
for k in keys:
newkeyname = '/new/dir' + k.name.partition('/old/dir')[2]
print 'new key name:', newkeyname
thebucket.copy_key(newkeyname, k.bucket.name, k.name)
For now it is working but is much slower than what I can do manually in the graphical managment console by just copy/past with the mouse. Very frustrating and there are lots of keys to copy...
Do you know any quicker method ? Thanks.
Edit: maybe I can do this with concurrent copy processes. I'm not really familiar with boto copy keys methods and how many concurrent processes I can send to amazon.
Edit2: i'm currently learning Python multiprocessing. Let's see if I can send 50 copy operations simultaneously...
Edit 3: I tried with 30 concurrent copy using Python multiprocessing module. Copy was much faster than within the console and less error prone. There is a new issue with large files (>5Gb): boto raises an exception. I need to debug this before posting the updated script.
Regarding your issue with files over 5GB: S3 doesn't support uploading files over 5GB using the PUT method, which is what boto tries to do (see boto source, Amazon S3 documentation).
Unfortunately I'm not sure how you can get around this, apart from downloading it and re-uploading in a multi-part upload. I don't think boto supports a multi-part copy operation yet (if such a thing even exists)