I'm writing a Python code to put files into S3 bucket.
I'm connecting using key and Id.
I also need to work on something to get temporary key and id to connect and do the same.
I saw various options like assume_role, API calls etc.
But, its pretty confusing for me.
My requirement -
My python code (running on my machine) should generate temporary key and id to make a S3 connection (using boto).
Please help
Related
I´m trying to do the following:
when I upload a file in my s3 storage, the lambda picks this json file and converts it into a csv file.
How can I specify in the lambda code which file must pick?
example of my code in local:
import pandas as pd
df = pd.read_json('movies.json')
df.to_csv('csv-movies.csv')
in this example, I provide the name of the file...but..how can I manage that on a Lambda?
I think I don´t understand how Lambda works...could you give me an example?
Lambda spins up execution environments to handle your requests. When it initialises these environments, it'll pull the code you uploaded, and execute it when invoked.
Execution environments have a concept of ephemeral (temporary) storage with a default size of 512mb.
Lambda doesn't have access to your files in S3 by default. You'd first need to download your file from S3 using something like the AWS SDK for Python. You can store it in the /tmp directory to make use of the ephemeral storage I mentioned earlier.
Once you've downloaded the file using the SDK, you can interact with it as you would if you were running this locally, like in your example.
On the flip side, you'd also need to use the SDK to upload the CSV back to S3 if you want to keep it beyond the lifecycle of that execution environment.
Something else you might want to explore in future is reading that file into memory and doing away with storing it in ephemeral storage altogether.
In order to achieve this you will need to use S3 as the event source for your Lambda, there's a useful tutorial for this provided by AWS themselves and has some sample python code to assist you, you can view it here.
To break it down slightly further and answer how you get the name of the file. The lambda handler will look similar to the following:
def lambda_handler(event, context)
What is important here is the event object. When your event source is the S3 bucket you will be given the name of the bucket and the s3 key in the object which is effectively the path to the file in the S3 bucket. With this information you can do some logic to decide if you want to download the file from that path. If you do, you can use the S3 get_object( ) api call as shown in the tutorial.
Once this file is downloaded it can be used like any other file you would have on your local machine, so you can then proceed to process the json to a CSV. Once it is converted you will presumably want to put it back in S3 and for this you can use the S3 put_object( ) call for this and reuse the information in the event object in order to specify the path.
I have a Python Scraper that I run periodically in my free tier AWS EC2 instance using Cron that outputs a csv file every day containing around 4-5000 rows with 8 columns. I have been ssh-ing into it from my home Ubuntu OS and adding the new data to a SQLite database which I can then use to extract the data I want.
Now I would like to try the free tier AWS MySQL database so I can have the database in the Cloud and pull data from it from my terminal on my home PC. I have searched around and found no direct tutorial on how this could be done. It would be great if anyone that has done this could give me a conceptual idea of the steps I would need to take. Ideally I would like to automate the updating of the database as soon as my EC2 instance updates with a new csv table. I can do all the de-duping once the table is in the aws MySQL database.
Any advice or link to tutorials on this most welcome. As I stated, I have searched quite a bit for guides but haven't found anything on this. Perhaps the concept is completely wrong and there is an entirely different way of doing it that I am not seeing?
The problem is you don't have access to RDS filesystem, therefore cannot upload csv there (and import too).
Modify your Python Scraper to connect to DB directly and insert data there.
Did you consider using AWS Lambda to run your scraper?
Take a look at this AWS tutorial which will help you configure a Lambda Function to access an Amazon RDS database.
I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?
Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.
You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")
I need to copy all keys from '/old/dir/' to '/new/dir/' in an amazon S3 bucket.
I came up with this script (quick hack):
import boto
s3 = boto.connect_s3()
thebucket = s3.get_bucket("bucketname")
keys = thebucket.list('/old/dir')
for k in keys:
newkeyname = '/new/dir' + k.name.partition('/old/dir')[2]
print 'new key name:', newkeyname
thebucket.copy_key(newkeyname, k.bucket.name, k.name)
For now it is working but is much slower than what I can do manually in the graphical managment console by just copy/past with the mouse. Very frustrating and there are lots of keys to copy...
Do you know any quicker method ? Thanks.
Edit: maybe I can do this with concurrent copy processes. I'm not really familiar with boto copy keys methods and how many concurrent processes I can send to amazon.
Edit2: i'm currently learning Python multiprocessing. Let's see if I can send 50 copy operations simultaneously...
Edit 3: I tried with 30 concurrent copy using Python multiprocessing module. Copy was much faster than within the console and less error prone. There is a new issue with large files (>5Gb): boto raises an exception. I need to debug this before posting the updated script.
Regarding your issue with files over 5GB: S3 doesn't support uploading files over 5GB using the PUT method, which is what boto tries to do (see boto source, Amazon S3 documentation).
Unfortunately I'm not sure how you can get around this, apart from downloading it and re-uploading in a multi-part upload. I don't think boto supports a multi-part copy operation yet (if such a thing even exists)
How do you guys deploy your code on your servers? I am using Fabric and Python and I would like a more automated way of pulling code from the repository through the use of public keys, but without any ops or manual intervention to set up the public keys.
Are you storing them in the code as text or in a database and generate the pk file on the fly? Any other opinions on this one ?
This is what ssh-copy-id is for. It deploys your public key onto a machine for you. Key management isn't something I'd suggest putting into code/VCS. Each user needs to setup their keys so that the local ssh client knows to use them. We use Fabric as well, but it only uses the key that the ssh config is already telling it to.