I am an absolute beginner in AWS: I have created a key and an instance, the python script I want to run in the EC2 environment needs to loop through around 80,000 filings, tokenize the sentences in them, and use these sentences for some unsupervised learning.
This might be a duplicate; but I can't find a way to copy these filings to the EC2 environment and run the python script in EC2, I am also not very sure as to how I can use boto3. I am using Mac OS. I am just looking for any way to speed things up. Thank you so so much! I am forever grateful!!!
Here's what I tried recently:
Create the bucket and keep the bucket accessible for public.
Create the role and add HTTP option.
Upload all the files and make sure the files are public accessible.
Get the HTTP link of the S3 file.
Connect the instance through putty.
wget copies the file into EC2
instance.
If your files are in zip format, one time copy enough to move all the files into instance.
Here's one way that might help:
create a simple IAM role that allows S3 access to the bucket holding your files
apply that IAM role to the running EC2 instance (or launch a new instance with the IAM role)
install the awscli on the EC2 instance
SSH to the instance and sync the S3 files to the EC2 instance using aws s3 sync
run your app
I'm assuming you've launched EC2 with enough diskspace to hold the files.
Related
I am new to get AWS CLI working, and finally have my commands working through gitBash with:
aws s3 ls --no-verify-ssl
I am now trying to run the same commands from Python.
I need to be able to do the following tasks in AWS s3 from Python:
Copy hundreds of local folders to the s3 bucket.
Update existing folders on the s3 bucket with changes made on local versions.
List contents of the s3 bucket.
In reading similar posts here, I see that --no-verify-ssl means there is a bigger problem, however using it is the way our network people have set things up, and I have no control over that. This is the flag they require to be used to allow access to the AWS CLI.
I have tried using boto3 and running the Python command there, but I get an authentication error because I don't know how to pass the --no-verify-ssl flag from Python.
I have code on aws ec2. Right now, it accepts input and output files from s3. Its an inefficient process. I have to upload input file to s3, copy s3 to ec2, run program, copy output files from ec2 to s3, then download locally.
Is there a way to run the code on ec2 and accept a local file as input and then have the output saved on my local machine?
It appears that your scenario is:
Some software on an Amazon EC2 instance is used to process data on the local disk
You are manually transferring that data to/from the instance via Amazon S3
An Amazon EC2 instance is just like any other computer. It runs the same operating system and the same software as you would on a server in your company. However, it does benefit from being in the cloud in that it has easy access to other services (such as Amazon S3) and resources can be turned off to save expense.
Optimize current process
In sticking with the current process, you could improve it with some simple automation:
Upload your data to Amazon S3 via an AWS Command-Line Interface (CLI) command, such as: aws s3 cp file.txt s3://my-bucket/input/
Execute a script on the EC2 process that will:
Download the file, eg: aws s3 cp s3://my-bucket/input/file.txt .
Process the file
Copy the results to S3, eg: aws s3 cp file.txt s3://my-bucket/output/
Download the results to your own computer, eg: aws s3 cp s3://my-bucket/output/file.txt .
Use scp to copy files
Assuming that you are connect to a Linux instance, you could automate via:
Use scp to copy the file to the EC2 instance (which is very similar to the SSH command)
Use ssh with a [remote command(https://malcontentcomics.com/systemsboy/2006/07/send-remote-commands-via-ssh.html) parameter to trigger the remote process
Use scp to copy the file down once complete
Re-architect to use AWS Lambda
If the job that runs on the data is suitable for being run as an AWS Lambda function, then the flow would be:
Upload the data to Amazon S3
This automatically triggers the Lambda function, which processes the data and stores the result
Download the result from Amazon S3
Please note that an AWS Lambda function runs for a maximum of 15 minutes and has a limit of 512MB of temporary disk space. (This can be expanded by using Amazon EFS is needed.)
Something in-between
There are other ways to upload/download data, such as running a web server on the EC2 instance and interacting via a web browser, or using AWS Systems Manager Run Command to trigger the process on the EC2 instance. Such a choice would be based on how much you are permitted to modify what is running on the instance and your technical capabilities.
#John Rotenstein we have solved the problem of loading 60MB+ models into Lambdas by attaching AWS EFS volumes via VPC. Also solves the problem with large libs such as Tensorflow, opencv etc. Basically lambda layers almost become redundant and you can really sit back and relax, this saved us days if not weeks of tweaking, building and cherry picking library components from source allowing us to concentrate on the real problem. Beats loading from S3 everytime too. The EFS approach would require an ec2 instance obviously.
Any ideas how I automatically send some files (mainly Tensorflow models) after training in Google AI platform to another compute instance or my local machine? I would like to run in my trainer for instance something like this os.system(scp -r ./file1 user#host:/path/to/folder). Of course I don’t need to use scp. It’s just an example. Is there such a possibility in Google? There is no problem to transfer files from job to Google Cloud Storage like this os.system('gsutil cp ./example_file gs://my_bucket/path/'). However when I try for example os.system('gcloud compute scp ./example_file my_instance:/path/') to transfer data from my AI platform job to another instance I get Your platform does not support SSH. Any ideas how can I do this?
UPDATE
Maybe there is a possibility to automatically download all the files from the google cloud storage which are in chosen folder? So I would for instance upload data from my job instance to the google cloud storage folder and my another instance would automatically detect changes and download all the new files?
UPDATE2
I found gsutil rsync but I am not sure whether it can be constantly running in the background? At this point the only solution that comes to my mind is to use cron job in the backend and run gsutil rsync for example every 10 minutes. But is doesn't seem to be optimal solution. Maybe there is a built-in tool or another better idea?
rsync command makes the contents under destination the same as the contents under source, by copying any missing files/objects (or those whose data has changed), and (if the -d option is specified) deleting any extra files/objects. source must specify a directory, bucket, or bucket subdirectory. But it command does not run in background.
Remember that the Notebook you're using is in fact a VM running JupyterLab, based on that you could run the command rsync once Tensorflow finished to created the files and sync it with a directory in another instance trying something like:
import os
os.system("rsync -avrz Tensorflow/outputs/filename root#ip:Tensorflow/otputs/file")
I suggest you to take a look in the rsync documentation to know all the options available to use that command.
I'm looking to create an AWS system with one master EC2 instance which can create other instances.
For now, I managed to create python files with boto able to create ec2 instances.
The script works fine in my computer environment but when I try to deploy it using Amazon BeanStalk with Django (Python 3.4 included) the script doesn't work. I can't configure aws cli (and so Boto) through SSL because the only user I can access is ec2-user and the web server uses another user.
I could simply handwrite my access ID key and password on the python file but that would not be secure. What can I do to solve this problem?
I also discovered AWS cloudformation today, is it a better idea to create new instances with that rather than with the boto function run?
This sounds like an AWS credentials question, not specifically a "create ec2 instances" question. The answer is to assign the appropriate AWS permissions to the EC2 instance via an IAM role. Then your boto/boto3 code and/or the AWS CLI running on that instance will have permissions to make the necessary AWS API calls without having an access key and secret key stored in your code.
I have an application where in I need to zip folders hosted on S3.The zipping process will be triggered from the model save method.The Django app is running on an EC2 instance.Any thoughts or leads on the same?
I tried django_storages but haven't got a breakthrough
from my understanding you can't zip files directly on s3. you would have to download the files you'd want to zip, zip them up, then upload the zipped file. i've done something similar before and used s3cmd to keep a local synced copy of my s3bucket, and since you're on an ec2 instance network speed and latency will be pretty good.