i am trying to download some data from our s3 server and im not being able to create the session.
i am running the following code:
session = boto3.Session(
aws_access_key_id = "###########",
aws_secret_access_key = "###########",
)
s3 = session.resource('s3')
bucket = s3.Bucket('########')
file_names = []
but it spits out the following error:
DataNotFoundError: Unable to load data for: sdk-default-configuration
These are my imports:
import pandas as pd
import mysql.connector
import boto3
import s3fs
import botocore
import pandas as pd
import os
and my versions of boto3 and botocore installed are boto3-1.20.44 and botocore-1.23.44
I have tried downloading different versions of boto3 and botocore with no success...
The problem appears to be in your session constructor:
boto3.Session(aws_access_key_id=a, aws_secret_access_key=b)
It should instead read as follows, per the documentation:
boto3.session.Session(aws_access_key_id=a, aws_secret_access_key=b)
Related
So i'm trying to insert some basic .csv files directly from a S3 bucket to elastic Search, each time a .csv will be dropped into the S3 it'll trigger my lambda that will feed the data from the .csv to Elastic Search, here's what i got so far :
import json
import os
import logging
import boto3
from datetime import datetime
import re
import csv
from aws_requests_auth.aws_auth import AWSRequestsAuth
from elasticsearch import RequestsHttpConnection, helpers, Elasticsearch
from core_libs.dynamodbtypescasters import DynamodbTypesCaster
from core_libs.streams import EsStream, EsClient
credentials = boto3.Session().get_credentials()
AWS_REGION = 'eu-west-3'
HOST = MY_PERSONAL_COMPANY_HOST
ES_SERVER = f"https://{HOST}"
AWS_ACCESS_KEY = credentials.access_key
AWS_SECRET_ACCESS_KEY = credentials.secret_key
AWS_SESSION_TOKEN = credentials.token
s3 = boto3.client('s3')
def lambda_handler(event, context):
awsauth = AWSRequestsAuth(
aws_access_key=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
aws_token=AWS_SESSION_TOKEN,
aws_host=HOST,
aws_region=AWS_REGION,
aws_service='es',
)
BUCKET_NAME = record['s3']['bucket']['name']
SOURCE_PATH = record['s3']['object']['key']
SOURCE_FILE = SOURCE_PATH.split('/')[-1]
obj = s3.get_object(Bucket=BUCKET_NAME, Key=SOURCE_PATH)
body = obj['Body'].read()
lines = body.splitlines()
for line in lines:
print(line)
And this is where i'm stuck. don't know if i should use the bulk API and if i can just insert a json version of my .csv as it is, nor how to do so
Not really an answer on you specific question.
However I am wondering why you are not using an SQS setup with Filebeat (recommended/out of the box functionality) and use an ingest pipeline with CSV processor?
I am trying to access gremlin via AWS Glue with PySpark as Runtime. As gremlinpython is external library, i had downloaded .whl file and placed in AWS S3. Now it was asking for "aenom" did the same. then isodate is required. So just wanted to know if there is anypackage which i can use instead of having separate modules.
Below is the sample script i am testing initially with all modules to keep it simple.
import boto3
import os
import sys
import site
import json
import pandas as pd
#from setuptools.command import easy_install
from importlib import reload
from io import StringIO
s3 = boto3.client('s3')
#dir_path = os.path.dirname(os.path.realpath(__file__))
#os.path.dirname(sys.modules['__main__'].__file__)
#install_path = os.environ['GLUE_INSTALLATION']
#easy_install.main( ["--install-dir", install_path, "gremlinpython"] )
#(site)
from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.process.traversal import T, Column
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
required libraries are below, after that there is no error related to modules.
tornado-6.0.4-cp35-cp35m-win32.whl
isodate-0.6.0-py2.py3-none-any.whl
aenum-2.2.4-py3-none-any.whl
gremlinpython-3.4.8-py2.py3-none-any.whl
I'm trying to download my cifar 10 data that is in S3 to train it in AWS SageMaker.
I'm using this code to load the data:
import s3fs
fs = s3fs.S3FileSystem()
def unpickle(file):
dict = pickle.load(file, encoding='bytes')
return dict
with fs.open(f's3://bucket_name/data_batch_1') as f:
data= unpickle(f)
I'm getting the error "EOFError: Ran out of input" on the unpickle function. I assume the "file" is empty, but I tried different ways to get the data from my bucket, and can't seem to get it right.
Unless you have granted the appropriate permissions in IAM for the user to have access to the S3 bucket, the easiest fix is to grant public access, i.e. make sure all are unchecked as below.
Then, using boto3 is an option for importing the dataset from S3 into SageMaker. Here is an example:
import boto3
import botocore
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket = 'databucketname'
data_key = 'datasetname.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
train_df = pd.read_csv(data_location)
Hope this helps.
How to download last uploaded files from s3
This code will get the last updated files in my s3. i just need to download them all at once.
code :
import os
import boto3
aws_access_key_id='***'
aws_secret_access_key='***'
client = boto3.client('s3',aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key)
import os
import boto3
from datetime import datetime
from datetime import timedelta
from datetime import datetime,timezone
now = datetime.now(timezone.utc)
import unittest
files = client.list_objects_v2(Bucket='mybuycket')['Contents']
This will help you:
from datetime import datetime
currentDate = datetime.now().strftime('%d/%m/%Y')
session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,)
print ("Creating S3 Session ...\n\n")
s3 = session.resource('s3')
for file_ in files:
filedate = file_.last_modified.strftime('%d/%m/%Y')
if filedate == currentDate:
print (file_.key)
s3.Bucket(bucket_name).download_file(file_.key, "path/to/save/file"+file_.key)
I am using below code to read an excel file from Amazon S3 using python xlrd and urllib module but I am getting Forbidden access error. I know it's because I am not passing AWS Access Key and AWS Secret Access Key. I looked around to find a way to pass the keys as a parameter with urllib but couldn't find an example.
import urllib.request
import xlrd
url = 'https://s3.amazonaws.com/bucket1/final.xlsx'
filecontent = urllib.request.urlopen(url).read()
workbook = xlrd.open_workbook(file_contents=filecontent)
worksheet = workbook.sheet_by_name(SheetName)
How can I read the excel from S3 using python xlrd module?
This can be done using boto API
import boto
import boto.s3.connection
from boto.s3.key import Key
import sys
import pandas as pd
try:
conn = boto.connect_s3(aws_access_key_id = your_access_key, aws_secret_access_key = your_secret_key)
bucket = conn.get_bucket('your_bucket')
print ("connected to AWS/s3")
except Exception as e:
print ("unable to connect to s3 - please check credentials")
print(e)
sys.exit(1)
destFileName = "/tmp/myFile.xlsx"
k = Key(bucket, "path_to_file_on_s3/sourceFile.xlsx")
k.get_contents_to_filename(destFileName)
df = pd.read_excel(destFileName, sheet_name=Sheet1)
print(df.head())
Use boto3 :
import xlrd
import boto3
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket=bucket)
s3_object = s3.get_object(Bucket=bucket, Key=filename)
body = s3_object['Body'].read()
book = xlrd.open_workbook(file_contents=body, on_demand=True)