Amazon AWS - S3 to ElasticSearch (Python Lambda)

Amazon AWS - S3 to ElasticSearch (Python Lambda) - python

I'd like to copy data from an S3 directory to the Amazon ElasticSearch service. I've tried following the guide, but unfortunately the part I'm looking for is missing. I don't know how the lambda function itself should look like (and all the info about this in the guide is: "Place your application source code in the eslambda folder."). I'd like ES to autoindex the files.
Currently I'm trying
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = urllib.unquote_plus(record['s3']['object']['key'])
index_name = event.get('index_name', key.split('/')[0])
object = s3_client.Object(bucket, key)
data = object.get()['Body'].read()
helpers.bulk(es, data, chunk_size=100)
But I get like a massive error stating
elasticsearch.exceptions.RequestError: TransportError(400, u'action_request_validation_exception', u'Validation Failed: 1: index is missing;2: type is missing;3: index is missing;4: type is missing;5: index is missing;6: type is missing;7: ...
Could anyone explain to me, how can I set things up so that my data gets moved from S3 to ES where it gets auto-mapped and auto-indexed? Apparently it's possible, as mentioned in the reference here and here.

While mapping can automatically be assigned in Elasticsearch, the indexes are not automatically generated. You have to specify the index name and type in the POST request. If that index does not exist, then Elasticsearch will create the index automatically.
Based on your error, it looks like you're not passing through an index and type.
For example, here's how a simple POST request to add a record to the index MyIndex and type MyType which would first create the index and type if it did not already exist.
curl -XPOST 'example.com:9200/MyIndex/MyType/' \
-d '{"name":"john", "tags" : ["red", "blue"]}'

I wrote a script to download a csv file from S3 and then transfer the data to ES.
Made an S3 client using boto3 and downloaded the file from S3
Made an ES client to connect to Elasticsearch.
Opened the csv file and used the helpers module from elasticsearch to insert csv file contents into elastic search.
main.py
import boto3
from elasticsearch import helpers, Elasticsearch
import csv
import os
from config import *
#S3
Downloaded_Filename=os.path.basename(Prefix)
s3 = boto3.client('s3', aws_access_key_id=awsaccesskey,aws_secret_access_key=awssecretkey,region_name=awsregion)
s3.download_file(Bucket,Prefix,Downloaded_Filename)
#ES
ES_index = Downloaded_Filename.split(".")[0]
ES_client = Elasticsearch([ES_host],http_auth=(ES_user, ES_password),port=ES_port)
#S3 to ES
with open(Downloaded_Filename) as f:
reader = csv.DictReader(f)
helpers.bulk(ES_client, reader, index=ES_index, doc_type='my-type')
config.py
awsaccesskey = ""
awssecretkey = ""
awsregion = "us-east-1"
Bucket=""
Prefix=''
ES_host = "localhost"
ES_port = "9200"
ES_user = "elastic"
ES_password = "changeme"

Related

Downloading only AWS S3 object file names and image URL in CSV Format

I have hosted files in AWS s3 bucket, I need only all S3 bucket object URL's in CSV file
Please suggest

You can get all S3 Object URLS by using the AWS SDK for S3. First, what you need to do is read all items in a bucket. You can use Python code similar to this Java code (you can port the logic):
ListObjectsRequest listObjects = ListObjectsRequest
.builder()
.bucket(bucketName)
.build();
ListObjectsResponse res = s3.listObjects(listObjects);
List<S3Object> objects = res.contents();
for (ListIterator iterVals = objects.listIterator(); iterVals.hasNext(); ) {
S3Object myValue = (S3Object) iterVals.next();
System.out.print("\n The name of the key is " + myValue.key());
}
Then iterate through the list and get the key as shown above. For each object, you can get the URL using Python code similar to this:
GetUrlRequest request = GetUrlRequest.builder()
.bucket(bucketName)
.key(keyName)
.build();
URL url = s3.utilities().getUrl(request);
System.out.println("The URL for "+keyName +" is "+url.toString());
Put each URL value into a collection and then write the collection out to a CSV. That is how you achieve your use case.

Python 3 and Azure table storage tablestorageaccount not working

I'm trying to use the sample provided by Microsoft to connect to an Azure storage table using Python. The code below fail because of tablestorageaccount not found. What I'm missing I installed the azure package but still complaining that it's not found.
import azure.common
from azure.storage import CloudStorageAccount
from tablestorageaccount import TableStorageAccount
print('Azure Table Storage samples for Python')
# Create the storage account object and specify its credentials
# to either point to the local Emulator or your Azure subscription
if IS_EMULATED:
account = TableStorageAccount(is_emulated=True)
else:
account_connection_string = STORAGE_CONNECTION_STRING
# Split into key=value pairs removing empties, then split the pairs into a dict
config = dict(s.split('=', 1) for s in account_connection_string.split(';') if s)
# Authentication
account_name = config.get('AccountName')
account_key = config.get('AccountKey')
# Basic URL Configuration
endpoint_suffix = config.get('EndpointSuffix')
if endpoint_suffix == None:
table_endpoint = config.get('TableEndpoint')
table_prefix = '.table.'
start_index = table_endpoint.find(table_prefix)
end_index = table_endpoint.endswith(':') and len(table_endpoint) or table_endpoint.rfind(':')
endpoint_suffix = table_endpoint[start_index+len(table_prefix):end_index]
account = TableStorageAccount(account_name = account_name, connection_string = account_connection_string, endpoint_suffix=endpoint_suffix)

I find the source sample code, and in the sample code there is still a custom module tablestorageaccount.py, it's just used to return TableService. If you already have the storage connection string and want to have a test, you could connect to table directly.
Sample:
from azure.storage.table import TableService, Entity
account_connection_string = 'DefaultEndpointsProtocol=https;AccountName=account name;AccountKey=account key;EndpointSuffix=core.windows.net'
tableservice=TableService(connection_string=account_connection_string)
Also you could refer to the new sdk to connect table. Here is the official tutorial about Get started with Azure Table storage.

how to fetch firebase data?

I am new to python and firebase and I am trying to flaten my firebase database.
I have a database in this format
each cat has thousands of data in it. All I want is to fetch the cat names and put them in an array. for example I want the output to be ['cat1','cat2'....]
I was using this tutorial
http://ozgur.github.io/python-firebase/
from firebase import firebase
firebase = firebase.FirebaseApplication('https://your_storage.firebaseio.com', None)
result = firebase.get('/Data', None)
the problem with the above code is it'll attempt to fetch all the data under Data. How can I only fetch the "cats"?

if you want to get the values inside the cats as columns, try using the pyrebase, using pip install pyrebase at cmd / anaconda prompt(later prefered if you didn't set up PIP or Python at your environment paths. after installing:
import pyrebase
config {"apiKey": yourapikey
"authDomain": yourapidomain
"databaseURL": yourdatabaseurl,
"storageBucket": yourstoragebucket,
"serviceAccount": yourserviceaccount
}
Note: you can find all the information above at your Firebase's console:
https://console.firebase.google.com/project/ >>> your project >>> click on the icon "<'/>" with the tag "add firebase to your web app
back to the code...
make a neat definition so you can store it into a py file:
def connect_firebase():
# add a way to encrypt those, I'm a starter myself and don't know how
username: "usernameyoucreatedatfirebase"
password: "passwordforaboveuser"
firebase = pyrebase.initialize_app(config)
auth = firebase.auth()
#authenticate a user > descobrir como não deixar hardcoded
user = auth.sign_in_with_email_and_password(username, password)
#user['idToken']
# At pyrebase's git the author said the token expires every 1 hour, so it's needed to refresh it
user = auth.refresh(user['refreshToken'])
#set database
db = firebase.database()
return db
Ok, now save this into a neat .py file
NEXT, at your new notebook or main .py you're going to import this new .py file that we'll call auth.py from now on...
from auth import *
# add do a variable
db = connect_firebase()
#and now the hard/ easy part that took me a while to figure out:
# notice the value inside the .child, it should be the parent name with all the cats keys
values = db.child('cats').get()
# adding all to a dataframe you'll need to use the .val()
data = pd.DataFrame(values.val())
and thats it, print(data.head()) to check if the values / columns are where they're expected to be.

Firebase Realtime Database is one big JSON tree:
when you fetch data at a location in your database, you also retrieve
all of its child nodes.
The best practice is to denormalize your data, creating multiple locations (nodes) for the same data:
Many times you can denormalize the data by using a query to retrieve a
subset of the data
In your case, you may create a second node named "categories" where you list "only" the category names.
/cat1
/...
/cat2
/...
/cat3
/...
/cat4
/...
/categories
/cat1
/cat2
/cat3
/cat4
In this scenario you can use the update() method to write to more than one location at the same time.

I was exploring pyrebase documentation. As per that, we may extract only keys from some path.
To return just the keys at a particular path use the shallow() method.
all_user_ids = db.child("users").shallow().get()
In your case, it'll be something like:
firebase = pyrebase.initialize_app(config)
db = firebase.database()
allCats = db.child("data").shallow().get()
Let me know if it didn't help.

Downloading via boto

I am using boto client to download and upload my files to s3 and do a whole bunch of other things like copy from one folder key to another and etc. The problem arises when I try to copy a key whose size is 0 bytes. The code that I use to copy is below
# Get the connection to the bucket
conn = boto.connect_s3(AWS_KEY, SECRET_KEY)
bucket = conn.get_bucket('mybucket')
# bucket.name is the name of my bucket
# candidate is the source key
destination_key = "destination/path/on/s3"
candidate = "the/file/to/copy"
# now copy the key
bucket.copy_key(destination_key, bucket.name, candidate) # --> This throws an exception
# just in case, see if the key ended up in the destination.
copied_key = bucket.lookup(destination_key)
The exception that I get is
3ResponseError: 404 Not Found
<Error><Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Key>the/file/to/copy</Key><RequestId>ABC123</RequestId><HostId>XYZ123</HostId>
</Error>
Now I have verified that the key infact exists by logging into the aws console and navigating to the source key location, the key is there and the aws console shows that its size is 0 (there are cases in my application that I may end up with empty files but I need them on s3).
So upload works fine, boto uploads the key without any issue, but when I attempt to copy it, then I get the error that the key does not exist
So is there any other logic that I should be using to copy such keys? Any help in this regard would be appreciated

Make sure you include the bucket of the source key. Should be something like bucket/path/to/file/to/copy

Try this:
from boto.s3.key import Key
download_path = '/tmp/dest_test.jpg'
bucket_key = Key(bucket)
bucket_key.key = file_key # e.g. images/source_test.jpg
bucket_key.get_contents_to_filename(download_path)

how to format the JSON for boto aws sdk?

I m using boto and cloudformation to orchestrate few resource
For creating templates for cloud formation. I m reading a json-file from my local disk and creating json-string to pass as a parameter for template_body
try:
fileObj = open(filename,'r')
json_data = json.loads(fileObj.read())
return json_data
except IOError as e:
print e
exit()
And my cloud formation connection string and stack creation goes like this
cfnConnectObj = cfn.connection.CloudFormationConnection(aws_access_key_id=aKey, aws_secret_access_key=sKey, is_secure=True,debug=2,path='/',validate_certs=True,region=region[3]) #created connection object for cloudformation service
stackID = cfnConnectObj.create_stack('demodrupal',template_body=templateJson, template_url=None,parameters=[],notification_arns=[],disable_rollback=False,timeout_in_minutes=None,capabilities=['CAPABILITY_IAM'],tags=None)
I m getting Boto Error [ERROR]:{"Error":{"Code":"ValidationError","Message":"Template format error: JSON not well-formed. (line 1, column 3)","Type":"Sender"}
Why is this error ?
I have used json.loads but still it shows Json not well formed. Is there anything i m missing ?
Please en-light me
**I m new to python and boto

json.loads takes json and converts it into a python object. If you have a JSON file already you can just pass that file directly to the service. Alternately you can load the JSON into python make any adjustments in python and then use json.dumps to get your well formed JSON.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Amazon AWS - S3 to ElasticSearch (Python Lambda) - python

Related

Downloading only AWS S3 object file names and image URL in CSV Format

Python 3 and Azure table storage tablestorageaccount not working

how to fetch firebase data?

Downloading via boto

how to format the JSON for boto aws sdk?

Categories

Resources