Elastic Search and AWS python

Elastic Search and AWS python - python

I am working on AWS ElasticSearch using python,I have JSON file with 3 field.
("cat1","Cat2","cat3"), each row is separated with \n
example cat1:food, cat2: wine, cat3: lunch etc.
from requests_aws4auth import AWS4Auth
import boto3
import requests
payload = {
"settings": {
"number_of_shards": 10,
"number_of_replicas": 5
},
"mappings": {
"Categoryall" :{
"properties" : {
"cat1" : {
"type": "string"
},
"Cat2":{
"type" : "string"
},
"cat3" : {
"type" : "string"
}
}
}
}
}
r = requests.put(url, auth=awsauth, json=payload)
I created schema/mapping for the index as shown above but i don't know how to populate index.
I am thinking to put a for loop for JSON file and call post request to insert the index. Doesn't have an idea how to proceed.
I want to create index and bulk upload this file in the index. Any suggestion would be appreciated.

Take a look at Elasticsearch Bulk API.
Basically, you need to create a bulk request body and post it to your "https://{elastic-endpoint}/_bulk" url.
The following example is showing a bulk request to insert 3 json records into your index called "my_index":
{ "index" : { "_index" : "my_index", "_type" : "_doc", "_id" : "1" } }
{ "cat1" : "food 1", "cat2": "wine 1", "cat3": "lunch 1" }
{ "index" : { "_index" : "my_index", "_type" : "_doc", "_id" : "2" } }
{ "cat1" : "food 2", "cat2": "wine 2", "cat3": "lunch 2" }
{ "index" : { "_index" : "my_index", "_type" : "_doc", "_id" : "3" } }
{ "cat1" : "food 3", "cat2": "wine 3", "cat3": "lunch 3" }
where each json record is represented by 2 json objects.
So if you write your bulk request body into a file called post-data.txt, then you can post it using Python something like this:
with open('post-data.txt','rb') as payload:
r = requests.post('https://your-elastic-endpoint/_bulk', auth=awsauth,
data=payload, ... add more params)
Alternatively, you can try Python elasticsearch bulk helpers.

Related

Elasticsearch analyzers not working on queries

So I am running Elasticsearch and Kibana locally on ports 9200 and 5601 respectively. I am attempting to process a JSONL file into Elasticsearch documents and apply an analyzer to some of the fields.
This is the body:
body = {
"mappings": {
"testdoc": {
"properties": {
"title": {
"type": "text",
"analyzer": "stop"
}
"content": {
"type": "text",
"analyzer": "stop"
}
}
}
}
}
I then create a new index (and I am deleting the index between tests so I know it's not that)
es.indices.create("testindex", body=body)
I then parse my JSONL object into documents and upload them to elasticsearch using
helpers.bulk(es, documents, index="textindex", doc_type="testdoc"
Finally I query like this
q = {"query": { "match-all": {}}}
print(es.search(index="testindex", body="query")
My result, for a sample sentence like "The quick brown fox" is unchanged when I'd expect it to be 'quick brown fox'.
When I run the same query in Kibana I also see it not working
GET /testindex/_search
{
"query": {
"match-all": {}
}
}
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 10,
"max_score" : 4.128039,
"hits" : [
{
"_index" : "testindex",
"_type" : "textdocument",
"_id" : "6bfkb4EBWF89_POuykkO",
"_score" : 4.128039,
"_source" : {
"title" : "The fastest fox",
"body" : "The fastest fox is also the brownest fox. They jump over lazy dogs."
}
}
]
}
}
Now I do this query:
POST /testindex/_analyze
{
"field": "title",
"text": "The quick brown fox"
}
I get this response:
{
"tokens" : [
{
"token" : "quick",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 2
},
{
"token" : "fox",
"start_offset" : 16,
"end_offset" : 19,
"type" : "word",
"position" : 3
}
]
}
Which is what I would expect.
But conversely if I do
POST testindex/testdoc
{
"title":"The fastest fox",
"content":"Test the a an migglybiggly",
"published":"2015-09-15T10:17:53Z"
}
and then search for 'migglybiggly', the content field of the returned document has not dropped its stop words.
I am really at a loss as to what I am doing wrong here. I'm fairly new to elasticsearch and this is really dragging me down.
Thanks in advance!
Edit:
If I run
GET /testindex/_mapping
I see
{
"textindex" : {
"mappings" : {
"testdoc" : {
"properties" : {
"title" : {
"type" : "text",
"analyzer" : "stop"
},
"content" : {
"type" : "text",
"analyzer" : "stop"
}
}
}
}
}
}
So, to me, it looks like the mapping is getting uploaded correctly, so I don't think it's that?

This is expected behavior because when you execute queries and get a response then it is your original content (_source) you receive and not analyzed field.
The analyzer is used for, how the Elasticsearch Index field into inverted index and it is not for changing your actual content. Same analyzer will be applied at query time as well, so when you pass the query, it will use stop analyzer and remove stopwords and search your query in inverted index.
This POST /testindex/_analyze API will show how your original content is analyzed / tokenized and store to inverted index. It will not change your original document.
So when you search match_all query, it will just get all the documents from Elasticsearch with _source which have original document content and give you as a response.
You can use match query for matching on specific field insted of match_all as match_all will give you all the document from index (by default 10).
{
"query": {
"match": {
"title": "The quick brown fox"
}
}
}
Here, you can try query like quick brown fox or The quick etc.
Hope I have clear your understandings..

Write queryDSL to find unique error messages from sys log data?

Is there a way to configure the elasticsearch analyzer so that it is possible to get unique error messages in different scenarios?
1."...July 2020 23:00:00.674z... same message....."
2. slight changes in the string :
message1: "....message_details.. (unknown error 20004)
message2: "....message_details.. (unknown error 278945)
OR
message1:"....a::::: message_details ...."
message2:"....a:f23ed:fff:ff:: message_details ...."
The above two messages are the same apart from the character differnce.
Here is the query :
GET log_stash_2020.06.16/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"message": "Error"
}
},
{
"match_phrase": {
"type": "lab_id"
}
}
]
}
},
"aggs": {
"log_message": {
"significant_text": {
"field": "message",
"filter_duplicate_text": "true"
}
}
},
"size": 1000
}
I have added the sample log file.
{
"_index" : "logstash_2020.06.16",
"_type" : "doc",
"_id" : "################",
"_score" : 1.0,
"_source" : {
"logsource" : "router_id",
"timestamp" : "Jun 15 20:00:00",
"program" : "some_program",
"host" : "#############",
"priority" : "27",
"#timestamp" : "2020-06-16T00:00:01.020Z",
"type" : "lab_id",
"pid" : "####",
"message" : ": ############### send failed with error: ENOENT -- Item not found (No error: 0)",
"#version" : "1"
}
}
{
"_index" : "logstash_2020.06.16",
"_type" : "doc",
"_id" : "################",
"_score" : 1.0,
"_source" : {
"host" : "################",
"#timestamp" : "2020-06-16T00:00:02.274Z",
"type" : "####",
"tags" : [
"_grokparsefailure"
],
"message" : "################:Jun 15 20:00:18.908 EDT: mediasvr[2546]: %MEDIASVR-MEDIASVR-4-PARTITION_USAGE_ALERT : High disk usage alert : host ##### exceeded 100% \n",
"#version" : "1"
}
}
Is there a way to do it in python ?(If elasticsearch does not have above mentioned functionality)

You can use the Elasticsearch Python client like so:
from elasticsearch import Elasticsearch
es = Elasticsearch(...)
resp = es.search(index="log_stash_2020.06.16", body={<dsl query>})
print(resp)
where is whatever query you want to run like the one you gave in the question.
<disclosure: I'm the maintainer of the Elasticsearch client and employed by Elastic>

Celery Result type for ElasticSearch

I'm exploring celery for my work currently and I'm trying to set-up Elasticsearch backend. Is there any way to send resulting value as a dictionary/JSON, not as a text? Therefore, results in Elasticsearch will be shown correctly and nested type could be used?
Automatic mapping created by celery
{
"celery" : {
"mappings" : {
"backend" : {
"properties" : {
"#timestamp" : {
"type" : "date"
},
"result" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
I've tried to create my own mapping with nested field, but it has resulted in a elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'object mapping for [result] tried to parse field [result] as object, but found a concrete value')
UPDATE
Result is already encoded in JSON and inside Elasticsearch wrapper JSON string is saved inside a dictionary. Adding json.loads(result) as a quick-fix actually helps.
After the quick-fix new mapping has appeared:
{
"celery" : {
"mappings" : {
"backend" : {
"properties" : {
"#timestamp" : {
"type" : "date"
},
"result" : {
"properties" : {
"date_done" : {
"type" : "date"
},
"result" : {
"type" : "long"
},
"status" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"task_id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}
Updated Kibana view:
Is there any way to disable serialization of results in Celery?
I could add a pull-request with unpacking JSON, just for Elasticsearch, but it looks like a hack.

Since v4.0 the default result_serializer is json, so you should have results in JSON format anyway. Maybe your configuration uses something else? - In that case I suggest you remove it (if you use Celery >=4.0) and you should enjoy results in JSON format. I prefer msgpack but on the other hand I do not use ElasticSearch on Celery results...

post request with \n-delimited JSON in python

I'm trying to use the bulk API from Elasticsearch and I see that this can be done using the following request which is special because what is given as a "data" is not a proper JSON, but a JSON that uses \n as delimiters.
curl -XPOST 'localhost:9200/_bulk?pretty' -H 'Content-Type: application/json' -d '
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }
'
My question is how can I perform such request within python? The authors of ElasticSearch suggest to not pretty print the JSON but I'm not sure what it means (see https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html)
I know that this is a valid python request
import requests
import json
data = json.dumps({"field":"value"})
r = requests.post("localhost:9200/_bulk?pretty", data=data)
But what do I do if the JSON is \n-delimited?

What this really is is a set of individual JSON documents, joined together with newlines. So you could do something like this:
data = [
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } },
{ "field1" : "value1" },
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" }, },
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" }, },
{ "field1" : "value3" },
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "test"} },
{ "doc" : {"field2" : "value2"} }
]
data_to_post = '\n'.join(json.dumps(d) for d in data)
r = requests.post("localhost:9200/_bulk?pretty", data=data_to_post)
However, as pointed out in the comments, the Elasticsearch Python client is likely to be more useful.

As a follow-up to Daniel's answer above, I had to add an additional '\n' to the end of the data_to_post, and add a {Content-Type: application/x-ndjson} header to get it work in Elasticsearch 6.3.
data_to_post = '\n'.join(json.dumps(d) for d in data) + "\n"
headers = {"Content-Type": "application/x-ndjson"}
r = requests.post("http://localhost:9200/_bulk?pretty", data=data_to_post, headers=headers)
Otherwise, I will receive the error:
"The bulk request must be terminated by a newline [\\n]"

You can use python ndjson library to do it.
https://pypi.org/project/ndjson/
It contains JSONEncoder and JSONDecoder classes for easy use with other libraries, such as requests:
import ndjson
import requests
response = requests.get('https://example.com/api/data')
items = response.json(cls=ndjson.Decoder)

Python and Elasticsearch autcompletion

I am trying to work with Python Elasticsearch version 1.1.0, on the master branch. It seems it will create an index, but there are issues with retrieving autocomplete results, when using a suggestion filed.
Below is a basic Python functions to create an index, then add a song to it, and finally we query it through the curl at the very bottom.
Unfortunately it fails with the error:
"reason" : "BroadcastShardOperationFailedException[[music][2] ]; nested: ElasticsearchException[failed to execute suggest]; nested: ElasticsearchException[Field [suggest] is not a completion suggest field]; "
} ]'
The functions I am using to create the index and add a song is below:
conn = Elasticsearch()
def mapping():
return """{
"song" : {
"properties" : {
"name" : { "type" : "string" },
"suggest" : { "type" : "completion",
"index_analyzer" : "simple",
"search_analyzer" : "simple",
"payloads" : true
}
}
}
}"""
def createMapping():
settings = mapping()
conn.indices.create(index= "music", body=settings)
def addSong():
body = """{
"name" : "Nevermind",
"suggest" : {
"input": [ "Nevermind", "Nirvana" ],
"output": "Nirvana - Nevermind",
"payload" : { "artistId" : 2321 },
"weight" : 34
}
}"""
res = conn.index(body=body, index="music", doc_type="song", id=1)
Curl request:
curl -X POST 'localhost:9200/music/_suggest?pretty' -d '{
"song-suggest" : {
"text" : "n",
"completion" : {
"field" : "suggest"
}
}
}'

When you use the create index API, you have to wrap your mappings in mappings:
def createMapping():
settings = """{"mappings": %s}""" % mapping()
conn.indices.create(index= "music", body=settings)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Elastic Search and AWS python - python

Related

Elasticsearch analyzers not working on queries

Write queryDSL to find unique error messages from sys log data?

Celery Result type for ElasticSearch

post request with \n-delimited JSON in python

Python and Elasticsearch autcompletion

Categories

Resources