Elasticsearch analyzers not working on queries

Elasticsearch analyzers not working on queries - python

So I am running Elasticsearch and Kibana locally on ports 9200 and 5601 respectively. I am attempting to process a JSONL file into Elasticsearch documents and apply an analyzer to some of the fields.
This is the body:
body = {
"mappings": {
"testdoc": {
"properties": {
"title": {
"type": "text",
"analyzer": "stop"
}
"content": {
"type": "text",
"analyzer": "stop"
}
}
}
}
}
I then create a new index (and I am deleting the index between tests so I know it's not that)
es.indices.create("testindex", body=body)
I then parse my JSONL object into documents and upload them to elasticsearch using
helpers.bulk(es, documents, index="textindex", doc_type="testdoc"
Finally I query like this
q = {"query": { "match-all": {}}}
print(es.search(index="testindex", body="query")
My result, for a sample sentence like "The quick brown fox" is unchanged when I'd expect it to be 'quick brown fox'.
When I run the same query in Kibana I also see it not working
GET /testindex/_search
{
"query": {
"match-all": {}
}
}
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 10,
"max_score" : 4.128039,
"hits" : [
{
"_index" : "testindex",
"_type" : "textdocument",
"_id" : "6bfkb4EBWF89_POuykkO",
"_score" : 4.128039,
"_source" : {
"title" : "The fastest fox",
"body" : "The fastest fox is also the brownest fox. They jump over lazy dogs."
}
}
]
}
}
Now I do this query:
POST /testindex/_analyze
{
"field": "title",
"text": "The quick brown fox"
}
I get this response:
{
"tokens" : [
{
"token" : "quick",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 2
},
{
"token" : "fox",
"start_offset" : 16,
"end_offset" : 19,
"type" : "word",
"position" : 3
}
]
}
Which is what I would expect.
But conversely if I do
POST testindex/testdoc
{
"title":"The fastest fox",
"content":"Test the a an migglybiggly",
"published":"2015-09-15T10:17:53Z"
}
and then search for 'migglybiggly', the content field of the returned document has not dropped its stop words.
I am really at a loss as to what I am doing wrong here. I'm fairly new to elasticsearch and this is really dragging me down.
Thanks in advance!
Edit:
If I run
GET /testindex/_mapping
I see
{
"textindex" : {
"mappings" : {
"testdoc" : {
"properties" : {
"title" : {
"type" : "text",
"analyzer" : "stop"
},
"content" : {
"type" : "text",
"analyzer" : "stop"
}
}
}
}
}
}
So, to me, it looks like the mapping is getting uploaded correctly, so I don't think it's that?

This is expected behavior because when you execute queries and get a response then it is your original content (_source) you receive and not analyzed field.
The analyzer is used for, how the Elasticsearch Index field into inverted index and it is not for changing your actual content. Same analyzer will be applied at query time as well, so when you pass the query, it will use stop analyzer and remove stopwords and search your query in inverted index.
This POST /testindex/_analyze API will show how your original content is analyzed / tokenized and store to inverted index. It will not change your original document.
So when you search match_all query, it will just get all the documents from Elasticsearch with _source which have original document content and give you as a response.
You can use match query for matching on specific field insted of match_all as match_all will give you all the document from index (by default 10).
{
"query": {
"match": {
"title": "The quick brown fox"
}
}
}
Here, you can try query like quick brown fox or The quick etc.
Hope I have clear your understandings..

Related

searching for space separated words in Elasticsearch

my data is,
POST index_name/_doc/1
{
"created_date" : "2023-02-09T13:21:41.632492",
"created_by" : "hats",
"data" : [
{
"name" : "west cost",
"document_id" : "1"
},
{
"name" : "mist cost",
"document_id" : "2"
},
{
"name" : "cost",
"document_id" : "3"
}
]
}
i used query_String to search
GET index_name/_serach
{
"query_string": {
"default_field": "data.name",
"query": "*t cost"
}
}
expected result was:
west cost, mist cost
but the output was:
west cost, mist cost, cost
i have tried many search query but still couldn't find a solution
which search query is used to handle the space, i need to search for the similar patterned value in the field

Celery Result type for ElasticSearch

I'm exploring celery for my work currently and I'm trying to set-up Elasticsearch backend. Is there any way to send resulting value as a dictionary/JSON, not as a text? Therefore, results in Elasticsearch will be shown correctly and nested type could be used?
Automatic mapping created by celery
{
"celery" : {
"mappings" : {
"backend" : {
"properties" : {
"#timestamp" : {
"type" : "date"
},
"result" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
I've tried to create my own mapping with nested field, but it has resulted in a elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'object mapping for [result] tried to parse field [result] as object, but found a concrete value')
UPDATE
Result is already encoded in JSON and inside Elasticsearch wrapper JSON string is saved inside a dictionary. Adding json.loads(result) as a quick-fix actually helps.
After the quick-fix new mapping has appeared:
{
"celery" : {
"mappings" : {
"backend" : {
"properties" : {
"#timestamp" : {
"type" : "date"
},
"result" : {
"properties" : {
"date_done" : {
"type" : "date"
},
"result" : {
"type" : "long"
},
"status" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"task_id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}
Updated Kibana view:
Is there any way to disable serialization of results in Celery?
I could add a pull-request with unpacking JSON, just for Elasticsearch, but it looks like a hack.

Since v4.0 the default result_serializer is json, so you should have results in JSON format anyway. Maybe your configuration uses something else? - In that case I suggest you remove it (if you use Celery >=4.0) and you should enjoy results in JSON format. I prefer msgpack but on the other hand I do not use ElasticSearch on Celery results...

Elastic Search and AWS python

I am working on AWS ElasticSearch using python,I have JSON file with 3 field.
("cat1","Cat2","cat3"), each row is separated with \n
example cat1:food, cat2: wine, cat3: lunch etc.
from requests_aws4auth import AWS4Auth
import boto3
import requests
payload = {
"settings": {
"number_of_shards": 10,
"number_of_replicas": 5
},
"mappings": {
"Categoryall" :{
"properties" : {
"cat1" : {
"type": "string"
},
"Cat2":{
"type" : "string"
},
"cat3" : {
"type" : "string"
}
}
}
}
}
r = requests.put(url, auth=awsauth, json=payload)
I created schema/mapping for the index as shown above but i don't know how to populate index.
I am thinking to put a for loop for JSON file and call post request to insert the index. Doesn't have an idea how to proceed.
I want to create index and bulk upload this file in the index. Any suggestion would be appreciated.

Take a look at Elasticsearch Bulk API.
Basically, you need to create a bulk request body and post it to your "https://{elastic-endpoint}/_bulk" url.
The following example is showing a bulk request to insert 3 json records into your index called "my_index":
{ "index" : { "_index" : "my_index", "_type" : "_doc", "_id" : "1" } }
{ "cat1" : "food 1", "cat2": "wine 1", "cat3": "lunch 1" }
{ "index" : { "_index" : "my_index", "_type" : "_doc", "_id" : "2" } }
{ "cat1" : "food 2", "cat2": "wine 2", "cat3": "lunch 2" }
{ "index" : { "_index" : "my_index", "_type" : "_doc", "_id" : "3" } }
{ "cat1" : "food 3", "cat2": "wine 3", "cat3": "lunch 3" }
where each json record is represented by 2 json objects.
So if you write your bulk request body into a file called post-data.txt, then you can post it using Python something like this:
with open('post-data.txt','rb') as payload:
r = requests.post('https://your-elastic-endpoint/_bulk', auth=awsauth,
data=payload, ... add more params)
Alternatively, you can try Python elasticsearch bulk helpers.

limit results of child/sub document when using find on master document

First time exploring mongoDB and I've bumped into a pickle.
Assuming I have a table/collection called inventory.
This collection in turn have documents that look like:
{
"book" : "Harry Potter",
"users" : {
"Read_it" : {
"John" : <personal number>,
"Elise" : <personal number>
},
"Currently_reading" : { ... }
}
}
Now the dictionary "Read_it" can become quite large and I'm limited to the amount of memory the querying client has so I would like to some how limit the number of returned item and perhaps page it.
This is a function I found in the docs, not sure how to convert this into what I need.
db.inventory.find( { "book": "Harry Potter" }, { item: 1, qty: 500 } )
Skipping the second parameter to find() gives me a result in the form a complete dictionary which works as long as the "Read_it" document/container doesn't grow to big.
One solution would be to pull back the structure so it becomes more flat, but that isn't optimal in terms of other aspects of this project.
Is is possible to work with find() here or are there another function that can do this better?

You seem to asking about projecting only specific elements of a nested structure.
Consider your document example (revised for use):
{
"book" : "Harry Potter",
"users" : {
"Read_it" : {
"John" : 1,
"Elise" : 2
},
"Currently_reading" : {
"Peter": 1
},
"More_information": 5
}
}
Then just issue as follows:
db.collection.find(
{ "book": "Harry Potter" },
{
"book": 1,
"users.Currently_reading": 1,
"users.More_information": 1
}
)
Returns the result with just the fields specified:
{
"_id" : ObjectId("5573b2beb67e246aba2b4b71"),
"book" : "Harry Potter",
"users" : {
"Currently_reading" : {
"Peter" : 1
},
"More_information" : 5
}
}
Not entirely sure, but that might not be supported in all MongoDB versions. Works in 3.X though. If you find it is not supported then do this instead:
db.collection.aggregate([
{ "$match": { "book": "Harry Potter" } },
{ "$project": {
"book": 1,
"users": {
"Currently_Reading": "$users.Currently_reading",
"More_information": "$users.More_information"
}
}}
])
The $project option of the .aggregate() method allows you to manipulate the document returned quite freely. So you don't even need to keep the same structure to return nested results and could change the result further if needed.
I would also strongly suggest using arrays with properties of sub-documents rather than nested dictionaries since that form is much easier to query and filter results than your current structure allows.
Additional to unclear question
As mentioned, it is better to use arrays rather than keys to represent the nested data. So if your intent is to actually just restrict the "Read_it" items to a number of entries then your data is best modelled as such:
{
"book" : "Harry Potter",
"users" : {
"Read_it" : [
{ "username": "John", "id": 1 },
{ "username": "Elise", "id": 2 }
],
"Currently_reading" : [
{ "username": "Peter", "id": 3 }
],
"More_information": 5
}
}
Then you can do a query to limit the number of items in "Read_it" using $slice :
db.collection.find(
{ "book": "Harry Potter" },
{ "users.Read_it": { "$slice": 1 } }
)
Which returns:
{
"_id" : ObjectId("5574118012ae33005f1fca17"),
"book" : "Harry Potter",
"users" : {
"Read_it" : [
{
"username" : "John",
"id" : 1
}
],
"Currently_reading" : [
{
"username" : "Peter",
"id" : 3
}
],
"More_information" : 5
}
}
Alternate options use the projection positional $ operator or even the aggregation framework for multiple matches in the array. But there are already many answers here that show you how to do that.

how to aggregate on each item in collection in mongoDB

MongoDB noob here...
when I do db.students.find().pretty() in the shell I get a long list from my collection...like so..
{
"_id" : 19,
"name" : "Gisela Levin",
"scores" : [
{
"type" : "exam",
"score" : 44.51211101958831
},
{
"type" : "quiz",
"score" : 0.6578497966368002
},
{
"type" : "homework",
"score" : 93.36341655949683
},
{
"type" : "homework",
"score" : 49.43132782777443
}
]
}
now I've got about over 100 of these...I need to run the following on each of them...
lowest_hw_score =
db.students.aggregate(
// Initial document match (uses index, if a suitable one is available)
{ $match: {
_id : 0
}},
// Expand the scores array into a stream of documents
{ $unwind: '$scores' },
// Filter to 'homework' scores
{ $match: {
'scores.type': 'homework'
}},
// Sort in descending order
{ $sort: {
'scores.score': 1
}},
{ $limit: 1}
)
So I can run something like this on each result
for item in lowest_hw_score:
print lowest_hw_score
Right now "lowest_score" works on only one item I to run this on all items in the collection...how do I do this?

> db.students.aggregate(
{ $match : { 'scores.type': 'homework' } },
{ $unwind: "$scores" },
{ $match:{"scores.type":"homework"} },
{ $group: {
_id : "$_id",
maxScore : { $max : "$scores.score"},
minScore: { $min:"$scores.score"}
}
});
You don't really need the first $match, but if "scores.type" is indexed, it means it would be used before unwinding the scores. (I don't believe after the $unwind mongo would be able to use the index.)
Result:
{
"result" : [
{
"_id" : 19,
"maxScore" : 93.36341655949683,
"minScore" : 49.43132782777443
}
],
"ok" : 1
}
Edit: tested and updated in mongo shell

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Elasticsearch analyzers not working on queries - python

Related

searching for space separated words in Elasticsearch

Celery Result type for ElasticSearch

Elastic Search and AWS python

limit results of child/sub document when using find on master document

how to aggregate on each item in collection in mongoDB

Categories

Resources