Elastic search: Partial search not working properly

Elastic search: Partial search not working properly - python

Partial search is not working on multiple fields.
Data: - "Sales inquiries generated".
{
"query_string": {
"fields": ["name", "title", "description", "subject"],
"query": search_data+"*"
}
}
Case1: When I pass search data as "inquiri" it works fine,
But when I pass search data as "inquirie" it's not working .
Case2: When I pass search data as "sale" it works fine,
But when I pass search data as "sales" it's not working.
Case3: When I pass search data as "generat" it works fine,
But when I pass search data as "generate" it's not working.
I defined my field this way.
text_analyzer = analyzer("text_analyzer", tokenizer="standard", filter=["lowercase", "stop", "snowball"])
name = Text(analyzer=text_analyzer, fields={"raw": Keyword()})
title = Text(analyzer=text_analyzer, fields={"raw": Keyword()})
subject = Text(analyzer=text_analyzer, fields={"raw": Keyword()})
What is the issue in my code? Any help would be much appreciated!
Thanks in advance.

This is happening due to the use of snowball token filter which stems the words, please refer official snowball doc for more info.
I create the same analyzer with your setting to see the generated tokens for your text, as at the end search happens when the index token matches the search term tokens.
ES provides nice REST apis and you can easily reproduce the issue:
Create index with your setting
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"snowball",
"stop"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
once index is created you can use the analyze API to see generated tokens for your text.
POST http://{{hostname}}:{{port}}/{{index-name}}/_analyze
{
"analyzer": "my_analyzer",
"text": "Sales inquiries generated"
}
{
"tokens": [
{
"token": "sale",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "inquiri",
"start_offset": 6,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "generat",
"start_offset": 16,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 2
}
]
}
You can see all the tokens are the same which matches your search query, hence you are getting result for other search terms, which means while querying instead of raw you are using the keyword part of your text field

Related

how to search query in elastic search with synonyms in Python

I am not able to understand the implementation of the elastic search query along with the synonym table. With a general query, I don't have any search problems but incorporating synonyms as become an issue to me.
es.search(index='data_inex', body={
"query": {
"match": {"inex": "tren"}
},
"settings": {
"filter": {
"synonym": {
"type": "synonym",
"lenient": true,
"synonyms": [ "foo, baz", "tren, hut" ]
}
}
}
}
)
Also, is it possible to use a file instead of this array?

Check the documentation: Click Here
You can configure synonyms file as well:
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [ "synonym" ]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt" // <======== location of synonym file
}
}
}
}
}
}
Please note:
Changes in the synonyms file will not reflect in the documents indexed before the change. Re-indexing is required for the same.
You cannot change the mapping (including the analyzer) of an existing field. What you need to do if you want to change the mapping of existing documents is reindex those documents to another index with the updated mapping.
Search query doesn't support "settings".

JSON EXtraction in Python

I am trying to extract a specific part of the JSON but I keep on getting errors.
I am interested in the following sections:
"field": "tag",
"value": "Wian",
I can extract the entire filter section using:
for i in range(0,values_num):
dedata[i]['filter']
But if I try to filter beyond that point I just get errors.
Could someone please assist me with this?
Here is the JSON output style:
{
"mod_time": 1594631137499,
"description": "",
"id": 82,
"name": "Wian",
"include_custom_devices": true,
"dynamic": true,
"field": null,
"value": null,
"filter": {
"rules": [
{
"field": "tag",
"operand": {
"value": "Wian",
"is_regex": false
},
"operator": "~"
}
],
"operator": "and"
}
}

You are probably trying to access the data in rules but since its an array, you have to specifically access that array by getting the [0] index.
You could simplistically just use .get('<name>') as shown below:
dedata['filter']['rules'][0].get('field'))
Likewise for value:
dedata[i]['filter']['rules'][0]['operand'].get('value')
comment out the for loop and try without it and [i] and see if it works

How to get Wikidata ID for an entity of a property ? Is there an API available for python?

The MediaWiki API is able to find ID for an item with the request URL：
/w/api.php?action=query&format=json&prop=pageprops&titles=skype&formatversion=2&ppprop=wikibase_item
The result is:
{
"batchcomplete": true,
"query": {
"normalized": [
{
"fromencoded": false,
"from": "skype",
"to": "Skype"
}
],
"pages": [
{
"pageid": 424589,
"ns": 0,
"title": "Skype",
"pageprops": {
"wikibase_item": "Q40984"
}
}
]
}
}
However, it does not work well when querying about a property, e.g., developer P178. The result is Q409857 rather than the desired P178:
{
"batchcomplete": true,
"query": {
"normalized": [
{
"fromencoded": false,
"from": "developer",
"to": "Developer"
}
],
"pages": [
{
"pageid": 179684,
"ns": 0,
"title": "Developer",
"pageprops": {
"wikibase_item": "Q409857"
}
}
]
}
}
Is there any way to get the ID for an entity which could be an item, a property or even a lexeme?

You could use on Wikidata the search API.
For example, to find properties with the name "developer" inside, use
https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=developer&srnamespace=120
120 is the property namespace. To find a lexeme, use srnamespace=146.
Note, this API guesses your language and adapts the results correspondingly. If you don't live in an English speaking country, the above example may thus fail.

No response when query elasticsearch with python

I have some code to query specific strings in a field message as below:
"message": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
Here is my code:
from elasticsearch import Elasticsearch
import json
client = Elasticsearch(['http://192.168.1.114:9200'])
response = client.search(
index="squidlog-2017.10.29",
body={
"query": {
"match": {
"message": 'GET'
}
}
}
)
for hit in response['hits']['hits']:
print json.dumps(hit['_source'], indent=4, sort_keys=True)
When I query with specific strings: GET with template above, everything is ok. But when I want to query something about url in message, I don't receive anything, like for the following query:
body={
"query": {
"match": {
"message": 'pravda'
}
}
}
Is there any problem with slashes in my message when I query? Anyone please give me an advice. Thanks.

You might consider using a different tokenizer, which will make the desired search possible. But let me explain why your query does not return you the result in the second case.
standard analyzer and tokenizer
By default standard analyzer consists of standard tokenizer, which will apparently keep the domain name not split by dots. You can try different analyzers and tokenizers with _analyze endpoint, like this:
GET _analyze
{
"text": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
}
The response is a list of tokens that ElasticSearch will be using to represent this string while searching. Here it is:
{
"tokens": [
{
"token": "oct",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
}, ...
{
"token": "http",
"start_offset": 59,
"end_offset": 63,
"type": "<ALPHANUM>",
"position": 11
},
{
"token": "www.pravda.ru",
"start_offset": 66,
"end_offset": 79,
"type": "<ALPHANUM>",
"position": 12
},
{
"token": "science",
"start_offset": 80,
"end_offset": 87,
"type": "<ALPHANUM>",
"position": 13
}, ...
]
}
As you can see, "pravda" is not in the list of tokens, hence you cannot search for it. You can only search for the tokens that your analyzer emits.
Note that "pravda" is part of the domain name, which is a analyzed as a separate token: "www.pravda.ru".
lowercase tokenizer
If you use different tokenizer, for instance, lowercase tokenizer, it will do emit pravda as a token and it will be possible to search for it:
GET _analyze
{
"tokenizer" : "lowercase",
"text": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
}
And the list of tokens:
{
"tokens": [
{
"token": "oct",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
}, ...
{
"token": "http",
"start_offset": 59,
"end_offset": 63,
"type": "word",
"position": 4
},
{
"token": "www",
"start_offset": 66,
"end_offset": 69,
"type": "word",
"position": 5
},
{
"token": "pravda",
"start_offset": 70,
"end_offset": 76,
"type": "word",
"position": 6
},
{
"token": "ru",
"start_offset": 77,
"end_offset": 79,
"type": "word",
"position": 7
},
{
"token": "science",
"start_offset": 80,
"end_offset": 87,
"type": "word",
"position": 8
}, ...
]
}
How to define analyzer before indexing?
To be able to search for such tokens, you have to analyze them during the index phase differently. It means to define a different mapping with different analyzer. Like in this example:
PUT yet_another_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "lowercase"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"message": {
"type": "text",
"fields": {
"lowercased": {
"type": "text",
"analyzer": "my_custom_analyzer"
}
}
}
}
}
}
}
Here, we first define a custom analyzer with desired tokenizer, and then tell ElasticSearch to index our message field twice via fields feature: implicitly with default analyzer, and explicitly with my_custom_analyzer.
Now we are able to query for the desired token. Request to the original field will give no response:
POST yet_another_index/my_type/_search
{
"query": {
"match": {
"message": "pravda"
}
}
}
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
But the query to the message.lowercased will succeed:
POST yet_another_index/my_type/_search
{
"query": {
"match": {
"message.lowercased": "pravda"
}
}
}
"hits": {
"total": 1,
"max_score": 0.25316024,
"hits": [
{
"_index": "yet_another_index",
"_type": "my_type",
"_id": "AV9u1qZmB9pi5Gaw0rj1",
"_score": 0.25316024,
"_source": {
"message": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
}
}
]
}
There are plenty of options, this solution answers the example you provided. Check out different analyzers and tokenizers to find which one suits you more.
Hope that helps!

Custom analyzer with edgeNgram filter doesn't work

I needed partial search in my website. Initially I used edgeNgramFeild directly it didn't work as expected. So I used custom search engine with custom analyzers.I am using Django-haystack.
'settings': {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_ngram"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_edgengram"]
},
"suggest_analyzer": {
"type":"custom",
"tokenizer":"standard",
"filter":[
"standard",
"lowercase",
"asciifolding"
]
},
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
Used edgengram_analyzer for indexing and suggest_analyzer for search. This worked for some extent. But,it doesn't work for numbers for example when 30 is entered it doesn't search for 303 and also with words containing alphabet and numbers combined. So I searched for various sites.
They suggested to use standard or whitespace tokenizer and with haystack_edgengram filter. But it didn't work at all, putting aside number partial search didn't work even for alphabet. The settings after the suggestion:
'settings': {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_ngram"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "whitepsace",
"filter": ["haystack_edgengram"]
},
"suggest_analyzer": {
"type":"custom",
"tokenizer":"standard",
"filter":[
"standard",
"lowercase",
"asciifolding"
]
},
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
Does anything other than lowercase tokenizer work with django-haystack? or haystack_edgengram filter not working for me. According my knowledge it should work like this. Considering 2 Lazy Dog as text supplied. it should get tokens like this with whitespace [2,Lazy,Dog]. and then applying haystack_edgengram filter it should generate tokens [2,la,laz,lazy,do,dog] .its not working like this.Did i do something wrong?
My requirement is for example for text 2 Lazy Dog when some one types 2 Laz it should work.
Edited:
In my assumption the lowercase tokenizer worked properly. But, in case of above text it will omit 2 and creates token [lazy,dog]. Why can't standard or whitespace tokenizer work?

In ngrams filter you define min_gram which is minimum length of created tokens. In your case '2' has length: 1 so this is ignored in ngram filters.
The easiest way to fix this is to change min_gram to 1. A bit more complicated way can be to combine some standard analyzer to match whole keyword (useful for shorter terms) and ngram analzyer for partial matching (for longer terms) - maybe with some bool queries.
You can also change ngrams to start from '1' characters but in your search box require at least 3 letters before send query to Elasticsearch.

Found the answer myself and with #jgr's suggestion:
ELASTICSEARCH_INDEX_SETTINGS = {
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["haystack_ngram"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["haystack_edgengram","lowercase"]
},
"suggest_analyzer": {
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase"
]
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 1,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
}
}
}
}
}
ELASTICSEARCH_DEFAULT_ANALYZER = "suggest_analyzer"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Elastic search: Partial search not working properly - python

Related

how to search query in elastic search with synonyms in Python

JSON EXtraction in Python

How to get Wikidata ID for an entity of a property ? Is there an API available for python?

No response when query elasticsearch with python

Custom analyzer with edgeNgram filter doesn't work

Categories

Resources