Analizer to ignore accents and plural singular in Elasticsearch - python

I am working on ignoring accents and plural/singular when I make a search query. I copied the Spanish analyzer from here and left only the stemmer https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html
you can check my code in Python (I bulk the data from a CSV latter):
settings={
"settings": {
"analysis": {
"filter": {
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"rebuilt_spanish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stemmer"
]
}
}
}
}
}
es.indices.create(index="activities", body=settings)
However, when I try a GET query from insomnia like geometrico, geométrico, geométricos, geometricos I get 0 results and there is a doc with Title Cuerpos geométricos. It should match since I want to make no difference with accents and plural singular. Any ideas?
The GET query I do:
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "geométricos",
"fields": [
"Descripcion",
"Nombre",
"Tags"
],
"analyzer":"rebuilt_spanish"
}
}
}
}
}

You will need to add ASCII folding token filter to your token filters check official documentation here. So your Analyzer should be like this:
Anlayzer:
"analysis": {
"filter": {
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"rebuilt_spanish": {
"tokenizer": "standard",
"filter": [
"asciifolding", // ASCII folding token filter
"lowercase",
"spanish_stemmer"
]
}
}
}
}

Related

Elasticsearch: highlight on query terms, not filter terms?

Say I have this:
search_object = {
'query': {
'bool' : {
'must' : {
'simple_query_string' : {
'query': search_text,
'fields': [ 'french_no_accents', 'def_no_accents', ],
},
},
'filter' : [
{ 'term' : { 'def_no_accents' : 'court', }, },
{ 'term' : { 'def_no_accents' : 'bridge', }, },
],
},
},
'highlight': {
'encoder': 'html',
'fields': {
'french_no_accents': {},
'def_no_accents': {},
},
'number_of_fragments' : 0,
},
}
... whatever search string I enter as search_text, its constituent terms, but also "court" and "bridge" are highlighted. I don't want "court" or "bridge" to be highlighted.
I've tried putting the "highlight" key-value in a different spot in the structure... nothing seems to work (i.e. syntax exception thrown).
More generally, is there a formal grammar anywhere specifying what you can and can't do with ES (v7) queries?
You could add a highlight query to limit what should and shouldn't get highlighted:
{
"query": {
"bool": {
"must": {
"simple_query_string": {
"query": "abc",
"fields": [
"french_no_accents",
"def_no_accents"
]
}
},
"filter": [
{ "term": { "def_no_accents": "court" } },
{ "term": { "def_no_accents": "bridge" } }
]
}
},
"highlight": {
"encoder": "html",
"fields": {
"*_no_accents": { <--
"highlight_query": {
"simple_query_string": {
"query": "abc",
"fields": [ "french_no_accents", "def_no_accents" ]
}
}
}
},
"number_of_fragments": 0
}
}
I've used a wildcard for the two fields (*_no_accents) -- if that matches unwanted fields too, you'll need to duplicate the highlight query on two separate, non-wilcard highlight fields like you originally had. Though I can't think of a scenario where that'd happen since your multi_match query targets two concrete fields.
As to:
More generally, is there a formal grammar anywhere specifying what you can and can't do with ES (v7) queries?
what exactly are you looking for?

ElasticSearch: Retrieve field and it's normalization

I want to retrieve a field as well as it's normalized version from Elasticsearch.
Here's my index definition and data
PUT normalizersample
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"refresh_interval": "60s",
"analysis": {
"normalizer": {
"my_normalizer": {
"filter": [
"lowercase",
"german_normalization",
"asciifolding"
],
"type": "custom"
}
}
}
},
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"myField": {
"type": "text",
"store": true,
"fields": {
"keyword": {
"type": "keyword",
"store": true
},
"normalized": {
"type": "keyword",
"store": true,
"normalizer": "my_normalizer"
}
}
}
}
}
}
POST normalizersample/_doc/1
{
"myField": ["Andreas", "Ämdreas", "Anders"]
}
My first approach was to use scripted fields like
GET /myIndex/_search
{
"size": 100,
"query": {
"match_all": {}
},
"script_fields": {
"keyword": {
"script": "doc['myField.keyword']"
},
"normalized": {
"script": "doc['myField.normalized']"
}
}
}
However, since myField is an array, this returns two lists of strings per ES document and each of them are sorted alphabetically. Hence, the corresponding entries might not match to each other due to the normalization.
"hits" : [
{
"_index" : "normalizersample",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"de" : [
"amdreas",
"anders",
"andreas"
],
"keyword" : [
"Anders",
"Andreas",
"Ämdreas"
]
}
}
]
While I would like to retrieve [(Andreas, andreas), (Ämdreas, amdreas) (Anders, anders)] or a similar format where I can match every entry to its normalization.
The only way I found was to call Term Vectors on both fields since they contain a position field, but this seems like a huge overhead to me. (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html)
Is there a simpler way to retrieve tuples with the keyword and the normalized field?
Thanks a lot!

Elasticsearch - Boosting an individual term if it appears in the fields

I have the following search query that returns the documents that contain the word "apple", "mango" or "strawberry". Now I want to boost the scoring of the document whenever the word "cake" or "chips" (or both) is in the document (the word cake or chips doesn't have to be in the document but whenever it appears in "title" or "body" fields, the scoring should be boosted, so that the documents containing the "cake" or "chips" are ranked higher)
res = es.search(index='fruits', body={
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "(apple) OR (mango) OR (strawberry)"
}
},
{
"bool": {
"must_not": [{
"match_phrase": {
"body": "Don't match this phrase."
}
}
]
}
}
]
},
"match": {
"query": "(cake) OR (chips)",
"boost": 2
}
}
}
})
Any help would be greatly appreciated!
Just include the values you would want to be boosted in a should clause as shown in the below query:
Query:
POST <your_index_name>/_search
{
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"(apple) OR (mango) OR (strawberry)"
}
},
{
"bool":{
"must_not":[
{
"match_phrase":{
"body":"Don't match this phrase."
}
}
]
}
}
],
"should":[ <----- Add this
{
"query_string":{
"query":"cake OR chips",
"fields": ["title","body"], <----- Specify fields
"boost":10 <----- Boost Field
}
}
]
}
}
}
Alternately, you can push your must_not clause to a level above in the query.
Updated Query:
POST <your_index_name>/_search
{
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"(apple) OR (mango) OR (strawberry)"
}
}
],
"should":[
{
"query_string":{
"query":"cake OR chips",
"fields": ["title","body"],
"boost":10
}
}
],
"must_not":[ <----- Note this
{
"match_phrase":{
"body":"Don't match this phrase."
}
}
]
}
}
}
Basically should qualifies as logical OR while must is used as logical AND in terms of Boolean Operations.
In that way the query would boost the results or documents higher up the order as it would have higher relevancy score while the ones which only qualifies only under must would come with lower relevancy.
Hope this helps!

Or in a Elasticsearch filter

I want to query my elasticsearch (using a python library) and I want to filter some of the document. Since I don't want to have a score I'm using only filter and must not keyword:
{
"_source": ["entities"],
"query": {
"bool": {
"must_not": [
{"exists": {"field": "retweeted_status"}}
],
"filter": [
{"match": {"entities.urls.display_url": "blabla.com"}},
{"match": {"entities.urls.display_url": "blibli.com"}}]
}
}
}
This is the query I have done but the problem is that in the same filter it's apparently a AND operation that is effectued. I would like it to be a OR. How can I change my query to have all the document that contain "blibli.com" OR "blabla.com"
You can nest bool inside another bool so you can write query like this:
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "retweeted_status"
}
}
],
"filter": [
{
"bool": {
"should": [
{
"match": {
"entities.urls.display_url": "blabla.com"
}
},
{
"match": {
"entities.urls.display_url": "blibli.com"
}
}
]
}
}
]
}
}
}
Tested on ES 5.3, you can use Explain API to check if this also works in your version of Elasticsearch.

Elasticsearch match multiple fields

I am recently using elasticsearch in a website. The scenario is, I have to search a string on afield. So, if the field is named as title then my search query was,
"query" :{"match": {"title": my_query_string}}.
But now I need to add another field in it. Let say, category. So i need to find the matches of my string which are in category :some_category and which have title : my_query_string I tried with multi_match. But it does not give me the result i am looking for. I am looking into query filter now. But is there way of adding two fields in such criteria in my match query?
GET indice/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"title": "title"
}
},
{
"match": {
"category": "category"
}
}
]
}
}
}
Replace should with must if desired.
Ok, so I think that what you need is something like this:
"query": {
"filtered": {
"query": {
"match": {
"title": YOUR_QUERY_STRING,
}
},
"filter": {
"term": {
"category": YOUR_CATEGORY
}
}
}
}
If your category field is analyzed, then you will need to use match instead of term in the filter.
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{"match": {"title": "bold title"},
{"match": {"body": "nice body"}}
]
}
},
"filter": {
"term": {
"category": "xxx"
}
}
}
}

Categories