I have some code to query specific strings in a field message as below:
"message": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
Here is my code:
from elasticsearch import Elasticsearch
import json
client = Elasticsearch(['http://192.168.1.114:9200'])
response = client.search(
index="squidlog-2017.10.29",
body={
"query": {
"match": {
"message": 'GET'
}
}
}
)
for hit in response['hits']['hits']:
print json.dumps(hit['_source'], indent=4, sort_keys=True)
When I query with specific strings: GET with template above, everything is ok. But when I want to query something about url in message, I don't receive anything, like for the following query:
body={
"query": {
"match": {
"message": 'pravda'
}
}
}
Is there any problem with slashes in my message when I query? Anyone please give me an advice. Thanks.
You might consider using a different tokenizer, which will make the desired search possible. But let me explain why your query does not return you the result in the second case.
standard analyzer and tokenizer
By default standard analyzer consists of standard tokenizer, which will apparently keep the domain name not split by dots. You can try different analyzers and tokenizers with _analyze endpoint, like this:
GET _analyze
{
"text": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
}
The response is a list of tokens that ElasticSearch will be using to represent this string while searching. Here it is:
{
"tokens": [
{
"token": "oct",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
}, ...
{
"token": "http",
"start_offset": 59,
"end_offset": 63,
"type": "<ALPHANUM>",
"position": 11
},
{
"token": "www.pravda.ru",
"start_offset": 66,
"end_offset": 79,
"type": "<ALPHANUM>",
"position": 12
},
{
"token": "science",
"start_offset": 80,
"end_offset": 87,
"type": "<ALPHANUM>",
"position": 13
}, ...
]
}
As you can see, "pravda" is not in the list of tokens, hence you cannot search for it. You can only search for the tokens that your analyzer emits.
Note that "pravda" is part of the domain name, which is a analyzed as a separate token: "www.pravda.ru".
lowercase tokenizer
If you use different tokenizer, for instance, lowercase tokenizer, it will do emit pravda as a token and it will be possible to search for it:
GET _analyze
{
"tokenizer" : "lowercase",
"text": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
}
And the list of tokens:
{
"tokens": [
{
"token": "oct",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
}, ...
{
"token": "http",
"start_offset": 59,
"end_offset": 63,
"type": "word",
"position": 4
},
{
"token": "www",
"start_offset": 66,
"end_offset": 69,
"type": "word",
"position": 5
},
{
"token": "pravda",
"start_offset": 70,
"end_offset": 76,
"type": "word",
"position": 6
},
{
"token": "ru",
"start_offset": 77,
"end_offset": 79,
"type": "word",
"position": 7
},
{
"token": "science",
"start_offset": 80,
"end_offset": 87,
"type": "word",
"position": 8
}, ...
]
}
How to define analyzer before indexing?
To be able to search for such tokens, you have to analyze them during the index phase differently. It means to define a different mapping with different analyzer. Like in this example:
PUT yet_another_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "lowercase"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"message": {
"type": "text",
"fields": {
"lowercased": {
"type": "text",
"analyzer": "my_custom_analyzer"
}
}
}
}
}
}
}
Here, we first define a custom analyzer with desired tokenizer, and then tell ElasticSearch to index our message field twice via fields feature: implicitly with default analyzer, and explicitly with my_custom_analyzer.
Now we are able to query for the desired token. Request to the original field will give no response:
POST yet_another_index/my_type/_search
{
"query": {
"match": {
"message": "pravda"
}
}
}
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
But the query to the message.lowercased will succeed:
POST yet_another_index/my_type/_search
{
"query": {
"match": {
"message.lowercased": "pravda"
}
}
}
"hits": {
"total": 1,
"max_score": 0.25316024,
"hits": [
{
"_index": "yet_another_index",
"_type": "my_type",
"_id": "AV9u1qZmB9pi5Gaw0rj1",
"_score": 0.25316024,
"_source": {
"message": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
}
}
]
}
There are plenty of options, this solution answers the example you provided. Check out different analyzers and tokenizers to find which one suits you more.
Hope that helps!
Related
i have this document in mongodb
{
"_id": {
"$oid": "62644af0368cb0a46d7c2a95"
},
"insertionData": "23/04/2022 19:50:50",
"ipfsMetadata": {
"Name": "data.json",
"Hash": "Qmb3FWgyJHzJA7WCBX1phgkV93GiEQ9UDWUYffDqUCbe7E",
"Size": "431"
},
"metadata": {
"sessionDate": "20220415 17:42:55",
"dataSender": "user345",
"data": {
"height": "180",
"weight": "80"
},
"addtionalInformation": [
{
"name": "poolsize",
"value": "30m"
},
{
"name": "swimStyle",
"value": "mariposa"
},
{
"name": "modality",
"value": "swim"
},
{
"name": "gender-title",
"value": "schoolA"
}
]
},
"fileId": {
"$numberLong": "4"
}
}
I want to update nested array document, for instance the name with gender-tittle. This have value schoolA and i want to change to adult like the body. I give the parameter number of fileId in the post request and in body i pass this
post request : localhost/sessionUpdate/4
and body:
{
"name": "gender-title",
"value": "adultos"
}
flask
#app.route('/sessionUpdate/<string:a>', methods=['PUT'])
def sessionUpdate(a):
datas=request.json
r=str(datas['name'])
r2=str(datas['value'])
print(r,r2)
r3=collection.update_one({'fileId':a, 'metadata.addtionalInformation':r}, {'$set':{'metadata.addtionalInformation.$.value':r2}})
return str(r3),200
i'm getting the 200 but the document don't update with the new value.
As you are using positional operator $ to work with your array, make sure your select query is targeting array element. You can see in below query that it is targeting metadata.addtionalInformation array with the condition that name: "gender-title"
db.collection.update({
"fileId": 4,
"metadata.addtionalInformation.name": "gender-title"
},
{
"$set": {
"metadata.addtionalInformation.$.value": "junior"
}
})
Here is the Mongo playground for your reference.
I have 10k+ records in elastic search. one of the fields(dept) holds data in form of array
eg records are
{
"username": "tom",
"dept": [
"cust_service",
"sales_rpr",
"store_in",
],
"location": "NY"
}
{
"username": "adam",
"dept": [
"cust_opr",
"floor_in",
"mg_cust_opr",
],
"location": "MA"
}
.
.
.
I want to do autocomplete on dept field, if user search for cus it should return
["cust_service", "cust_opr", "mg_cust_opr"]
With best match at the top
I have made the query
query = {
"_source": [],
"size": 0,
"min_score": 0.5,
"query": {
"bool": {
"must": [
{
"wildcard": {
"dept": {
"value": "*cus*"
}
}
}
],
"filter": [],
"should": [],
"must_not": []
}
},
"aggs": {
"auto_complete": {
"terms": {
"field": f"dept.raw",
"size": 20,
"order": {"max_score": 'desc'}
},
"aggs": {
"max_score": {
"avg": {"script": "_score"}
}
}
}
}
}
It is not giving ["cust_service", "cust_opr", "mg_cust_opr"] instead gives other answers which are irrelevant to search key(cus). but when field is just string instead of array it is giving the result as expected.
How do i solve this problem?
Thanks in advance!
Partial search is not working on multiple fields.
Data: - "Sales inquiries generated".
{
"query_string": {
"fields": ["name", "title", "description", "subject"],
"query": search_data+"*"
}
}
Case1: When I pass search data as "inquiri" it works fine,
But when I pass search data as "inquirie" it's not working .
Case2: When I pass search data as "sale" it works fine,
But when I pass search data as "sales" it's not working.
Case3: When I pass search data as "generat" it works fine,
But when I pass search data as "generate" it's not working.
I defined my field this way.
text_analyzer = analyzer("text_analyzer", tokenizer="standard", filter=["lowercase", "stop", "snowball"])
name = Text(analyzer=text_analyzer, fields={"raw": Keyword()})
title = Text(analyzer=text_analyzer, fields={"raw": Keyword()})
subject = Text(analyzer=text_analyzer, fields={"raw": Keyword()})
What is the issue in my code? Any help would be much appreciated!
Thanks in advance.
This is happening due to the use of snowball token filter which stems the words, please refer official snowball doc for more info.
I create the same analyzer with your setting to see the generated tokens for your text, as at the end search happens when the index token matches the search term tokens.
ES provides nice REST apis and you can easily reproduce the issue:
Create index with your setting
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"snowball",
"stop"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
once index is created you can use the analyze API to see generated tokens for your text.
POST http://{{hostname}}:{{port}}/{{index-name}}/_analyze
{
"analyzer": "my_analyzer",
"text": "Sales inquiries generated"
}
{
"tokens": [
{
"token": "sale",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "inquiri",
"start_offset": 6,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "generat",
"start_offset": 16,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 2
}
]
}
You can see all the tokens are the same which matches your search query, hence you are getting result for other search terms, which means while querying instead of raw you are using the keyword part of your text field
I have to search for keywords in one field and an exact match in a different field. I have tried something but it does not seem to work at all.
I tried giving the full article with the author as i have put in AWS ElasticSearch but it still won't retrieve anything.
query=json.dumps({
"query": {
"bool": {
"must": {
"match": {
"article": "man killed kim jones"
}
},
"filter": {
"term": {
"author": "Barbara Boyer"
}
}
}
}
})
response = requests.get(url-ES-domain/data/_search?",headers=headers,data=(query))
response.json()
Mapping details
{
"mappings": {
"article": {
"full_name": "article",
"mapping": {
"article": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
This is for the keyword in the article. Even if I give the full article as it is in the ES index, it still won't give any hits.
Try like this
Mappings
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"normalizer": {
"lc_normalizer": {
"type": "custom",
"filter": ["lowercase"]
}
}
},
"max_ngram_diff": 20
},
"mappings": {
"properties": {
"article": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "lc_normalizer"
}
}
},
"key": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"publication": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"title": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Query
{
"query": {
"bool": {
"must" : {
"multi_match" : {
"query": "man killed kim jones",
"fields": [ "article", "title" ]
}
},
"filter": {
"term": {
"author.keyword": "Maddie Hanna"
}
}
}
}
}
The above query returns matches and returns the document you have added to the document.
When you are searching for a multi-word match, I suggest you use the match_phrase query. By default, elasticsearch will create keyword mapping for the text fields.
Note: You can try these things using Kibana UI provided by the elastic team. It will save a lot of time.
I use the phonetic analysis plugin from elastic search to do some string matching thanks to phonetic transformation.
My problem is, how to get phonetic transformation processed by elastic search in the result of the query?.
First, I create an index with a metaphone transformation:
request_body = {
'settings': {
'index': {
'analysis': {
'analyzer': {
'metaphone_analyzer': {
'tokenizer':
'standard',
'filter': [
'ascii_folding_filter', 'lowercase',
'metaphone_filter'
]
}
},
'filter': {
'metaphone_filter': {
'type': 'phonetic',
'encoder': 'metaphone',
'replace': False
},
'ascii_folding_filter': {
'type': 'asciifolding',
'preserve_original': True
}
}
}
}
},
'mappings': {
'person_name': {
'properties': {
'full_name': {
'type': 'text',
'fields': {
'metaphone_field': {
'type': 'string',
'analyzer': 'metaphone_analyzer'
}
}
}
}
}
}
}
res = es.indices.create(index="my_index", body=request_body)
Then, I add some data:
# Add some data
names = [{
"full_name": "John Doe"
}, {
"full_name": "Bob Alice"
}, {
"full_name": "Foo Bar"
}]
for name in names:
res = es.index(index="my_index",
doc_type='person_name',
body=name,
refresh=True)
And finally, I query a name:
es.search(index="my_index",
body={
"size": 5,
"query": {
"multi_match": {
"query": "Jon Doe",
"fields": "*_field"
}
}
})
Search returns:
{
'took': 1,
'timed_out': False,
'_shards': {
'total': 5,
'successful': 5,
'skipped': 0,
'failed': 0
},
'hits': {
'total':
1,
'max_score':
0.77749264,
'hits': [{
'_index': 'my_index',
'_type': 'person_name',
'_id': 'AWwYjl4Mqo63y_hLp5Yl',
'_score': 0.77749264,
'_source': {
'full_name': 'John Doe'
}
}]
}
}
In the search return I would like to get the phonetic transformation of the names in elastic search (also from the query name but it is less important) when I execute the search.
I know, that I could use explain API but I would like to avoid a 2nd request, and moreover the explain API seems a little "overkill" for what I want to achieve.
Thanks !
It doesn't look like an easy thing to implement in an Elasticsearch query, but you could try analyze API and scripted fields with fielddata enabled, and term vectors might come handy. Here's how.
Retrieve tokens from an arbitrary query
Analyze API is a great tool if you want to understand how exactly does Elasticsearch tokenize your query.
Using your mapping you could do, for example:
GET myindex/_analyze
{
"analyzer": "metaphone_analyzer",
"text": "John Doe"
}
And get something like this as a result:
{
"tokens": [
{
"token": "JN",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "john",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "T",
"start_offset": 5,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "doe",
"start_offset": 5,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
}
]
}
This is technically a different query, but still might be useful.
Retrieve tokens from a field of a document
In theory, we could try to retrieve the very same tokens which analyze API returned in the previous section, from the documents matched by our query.
In practice Elasticsearch will not store the tokens of a text field it has just analyzed: fielddata is disabled by default. We need to enable it:
PUT /myindex
{
"mappings": {
"person_name": {
"properties": {
"full_name": {
"fields": {
"metaphone_field": {
"type": "text",
"analyzer": "metaphone_analyzer",
"fielddata": true
}
},
"type": "text"
}
}
}
},
"settings": {
...
}
}
Now, we can use scripted fields to ask Elasticsearch to return those tokens.
The query might look like this:
POST myindex/_search
{
"script_fields": {
"my tokens": {
"script": {
"lang": "painless",
"source": "doc[params.field].values",
"params": {
"field": "full_name.metaphone_field"
}
}
}
}
}
And the response would look like this:
{
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "myindex",
"_type": "person_name",
"_id": "123",
"_score": 1,
"fields": {
"my tokens": [
"JN",
"T",
"doe",
"john"
]
}
}
]
}
}
As you can see, the very same tokens (but in random order).
Can we retrieve also the information about location of these tokens in the document?
Retrieving tokens with their positions
term vectors may help. To be able to use them we actually don't need fielddata enabled. We could lookup term vectors for a document:
GET myindex/person_name/123/_termvectors
{
"fields" : ["full_name.metaphone_field"],
"offsets" : true,
"positions" : true
}
This would return something like this:
{
"_index": "myindex",
"_type": "person_name",
"_id": "123",
"_version": 1,
"found": true,
"took": 1,
"term_vectors": {
"full_name.metaphone_field": {
"field_statistics": {
"sum_doc_freq": 4,
"doc_count": 1,
"sum_ttf": 4
},
"terms": {
"JN": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 4
}
]
},
"T": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 5,
"end_offset": 8
}
]
},
"doe": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 5,
"end_offset": 8
}
]
},
"john": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 4
}
]
}
}
}
}
}
This gives a way to get the tokens of a field of a document like the analyzer produced them.
Unfortunately, as of my knowledge, there is no way to combine these three queries into a single one. Also fielddata should be used with caution since it uses a lot of memory.
Hope this helps!