searching for space separated words in Elasticsearch - python

my data is,
POST index_name/_doc/1
{
"created_date" : "2023-02-09T13:21:41.632492",
"created_by" : "hats",
"data" : [
{
"name" : "west cost",
"document_id" : "1"
},
{
"name" : "mist cost",
"document_id" : "2"
},
{
"name" : "cost",
"document_id" : "3"
}
]
}
i used query_String to search
GET index_name/_serach
{
"query_string": {
"default_field": "data.name",
"query": "*t cost"
}
}
expected result was:
west cost, mist cost
but the output was:
west cost, mist cost, cost
i have tried many search query but still couldn't find a solution
which search query is used to handle the space, i need to search for the similar patterned value in the field

Related

Elasticsearch analyzers not working on queries

So I am running Elasticsearch and Kibana locally on ports 9200 and 5601 respectively. I am attempting to process a JSONL file into Elasticsearch documents and apply an analyzer to some of the fields.
This is the body:
body = {
"mappings": {
"testdoc": {
"properties": {
"title": {
"type": "text",
"analyzer": "stop"
}
"content": {
"type": "text",
"analyzer": "stop"
}
}
}
}
}
I then create a new index (and I am deleting the index between tests so I know it's not that)
es.indices.create("testindex", body=body)
I then parse my JSONL object into documents and upload them to elasticsearch using
helpers.bulk(es, documents, index="textindex", doc_type="testdoc"
Finally I query like this
q = {"query": { "match-all": {}}}
print(es.search(index="testindex", body="query")
My result, for a sample sentence like "The quick brown fox" is unchanged when I'd expect it to be 'quick brown fox'.
When I run the same query in Kibana I also see it not working
GET /testindex/_search
{
"query": {
"match-all": {}
}
}
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 10,
"max_score" : 4.128039,
"hits" : [
{
"_index" : "testindex",
"_type" : "textdocument",
"_id" : "6bfkb4EBWF89_POuykkO",
"_score" : 4.128039,
"_source" : {
"title" : "The fastest fox",
"body" : "The fastest fox is also the brownest fox. They jump over lazy dogs."
}
}
]
}
}
Now I do this query:
POST /testindex/_analyze
{
"field": "title",
"text": "The quick brown fox"
}
I get this response:
{
"tokens" : [
{
"token" : "quick",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 2
},
{
"token" : "fox",
"start_offset" : 16,
"end_offset" : 19,
"type" : "word",
"position" : 3
}
]
}
Which is what I would expect.
But conversely if I do
POST testindex/testdoc
{
"title":"The fastest fox",
"content":"Test the a an migglybiggly",
"published":"2015-09-15T10:17:53Z"
}
and then search for 'migglybiggly', the content field of the returned document has not dropped its stop words.
I am really at a loss as to what I am doing wrong here. I'm fairly new to elasticsearch and this is really dragging me down.
Thanks in advance!
Edit:
If I run
GET /testindex/_mapping
I see
{
"textindex" : {
"mappings" : {
"testdoc" : {
"properties" : {
"title" : {
"type" : "text",
"analyzer" : "stop"
},
"content" : {
"type" : "text",
"analyzer" : "stop"
}
}
}
}
}
}
So, to me, it looks like the mapping is getting uploaded correctly, so I don't think it's that?
This is expected behavior because when you execute queries and get a response then it is your original content (_source) you receive and not analyzed field.
The analyzer is used for, how the Elasticsearch Index field into inverted index and it is not for changing your actual content. Same analyzer will be applied at query time as well, so when you pass the query, it will use stop analyzer and remove stopwords and search your query in inverted index.
This POST /testindex/_analyze API will show how your original content is analyzed / tokenized and store to inverted index. It will not change your original document.
So when you search match_all query, it will just get all the documents from Elasticsearch with _source which have original document content and give you as a response.
You can use match query for matching on specific field insted of match_all as match_all will give you all the document from index (by default 10).
{
"query": {
"match": {
"title": "The quick brown fox"
}
}
}
Here, you can try query like quick brown fox or The quick etc.
Hope I have clear your understandings..

Match entire statement in Elastic Search

I am trying to match whole statement while querying in Elastic search. But not able to achieve it right.
"query": {
"match_phrase": {"description": query_tokens}
}
I tried with Air Conditioning, Air Conditioner, but it provided me result like general match query.
How should i achieve complete statement fetch?
Solution
Store these statement(description field) in the keyword field.
Reason of not working
Match phrase query is analyzed which means the same analyzer which used at index time, used as query time to create search tokens and by default for text field it's standard analyzer which breaks tokens on space and , id special char.
Statement from Elasticsearch doc
The match_phrase query analyzes the text and creates a phrase query
out of the analyzed text. For example:
Index Def
{
"mappings": {
"properties": {
"description":
{
"type" : "keyword"
}
}
}
}
Index sample doc
{
"description" : "Air Conditioning, Air Conditioner"
}
Search query
{
"query": {
"match_phrase" : {
"description" : {
"query" : "Air Conditioning, Air Conditioner"
}
}
}
}
Search result
"hits": [
{
"_index": "match",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"description": "Air Conditioning, Air Conditioner"
}
}
]

limit results of child/sub document when using find on master document

First time exploring mongoDB and I've bumped into a pickle.
Assuming I have a table/collection called inventory.
This collection in turn have documents that look like:
{
"book" : "Harry Potter",
"users" : {
"Read_it" : {
"John" : <personal number>,
"Elise" : <personal number>
},
"Currently_reading" : { ... }
}
}
Now the dictionary "Read_it" can become quite large and I'm limited to the amount of memory the querying client has so I would like to some how limit the number of returned item and perhaps page it.
This is a function I found in the docs, not sure how to convert this into what I need.
db.inventory.find( { "book": "Harry Potter" }, { item: 1, qty: 500 } )
Skipping the second parameter to find() gives me a result in the form a complete dictionary which works as long as the "Read_it" document/container doesn't grow to big.
One solution would be to pull back the structure so it becomes more flat, but that isn't optimal in terms of other aspects of this project.
Is is possible to work with find() here or are there another function that can do this better?
You seem to asking about projecting only specific elements of a nested structure.
Consider your document example (revised for use):
{
"book" : "Harry Potter",
"users" : {
"Read_it" : {
"John" : 1,
"Elise" : 2
},
"Currently_reading" : {
"Peter": 1
},
"More_information": 5
}
}
Then just issue as follows:
db.collection.find(
{ "book": "Harry Potter" },
{
"book": 1,
"users.Currently_reading": 1,
"users.More_information": 1
}
)
Returns the result with just the fields specified:
{
"_id" : ObjectId("5573b2beb67e246aba2b4b71"),
"book" : "Harry Potter",
"users" : {
"Currently_reading" : {
"Peter" : 1
},
"More_information" : 5
}
}
Not entirely sure, but that might not be supported in all MongoDB versions. Works in 3.X though. If you find it is not supported then do this instead:
db.collection.aggregate([
{ "$match": { "book": "Harry Potter" } },
{ "$project": {
"book": 1,
"users": {
"Currently_Reading": "$users.Currently_reading",
"More_information": "$users.More_information"
}
}}
])
The $project option of the .aggregate() method allows you to manipulate the document returned quite freely. So you don't even need to keep the same structure to return nested results and could change the result further if needed.
I would also strongly suggest using arrays with properties of sub-documents rather than nested dictionaries since that form is much easier to query and filter results than your current structure allows.
Additional to unclear question
As mentioned, it is better to use arrays rather than keys to represent the nested data. So if your intent is to actually just restrict the "Read_it" items to a number of entries then your data is best modelled as such:
{
"book" : "Harry Potter",
"users" : {
"Read_it" : [
{ "username": "John", "id": 1 },
{ "username": "Elise", "id": 2 }
],
"Currently_reading" : [
{ "username": "Peter", "id": 3 }
],
"More_information": 5
}
}
Then you can do a query to limit the number of items in "Read_it" using $slice :
db.collection.find(
{ "book": "Harry Potter" },
{ "users.Read_it": { "$slice": 1 } }
)
Which returns:
{
"_id" : ObjectId("5574118012ae33005f1fca17"),
"book" : "Harry Potter",
"users" : {
"Read_it" : [
{
"username" : "John",
"id" : 1
}
],
"Currently_reading" : [
{
"username" : "Peter",
"id" : 3
}
],
"More_information" : 5
}
}
Alternate options use the projection positional $ operator or even the aggregation framework for multiple matches in the array. But there are already many answers here that show you how to do that.

mongodb python , quick pipeline code check

i am a beginner to mongodb and i have the assignment to write pipeline code. MY goal is to find the Region in India has the largest number of cities with longitude between 75 and 80? I hope anybody can help me to point out my misconceptions and/or mistakes, it is a very short code, so i am sure the pros will spot it right away.
Here is my code, i will post how the datastructure looks like under it :
pipeline = [
{"$match" : {"lon": {"$gte":75, "$lte" : 80}},
{'country' : 'India'}},
{ '$unwind' : '$isPartOf'},
{ "$group":
{
"_id": "$name",
"count" :{"$sum":{"cityname":"$name"}} }},
{"$sort": {"count": -1}},
{"$limit": 1}
]
{
"_id" : ObjectId("52fe1d364b5ab856eea75ebc"),
"elevation" : 1855,
"name" : "Kud",
"country" : "India",
"lon" : 75.28,
"lat" : 33.08,
"isPartOf" : [
"Jammu and Kashmir",
"Udhampur district"
],
"timeZone" : [
"Indian Standard Time"
],
"population" : 1140
}
The following pipeline will give you the desired result. The first $match pipeline operator uses standard MongoDB queries to filter the documents (cities) whose longitude is between 75 and 80 and as well as the ones only in India based on the country field. Since each document represents a city, the $unwind operator on the isPartOf deconstructs that array field from the filtered documents to output a document for each element. Each output document replaces the array with an element value. Thus for each input document, outputs n documents where n is the number of array elements and this operation is rather useful in the next $group operator stage since that's where you can calculate the number n through $sum group accumulator operator. The next pipeline stages will then transform your final document structure by introducing new replacement fields Region and NumberOfCities + sorting the documents in descending order and then returning the top 1 document which is your region with the largest number of cities:
pipeline = [
{
"$match": {
"lon": {"$gte": 75, "$lte": 80},
"country": "India"
}
},
{
"$unwind": "$isPartOf"
},
{
"$group": {
"_id": "$isPartOf",
"count": {
"$sum": 1
}
}
},
{
"$project": {
"_id": 0,
"Region": "$_id",
"NumberOfCities": "$count"
}
},
{
"$sort": {"NumberOfCities": -1}
},
{ "$limit": 1 }
]
There are some syntax and logical errors in your pipeline.
{"$match" : {"lon": {"$gte":75, "$lte" : 80}},
{'country' : 'India'}},
The Syntax here is wrong, you should just use comma to seperate key value pairs in `$match.
"_id": "$name",
You are grouping based on city name and not on the region.
{"$sum":{"cityname":"$name"}}
You need to send a numeric values to the $sum operator that result from applying a specified expression. {"cityname":"$name"} will be ignored.
The correct pipeline would be :-
[
{"$match" : {"lon": {"$gte":75,"$lte" : 80},'country' : 'India'}},
{ '$unwind' : '$isPartOf'},
{ "$group":
{
"_id": "$isPartOf",
"count" :{"$sum":1}
}
},
{"$sort": {"count": -1}},
{"$limit": 1}
]
If you want to get all the cities in that region satisfying your condition as well ,you can add "cities": {'$push': '$name'} in the $group stage.

how to aggregate on each item in collection in mongoDB

MongoDB noob here...
when I do db.students.find().pretty() in the shell I get a long list from my collection...like so..
{
"_id" : 19,
"name" : "Gisela Levin",
"scores" : [
{
"type" : "exam",
"score" : 44.51211101958831
},
{
"type" : "quiz",
"score" : 0.6578497966368002
},
{
"type" : "homework",
"score" : 93.36341655949683
},
{
"type" : "homework",
"score" : 49.43132782777443
}
]
}
now I've got about over 100 of these...I need to run the following on each of them...
lowest_hw_score =
db.students.aggregate(
// Initial document match (uses index, if a suitable one is available)
{ $match: {
_id : 0
}},
// Expand the scores array into a stream of documents
{ $unwind: '$scores' },
// Filter to 'homework' scores
{ $match: {
'scores.type': 'homework'
}},
// Sort in descending order
{ $sort: {
'scores.score': 1
}},
{ $limit: 1}
)
So I can run something like this on each result
for item in lowest_hw_score:
print lowest_hw_score
Right now "lowest_score" works on only one item I to run this on all items in the collection...how do I do this?
> db.students.aggregate(
{ $match : { 'scores.type': 'homework' } },
{ $unwind: "$scores" },
{ $match:{"scores.type":"homework"} },
{ $group: {
_id : "$_id",
maxScore : { $max : "$scores.score"},
minScore: { $min:"$scores.score"}
}
});
You don't really need the first $match, but if "scores.type" is indexed, it means it would be used before unwinding the scores. (I don't believe after the $unwind mongo would be able to use the index.)
Result:
{
"result" : [
{
"_id" : 19,
"maxScore" : 93.36341655949683,
"minScore" : 49.43132782777443
}
],
"ok" : 1
}
Edit: tested and updated in mongo shell

Categories