I am using elasticsearch on a large indexed database. One of the queries requires to find an integer value and a string such as:
s = Search(using=es, index="index1").extra(size=500) \
.query("match_phrase", name={"query": "john".casefold()})\
.query("match", age="46")
This will search for a data record that contains "John white" and "46". However, If the age is not correct, I would like to get a record that contains "John white" and age that is the closest to "46" (assuming I have those records, otherwise it will return nothing).
The above query however only returns records of age EXACTLY "46".
A similar question already exists on SO: how to find the nearest / closest number using Query DSL in elasticsearch
But I am not sure how to incorporate the JSON in my query since I am using specific python modules.
A case in point is the fact that I can use fuzziness on a string. But I think fuzziness on an integer is not possible in the same manner in elasticsearch.
I would recommend using script based sorting to accomplish this, as described here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html#_script_based_sorting
Working under the assumption that you're only matching on the first name - if you want to match the name exactly, I'd recommend using a filter based match. I used three different 'users' in the index, defined as follows:
POST index1/_doc
{
"name": "John White",
"age": 46
}
POST index1/_doc
{
"name": "John White",
"age": 40
}
POST index1/_doc
{
"name": "John Black",
"age": 47
}
I find it easier to write something a little more complex like this using Kibana's Dev Tools for testing, and then convert it to the Python Elasticsearch DSL compatible format - so in Kibana, I ultimately came up with the following:
GET index1/_search
{
"query": {
"match_phrase": {
"name": {
"query": "john"
}
}
},
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": "Math.abs(doc['age'].value - params.target_age)",
"params": {
"target_age": 46
}
},
"order": "asc"
}
}
}
Note using the absolute value of the difference will give you the closest value in either direction (i.e. younger or older). Some tweaks might be necessary if your requirements are different. Simply adjust the parameter as your queries change to accommodate different target ages.
Once tested and validated, converting to Python Elasticsearch DSL is quite easy - you can use the 'Auto Indent' function to flatten the complexity of the sort and drop it right into your existing statement.
s = Search(using=es, index="index1").extra(size=500) \
.query("match_phrase", name={"query": "john".casefold()}) \
.sort({"_script":{"type":"number","script":{"lang":"painless","source": \
"Math.abs(doc['age'].value - params.target_age)", \
"params":{"target_age":46}},"order":"asc"}})
Executing this returns the expected response:
<Response: [<Hit(index1/_doc/VR3e7WkBsHIsqLp6vfx_): {'name': 'John White', 'age': 46}>, <Hit(index1/_doc/Vx3f7WkBsHIsqLp6DPxM): {'name': 'John Black', 'age': 47}>, <Hit(index1/_doc/Vh3e7WkBsHIsqLp6yfxd): {'name': 'John White', 'age': 40}>]>
However, as you indicated you want the closest value, I'd recommend changing the size parameter to 1.
Related
Lets say I have a dictionary:
episode = {
"translations": [{
"language": {
"code": "de"
},
"title": "German"
}, {
"language": {
"code": "en"
},
"title": "English"
}, {
"language": {
"code": "fr"
},
"title": "French"
}]
};
I would like to get specifically the list that matches a specific language code. I could walk through the entire dictionary using the following code:
for translation in episode['translations']:
if translation['language']['code'] == 'fr':
language = translation;
break;
But that seems a bit excessive, and a waste of resources. Is there a better way of doing this, without having to walk through the entire array?
If the data is stored in a list, then the only way to extract queries based on a condition is to iterate over the entries. In the snippet you provide, the implicit assumption of using break is that there is a unique entry of interest (or perhaps the first match is of interest).
For more optimal queries of this data, it should be transformed to a different structure. For example, it's possible to convert it to a pandas dataframe or convert the data to a dictionary where keys are translation['language']['code'] (so look-ups become O(1)).
Short of modifying how you structured your data*, I don't see a way that doesn't involve traversing the whole dictionary. That being said, there are a few more elegant ways to do it, although elegance is highly subjective:
filter(lambda x: 'fr' in x['language'].values(), episode['translations'])
would give you an iterable that contains all the the entries in your dictionary that have the required language code. Calling next on it would give you the first one, for instance.
Edit:
* what SultanOrazbayev proposed in their answer is one such way to modify your data structure.
A list comprehension is neater although not really different from the original code except that it allows for more than one dictionary entry having a particular code. For example:
episode = {
"translations": [{
"language": {
"code": "de"
},
"title": "German"
}, {
"language": {
"code": "en"
},
"title": "English"
}, {
"language": {
"code": "fr"
},
"title": "French"
}]
}
list_ = [d for d in episode['translations'] if d['language']['code'] == 'en']
print(list_)
Output:
[{'language': {'code': 'en'}, 'title': 'English'}]
Background
I have data stored in the following format
{
"player_id": "VU3R5HNTAGMK",
"markers": {
"BICF2P964092": "GC",
"BICF2G630653981": "CG",
"BICF2P483996": "CT",
"BICF2S23452916": "CG",
"chr26_19147949": "TC",
}
}
You can imagine i have data stored for multiple players and each has a unique player_id and they all have varying number of markers with different marker values.
In the above case a marker is BICF2P964092 and it's marker value is GC.
I am trying to query my mongo db in various ways. One obvious way is by using player_id. To do that I do the following col.find({"player_id": "VU3R5HNTAGMK"})
Another thing i want to do is maybe I just want to know value of a specific marker for a specific player. So for that I can do the following col.find({"player_id": "VU3R5HNTAGMK"}, {'markers.BICF2P964092'})
ISSUE
I also want to be able to get values for multiple markers for a specific player and i am not able to do so. I have tried the following with no luck.
col.find({"player_id": "VU3R5HNTAGMK"},{'markers': {'$in': ["BICF2P964092", "chr26_19147949"]}})
col.find({"player_id": "VU3R5HNTAGMK"}, {'markers.BICF2P964092'}, {'markers.chr26_19147949'})
I would really appreciate it if someone can help me write a query where i can get multiple marker values for specified marker and player_id
You can simply do the following
col.find({“player_id”: “VU3R5HNTAGMK”}, {“markers.” + m: 1 for m in [“ BICF2P964092", “BICF2G630653981”]})
As you've tagged this pymongo, you might be as best to process the marker values in python after the find; e.g.
docs = col.find({"player_id": "VU3R5HNTAGMK"})
for doc in docs:
for marker, value in doc.get('markers').items():
if marker in ["BICF2P964092", "chr26_19147949"]:
print(marker, value)
#Belly Buster solution is good if you want to handle this using python.
But, there is a way to completely handle this on the MongoDB side using Aggregation.
You can combine $objectToArray, $filter, and $arrayToObject operators in $project stage.
collection.aggregate([
{
"$match": {
"player_id": "VU3R5HNTAGMK" # <-- All your match conditons
}
},
{
"$project": {
"player_id": 1, # All the other keys which you want to project
"markers": {
"$arrayToObject": {
"$filter": {
"input": {
"$objectToArray": "$markers"
},
"as": "elem",
"cond": {
"$in": [
"$$elem.k",
[
# <-- List of key names you want to project
"BICF2G630653981",
"BICF2P483996"
]
]
},
},
},
},
}
},
])
Note: You have to use MongoDB version >= 3.4.4 for this aggregation query to work.
I have JSON output as follows:
{
"service": [{
"name": ["Production"],
"id": ["256212"]
}, {
"name": ["Non-Production"],
"id": ["256213"]
}]
}
I wish to find all ID's where the pair contains "Non-Production" as a name.
I was thinking along the lines of running a loop to check, something like this:
data = json.load(urllib2.urlopen(URL))
for key, value in data.iteritems():
if "Non-Production" in key[value]: print key[value]
However, I can't seem to get the name and ID from the "service" tree, it returns:
if "Non-Production" in key[value]: print key[value]
TypeError: string indices must be integers
Assumptions:
The JSON is in a fixed format, this can't be changed
I do not have root access, and unable to install any additional packages
Essentially the goal is to obtain a list of ID's of non production "services" in the most optimal way.
Here you go:
data = {
"service": [
{"name": ["Production"],
"id": ["256212"]
},
{"name": ["Non-Production"],
"id": ["256213"]}
]
}
for item in data["service"]:
if "Non-Production" in item["name"]:
print(item["id"])
Whatever I see JSON I think about functionnal programming ! Anyone else ?!
I think it is a better idea if you use function like concat or flat, filter and reduce, etc.
Egg one liner:
[s.get('id', [0])[0] for s in filter(lambda srv : "Non-Production" not in srv.get('name', []), data.get('service', {}))]
EDIT:
I updated the code, even if data = {}, the result will be [] an empty id list.
I have a json response from an API in this way:-
{
"meta": {
"code": 200
},
"data": {
"username": "luxury_mpan",
"bio": "Recruitment Agents👑👑👑👑\nThe most powerful manufacturers,\nwe have the best quality.\n📱Wechat:13255996580💜💜\n📱Whatsapp:+8618820784535",
"website": "",
"profile_picture": "https://scontent.cdninstagram.com/t51.2885-19/10895140_395629273936966_528329141_a.jpg",
"full_name": "Mpan",
"counts": {
"media": 17774,
"followed_by": 7982,
"follows": 7264
},
"id": "1552277710"
}
}
I want to fetch the data in "media", "followed_by" and "follows" and store it in three different lists as shown in the below code:--
for r in range(1,5):
var=r,st.cell(row=r,column=3).value
xy=var[1]
ij=str(xy)
myopener=Myopener()
url=myopener.open('https://api.instagram.com/v1/users/'+ij+'/?access_token=641567093.1fb234f.a0ffbe574e844e1c818145097050cf33')
beta=json.load(url)
for item in beta['data']:
list1.append(item['media'])
list2.append(item['followed_by'])
list3.append(item['follows'])
When I run it, it shows the error TypeError: string indices must be integers
How would my loop change in order to fetch the above mentioned values?
Also, Asking out of curiosity:- Is there any way to fetch the Watzapp no from the "BIO" key in data dictionary?
I have referred questions similar to this and still did not get my answer. Please help!
beta['data'] is a dictionary object. When you iterate over it with for item in beta['data'], the values taken by item will be the keys of the dictionary: "username", "bio", etc.
So then when you ask for, e.g., item['media'] it's like asking for "username"['media'], which of course doesn't make any sense.
It isn't quite clear what it is that you want: is it just the stuff inside counts? If so, then instead of for item in beta['data']: you could just say item = beta['data']['counts'], and then item['media'] etc. will be the values you want.
As to your secondary question: I suggest looking into regular expressions.
I'm trying to use $set to create an array/list/collection (not sure which is proper terminology), and I'm not sure how to do it. For example:
I have a document inserted into my database that looks like this:
"_id": (unique, auto-generated id)
"Grade": Sophomore
I want to insert a collection/list/array using update. So, basically I want this:
"_id": (unique, auto-generated id)
"Grade": Sophomore
"Information"{
"Class_Info": [
{"Class_Name": "Math"}
]
What I've been doing so far is using .update and dot notation. So, what I was trying to do was use $set like this:
collection.update({'_id': unique ID}, {'$set': {'Information.Class_Info.Class_Name': 'Math}})
However, what that is doing is making Class_Info a document and not a list/collection/array, so it's doing:
"_id": (unique id)
"Grade": Sophomore
"Information"{
"Class_Info": {
"Class_Name": "Math"
}
How do I specify that I want Class_Info to be a list? IF for some reason I absolutely cannot use $set to do this, it is very important that I can use dot notation because of the way the rest of my program works, so if I'm supposed to use something other than $set, can it have dot notation to specify where to insert the list? (I know $push is another option, but it doesn't use dot notation, so I can't really use it in my case).
Thanks!
If you want to do it with only one instruction but starting up from NOT having any key created yet, this is the only way to do it ($set will never create an array that's not explicit, like {$set: {"somekey": [] }}
db.test.update(
{ _id: "(unique id)" },
{ $push: {
"Information.Class_Info": { "Class_Name": "Math" }
}}
)
This query does the trick, push to a non-existing key Information.Class_Info, the object you need to create as an array. This is the only possible solution with only one instruction, using dot notation and that works.
There is a way to do it with one instructions, $set and dot notation, as follows:
db.test.updateOne(
{ _id: "my-unique-id" },
{ $set: {
"Information.Class_Info": [ { "Class_Name": "Math" } ]
}}
)
There is also a way to do it with two instructions and the array index in the dot notation, allowing you to use similar statements to add more array elements:
db.test.updateOne(
{ _id: "my-unique-id" },
{ $set: { "Information.Class_Info": [] }}
)
db.test.updateOne(
{ _id: "my-unique-id" },
{ $set: {
"Information.Class_Info.0": { "Class_Name": "Math" },
"Information.Class_Info.1": { "Class_AltName": "Mathematics" }
}}
)
Deviating from these options has interesting failure modes:
If you try to combine the second option into a single updateOne() call, which is usually possible, MongoDB will complain that "Updating the path 'Information.Class_Info.0' would create a conflict at 'Information.Class_Info'"
If you try to use dot the notation with the array index ("Information.Class_Info.0.Class_Name": "Math") but without creating an empty array first, then MongoDB will create an object with numeric keys ("0", "1", …). It really refuses to create array except when told explicitly using […] (as also told in the answer by #Maximiliano).