Searching letter by letter in elastic search - python

i am using elasticsearch with python as client. I want to query through a list of companies. Say Company field values are
Gokl
Normn
Nerth
Scenario 1(using elasticsearch-dsl python)
s = Search(using=client, index="index-test") \
.query("match", Company="N")
So when i put N in query match i don't get Normn or Nerth. I think its probably because of tokenization based on words.
Scenario 2(using elasticsearch-dsl python)
s = Search(using=client, index="index-test") \
.query("match", Company="Normn")
When i enter Normn i get the output clearly. So how can i make the search active when i enter letter n as in above scenario 1.

I think you are looking for a prefix search. I don't know the python syntax but the direct query would look like this:
GET index-test/_search
{
"query": {
"prefix": {
"company": {
"value": "N"
}
}
}
}
See here for more info.

If I understand correctly you need to query companies started with specific letter
In this case you can use this query
{
"query": {
"regexp": {
"Company": "n.*"
}
}
}

please read query types from here
for this case, you can use the code below:
s = Search(using=client, index="index-test").\
.query("match_phrase_prefix", Company="N")
you can use multi-match query for Company and Another field like this:
s = Search(using=client, index="index-test").\
.query("multi_match", query="N", fields=['Company','Another_field'],type='phrase_prefix')

Related

Get second level domain with regexp in elasticsearch

I want to search through an index in my database which is elasticsearch and I want to search for domains contains a second level domain (sld) but it returns me None.
here is what I've done so far:
sld = "smth"
query = client.search(
index = "x",
body = {
"query": {
"regexp": {
"site_domain.keyword": fr"*\.{sld}\.*"
}
}
}
)
EDIT:
I think the problem is with the regex I wrote
any help would be appreciated.
TLDR;
GET /so_regex_url/_search
{
"query": {
"regexp": {
"site_domain": ".*api\\.[a-z]+\\.[a-z]+"
}
}
}
This regex will match api.google.com but won't google.com.
You should watch out for the reserved characters such as .
It require proper escape sequence.
To understand
First let's talk about the pattern your are looking for.
You want to match every url that as a given subdomain.
1. Check the subdomain string exist in the url
Something like .*<subdomain>.* will work. .* means any char in any quantity.
2. Check it is a subdomain
A subdomain in a url looks like <subdomain>.<domain>.<top level domain>
You need to make sure that your subdomain has a . between both domain and top domain
Something like .*<subdomain>.*\.[a-z]+\.[a-z]+ will work [a-z]+
means at least one character between a to z and because . has a special meaning you need to escape it with \
This will match https://<subdomain>.google.com, but won't https://<subdomain>.com
/!\ This is a naive implementation.
https://<subdomain>.1234.com won't match has neither 1, 2 ... exist in [a-z]
3. Create Elastic DSL
I am performing the request on the text field not the keyword, this keep my exemple leaner but work the same way.
GET /so_regex_url/_search
{
"query": {
"regexp": {
"site_domain": ".*api\\.[a-z]+\\.[a-z]+"
}
}
}
You may have noticed the \\ it is explained in the thread it is because the payload travel in a json it also needs to escape that.
4. Python implementation
I imagine it should be
sld = "smth"
query = client.search(
index = "x",
body = {
"query": {
"regexp": {
"site_domain.keyword": `.*{sld}\\.[a-z]+\\.[a-z]+`
}
}
}
)

MongoDB (PyMongo) Pagination with distinct not giving consistent result

I am trying to achieve pagination with distinct using pymongo.
I have records
{
name: string,
roll: integer,
address: string,
.
.
}
I only want name for each record, where name can be duplicate, so i want distinct name with pagination.
result = collection.aggregate([
{'$sort':{"name":1}},
{'$group':{"_id":"$name"}},
{'$skip':skip},
{'$limit':limit}
])
Problem is, with this query, each time I query I get different result for same page number
Looked into this answer
Distinct() command used with skip() and limit()
but didn't help in my case.
How do I resolve this.
Thanks in advance!
I've tried to sort after the group and it seems to solve the problem
db.collection.aggregate([
{
"$group": {
"_id": "$name"
}
},
{
"$sort": {
"_id": 1
}
},
{
"$skip": 0
},
{
"$limit": 1
}
])
try it here

Is there a way to find record in mongo by matching field string with an array of values

I have the below record
{
"title": "Kim floral jacquard minidress",
"designer": "Rotate Birger Christensen"
}
How can I find a record in the collection using an array of values. For example, I have the below array values. Because "title" field contains the "floral" value, the record is selected.
['floral', 'dresses']
The query I am using below doesn't work. :(
queryParam = ['floral', 'dresses']
def get_query(queryParam, gender):
query = {
"gender": gender
}
if (len(queryParam) != 0):
query["title"] = {"$in": queryParam}
return query
products_query = get_query(query, gender)
products = mongo.db.products.find(products_query)
To add to the previous answer, there's a little bit more to do to get this to work in pymongo. You have to use re.compile() to get the regex search to work:
import re
queryParam = [re.compile('floral'), re.compile('dresses')]
Alternatively you could use this approach which removes the need for the $in operator:
import re
queryParam = [re.compile('floral|dresses')]
And once you've done that you don't even need to use re.compile:
queryParam = 'floral|dress'
...
query = {"title": {"$regex": queryParam}}
Take your pick.
You need to do regex search along with $in operator :
db.collectionName.find( { title: { $in: [ /floral/, /dresses/ ] } })

How to return only aggregation results not hits in elasticsearch query dsl

I am writing a query dsl in python using http://elasticsearch-dsl.readthedocs.io
and I have following code
search.aggs.bucket('per_ts', 'terms', field='ts')\
.bucket('load_time', 'percentiles', field='total_req', percents=[99])
response = search.execute()
This works fine but it also returns hits. But I don't want hits
In curl query mode I can get what I want by doing size:0 in
GET /twitter/tweet/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"field": "text"
}
}
}
}
I couldn't find a way where I can use size = 0 in query dsl.
Referring to the code of elasticsearch-dsl-py/search.py here
s = Search().query(...).extra(from_=0, size=25)
This statement should work.

How to use mapreduce in mongodb?

I have the following code in python:
from pymongo import Connection
import bson
c = Connection()
db = c.twitter
ids = db.users_from_united_states.distinct("user.id")
for i in ids:
count = db.users_from_united_states.find({"user.id":i}).count()
for u in db.users_from_united_states.find({"user.id":i, "tweets_text": {"$size": count}}).limit(1):
db.my_usa_fitness_network.insert(u)
I need to get all the users and find the register of each user where the number of tweets_text is equal to the number of times that it appears in the collection (meaning that this document contains ALL the tweets that the same user posted).
Then, I need to save it in another collection, or just group it on the same collection.
When I run this code it gives me a number of documents that is less than the ids number
I saw something about mapReduce but I just can't figure out how to use it in my case.
I tried to run another code directly on mongodb but it hasn't worked at all:
var ids = db.users_from_united_states.distinct("user.id")
for (i=0; i< ids.length; i++){
var count = db.users_from_united_states.find({"user.id":ids[i]}).count()
db.users_from_united_states.find({"user.id":ids[i], "tweets_text": {$size: count}).limit(1).forEach(function(doc){db.my_usa_fitness_network.insert(doc)})
}
Can you help me please? I have a huge project and I need help. Thank you.
[
{
"$group": {
"_id": "$user.id",
"my_fitness_data": {
"$push": "$text"
}
}
},
{
"$project": {
"UserId": "$_id",
"TweetsCount": {
"$size": "$my_fitness_data"
},
"Tweets": "$my_fitness_data"
}
}
]

Categories