Get second level domain with regexp in elasticsearch - python

I want to search through an index in my database which is elasticsearch and I want to search for domains contains a second level domain (sld) but it returns me None.
here is what I've done so far:
sld = "smth"
query = client.search(
index = "x",
body = {
"query": {
"regexp": {
"site_domain.keyword": fr"*\.{sld}\.*"
}
}
}
)
EDIT:
I think the problem is with the regex I wrote
any help would be appreciated.

TLDR;
GET /so_regex_url/_search
{
"query": {
"regexp": {
"site_domain": ".*api\\.[a-z]+\\.[a-z]+"
}
}
}
This regex will match api.google.com but won't google.com.
You should watch out for the reserved characters such as .
It require proper escape sequence.
To understand
First let's talk about the pattern your are looking for.
You want to match every url that as a given subdomain.
1. Check the subdomain string exist in the url
Something like .*<subdomain>.* will work. .* means any char in any quantity.
2. Check it is a subdomain
A subdomain in a url looks like <subdomain>.<domain>.<top level domain>
You need to make sure that your subdomain has a . between both domain and top domain
Something like .*<subdomain>.*\.[a-z]+\.[a-z]+ will work [a-z]+
means at least one character between a to z and because . has a special meaning you need to escape it with \
This will match https://<subdomain>.google.com, but won't https://<subdomain>.com
/!\ This is a naive implementation.
https://<subdomain>.1234.com won't match has neither 1, 2 ... exist in [a-z]
3. Create Elastic DSL
I am performing the request on the text field not the keyword, this keep my exemple leaner but work the same way.
GET /so_regex_url/_search
{
"query": {
"regexp": {
"site_domain": ".*api\\.[a-z]+\\.[a-z]+"
}
}
}
You may have noticed the \\ it is explained in the thread it is because the payload travel in a json it also needs to escape that.
4. Python implementation
I imagine it should be
sld = "smth"
query = client.search(
index = "x",
body = {
"query": {
"regexp": {
"site_domain.keyword": `.*{sld}\\.[a-z]+\\.[a-z]+`
}
}
}
)

Related

Is there a way to find record in mongo by matching field string with an array of values

I have the below record
{
"title": "Kim floral jacquard minidress",
"designer": "Rotate Birger Christensen"
}
How can I find a record in the collection using an array of values. For example, I have the below array values. Because "title" field contains the "floral" value, the record is selected.
['floral', 'dresses']
The query I am using below doesn't work. :(
queryParam = ['floral', 'dresses']
def get_query(queryParam, gender):
query = {
"gender": gender
}
if (len(queryParam) != 0):
query["title"] = {"$in": queryParam}
return query
products_query = get_query(query, gender)
products = mongo.db.products.find(products_query)
To add to the previous answer, there's a little bit more to do to get this to work in pymongo. You have to use re.compile() to get the regex search to work:
import re
queryParam = [re.compile('floral'), re.compile('dresses')]
Alternatively you could use this approach which removes the need for the $in operator:
import re
queryParam = [re.compile('floral|dresses')]
And once you've done that you don't even need to use re.compile:
queryParam = 'floral|dress'
...
query = {"title": {"$regex": queryParam}}
Take your pick.
You need to do regex search along with $in operator :
db.collectionName.find( { title: { $in: [ /floral/, /dresses/ ] } })

How to remove the first and last portion of a string in Python?

How can i cut from such a string (json) everything before and including the first [ and everything behind and including the last ] with Python?
{
"Customers": [
{
"cID": "w2-502952",
"soldToId": "34124"
},
...
...
],
"status": {
"success": true,
"message": "Customers: 560",
"ErrorCode": ""
}
}
I want to have at least only
{
"cID" : "w2-502952",
"soldToId" : "34124",
}
...
...
String manipulation is not the way to do this. You should parse your JSON into Python and extract the relevant data using normal data structure access.
obj = json.loads(data)
relevant_data = obj["Customers"]
Addition to #Daniel Rosman answer, if you want all the list from JSON.
result = []
obj = json.loads(data)
for value in obj.values():
if isinstance(value, list):
result.append(*value)
While I agree that Daniel's answer is the absolute best way to go, if you must use string splitting, you can try .find()
string = #however you are loading this json text into a string
start = string.find('[')
end = string.find(']')
customers = string[start:end]
print(customers)
output will be everything between the [ and ] braces.
If you really want to do this via string manipulation (which I don't recommend), you can do it this way:
start = s.find('[') + 1
finish = s.find(']')
inner = s[start : finish]

Python Proper JSON Format

I need to post data to a REST API. One field, incident_type, needs to be passed in the below JSON format ( must include brackets, can't be just curly brackets ):
"incident_type_ids": [{
"name": "Phishing - General"
}],
When I try to force this in my code, it doesn't come out quite right. There will usually be some extra quote-escapes ( ex. output: "incident_type_ids": "[\\"{ name : Phishing - General }\\"]")and I realized that was because I was double-encoding the JSON data in the incident type variable to forcibly add the brackets ( in line 6 which has since been commented out ):
#incident variables
name = 'Incident Name 2'
description = 'This is the description'
corpID = 'id'
incident_type = '{ name : Phishing - General }'
#incident_type = json.dumps([incident_type])
incident_owner = 'Security Operations Center'
payload = {
'name':name,
'discovered_date':'0',
'owner_id':incident_owner,
'description':description,
'exposure_individual_name':corpID,
'incident_type_ids':incident_type
}
body=json.dumps(payload)
create = s.post(url, data=body, headers=headers, verify=False)
However since I commented out the line, I can't get incident_type in the format I need ( with brackets ).
So, my question is: How can I get the incident_type variable in the proper format in the final payload?
Input I manually got to work using product's interactive REST API:
{
"name": "Incident Name 2",
"incident_type_ids": [{
"name": "Phishing - General"
}],
"description": "This is the description",
"discovered_date": "0",
"exposure_individual_name": "id",
"owner_id": "Security Operations Center"
}
I figure my approach is wrong and I'd appreciate any help. I'm new to Python so I'm expecting this is a beginner's mistake.
Thanks for your help.
JSON square brackets are for arrays, which correspond to Python lists. JSON curly braces are for objects, which correspond to Python dictionaries.
So you need to create a list containing a dictionary, then convert that to JSON.
incident_type = [{"name": "Phishing - General"}]
incident_owner = 'Security Operations Center'
payload = {
'name':name,
'discovered_date':'0',
'owner_id':incident_owner,
'description':description,
'exposure_individual_name':corpID,
'incident_type_ids':incident_type
}
body=json.dumps(payload)
It's only slightly coincidental that the Python syntax for this is similar to the JSON syntax.

Searching letter by letter in elastic search

i am using elasticsearch with python as client. I want to query through a list of companies. Say Company field values are
Gokl
Normn
Nerth
Scenario 1(using elasticsearch-dsl python)
s = Search(using=client, index="index-test") \
.query("match", Company="N")
So when i put N in query match i don't get Normn or Nerth. I think its probably because of tokenization based on words.
Scenario 2(using elasticsearch-dsl python)
s = Search(using=client, index="index-test") \
.query("match", Company="Normn")
When i enter Normn i get the output clearly. So how can i make the search active when i enter letter n as in above scenario 1.
I think you are looking for a prefix search. I don't know the python syntax but the direct query would look like this:
GET index-test/_search
{
"query": {
"prefix": {
"company": {
"value": "N"
}
}
}
}
See here for more info.
If I understand correctly you need to query companies started with specific letter
In this case you can use this query
{
"query": {
"regexp": {
"Company": "n.*"
}
}
}
please read query types from here
for this case, you can use the code below:
s = Search(using=client, index="index-test").\
.query("match_phrase_prefix", Company="N")
you can use multi-match query for Company and Another field like this:
s = Search(using=client, index="index-test").\
.query("multi_match", query="N", fields=['Company','Another_field'],type='phrase_prefix')

aggregate a field in elasticsearch-dsl using python

Can someone tell me how to write Python statements that will aggregate (sum and count) stuff about my documents?
SCRIPT
from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])
s = Search(using=client, index="attendance")
s = s.execute()
for tag in s.aggregations.per_tag.buckets:
print (tag.key)
OUTPUT
File "/Library/Python/2.7/site-packages/elasticsearch_dsl/utils.py", line 106, in __getattr__
'%r object has no attribute %r' % (self.__class__.__name__, attr_name))
AttributeError: 'Response' object has no attribute 'aggregations'
What is causing this? Is the "aggregations" keyword wrong? Is there some other package I need to import? If a document in the "attendance" index has a field called emailAddress, how would I count which documents have a value for that field?
First of all. I notice now that what I wrote here, actually has no aggregations defined. The documentation on how to use this is not very readable for me. Using what I wrote above, I'll expand. I'm changing the index name to make for a nicer example.
from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])
s = Search(using=client, index="airbnb", doc_type="sleep_overs")
s = s.execute()
# invalid! You haven't defined an aggregation.
#for tag in s.aggregations.per_tag.buckets:
# print (tag.key)
# Lets make an aggregation
# 'by_house' is a name you choose, 'terms' is a keyword for the type of aggregator
# 'field' is also a keyword, and 'house_number' is a field in our ES index
s.aggs.bucket('by_house', 'terms', field='house_number', size=0)
Above we're creating 1 bucket per house number. Therefore, the name of the bucket will be the house number. ElasticSearch (ES) will always give a document count of documents fitting into that bucket. Size=0 means to give use all results, since ES has a default setting to return 10 results only (or whatever your dev set it up to do).
# This runs the query.
s = s.execute()
# let's see what's in our results
print s.aggregations.by_house.doc_count
print s.hits.total
print s.aggregations.by_house.buckets
for item in s.aggregations.by_house.buckets:
print item.doc_count
My mistake before was thinking an Elastic Search query had aggregations by default. You sort of define them yourself, then execute them. Then your response can be split b the aggregators you mentioned.
The CURL for the above should look like:
NOTE: I use SENSE an ElasticSearch plugin/extension/add-on for Google Chrome. In SENSE you can use // to comment things out.
POST /airbnb/sleep_overs/_search
{
// the size 0 here actually means to not return any hits, just the aggregation part of the result
"size": 0,
"aggs": {
"by_house": {
"terms": {
// the size 0 here means to return all results, not just the the default 10 results
"field": "house_number",
"size": 0
}
}
}
}
Work-around. Someone on the GIT of DSL told me to forget translating, and just use this method. It's simpler, and you can just write the tough stuff in CURL. That's why I call it a work-around.
# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])
s = Search(using=client, index="airbnb", doc_type="sleep_overs")
# how simple we just past CURL code here
body = {
"size": 0,
"aggs": {
"by_house": {
"terms": {
"field": "house_number",
"size": 0
}
}
}
}
s = Search.from_dict(body)
s = s.index("airbnb")
s = s.doc_type("sleepovers")
body = s.to_dict()
t = s.execute()
for item in t.aggregations.by_house.buckets:
# item.key will the house number
print item.key, item.doc_count
Hope this helps. I now design everything in CURL, then use Python statement to peel away at the results to get what I want. This helps for aggregations with multiple levels (sub-aggregations).
I do not have the rep to comment yet but wanted to make a small fix on Matthew's comment on VISQL's answer regarding from_dict. If you want to maintain the search properties, use update_from_dict rather the from_dict.
According to the Docs , from_dict creates a new search object but update_from_dict will modify in place, which is what you want if Search already has properties such as index, using, etc
So you would want to declare the query body before the search and then create the search like this:
query_body = {
"size": 0,
"aggs": {
"by_house": {
"terms": {
"field": "house_number",
"size": 0
}
}
}
}
s = Search(using=client, index="airbnb", doc_type="sleep_overs").update_from_dict(query_body)

Categories