PyMongo- selecting sub-documents from collection by regex

PyMongo- selecting sub-documents from collection by regex - python

Lets take for example the following collections:
{
'_id': '0',
'docs': [
{'value': 'abcd', 'key': '1234'},
{'value': 'abef', 'key': '5678'}
]
}
{
'_id': '1',
'docs': [
{'value': 'wxyz', 'key': '1234'},
{'value': 'abgh', 'key': '5678'}
]
}
I want to be able to select only the sub-documents under the 'docs' list which 'value' contains the string 'ab'. What I'm expecting to get is the following collections:
{
'_id': '0',
'docs': [
{'value': 'abcd', 'key': '1234'},
{'value': 'abef', 'key': '5678'}
]
}
{
'_id': '1',
'docs': [
{'value': 'abgh', 'key': '5678'}
]
}
Thus, filtering out the unmatched sub-documents.

You need an aggregation pipeline that matches each subdocument separately, then re-joins the matching subdocuments into arrays:
from pprint import pprint
from bson import Regex
regex = Regex(r'ab')
pprint(list(col.aggregate([{
'$unwind': '$docs'
}, {
'$match': {'docs.value': regex}
}, {
'$group': {
'_id': '$_id',
'docs': {'$push': '$docs'}
}
}])))
I assume "col" is a variable pointing to your PyMongo Collection object. This outputs:
[{u'_id': u'1',
u'docs': [{u'key': u'5678', u'value': u'abgh'}]},
{u'_id': u'0',
u'docs': [{u'key': u'1234', u'value': u'abcd'},
{u'key': u'5678', u'value': u'abef'}]}]
The "r" prefix to the string makes it a Python "raw" string to avoid any trouble with regex code. In this case the regex is just "ab" so the "r" prefix isn't necessary, but it's good practice now so you don't make a mistake in the future.

Related

Return specific values from nested Elastic Object

I have to preface this with the fact that I'm working with Elasticsearch module, which returns elastic_transport.ObjectApiResponse. My problem is that I need to select specific keys from this json/dictionary looking log. The indices come from different sources, and thus contain different key/value pairs. They values I need to select are ip, port, username, rule_name, severity, and risk_score. The problem is that they have different key names and each dictionary is vastly different from the other, but they all contain those values. After that, I'll throw them into a Pandas dataframe and create a table with those values. Should a value be missing, I'll fill them with a '-'.
So my question is how I can iterate over these nested objects that are neither ordered nor standardized? Any help is appreciated. Below is a sample of the data.
{
'took': 11,
'timed_out': False,
'_shards': {
'total': 17,
'successful': 17,
'skipped': 0, 'failed': 0
},
'hits': {
'total': {'value': 58, 'relation': 'eq'},
'max_score': 0.0,
'hits': [
{
'_index': '.siem-signals-default-000017',
'_type': '_doc',
'_id': 'abcd1234',
'_score': 0.0,
'_source': {
'#timestamp': '2023-02-09T15:24:09.368Z',
'process': {'pid': 668, 'executable': 'C:\\Windows\\System32\\lsass.exe', 'name': 'lsass.exe'},
'ecs': {'version': '1.10.0'},
'winlog': {
'computer_name': 'SRVDC1',
'User': 'John.Smith',
'api': 'wineventlog',
'keywords': ['Audit Failure']
},
'source':{'domain': 'SRVDC1', 'ip': '10.17.13.118', 'port': 42548}}
'rule': {'id': 'aaabbb', 'actions': [], 'interval': '2m', 'name': 'More Than 3 Failed Login Attempts Within 1 Hour '}
},
{
'_index': '.siem-signals-default-000017',
'_type': '_doc',
'_id': 'abc123',
'_score': 0.0,
'_source': {
'#timestamp': '2023-02-09T15:24:09.369Z',
'log': {'level': 'information'},
'user': {
'id': 'S-1-0-0',
'name': 'John.Smith',
'domain': 'ACME'
},
'related': {
'port': '42554',
'ip': '10.17.13.118'
},
'logon': {'id': '0x3e7', 'type': 'Network', 'failure': {'sub_status': 'User logon with misspelled or bad password'}},
'meta': {'risk_score': 46, 'severity': 'medium'}}},
{
'_index': '.siem-signals-default-000017',
'_type': '_doc',
'_id': 'zzzzz',
'_score': 0.0,
'_source': {
'source': {
'port': '56489',
'ip': '10.18.13.101'
},
'observer': {
'type': 'firewall',
'name': 'pfSense',
'serial_number': 'xoxo',
'product': 'Supermicro',
'ip': '10.7.3.253'
},
'process': {'name': 'filterlog', 'pid': '45005'},
'tags': ['firewall', 'IP_Private_Source', 'IP_Private_Destination'],
'destination': {'service': 'microsoft-ds', 'port': '445', 'ip': '10.250.0.64'},
'log': {'risk_score': 73, 'severity': 'high'},
'rule':{'name': 'Logstash Firewall (NetBIOS and SMB Vulnerability)'}}}]}}
Expected Output
The sample below is possible only when the logs have the same standard structure.

Python: Convert specific value in dictionary from string to int

i want to convert specific value in dictionary from string to int.
This is the dictionary:
for r in repo_tags:
name, tag = r.split(":")
image = {"repo": name, "tag": tag, "id": img_id[1][:12], "full_name": r}
print(image["repo"])
images.append(image)
print(image)
This is the output:
{'repo': 'python', 'tag': 'latest', 'id': '254d4a8a8f31', 'full_name': 'python:latest'}
nginx
{'repo': 'nginx', 'tag': 'latest', 'id': '35c43ace9216', 'full_name': 'nginx:latest'}
hello-world
{'repo': 'hello-world', 'tag': 'latest', 'id': 'bf756fb1ae65', 'full_name': 'hello-world:latest'}
The id is a string, but i want this to be an int to make it a primary key in database.
Is this possible to change only this value to an int?

Partial word search not working in elasticsearch (elasticsearch-py) using mongo-connector

Currently I've indexed my mongoDB collection into Elasticsearch running in a docker container. I am able to query a document by it's exact name, but Elasticsearch is unable to match the query if it is only part of the name. Here is an example:
>>> es = Elasticsearch('0.0.0.0:9200')
>>> es.indices.get_alias('*')
{'mongodb_meta': {'aliases': {}}, 'sigstore': {'aliases': {}}, 'my-index': {'aliases': {}}}
>>> x = es.search(index='sigstore', body={'query': {'match': {'name': 'KEGG_GLYCOLYSIS_GLUCONEOGENESIS'}}})
>>> x
{'took': 198, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 1, 'relation': 'eq'}, 'max_score': 8.062855, 'hits': [{'_index': 'sigstore', '_type': 'sigs', '_id': '5d66c23228144432307c2c49', '_score': 8.062855, '_source': {'id': 1, 'name': 'KEGG_GLYCOLYSIS_GLUCONEOGENESIS', 'description': 'http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GLYCOLYSIS_GLUCONEOGENESIS', 'members': ['ACSS2', 'GCK', 'PGK2', 'PGK1', 'PDHB', 'PDHA1', 'PDHA2', 'PGM2', 'TPI1', 'ACSS1', 'FBP1', 'ADH1B', 'HK2', 'ADH1C', 'HK1', 'HK3', 'ADH4', 'PGAM2', 'ADH5', 'PGAM1', 'ADH1A', 'ALDOC', 'ALDH7A1', 'LDHAL6B', 'PKLR', 'LDHAL6A', 'ENO1', 'PKM2', 'PFKP', 'BPGM', 'PCK2', 'PCK1', 'ALDH1B1', 'ALDH2', 'ALDH3A1', 'AKR1A1', 'FBP2', 'PFKM', 'PFKL', 'LDHC', 'GAPDH', 'ENO3', 'ENO2', 'PGAM4', 'ADH7', 'ADH6', 'LDHB', 'ALDH1A3', 'ALDH3B1', 'ALDH3B2', 'ALDH9A1', 'ALDH3A2', 'GALM', 'ALDOA', 'DLD', 'DLAT', 'ALDOB', 'G6PC2', 'LDHA', 'G6PC', 'PGM1', 'GPI'], 'user': 'naji.taleb#medimmune.com', 'type': 'public', 'level1': 'test', 'level2': 'test2', 'time': '08-28-2019 14:03:29 EDT-0400', 'source': 'File', 'mapped': [''], 'notmapped': [''], 'organism': 'human'}}]}}
When using the full name of the document, elasticsearch is able to successfully query it. But this is what happens when I attempt to search part of the name or use a wildcard:
>>> x = es.search(index='sigstore', body={'query': {'match': {'name': 'KEGG'}}})
>>> x
{'took': 17, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 0, 'relation': 'eq'}, 'max_score': None, 'hits': []}}
>>> x = es.search(index='sigstore', body={'query': {'match': {'name': 'KEGG*'}}})
>>> x
{'took': 3, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 0, 'relation': 'eq'}, 'max_score': None, 'hits': []}}
In addition to the default index settings I also tried making an index that allows the use of the nGram tokenizer to enable me to do partial search, but that also didn't work. These are the settings I used for that index:
{
"sigstore": {
"aliases": {},
"mappings": {},
"settings": {
"index": {
"max_ngram_diff": "99",
"number_of_shards": "1",
"provided_name": "sigstore",
"creation_date": "1579200699718",
"analysis": {
"filter": {
"substring": {
"type": "nGram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"str_index_analyzer": {
"filter": [
"lowercase",
"substring"
],
"tokenizer": "keyword"
},
"str_search_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "1",
"uuid": "3nf915U6T9maLdSiJozvGA",
"version": {
"created": "7050199"
}
}
}
}
}
and this is the corresponding python command that created it:
es.indices.create(index='sigstore',body={"mappings": {},"settings": { 'index': { "analysis": {"analyzer": {"str_search_analyzer": {"tokenizer": "keyword","filter": ["lowercase"]},"str_index_analyzer": {"tokenizer": "keyword","filter": ["lowercase", "substring"]}},"filter": {"substring": {"type": "nGram","min_gram": 1,"max_gram": 20}}}},'max_ngram_diff': '99'}})
I use mongo-connector as the pipeline between my mongoDB collection and elasticsearch. This is the command I use to start it:
mongo-connector -m mongodb://username:password#xx.xx.xxx.xx:27017/?authSource=admin -t elasticsearch:9200 -d elastic2_doc_manager -n sigstore.sigs
I'm unsure as to why my elasticsearch is unable to get a partial match, and wondering if there is some setting I'm missing or if there's some crucial mistake I've made somewhere. Thanks for reading.
Versions
MongoDB 4.0.10
elasticsearch==7.1.0
elastic2-doc-manager[elastic5]

Updated after checked your gist:
You need to apply the mapping to your field as written in the doc, cf the first link I share in the comment.
You need to do it after applying the settings on your index according to the gist it's line 11.
Something like:
PUT /your_index/_mapping
{
"properties": {
"name": {
"type": "keyword",
"ignore_above": 256,
"fields": {
"str_search_analyzer": {
"type": "text",
"analyzer": "str_search_analyzer"
}
}
}
}
}
After you set the mapping need to apply it to your document, using update_by_query
https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-update-by-query.html
So you can continue to search with term search on your field name as it will be indexed with a keyword mapping (exact match) and on the sub_field name.str_search_analyzer with part of the word.
your_keyword = 'KEGG_GLYCOLYSIS_GLUCONEOGENESIS' OR 'KEGG*'
x = es.search(index='sigstore', body={'query': {'bool': {'should':[{'term': {'name': your_keyword}},
{'match': {'name.str_search_analyzer': your_keyword}}
]}}
})

How to seperate dictionaries that are not comma separated?

I have a text file which contains dictionaries that are not comma sepearated in the following format:
{} {} {}
Example
{
'header': 'sdf',
'meta': {
'searchId': {
'searchId': 1234
},
'timestamp': 1234,
'attachments': [
'ABC'
],
'xmlData': {
'release': None,
'version': None,
}
}
{
'header': 'sdf',
'timestamp': 14,
'attachments': [
'ABC'
],
'xmlData': {
'release': None,
'version': None,
}
}
These dictionaries may contain nested dictionaries. I want to read this file and turn it into a list of dictionaries i.e. in the format [{},{},{}]
Example
[{
'header': 'sdf',
'meta': {
'searchId': {
'searchId': 1234
},
'timestamp': 1234,
'attachments': [
'ABC'
],
'xmlData': {
'release': None,
'version': None,
}
},
{
'header': 'sdf',
'timestamp': 14,
'attachments': [
'ABC'
],
'xmlData': {
'release': None,
'version': None,
}
}]
Can someone suggest a way to do it.
Thanks

My two other answers assume that the dicts in your data file are on separate lines so that each dict can be parsed as valid Python statements. If that is not the case, however, you can use lib2to3 and modify the Python grammar in Grammar.txt so that a simple statement (denoted by simple_stmt in the grammar file) does not have to end with a newline character:
from lib2to3 import fixer_base, refactor, pygram, pgen2
from io import StringIO
from functools import partialmethod
with open(pygram._GRAMMAR_FILE) as file:
grammar = StringIO(''.join(line.replace(' NEWLINE', '') if line.startswith('simple_stmt:') else line for line in file))
pgen2.pgen.ParserGenerator.__init__ = partialmethod(pgen2.pgen.ParserGenerator.__init__, stream=grammar)
pygram.python_grammar = pgen2.pgen.generate_grammar()
and look for atom nodes at the top level (whose parent node does not have a parent) instead:
class ScrapeAtoms(fixer_base.BaseFix):
PATTERN = "atom"
def __init__(self, *args):
super().__init__(*args)
self.nodes = []
def transform(self, node, results):
if not node.parent.parent:
self.nodes.append(node)
return node
class Refactor(refactor.RefactoringTool):
def get_fixers(self):
self.scraper = ScrapeAtoms(None, None)
return [self.scraper], []
def get_result(self):
return '[%s]\n' % ',\n'.join(str(node).rstrip() for node in self.scraper.nodes)
so that:
s = '''{'a': {1: 2}}{'b': 2}{
'c': 3
}{'d': 4}'''
refactor = Refactor(None)
refactor.refactor_string(s, '')
print(refactor.get_result())
outputs:
[{'a': {1: 2}},
{'b': 2},
{
'c': 3
},
{'d': 4}]
Demo: https://repl.it/#blhsing/CompleteStarchyFactorial

Like others have stated in the comments. This isn't json data. You merely have multiple string representations of dicts pretty printed to the file in succession, and you're also missing a closing bracket in the first one.
So I suggest looping through the file and build a string for each dict then you can use ast.literal_eval to parse the string into a dict. Something like this:
from ast import literal_eval
current = ''
data = []
with open('filename.txt') as f:
for line in f:
if line.startswith('{'):
current = line
elif line.startswith('}'):
data.append(literal_eval(current + line))
else:
current += line
Results in data (using pprint):
[{'header': 'sdf',
'meta': {'attachments': ['ABC'],
'searchId': {'searchId': 1234},
'timestamp': 1234,
'xmlData': {'release': None, 'version': None}}},
{'attachments': ['ABC'],
'header': 'sdf',
'timestamp': 14,
'xmlData': {'release': None, 'version': None}}]
After this you should overwrite the data, And never use this as serialization again. This is why there's libraries for this.

Since each dict in the file is a valid Python statement, a more robust solution would be to use the lib2to3 to parse the file as Python code and extract the statement nodes so that you can enclose them in square brackets, separated by commas:
from lib2to3 import fixer_base, refactor
class ScrapeStatements(fixer_base.BaseFix):
PATTERN = "simple_stmt"
def __init__(self, *args):
super().__init__(*args)
self.nodes = []
def transform(self, node, results):
self.nodes.append(node)
return node
class Refactor(refactor.RefactoringTool):
def get_fixers(self):
self.scraper = ScrapeStatements(None, None)
return [self.scraper], []
def get_result(self):
return '[%s]\n' % ',\n'.join(str(node).rstrip() for node in self.scraper.nodes)
so that:
s = '''{
'header': 'sdf',
'meta': {
'searchId': {
'searchId': 1234
},
'timestamp': 1234,
'attachments': [
'ABC'
],
'xmlData': {
'release': None,
'version': None,
}
}
}
{
'header': 'sdf',
'timestamp': 14,
'attachments': [
'ABC'
],
'xmlData': {
'release': None,
'version': None,
}
}
'''
refactor = Refactor(None)
refactor.refactor_string(s, '')
print(refactor.get_result())
outputs:
[{
'header': 'sdf',
'meta': {
'searchId': {
'searchId': 1234
},
'timestamp': 1234,
'attachments': [
'ABC'
],
'xmlData': {
'release': None,
'version': None,
}
}
},
{
'header': 'sdf',
'timestamp': 14,
'attachments': [
'ABC'
],
'xmlData': {
'release': None,
'version': None,
}
}]

If all the dicts in the file are on separate lines as they are in your sample input, then each dict by itself is a valid Python statement, so you can use ast.parse to parse the file into an abstract syntax tree, look for the expression nodes (of type Expr), and build a new Expression node with a List node to hold all the aforementioned Expr nodes. The new Expression node can then be compiled and evaluated as an actual Python list of dicts, so that given your sample input data in variable s:
import ast
tree = ast.parse(s)
exprs = [node.value for node in ast.walk(tree) if isinstance(node, ast.Expr)]
new = ast.Expression(body=ast.List(elts=exprs, ctx=ast.Load()))
ast.fix_missing_locations(new)
lst = eval(compile(new, '', 'eval'))
lst would become:
[{'header': 'sdf',
'meta': {'searchId': {'searchId': 1234},
'timestamp': 1234,
'attachments': ['ABC'],
'xmlData': {'release': None, 'version': None}}},
{'header': 'sdf',
'timestamp': 14,
'attachments': ['ABC'],
'xmlData': {'release': None, 'version': None}}]
Demo: https://repl.it/#blhsing/FocusedCylindricalTypes

Loop through multidimensional JSON (Python)

I have a JSON with following structure:
{
'count': 93,
'apps' : [
{
'last_modified_at': '2016-10-21T12:20:26Z',
'frequency_caps': [],
'ios': {
'enabled': True,
'push_enabled': False,
'app_store_id': 'bbb',
'connection_type': 'certificate',
'sdk_api_secret': '--'
},
'organization_id': '--',
'name': '---',
'app_id': 27,
'control_group_percentage': 0,
'created_by': {
'user_id': 'abc',
'user_name': 'def'
},
'created_at': '2016-09-28T11:41:24Z',
'web': {}
}, {
'last_modified_at': '2016-10-12T08:58:57Z',
'frequency_caps': [],
'ios': {
'enabled': True,
'push_enabled': True,
'app_store_id': '386304604',
'connection_type': 'certificate',
'sdk_api_secret': '---',
'push_expiry': '2018-01-14T08:24:09Z'
},
'organization_id': '---',
'name': '---',
'app_id': 87,
'control_group_percentage': 0,
'created_by': {
'user_id': '----',
'user_name': '---'
},
'created_at': '2016-10-12T08:58:57Z',
'web': {}
}
]
}
It's a JSON with two key-value-pairs. The second pair's value is a List of more JSON's.
For me it is too much information and I want to have a JSON like this:
{
'apps' : [
{
'name': 'Appname',
'app_id' : 1234,
'organization_id' : 'Blablabla'
},
{
'name': 'Appname2',
'app_id' : 5678,
'organization_id' : 'Some other Organization'
}
]
}
I want to have a JSON that only contains one key ("apps") and its value, which would be a List of more JSONs that only have three key-value-pairs..
I am thankful for any advice.
Thank you for your help!

#bishakh-ghosh I don't think you need to use the input json as string. It can be used straight as a dictionary. (thus avoid ast)
One more concise way :
# your original json
input_ = { 'count': 93, ... }
And here are the steps :
Define what keys you want to keep
slice_keys = ['name', 'app_id', 'organization_id']
Define the new dictionary as a slice on the slice_keys
dict(apps=[{key:value for key,value in d.items() if key in slice_keys} for d in input_['apps']])
And that's it.
That should yield the JSON formatted as you want, e.g
{
'apps':
[
{'app_id': 27, 'name': '---', 'organization_id': '--'},
{'app_id': 87, 'name': '---', 'organization_id': '---'}
]
}

This might be what you are looking for:
import ast
import json
json_str = """{
'count': 93,
'apps' : [
{
'last_modified_at': '2016-10-21T12:20:26Z',
'frequency_caps': [],
'ios': {
'enabled': True,
'push_enabled': False,
'app_store_id': 'bbb',
'connection_type': 'certificate',
'sdk_api_secret': '--'
},
'organization_id': '--',
'name': '---',
'app_id': 27,
'control_group_percentage': 0,
'created_by': {
'user_id': 'abc',
'user_name': 'def'
},
'created_at': '2016-09-28T11:41:24Z',
'web': {}
}, {
'last_modified_at': '2016-10-12T08:58:57Z',
'frequency_caps': [],
'ios': {
'enabled': True,
'push_enabled': True,
'app_store_id': '386304604',
'connection_type': 'certificate',
'sdk_api_secret': '---',
'push_expiry': '2018-01-14T08:24:09Z'
},
'organization_id': '---',
'name': '---',
'app_id': 87,
'control_group_percentage': 0,
'created_by': {
'user_id': '----',
'user_name': '---'
},
'created_at': '2016-10-12T08:58:57Z',
'web': {}
}
]
}"""
json_dict = ast.literal_eval(json_str)
new_dict = {}
app_list = []
for appdata in json_dict['apps']:
appdata_dict = {}
appdata_dict['name'] = appdata['name']
appdata_dict['app_id'] = appdata['app_id']
appdata_dict['organization_id'] = appdata['organization_id']
app_list.append(appdata_dict)
new_dict['apps'] = app_list
new_json_str = json.dumps(new_dict)
print(new_json_str) # This is your resulting json string

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyMongo- selecting sub-documents from collection by regex - python

Related

Return specific values from nested Elastic Object

Python: Convert specific value in dictionary from string to int

Partial word search not working in elasticsearch (elasticsearch-py) using mongo-connector

How to seperate dictionaries that are not comma separated?

Loop through multidimensional JSON (Python)

Categories

Resources