I'm trying to automatically parse an existing bind configuration, consisting of multiple of these zone definitons:
zone "domain.com" {
type slave;
file "sec/domain.com";
masters {
11.22.33.44;
55.66.77.88;
};
allow-transfer {
"acl1";
"acl2";
};
};
note that the amount of elements in masters and in allow-transfer may differ. I tried my way around splitting this using re.split() and failed horribly due to the nested curly braces.
My goal is a dictionary for each of these entries.
Thanks in advance for any help!
This should do the trick, where 'st' is a string of all your zone definitions:
import re
zone_def = re.split('zone', st, re.DOTALL)
big_dict = {}
for zone in zone_def:
if len(zone) > 0:
zone_name = re.search('(".*?")', zone)
sub_dicts = re.finditer('([\w]+) ({.*?})', zone, re.DOTALL)
big_dict[zone_name.group(1)] = {}
for sub_dict in sub_dicts:
big_dict[zone_name.group(1)][sub_dict.group(1)] = sub_dict.group(2).replace(' ', '')
sub_types = re.finditer('([\w]+) (.*?);', zone)
for sub_type in sub_types:
big_dict[zone_name.group(1)][sub_type.group(1)] = sub_type.group(2)
big_dict will then return a dictionary of zone definitions. Each zone definition will have the domain/url as its key. Every key/value in the zone definition is a string.
This is the output for the above example:
{'"domain.com"': {'transfer': '{\n"acl1";\n"acl2";\n}', 'masters': '{\n11.22.33.44;\n55.66.77.88;\n}', 'type': 'slave', 'file': '"sec/domain.com"'}}
And this is the output if you were to have a second identical zone, with key "sssss.com".
{'"sssss.com"': {'transfer': '{\n"acl1";\n"acl2";\n}', 'masters': '{\n11.22.33.44;\n55.66.77.88;\n}', 'type': 'slave', 'file': '"sec/domain.com"'},'"domain.com"': {'transfer': '{\n"acl1";\n"acl2";\n}', 'masters': '{\n11.22.33.44;\n55.66.77.88;\n}', 'type': 'slave', 'file': '"sec/domain.com"'}}
You will have to do some further stripping to make it more readable.
A way is to (install and) use the regex module instead of the re module. The problem is that the re module is unable to deal with undefined level of nested brackets:
#!/usr/bin/python
import regex
data = '''zone "domain.com" {
type slave;
file "sec/domain.com";
masters {
11.22.33.44; { toto { pouet } glups };
55.66.77.88;
};
allow-transfer {
"acl1";
"acl2";
};
}; '''
pattern = r'''(?V1xi)
(?:
\G(?<!^)
|
zone \s (?<zone> "[^"]+" ) \s* {
) \s*
(?<key> \S+ ) \s+
(?<value> (?: ({ (?> [^{}]++ | (?4) )* }) | "[^"]+" | \w+ ) ; )
'''
matches = regex.finditer(pattern, data)
for m in matches:
if m.group("zone"):
print "\n" + m.group("zone")
print m.group("key") + "\t" + m.group("value")
You can find more informations about this module by following this link: https://pypi.python.org/pypi/regex
Related
I want to search through an index in my database which is elasticsearch and I want to search for domains contains a second level domain (sld) but it returns me None.
here is what I've done so far:
sld = "smth"
query = client.search(
index = "x",
body = {
"query": {
"regexp": {
"site_domain.keyword": fr"*\.{sld}\.*"
}
}
}
)
EDIT:
I think the problem is with the regex I wrote
any help would be appreciated.
TLDR;
GET /so_regex_url/_search
{
"query": {
"regexp": {
"site_domain": ".*api\\.[a-z]+\\.[a-z]+"
}
}
}
This regex will match api.google.com but won't google.com.
You should watch out for the reserved characters such as .
It require proper escape sequence.
To understand
First let's talk about the pattern your are looking for.
You want to match every url that as a given subdomain.
1. Check the subdomain string exist in the url
Something like .*<subdomain>.* will work. .* means any char in any quantity.
2. Check it is a subdomain
A subdomain in a url looks like <subdomain>.<domain>.<top level domain>
You need to make sure that your subdomain has a . between both domain and top domain
Something like .*<subdomain>.*\.[a-z]+\.[a-z]+ will work [a-z]+
means at least one character between a to z and because . has a special meaning you need to escape it with \
This will match https://<subdomain>.google.com, but won't https://<subdomain>.com
/!\ This is a naive implementation.
https://<subdomain>.1234.com won't match has neither 1, 2 ... exist in [a-z]
3. Create Elastic DSL
I am performing the request on the text field not the keyword, this keep my exemple leaner but work the same way.
GET /so_regex_url/_search
{
"query": {
"regexp": {
"site_domain": ".*api\\.[a-z]+\\.[a-z]+"
}
}
}
You may have noticed the \\ it is explained in the thread it is because the payload travel in a json it also needs to escape that.
4. Python implementation
I imagine it should be
sld = "smth"
query = client.search(
index = "x",
body = {
"query": {
"regexp": {
"site_domain.keyword": `.*{sld}\\.[a-z]+\\.[a-z]+`
}
}
}
)
I'm trying to create a re pattern in python to extract this pattern of text.
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08'
contentId: 'a887526b-ff19-4409-91ff-e1679e418922'
The length of the content ID is 36 characters long and has a mix of lowercase letters and numbers with dashes included at position 8,13,18,23,36.
Any help with this would be much appreciated as I just can't seem to get the results right now.
r1 = re.findall(r'^[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]*{36}$',f.read())
print(r1)
Below is the file I'm trying to pull from
Object.defineProperty(e, '__esModule', { value: !0 }), e.default = void 0;
var t = r(d[0])(r(d[1])), n = r(d[0])(r(d[2])), o = r(d[0])(r(d[3])), c = r(d[0])(r(d[4])), l = r(d[0])(r(d[5])), u = function (t) {
return [
{
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08',
prettyId: 'super',
style: { height: 0.5 * t }
},
{
contentId: 'a887526b-ff19-4409-91ff-e1679e418922',
prettyId: 'zap',
style: { height: t }
}
];
},
Is there a typo in the regex in your question? *{36} after the bracket ] that closes the character group causes an error: multiple repeat. Did you mean r'^[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]{36}$'?
Fixing that, you get no results because ^ anchors the match to the start of the line, and $ to the end of the line, so you'd only get results if this pattern was alone on a single line.
Removing these anchors, we get lots of matches because it matches any string of those characters that is 36-long:
r1 = re.findall(r'[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]{36}',t)
r1: ['var t = r(d[0])(r(d[1])), n = r(d[0]',
')(r(d[2])), o = r(d[0])(r(d[3])), c ',
'= r(d[0])(r(d[4])), l = r(d[0])(r(d[',
'2301ae56-3b9c-4653-963b-2ad84d06ba08',
' style: { height: 0.5',
'a887526b-ff19-4409-91ff-e1679e418922',
' style: { height: t }']
To only match your ids, only look for alphanumeric characters or dashes.
r1 = re.findall(r'[a-zA-Z0-9\-]{36}',t)
r1: ['2301ae56-3b9c-4653-963b-2ad84d06ba08',
'a887526b-ff19-4409-91ff-e1679e418922']
To make it even more specific, you could specify the positions of the dashes:
r1 = re.findall(r'[a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12}', t, re.IGNORECASE)
r1: ['2301ae56-3b9c-4653-963b-2ad84d06ba08',
'a887526b-ff19-4409-91ff-e1679e418922']
Specifying the re.IGNORECASE flag removes the need to look for both upper- and lower-case characters.
Note:
You should read the file into a variable and use that variable if you're going to use its contents more than once, since f.read() won't give anything after the first .read() unless you f.seek(0)
To avoid creating a new file on disk with those contents, I just defined
t = """Object.defineProperty(e, '__esModule', { value: !0 }), e.default = void 0;
var t = r(d[0])(r(d[1])), n = r(d[0])(r(d[2])), o = r(d[0])(r(d[3])), c = r(d[0])(r(d[4])), l = r(d[0])(r(d[5])), u = function (t) {
return [
{
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08',
prettyId: 'super',
style: { height: 0.5 * t }
},
{
contentId: 'a887526b-ff19-4409-91ff-e1679e418922',
prettyId: 'zap',
style: { height: t }
}
];
},"""
and used t in place of f.read() from your question.
With json.dumps(some_dict,indent=4,sort_keys=True) in my code:
I get something like this:
{
"a": {
"x":1,
"y":2
},
"b": {
"z":3,
"w":4
}
}
But I want something like this:
{
"a":
{
"x":1,
"y":2
},
"b":
{
"z":3,
"w":4
}
}
How can I force each opening curly brace to appear at the beginning of a new separate line?
Do I have to write my own JSON serializer, or is there a special argument that I can use when calling json.dumps?
You can use a regular expression replacement on the result.
better_json = re.sub(r'^((\s*)".*?":)\s*([\[{])', r'\1\n\2\3', json, flags=re.MULTILINE)
The first capture group matches everything up to the : after the property name, the second capture group matches the whitespace before the property name, and the third capture group captures the { or [ before the object or array. The whitespace is then copied after the newline, so that the indentation will match properly.
DEMO
Building on Barmar's excellent answer, here's a more complete demo showing how you can convert and customize your JSON in Python:
import json
import re
# JSONifies dict and saves it to file
def save(data, filename):
with open(filename, "w") as write_file:
write_file.write(jsonify(data))
# Converts Python dict to a JSON string. Indents with tabs and puts opening
# braces on their own line.
def jsonify(data):
default_json = json.dumps(data, indent = '\t')
better_json = re.sub(
r'^((\s*)".*?":)\s*([\[{])',
r'\1\n\2\3',
default_json,
flags=re.MULTILINE
)
return better_json
# Sample data for demo
data = {
"president":
{
"name": "Zaphod Beeblebrox",
"species": "Betelgeusian"
}
}
filename = 'test.json'
# Demo
print("Here's your pretty JSON:")
print(jsonify(data))
print()
print('Saving to file:', filename)
save(data, filename)
I have a non-standard "JSON" file to parse. Each item is semicolon separated instead of comma separated. I can't simply replace ; with , because there might be some value containing ;, ex. "hello; world". How can I parse this into the same structure that JSON would normally parse it?
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
...
}
Use the Python tokenize module to transform the text stream to one with commas instead of semicolons. The Python tokenizer is happy to handle JSON input too, even including semicolons. The tokenizer presents strings as whole tokens, and 'raw' semicolons are in the stream as single token.OP tokens for you to replace:
import tokenize
import json
corrected = []
with open('semi.json', 'r') as semi:
for token in tokenize.generate_tokens(semi.readline):
if token[0] == tokenize.OP and token[1] == ';':
corrected.append(',')
else:
corrected.append(token[1])
data = json.loads(''.join(corrected))
This assumes that the format becomes valid JSON once you've replaced the semicolons with commas; e.g. no trailing commas before a closing ] or } allowed, although you could even track the last comma added and remove it again if the next non-newline token is a closing brace.
Demo:
>>> import tokenize
>>> import json
>>> open('semi.json', 'w').write('''\
... {
... "client" : "someone";
... "server" : ["s1"; "s2"];
... "timestamp" : 1000000;
... "content" : "hello; world"
... }
... ''')
>>> corrected = []
>>> with open('semi.json', 'r') as semi:
... for token in tokenize.generate_tokens(semi.readline):
... if token[0] == tokenize.OP and token[1] == ';':
... corrected.append(',')
... else:
... corrected.append(token[1])
...
>>> print ''.join(corrected)
{
"client":"someone",
"server":["s1","s2"],
"timestamp":1000000,
"content":"hello; world"
}
>>> json.loads(''.join(corrected))
{u'content': u'hello; world', u'timestamp': 1000000, u'client': u'someone', u'server': [u's1', u's2']}
Inter-token whitespace was dropped, but could be re-instated by paying attention to the tokenize.NL tokens and the (lineno, start) and (lineno, end) position tuples that are part of each token. Since the whitespace around the tokens doesn't matter to a JSON parser, I've not bothered with this.
You can do some odd things and get it (probably) right.
Because strings on JSON cannot have control chars such as \t, you could replace every ; to \t, so the file will be parsed correctly if your JSON parser is able to load non strict JSON (such as Python's).
After, you only need to convert your data back to JSON so you can replace back all these \t, to ; and use a normal JSON parser to finally load the correct object.
Some sample code in Python:
data = '''{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world"
}'''
import json
dec = json.JSONDecoder(strict=False).decode(data.replace(';', '\t,'))
enc = json.dumps(dec)
out = json.loads(dec.replace('\\t,' ';'))
Using a simple character state machine, you can convert this text back to valid JSON. The basic thing we need to handle is to determine the current "state" (whether we are escaping a character, in a string, list, dictionary, etc), and replace ';' by ',' when in a certain state.
I don't know if this is properly way to write it, there is a probably a way to make it shorter, but I don't have enough programming skills to make an optimal version for this.
I tried to comment as much as I could :
def filter_characters(text):
# we use this dictionary to match opening/closing tokens
STATES = {
'"': '"', "'": "'",
"{": "}", "[": "]"
}
# these two variables represent the current state of the parser
escaping = False
state = list()
# we iterate through each character
for c in text:
if escaping:
# if we are currently escaping, no special treatment
escaping = False
else:
if c == "\\":
# character is a backslash, set the escaping flag for the next character
escaping = True
elif state and c == state[-1]:
# character is expected closing token, update state
state.pop()
elif c in STATES:
# character is known opening token, update state
state.append(STATES[c])
elif c == ';' and state == ['}']:
# this is the delimiter we want to change
c = ','
yield c
assert not state, "unexpected end of file"
def filter_text(text):
return ''.join(filter_characters(text))
Testing with :
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
...
}
Returns :
{
"client" : "someone",
"server" : ["s1"; "s2"],
"timestamp" : 1000000,
"content" : "hello; world",
...
}
Pyparsing makes it easy to write a string transformer. Write an expression for the string to be changed, and add a parse action (a parse-time callback) to replace the matched text with what you want. If you need to avoid some cases (like quoted strings or comments), then include them in the scanner, but just leave them unchanged. Then, to actually transform the string, call scanner.transformString.
(It wasn't clear from your example whether you might have a ';' after the last element in one of your bracketed lists, so I added a term to suppress these, since a trailing ',' in a bracketed list is also invalid JSON.)
sample = """
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
}"""
from pyparsing import Literal, replaceWith, Suppress, FollowedBy, quotedString
import json
SEMI = Literal(";")
repl_semi = SEMI.setParseAction(replaceWith(','))
term_semi = Suppress(SEMI + FollowedBy('}'))
qs = quotedString
scanner = (qs | term_semi | repl_semi)
fixed = scanner.transformString(sample)
print(fixed)
print(json.loads(fixed))
prints:
{
"client" : "someone",
"server" : ["s1", "s2"],
"timestamp" : 1000000,
"content" : "hello; world"}
{'content': 'hello; world', 'timestamp': 1000000, 'client': 'someone', 'server': ['s1', 's2']}
I am struggling with regular expressions. I`m having problems getting my head wrapped around similar text nested within larger text. Perhaps you can help me unclutter my thinking.
Here is an example test string:
message msgName { stuff { innerStuff } } \n message mn2 { junk }
I want to pull out term (e.g., msgName, mn2) and what follows until the next message, to get a list like this:
msgName
{ stuff { innerStuff } more stuff }
mn2
{ junk }'
I am having trouble with too greedily or non-greedily matching to retain the inner brackets but split apart the higher level messages.
Here is one program:
import re
text = 'message msgName { stuff { innerStuff } more stuff } \n message mn2 { junk }'
messagePattern = re.compile('message (.*?) {(.*)}', re.DOTALL)
messageList = messagePattern.findall(text)
print "messages:\n"
count = 0
for message, msgDef in messageList:
count = count + 1
print str(count)
print message
print msgDef
It produces:
messages:
1
msgName
stuff { innerStuff } more stuff }
message mn2 { junk
Here is my next attempt, which makes the inner part non-greedy:
import re
text = 'message msgName { stuff { innerStuff } more stuff } \n message mn2 { junk }'
messagePattern = re.compile('message (.*?) {(.*?)}', re.DOTALL)
messageList = messagePattern.findall(text)
print "messages:\n"
count = 0
for message, msgDef in messageList:
count = count + 1
print str(count)
print message
print msgDef
It produces:
messages:
1
msgName
stuff { innerStuff
2
mn2
junk
So, I lose } more stuff }
I've really run into a mental block on this. Could someone point me in the right direction? I`m failing to deal with text in nested brackets. A suggestion as to a working regular expression or a simpler example of dealing with nested, similar text would be helpful.
If you can use PyPi regex module, you can leverage its subroutine call support:
>>> import regex
>>> reg = regex.compile(r"(\w+)\s*({(?>[^{}]++|(?2))*})")
>>> s = "message msgName { stuff { innerStuff } } \n message mn2 { junk }"
>>> print(reg.findall(s))
[('msgName', '{ stuff { innerStuff } }'), ('mn2', '{ junk }')]
The regex - (\w+)\s*({(?>[^{}]++|(?2))*}) - matches:
(\w+) - Group 1 matching 1 or more alphanumeric / underscore characters
\s* - 0+ whitespace(s)
({(?>[^{}]++|(?2))*}) - Group 2 matching a {, followed with non-{} or another balanced {...} due to the (?2) subroutine call (recurses the whole Group 2 subpattern), 0 or more times, and then matches a closing }.
If there is only one nesting level, re can be used, too, with
(\w+)\s*{[^{}]*(?:{[^{}]*}[^{}]*)*}
See this regex demo
(\w+) - Group 1 matching word characters
\s* - 0+ whitespaces
{ - opening brace
[^{}]* - 0+ characters other than { and }
(?:{[^{}]*}[^{}]*)* - 0+ sequences of:
{- opening brace
[^{}]* - 0+ characters other than { and }
} - closing brace
[^{}]* - 0+ characters other than { and }
} - closing brace