Capture property names - python
I'm scanning a ".twig" (PHP template) file and trying to capture the property names of an object.
The twig file contains lines (strings) like these:
{{ product.id }}
{{ product.parentProductId }}
{{ product.countdown.startDate | date('Y/m/d H:i:s') }}
{{ product.countdown.endDate | date('Y/m/d H:i:s') }}
{{ product.countdown.expireDate | date('Y/m/d H:i:s') }}
{{ product.primaryImage.originalUrl }}
{{ product.image(1).originalUrl }}
{{ product.image(1).thumbUrl }}
{{ product.priceWithTax(preferences.default_currency) | money }}
The things I want to capture are:
.id
.parentProductId
.countdown
.startDate
.endDate
.expireDate
.primaryImage
.originalUrl
.image(1)
.originalUrl
.thumbUrl
.priceWithTax(preferences.default_currency)
Basically, I'm trying to figure out the properties of the product object. I have the following pattern, but it doesn't capture chained properties. For example,
"{{.+?product(\.[a-zA-Z]+(?:\(.+?\)){,1})++.+?}}" captures only .startDate, but it should capture both .countdown and .startDate seperately. Is this not possible, or am I missing something?
regex101
I could capture ("{{.+?product((?:\.[a-zA-Z]+(?:\(.+?\)){,1})+).+?}}") it as a whole (.countdown.startDate) and later check/split it, but this sounds troublesome.
If you want to handle it with a single regex, you might want to use the PyPi regex module:
import regex
s = """{{ product.id }}
{{ product.parentProductId }}
{{ product.countdown.startDate | date('Y/m/d H:i:s') }}
{{ product.primaryImage.originalUrl }}
{{ product.image(1).originalUrl }}
{{ product.priceWithTax(preferences.default_currency) | money }}"""
rx = r'{{[^{}]*product(\.[a-zA-Z]+(?:\([^()]+\))?)*[^{}]*}}'
l = [m.captures(1) for m in regex.finditer(rx, s)]
print([item for sublist in l for item in sublist])
# => ['.id', '.parentProductId', '.countdown', '.startDate', '.primaryImage', '.originalUrl', '.image(1)', '.originalUrl', '.priceWithTax(preferences.default_currency)']
See the Python demo
The {{[^{}]*product(\.[a-zA-Z]+(?:\([^()]+\))?)*[^{}]*}} regex will match
{{ - {{ substring
[^{}]* - 0+ chars other than { and }
product - the substring product
(\.[a-zA-Z]+(?:\([^()]+\))?)* - Capturing group 1: zero or more sequences of
\. - a dot
[a-zA-Z]+ - 1+ ASCII letters
(?:\([^()]+\))? - an optional sequence of (, 1+ chars other than ( and ) and then )
[^{}]* - 0+ chars other than { and }
}} - a }} substring.
If you are only limited to re, you will need to capture all the properties into 1 capturing group (wrap this (\.[a-zA-Z]+(?:\([^()]+\))?)* with (...)) and then run a regex based post-process to split by . not inside parentheses:
import re
rx = r'{{[^{}]*product((?:\.[a-zA-Z]+(?:\([^()]+\))?)*)[^{}]*}}'
l = re.findall(rx, s)
res = []
for m in l:
res.extend([".{}".format(n) for n in filter(None, re.split(r'\.(?![^()]*\))', m))])
print(res)
# => ['.id', '.parentProductId', '.countdown', '.startDate', '.primaryImage', '.originalUrl', '.image(1)', '.originalUrl', '.priceWithTax(preferences.default_currency)']
See this Python demo
try this one, captures all in your requirement
^{{ product(\..*?[(][^\d\/]+[)]).*?}}|^{{ product(\..*?)(\..*?)?(?= )
demo and explanation at regex 101
I've decided to stick to re (instead of regex, as suggested by Victor) and this is what I ended up with:
import re, json
file = open("test.twig", "r", encoding="utf-8")
content = file.read()
file.close()
patterns = {
"template" : r"{{[^{}]*product((?:\.[a-zA-Z]+(?:\([^()]+\))?)*)[^{}]*}}",
"prop" : r"^[^\.]+$", # .id
"subprop" : r"^[^\.()]+(\.[^\.]+)+$", # .countdown.startDate
"itemprop" : r"^[^\.]+\(\d+\)\.[^\.]+$", # .image(1).originalUrl
"method" : r"^[^\.]+\(.+\)$", # .priceWithTax(preferences.default_currency)
}
temp_re = re.compile(patterns["template"])
matches = temp_re.findall(content)
product = {}
for match in matches:
match = match[1:]
if re.match(patterns["prop"], match):
product[match] = match
elif re.match(patterns["subprop"], match):
match = match.split(".")
if match[0] not in product:
product[match[0]] = []
if match[1] not in product[match[0]]:
product[match[0]].append(match[1])
elif re.match(patterns["itemprop"], match):
match = match.split(".")
array = re.sub("\(\d+\)", "(i)", match[0])
if array not in product:
product[array] = []
if match[1] not in product[array]:
product[array].append(match[1])
elif re.match(patterns["method"], match):
product[match] = match
props = json.dumps(product, indent=4)
print(props)
Example output:
{
"id": "id",
"parentProductId": "parentProductId",
"countdown": [
"startDate",
"endDate",
"expireDate"
],
"primaryImage": [
"originalUrl"
],
"image(i)": [
"originalUrl",
"thumbUrl"
],
"priceWithTax(preferences.default_currency)": "priceWithTax(preferences.default_currency)"
}
Related
Removing different string patterns from Pandas column
I have the following column which consists of email subject headers: Subject EXT || Transport enquiry EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry EXT || FW: Model - Jan SV: [EXTERNAL] Calculations What I want to achieve is: Subject Transport enquiry 0001 || Copy of enquiry Model - Jan Calculations and for this I am using the below code which only takes into account the first regular expression that I am passing and ignoring the rest def clean_subject_prelim(text): text = re.sub(r'^EXT \|\| $' , '' , text) text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text) text = re.sub(r'EXT \|\| FW:', '' , text) text = re.sub(r'^SV: \[EXTERNAL]$' , '' , text) return text df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x)) Why this is not working, what am I missing here?
You can use pattern = r"""(?mx) # MULTILINE mode on ^ # start of string (?: # non-capturing group start EXT\s*\|\|\s*(?:RE:\s*EXTERNAL:\s*RE:|FW:)? # EXT || or EXT || RE: EXTERNAL: RE: or EXT || FW: | # or SV:\s*\[EXTERNAL]# SV: [EXTERNAL] ) # non-capturing group end \s* # zero or more whitespaces """ df['subject_clean'] = df['Subject'].str.replace(pattern', '', regex=True) See the regex demo. Since the re.X ((?x)) is used, you should escape literal spaces and # chars, or just use \s* or \s+ to match zero/one or more whitespaces.
Get rid of the $ sign in the first expression and switch some of regex expressions from place. Like this: import pandas as pd import re def clean_subject_prelim(text): text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text) text = re.sub(r'EXT \|\| FW:', '' , text) text = re.sub(r'^EXT \|\|' , '' , text) text = re.sub(r'^SV: \[EXTERNAL]' , '' , text) return text data = {"Subject": [ "EXT || Transport enquiry", "EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry", "EXT || FW: Model - Jan", "SV: [EXTERNAL] Calculations"]} df = pd.DataFrame(data) df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))
Get data with boundaries using regex
I would like to get the labels and data from this function using regex, I have tried using this: pattern = re.compile(r'/blabels: ],/b') print(pattern) result = soup.find("script", text=pattern) But I get None using boundaries This is the soup: <script> Chart.defaults.LineWithLine = Chart.defaults.line; new Chart(document.getElementById("chart-overall-mentions"), { type: 'LineWithLine', data: { labels: [1637005508000,1637006108000,1637006708000,1637007308000,1637007908000,1637008508000,1637009108000,1637009708000,1637010308000,1637010908000,1637011508000,1637012108000,1637012708000,1637013308000,1637013908000,1637014508000,1637015108000,1637015708000,1637016308000,1637016908000,1637017508000,1637018108000,1637018708000,1637019308000,1637019908000,1637020508000,1637021108000,1637021708000,1637022308000,1637022908000,1637023508000,1637024108000,1637024708000,1637025308000,1637025908000,1637026508000,1637027108000,1637027708000,1637028308000,1637028908000,1637029508000,1637030108000,1637030708000,1637031308000,1637031908000,1637032508000,1637033108000,1637033708000,1637034308000,1637034908000,1637035508000,1637036108000,1637036708000,1637037308000,1637037908000,1637038508000,1637039108000,1637039708000,1637040308000,1637040908000,1637041508000,1637042108000,1637042708000,1637043308000,1637043908000,1637044508000,1637045108000,1637045708000,1637046308000,1637046908000,1637047508000,1637048108000,1637048708000,1637049308000,1637049908000,1637050508000,1637051108000,1637051708000,1637052308000,1637052908000,1637053508000,1637054108000,1637054708000,1637055308000,1637055908000,1637056508000,1637057108000,1637057708000,1637058308000,1637058908000,1637059508000,1637060108000,1637060708000,1637061308000,1637061908000,1637062508000,1637063108000,1637063708000,1637064308000,1637064908000,1637065508000,1637066108000,1637066708000,1637067308000,1637067908000,1637068508000,1637069108000,1637069708000,1637070308000,1637070908000,1637071508000,1637072108000,1637072708000,1637073308000,1637073908000,1637074508000,1637075108000,1637075708000,1637076308000,1637076908000,1637077508000,1637078108000,1637078708000,1637079308000,1637079908000,1637080508000,1637081108000,1637081708000,1637082308000,1637082908000,1637083508000,1637084108000,1637084708000,1637085308000,1637085908000,1637086508000,1637087108000,1637087708000,1637088308000,1637088908000,1637089508000,1637090108000,1637090708000,1637091308000], datasets: [{ data: [13,10,20,26,21,23,24,21,24,35,25,31,42,24,24,20,23,22,17,23,30,11,16,20,9,10,22,10,19,16,15,16,17,19,10,20,24,14,19,15,13,9,13,17,20,16,15,21,18,25,15,14,16,15,16,14,14,21,10,9,5,9,9,13,14,9,9,18,15,11,11,6,12,14,19,17,16,11,20,14,21,13,15,12,14,10,20,16,25,17,17,11,23,11,13,11,19,10,17,19,10,20,22,19,19,27,28,18,20,22,18,16,17,18,14,17,19,18,20,11,13,20,15,15,18,14,13,14,14,11,19,14,14,11,11,15,26,12,15,15,11,4,3,6], pointRadius: 0, borderColor: "#666", fill: true, yAxisID:'yAxis1' }, ] }, options: { tooltips: { mode: 'index', bodyFontSize: 18, intersect: false, titleFontSize: 16, }, . . . </script>
Here is how you can do that: Get the script tag - you can use a regex, too, if that is the only way to obtain that node Then run a regex search against the node text/string to get your final output. You can use # Get the script node with text matching your pattern item = soup.find("script", text=re.compile(r'\blabels:\s*\[')) import re match = re.search(r'\blabels:\s*\[([^][]*)]', item.string) if match: labels = map(int, match.group(1).split(',')) Output: >>> print(list(labels)) [1637005508000, 1637006108000, 1637006708000, 1637007308000, 1637007908000, 1637008508000, 1637009108000, 1637009708000, 1637010308000, 1637010908000, 1637011508000, 1637012108000, 1637012708000, 1637013308000, 1637013908000, 1637014508000, 1637015108000, 1637015708000, 1637016308000, 1637016908000, 1637017508000, 1637018108000, 1637018708000, 1637019308000, 1637019908000, 1637020508000, 1637021108000, 1637021708000, 1637022308000, 1637022908000, 1637023508000, 1637024108000, 1637024708000, 1637025308000, 1637025908000, 1637026508000, 1637027108000, 1637027708000, 1637028308000, 1637028908000, 1637029508000, 1637030108000, 1637030708000, 1637031308000, 1637031908000, 1637032508000, 1637033108000, 1637033708000, 1637034308000, 1637034908000, 1637035508000, 1637036108000, 1637036708000, 1637037308000, 1637037908000, 1637038508000, 1637039108000, 1637039708000, 1637040308000, 1637040908000, 1637041508000, 1637042108000, 1637042708000, 1637043308000, 1637043908000, 1637044508000, 1637045108000, 1637045708000, 1637046308000, 1637046908000, 1637047508000, 1637048108000, 1637048708000, 1637049308000, 1637049908000, 1637050508000, 1637051108000, 1637051708000, 1637052308000, 1637052908000, 1637053508000, 1637054108000, 1637054708000, 1637055308000, 1637055908000, 1637056508000, 1637057108000, 1637057708000, 1637058308000, 1637058908000, 1637059508000, 1637060108000, 1637060708000, 1637061308000, 1637061908000, 1637062508000, 1637063108000, 1637063708000, 1637064308000, 1637064908000, 1637065508000, 1637066108000, 1637066708000, 1637067308000, 1637067908000, 1637068508000, 1637069108000, 1637069708000, 1637070308000, 1637070908000, 1637071508000, 1637072108000, 1637072708000, 1637073308000, 1637073908000, 1637074508000, 1637075108000, 1637075708000, 1637076308000, 1637076908000, 1637077508000, 1637078108000, 1637078708000, 1637079308000, 1637079908000, 1637080508000, 1637081108000, 1637081708000, 1637082308000, 1637082908000, 1637083508000, 1637084108000, 1637084708000, 1637085308000, 1637085908000, 1637086508000, 1637087108000, 1637087708000, 1637088308000, 1637088908000, 1637089508000, 1637090108000, 1637090708000, 1637091308000] Once the node is obtained the \blabels:\s*\[([^][]*)] regex searches for \b - a word boundary labels: - a fixed string \s* - zero or more whitespaces \[ - a [ char ([^][]*) - Group 1 (this is what you will need to split with a comma later): any zero or more chars other than ] and [ ] - a ] char.
How can I leave whitespaces in nestedExpr pyparsing
I've a wiki text like that data = """ {{hello}} {{hello world}} {{hello much { }} {{a {{b}}}} {{a td { } {{inner}} }} """ and I want to extract the macros inside it macro is a text enclosed between {{ and }} so I tried using nestedExpr from pyparsing import * import pprint def getMacroCandidates(txt): candidates = [] def nestedExpr(opener="(", closer=")", content=None, ignoreExpr=quotedString.copy()): if opener == closer: raise ValueError("opening and closing strings cannot be the same") if content is None: if isinstance(opener,str) and isinstance(closer,str): if ignoreExpr is not None: content = (Combine(OneOrMore(~ignoreExpr + ~Literal(opener) + ~Literal(closer) + CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1)) ).setParseAction(lambda t:t[0])) ret = Forward() ret <<= Group( opener + ZeroOrMore( ignoreExpr | ret | content ) + closer ) ret.setName('nested %s%s expression' % (opener,closer)) return ret # use {}'s for nested lists macro = nestedExpr("{{", "}}") # print(( (nestedItems+stringEnd).parseString(data).asList() )) for toks, preloc, nextloc in macro.scanString(data): print(toks) return candidates data = """ {{hello}} {{hello world}} {{hello much { }} {{a {{b}}}} {{a td { } {{inner}} }} """ getMacroCandidates(data) Which gives me the tokens and spaces removed [['{{', 'hello', '}}']] [['{{', 'hello', 'world', '}}']] [['{{', 'hello', 'much', '{', '}}']] [['{{', 'a', ['{{', 'b', '}}'], '}}']] [['{{', 'a', 'td', '{', '}', ['{{', 'inner', '}}'], '}}']]
You can you replace data = """ {{hello}} {{hello world}} {{hello much { }} {{a {{b}}}} {{a td { } {{inner}} }} """ import shlex data1= data.replace("{{",'"') data2 = data1.replace("}}",'"') data3= data2.replace("}"," ") data4= data3.replace("{"," ") data5= ' '.join(data4.split()) print(shlex.split(data5.replace("\n"," "))) Output This returns you all the tokens with braces and white space removed with extra line space also removed ['hello', 'hello world', 'hello much ', 'a b', 'a td inner '] PS:This can be made to a single expression , multiple expression is used for readability
ParseError parsing empty valued xml doc
There are lots of articles pertaining to parsing xml with elementtree. I've gone through a bunch of them and read through the docs but I can't come up with a solution that works for me. I'm trying to supplement info thats created by another app in a nfo file but i need to preserve the conventions in the file. Here is an example of how the file is laid out <title> <name>Test Name</name> <alt name /> <file local="C:\file\file1.doc" type="word">http://filestore/file1.doc</file> <file local="" type="excel">http://filestore/file2.xls</file> <file local="C:\file\file3.xls" type="excel" /> <file local="" type="ppt" /> </title> Note: Elements are not closed properly e.g... <alt name /> should be <alt name></alt name> This is what I'm running... import xml.etree.ElementTree as ET tree = ET.parse('file.nfo') root = tree.getroot() The error I'm getting is... xml.etree.ElementTree.ParseError: not well-formed (invalid token): I've tried... myparser = ET.XMLParser(encoding='UTF-8') tree = ET.parse('file.nfo', myparser) Also tried, xmlparser, opening with codecs but i'm pretty sure its the formatting. I'm guessing the immediate issue is non-escaped > but i suspect ET needs opening/closing? I'm sure i could open this file and go through it with regex but i was hoping to use ElementTree. The end goal is to have the details from the nfo as a dictionary that looks like... dict = {'title': [{'name': 'Test Name', 'alt name': '', 'file': [{'local': 'C:\file\file1.doc', 'type': 'word', 'url': 'http://filestore/file1.doc'}, {'local': '', 'type': 'excel', 'url': 'http://filestore/file2.xls'}, {'local': 'C:\file\file3.xls', 'type': 'excel', 'url': ''}, {'local': '', 'type': 'ppt', 'url': ''}] }]} I'm sure there is a better (more pythonic) way to do this but I'm pretty new to python. Any help would be appreciated EDIT: I'm also trying to avoid using 3rd party libraries if possible
So I ended up creating a customer parser of sorts, its not ideal but it works. It was suggested to me that lxml and html.parser may parse malformed xml better but i just went with this. I'm also still very interested in any feedback whether it be on this or using any other method. import re def merge_dicts(*dict_args): result = {} for dictionary in dict_args: result.update(dictionary) return result def make_dict(str_arg, op): result = {} result = dict(s.split(op) for s in str_arg.split(",")) return result ''' Samples lst = r' <name>Test Name</name>' lst = r' <alt name />' lst = r' <file local="C:\file\file1.doc" type="word">http://filestore/file1.doc</file>' lst = r' <file local="" type="excel">http://filestore/file2.xls</file>' lst = r' <file local="C:\file\file3.xls" type="excel" />' lst = r' <file local="" type="ppt" />' ''' def match_pattern(file_str): #<description>desc blah</description>' pattern1 = r'''(?x) ^ \s* # cut leading whitespace (?P<whole_thing> < (?P<tag_open> (\w+?|\w*\s\w+?)+) \b # word boundary, so we can > # skip attributes (?P<tag_body> .+? ) # insides </ (?P<tag_close> (\w+?|\w*\s\w+?)+) > # closing tag, nothing interesting ) $''' #<alt name /> pattern2 = r'''(?x) ^ \s* (?P<whole_thing> < (?P<tag_open> (\w+?|\w*\s\w+?)+) \b \s/> ) $''' #<file local="C:\file\file1.doc" type="word">http://filestore/file1.doc</file>' pattern3 = r'''(?x) ^ \s* (?P<whole_thing> < (?P<tag_open> (\w+?|\w*\s\w+!=?)+) \b \s (?P<tag_attrib1> (\w*\=.*?)) # 1st attribute \s (?P<tag_attrib2> (\w*\=.*)) # 2nd attribute .*? > (?P<tag_body> .+? ) </ (?P<tag_close> (\w+?|\w*\s\w+?)+) > ) $''' #<file local="" type="ppt" /> pattern4 = r'''(?x) ^ \s* (?P<whole_thing> < (?P<tag_open> (\w+?|\w*\s\w+!=?)+) \b \s (?P<tag_attrib1> (\w*\=.*?)) # 1st attribute \s (?P<tag_attrib2> (\w*\=.*)) # 2nd attribute \s/> ) $''' pat_str = 'pattern' pat_val = 1 return_dict = {} while (pat_val <= 4): pattern = pat_str+str(pat_val) matchObj = re.match(eval(pattern), file_str, re.L|re.M) if matchObj: #for k, v in matchObj.groupdict().items(): # print('matchObj.group({!r}) == {!r}'.format(k, v)) if pat_val == 1: body = matchObj.group('tag_body') return_dict = {matchObj.group('tag_open'): body} elif pat_val == 2: return_dict = {matchObj.group('tag_open'): ''} elif pat_val == 3: attr1 = make_dict(matchObj.group('tag_attrib1'), '=') attr2 = make_dict(matchObj.group('tag_attrib2'), '=') body = {'url': matchObj.group('tag_body')} attrib = merge_dicts(attr1, attr2, body) return_dict = {matchObj.group('tag_open'): attrib} elif pat_val == 4: attr1 = make_dict(matchObj.group('tag_attrib1'), '=') attr2 = make_dict(matchObj.group('tag_attrib2'), '=') body = {'url': ''} attrib = merge_dicts(attr1, attr2, body) return_dict = {matchObj.group('tag_open'): attrib} return return_dict else: pat_val = pat_val + 1 if pat_val > 4: print("No match!!") #print(match_pattern(lst)) def in_file(file): result = {} with open(file, "r") as file: data = (file.read().splitlines()) for d in data: if data.index(d) == 0 or data.index(d) == len(data)-1: if data.index(d) == 0: print(re.sub('<|/|>', '', d)) elif d: lst = [] dct = {} if 'file' in match_pattern(d).keys(): for i in match_pattern(d).items(): if 'file' in result.keys(): lst = result['file'] lst.append(i[1]) dct = {i[0]: lst} result = merge_dicts(result, dct) #print(result['file']) else: dct = {i[0]: [i[1]]} result = merge_dicts(result, dct) else: result = merge_dicts(result, match_pattern(d)) print('else', match_pattern(d)) return result print(in_file('C:\\test.nfo')) NOTE: I dropped the top most dictionary from the original post
Python regex sub confusion
There are four keywords: title, blog, tags, state Excess keyword occurrences are being removed from their respective matches. Example: blog: blog state title tags and returns state title tags and instead of blog state title tags and The sub function should be matching .+ after it sees blog:, so I don't know why it treats blog as an exception to .+ Regex: re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a) Code: def n15(): import re a = """blog: blog: fooblog state: private title: this is atitle bun and text""" kwargs = {} def matcher(string): v = string.group(1).replace(string.group(2), '').replace(string.group(3), '').replace(string.group(4), '').replace(string.group(5), '') if string.group(3) == 'title': kwargs['title'] = v elif string.group(3) == 'blog': kwargs['blog_url'] = v elif string.group(3) == 'tags': kwargs['comma_separated_tags'] = v elif string.group(3) == 'state': kwargs['post_state'] = v return '' a = re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a) a = a.replace('\n', '<br />') a = a.replace('\r', '') a = a.replace('"', r'\"') a = '<p>' + a + '</p>' kwargs['body'] = a print kwargs Output: {'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'foo', 'title': 'this is a bun'} Edit: Desired Output: {'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'fooblog', 'title': 'this is atitle bun'}
replace(string.group(3), '') is replacing all occurrences of 'blog' with '' . Rather than try to replace all the other parts of the matched string, which will be hard to get right, I suggest capture the string you actually want in the original match. r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s)(.+)(\n|$))' which has () around the .+ to capture that part of the string, then v = match.group(5) at the start of matcher.