Capture property names - python

I'm scanning a ".twig" (PHP template) file and trying to capture the property names of an object.
The twig file contains lines (strings) like these:
{{ product.id }}
{{ product.parentProductId }}
{{ product.countdown.startDate | date('Y/m/d H:i:s') }}
{{ product.countdown.endDate | date('Y/m/d H:i:s') }}
{{ product.countdown.expireDate | date('Y/m/d H:i:s') }}
{{ product.primaryImage.originalUrl }}
{{ product.image(1).originalUrl }}
{{ product.image(1).thumbUrl }}
{{ product.priceWithTax(preferences.default_currency) | money }}
The things I want to capture are:
.id
.parentProductId
.countdown
.startDate
.endDate
.expireDate
.primaryImage
.originalUrl
.image(1)
.originalUrl
.thumbUrl
.priceWithTax(preferences.default_currency)
Basically, I'm trying to figure out the properties of the product object. I have the following pattern, but it doesn't capture chained properties. For example,
"{{.+?product(\.[a-zA-Z]+(?:\(.+?\)){,1})++.+?}}" captures only .startDate, but it should capture both .countdown and .startDate seperately. Is this not possible, or am I missing something?
regex101
I could capture ("{{.+?product((?:\.[a-zA-Z]+(?:\(.+?\)){,1})+).+?}}") it as a whole (.countdown.startDate) and later check/split it, but this sounds troublesome.

If you want to handle it with a single regex, you might want to use the PyPi regex module:
import regex
s = """{{ product.id }}
{{ product.parentProductId }}
{{ product.countdown.startDate | date('Y/m/d H:i:s') }}
{{ product.primaryImage.originalUrl }}
{{ product.image(1).originalUrl }}
{{ product.priceWithTax(preferences.default_currency) | money }}"""
rx = r'{{[^{}]*product(\.[a-zA-Z]+(?:\([^()]+\))?)*[^{}]*}}'
l = [m.captures(1) for m in regex.finditer(rx, s)]
print([item for sublist in l for item in sublist])
# => ['.id', '.parentProductId', '.countdown', '.startDate', '.primaryImage', '.originalUrl', '.image(1)', '.originalUrl', '.priceWithTax(preferences.default_currency)']
See the Python demo
The {{[^{}]*product(\.[a-zA-Z]+(?:\([^()]+\))?)*[^{}]*}} regex will match
{{ - {{ substring
[^{}]* - 0+ chars other than { and }
product - the substring product
(\.[a-zA-Z]+(?:\([^()]+\))?)* - Capturing group 1: zero or more sequences of
\. - a dot
[a-zA-Z]+ - 1+ ASCII letters
(?:\([^()]+\))? - an optional sequence of (, 1+ chars other than ( and ) and then )
[^{}]* - 0+ chars other than { and }
}} - a }} substring.
If you are only limited to re, you will need to capture all the properties into 1 capturing group (wrap this (\.[a-zA-Z]+(?:\([^()]+\))?)* with (...)) and then run a regex based post-process to split by . not inside parentheses:
import re
rx = r'{{[^{}]*product((?:\.[a-zA-Z]+(?:\([^()]+\))?)*)[^{}]*}}'
l = re.findall(rx, s)
res = []
for m in l:
res.extend([".{}".format(n) for n in filter(None, re.split(r'\.(?![^()]*\))', m))])
print(res)
# => ['.id', '.parentProductId', '.countdown', '.startDate', '.primaryImage', '.originalUrl', '.image(1)', '.originalUrl', '.priceWithTax(preferences.default_currency)']
See this Python demo

try this one, captures all in your requirement
^{{ product(\..*?[(][^\d\/]+[)]).*?}}|^{{ product(\..*?)(\..*?)?(?= )
demo and explanation at regex 101

I've decided to stick to re (instead of regex, as suggested by Victor) and this is what I ended up with:
import re, json
file = open("test.twig", "r", encoding="utf-8")
content = file.read()
file.close()
patterns = {
"template" : r"{{[^{}]*product((?:\.[a-zA-Z]+(?:\([^()]+\))?)*)[^{}]*}}",
"prop" : r"^[^\.]+$", # .id
"subprop" : r"^[^\.()]+(\.[^\.]+)+$", # .countdown.startDate
"itemprop" : r"^[^\.]+\(\d+\)\.[^\.]+$", # .image(1).originalUrl
"method" : r"^[^\.]+\(.+\)$", # .priceWithTax(preferences.default_currency)
}
temp_re = re.compile(patterns["template"])
matches = temp_re.findall(content)
product = {}
for match in matches:
match = match[1:]
if re.match(patterns["prop"], match):
product[match] = match
elif re.match(patterns["subprop"], match):
match = match.split(".")
if match[0] not in product:
product[match[0]] = []
if match[1] not in product[match[0]]:
product[match[0]].append(match[1])
elif re.match(patterns["itemprop"], match):
match = match.split(".")
array = re.sub("\(\d+\)", "(i)", match[0])
if array not in product:
product[array] = []
if match[1] not in product[array]:
product[array].append(match[1])
elif re.match(patterns["method"], match):
product[match] = match
props = json.dumps(product, indent=4)
print(props)
Example output:
{
"id": "id",
"parentProductId": "parentProductId",
"countdown": [
"startDate",
"endDate",
"expireDate"
],
"primaryImage": [
"originalUrl"
],
"image(i)": [
"originalUrl",
"thumbUrl"
],
"priceWithTax(preferences.default_currency)": "priceWithTax(preferences.default_currency)"
}

Related

Removing different string patterns from Pandas column

I have the following column which consists of email subject headers:
Subject
EXT || Transport enquiry
EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry
EXT || FW: Model - Jan
SV: [EXTERNAL] Calculations
What I want to achieve is:
Subject
Transport enquiry
0001 || Copy of enquiry
Model - Jan
Calculations
and for this I am using the below code which only takes into account the first regular expression that I am passing and ignoring the rest
def clean_subject_prelim(text):
text = re.sub(r'^EXT \|\| $' , '' , text)
text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
text = re.sub(r'EXT \|\| FW:', '' , text)
text = re.sub(r'^SV: \[EXTERNAL]$' , '' , text)
return text
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))
Why this is not working, what am I missing here?
You can use
pattern = r"""(?mx) # MULTILINE mode on
^ # start of string
(?: # non-capturing group start
EXT\s*\|\|\s*(?:RE:\s*EXTERNAL:\s*RE:|FW:)? # EXT || or EXT || RE: EXTERNAL: RE: or EXT || FW:
| # or
SV:\s*\[EXTERNAL]# SV: [EXTERNAL]
) # non-capturing group end
\s* # zero or more whitespaces
"""
df['subject_clean'] = df['Subject'].str.replace(pattern', '', regex=True)
See the regex demo.
Since the re.X ((?x)) is used, you should escape literal spaces and # chars, or just use \s* or \s+ to match zero/one or more whitespaces.
Get rid of the $ sign in the first expression and switch some of regex expressions from place. Like this:
import pandas as pd
import re
def clean_subject_prelim(text):
text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
text = re.sub(r'EXT \|\| FW:', '' , text)
text = re.sub(r'^EXT \|\|' , '' , text)
text = re.sub(r'^SV: \[EXTERNAL]' , '' , text)
return text
data = {"Subject": [
"EXT || Transport enquiry",
"EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry",
"EXT || FW: Model - Jan",
"SV: [EXTERNAL] Calculations"]}
df = pd.DataFrame(data)
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))

Get data with boundaries using regex

I would like to get the labels and data from this function using regex, I have tried using this:
pattern = re.compile(r'/blabels: ],/b')
print(pattern)
result = soup.find("script", text=pattern)
But I get None using boundaries
This is the soup:
<script>
Chart.defaults.LineWithLine = Chart.defaults.line;
new Chart(document.getElementById("chart-overall-mentions"), {
type: 'LineWithLine',
data: {
labels: [1637005508000,1637006108000,1637006708000,1637007308000,1637007908000,1637008508000,1637009108000,1637009708000,1637010308000,1637010908000,1637011508000,1637012108000,1637012708000,1637013308000,1637013908000,1637014508000,1637015108000,1637015708000,1637016308000,1637016908000,1637017508000,1637018108000,1637018708000,1637019308000,1637019908000,1637020508000,1637021108000,1637021708000,1637022308000,1637022908000,1637023508000,1637024108000,1637024708000,1637025308000,1637025908000,1637026508000,1637027108000,1637027708000,1637028308000,1637028908000,1637029508000,1637030108000,1637030708000,1637031308000,1637031908000,1637032508000,1637033108000,1637033708000,1637034308000,1637034908000,1637035508000,1637036108000,1637036708000,1637037308000,1637037908000,1637038508000,1637039108000,1637039708000,1637040308000,1637040908000,1637041508000,1637042108000,1637042708000,1637043308000,1637043908000,1637044508000,1637045108000,1637045708000,1637046308000,1637046908000,1637047508000,1637048108000,1637048708000,1637049308000,1637049908000,1637050508000,1637051108000,1637051708000,1637052308000,1637052908000,1637053508000,1637054108000,1637054708000,1637055308000,1637055908000,1637056508000,1637057108000,1637057708000,1637058308000,1637058908000,1637059508000,1637060108000,1637060708000,1637061308000,1637061908000,1637062508000,1637063108000,1637063708000,1637064308000,1637064908000,1637065508000,1637066108000,1637066708000,1637067308000,1637067908000,1637068508000,1637069108000,1637069708000,1637070308000,1637070908000,1637071508000,1637072108000,1637072708000,1637073308000,1637073908000,1637074508000,1637075108000,1637075708000,1637076308000,1637076908000,1637077508000,1637078108000,1637078708000,1637079308000,1637079908000,1637080508000,1637081108000,1637081708000,1637082308000,1637082908000,1637083508000,1637084108000,1637084708000,1637085308000,1637085908000,1637086508000,1637087108000,1637087708000,1637088308000,1637088908000,1637089508000,1637090108000,1637090708000,1637091308000],
datasets: [{
data: [13,10,20,26,21,23,24,21,24,35,25,31,42,24,24,20,23,22,17,23,30,11,16,20,9,10,22,10,19,16,15,16,17,19,10,20,24,14,19,15,13,9,13,17,20,16,15,21,18,25,15,14,16,15,16,14,14,21,10,9,5,9,9,13,14,9,9,18,15,11,11,6,12,14,19,17,16,11,20,14,21,13,15,12,14,10,20,16,25,17,17,11,23,11,13,11,19,10,17,19,10,20,22,19,19,27,28,18,20,22,18,16,17,18,14,17,19,18,20,11,13,20,15,15,18,14,13,14,14,11,19,14,14,11,11,15,26,12,15,15,11,4,3,6],
pointRadius: 0,
borderColor: "#666",
fill: true,
yAxisID:'yAxis1'
},
]
},
options: {
tooltips: {
mode: 'index',
bodyFontSize: 18,
intersect: false,
titleFontSize: 16,
},
.
.
.
</script>
Here is how you can do that:
Get the script tag - you can use a regex, too, if that is the only way to obtain that node
Then run a regex search against the node text/string to get your final output.
You can use
# Get the script node with text matching your pattern
item = soup.find("script", text=re.compile(r'\blabels:\s*\['))
import re
match = re.search(r'\blabels:\s*\[([^][]*)]', item.string)
if match:
labels = map(int, match.group(1).split(','))
Output:
>>> print(list(labels))
[1637005508000, 1637006108000, 1637006708000, 1637007308000, 1637007908000, 1637008508000, 1637009108000, 1637009708000, 1637010308000, 1637010908000, 1637011508000, 1637012108000, 1637012708000, 1637013308000, 1637013908000, 1637014508000, 1637015108000, 1637015708000, 1637016308000, 1637016908000, 1637017508000, 1637018108000, 1637018708000, 1637019308000, 1637019908000, 1637020508000, 1637021108000, 1637021708000, 1637022308000, 1637022908000, 1637023508000, 1637024108000, 1637024708000, 1637025308000, 1637025908000, 1637026508000, 1637027108000, 1637027708000, 1637028308000, 1637028908000, 1637029508000, 1637030108000, 1637030708000, 1637031308000, 1637031908000, 1637032508000, 1637033108000, 1637033708000, 1637034308000, 1637034908000, 1637035508000, 1637036108000, 1637036708000, 1637037308000, 1637037908000, 1637038508000, 1637039108000, 1637039708000, 1637040308000, 1637040908000, 1637041508000, 1637042108000, 1637042708000, 1637043308000, 1637043908000, 1637044508000, 1637045108000, 1637045708000, 1637046308000, 1637046908000, 1637047508000, 1637048108000, 1637048708000, 1637049308000, 1637049908000, 1637050508000, 1637051108000, 1637051708000, 1637052308000, 1637052908000, 1637053508000, 1637054108000, 1637054708000, 1637055308000, 1637055908000, 1637056508000, 1637057108000, 1637057708000, 1637058308000, 1637058908000, 1637059508000, 1637060108000, 1637060708000, 1637061308000, 1637061908000, 1637062508000, 1637063108000, 1637063708000, 1637064308000, 1637064908000, 1637065508000, 1637066108000, 1637066708000, 1637067308000, 1637067908000, 1637068508000, 1637069108000, 1637069708000, 1637070308000, 1637070908000, 1637071508000, 1637072108000, 1637072708000, 1637073308000, 1637073908000, 1637074508000, 1637075108000, 1637075708000, 1637076308000, 1637076908000, 1637077508000, 1637078108000, 1637078708000, 1637079308000, 1637079908000, 1637080508000, 1637081108000, 1637081708000, 1637082308000, 1637082908000, 1637083508000, 1637084108000, 1637084708000, 1637085308000, 1637085908000, 1637086508000, 1637087108000, 1637087708000, 1637088308000, 1637088908000, 1637089508000, 1637090108000, 1637090708000, 1637091308000]
Once the node is obtained the \blabels:\s*\[([^][]*)] regex searches for
\b - a word boundary
labels: - a fixed string
\s* - zero or more whitespaces
\[ - a [ char
([^][]*) - Group 1 (this is what you will need to split with a comma later): any zero or more chars other than ] and [
] - a ] char.

How can I leave whitespaces in nestedExpr pyparsing

I've a wiki text like that
data = """
{{hello}}
{{hello world}}
{{hello much { }}
{{a {{b}}}}
{{a
td {
}
{{inner}}
}}
"""
and I want to extract the macros inside it
macro is a text enclosed between {{ and }}
so I tried using nestedExpr
from pyparsing import *
import pprint
def getMacroCandidates(txt):
candidates = []
def nestedExpr(opener="(", closer=")", content=None, ignoreExpr=quotedString.copy()):
if opener == closer:
raise ValueError("opening and closing strings cannot be the same")
if content is None:
if isinstance(opener,str) and isinstance(closer,str):
if ignoreExpr is not None:
content = (Combine(OneOrMore(~ignoreExpr +
~Literal(opener) + ~Literal(closer) +
CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1))
).setParseAction(lambda t:t[0]))
ret = Forward()
ret <<= Group( opener + ZeroOrMore( ignoreExpr | ret | content ) + closer )
ret.setName('nested %s%s expression' % (opener,closer))
return ret
# use {}'s for nested lists
macro = nestedExpr("{{", "}}")
# print(( (nestedItems+stringEnd).parseString(data).asList() ))
for toks, preloc, nextloc in macro.scanString(data):
print(toks)
return candidates
data = """
{{hello}}
{{hello world}}
{{hello much { }}
{{a {{b}}}}
{{a
td {
}
{{inner}}
}}
"""
getMacroCandidates(data)
Which gives me the tokens and spaces removed
[['{{', 'hello', '}}']]
[['{{', 'hello', 'world', '}}']]
[['{{', 'hello', 'much', '{', '}}']]
[['{{', 'a', ['{{', 'b', '}}'], '}}']]
[['{{', 'a', 'td', '{', '}', ['{{', 'inner', '}}'], '}}']]
You can you replace
data = """
{{hello}}
{{hello world}}
{{hello much { }}
{{a {{b}}}}
{{a
td {
}
{{inner}}
}}
"""
import shlex
data1= data.replace("{{",'"')
data2 = data1.replace("}}",'"')
data3= data2.replace("}"," ")
data4= data3.replace("{"," ")
data5= ' '.join(data4.split())
print(shlex.split(data5.replace("\n"," ")))
Output
This returns you all the tokens with braces and white space removed with extra line space also removed
['hello', 'hello world', 'hello much ', 'a b', 'a td inner ']
PS:This can be made to a single expression , multiple expression is used for readability

ParseError parsing empty valued xml doc

There are lots of articles pertaining to parsing xml with elementtree. I've gone through a bunch of them and read through the docs but I can't come up with a solution that works for me. I'm trying to supplement info thats created by another app in a nfo file but i need to preserve the conventions in the file.
Here is an example of how the file is laid out
<title>
<name>Test Name</name>
<alt name />
<file local="C:\file\file1.doc" type="word">http://filestore/file1.doc</file>
<file local="" type="excel">http://filestore/file2.xls</file>
<file local="C:\file\file3.xls" type="excel" />
<file local="" type="ppt" />
</title>
Note: Elements are not closed properly e.g...
<alt name /> should be <alt name></alt name>
This is what I'm running...
import xml.etree.ElementTree as ET
tree = ET.parse('file.nfo')
root = tree.getroot()
The error I'm getting is...
xml.etree.ElementTree.ParseError: not well-formed (invalid token):
I've tried...
myparser = ET.XMLParser(encoding='UTF-8')
tree = ET.parse('file.nfo', myparser)
Also tried, xmlparser, opening with codecs but i'm pretty sure its the formatting. I'm guessing the immediate issue is non-escaped > but i suspect ET needs opening/closing?
I'm sure i could open this file and go through it with regex but i was hoping to use ElementTree.
The end goal is to have the details from the nfo as a dictionary that looks like...
dict = {'title': [{'name': 'Test Name',
'alt name': '',
'file': [{'local': 'C:\file\file1.doc', 'type': 'word', 'url': 'http://filestore/file1.doc'},
{'local': '', 'type': 'excel', 'url': 'http://filestore/file2.xls'},
{'local': 'C:\file\file3.xls', 'type': 'excel', 'url': ''},
{'local': '', 'type': 'ppt', 'url': ''}]
}]}
I'm sure there is a better (more pythonic) way to do this but I'm pretty new to python.
Any help would be appreciated
EDIT: I'm also trying to avoid using 3rd party libraries if possible
So I ended up creating a customer parser of sorts, its not ideal but it works. It was suggested to me that lxml and html.parser may parse malformed xml better but i just went with this.
I'm also still very interested in any feedback whether it be on this or using any other method.
import re
def merge_dicts(*dict_args):
result = {}
for dictionary in dict_args:
result.update(dictionary)
return result
def make_dict(str_arg, op):
result = {}
result = dict(s.split(op) for s in str_arg.split(","))
return result
'''
Samples
lst = r' <name>Test Name</name>'
lst = r' <alt name />'
lst = r' <file local="C:\file\file1.doc" type="word">http://filestore/file1.doc</file>'
lst = r' <file local="" type="excel">http://filestore/file2.xls</file>'
lst = r' <file local="C:\file\file3.xls" type="excel" />'
lst = r' <file local="" type="ppt" />'
'''
def match_pattern(file_str):
#<description>desc blah</description>'
pattern1 = r'''(?x)
^
\s* # cut leading whitespace
(?P<whole_thing>
< (?P<tag_open> (\w+?|\w*\s\w+?)+) \b # word boundary, so we can
> # skip attributes
(?P<tag_body> .+? ) # insides
</ (?P<tag_close> (\w+?|\w*\s\w+?)+) > # closing tag, nothing interesting
)
$'''
#<alt name />
pattern2 = r'''(?x)
^
\s*
(?P<whole_thing>
< (?P<tag_open> (\w+?|\w*\s\w+?)+) \b
\s/>
)
$'''
#<file local="C:\file\file1.doc" type="word">http://filestore/file1.doc</file>'
pattern3 = r'''(?x)
^
\s*
(?P<whole_thing>
< (?P<tag_open> (\w+?|\w*\s\w+!=?)+) \b
\s
(?P<tag_attrib1> (\w*\=.*?)) # 1st attribute
\s
(?P<tag_attrib2> (\w*\=.*)) # 2nd attribute
.*? >
(?P<tag_body> .+? )
</ (?P<tag_close> (\w+?|\w*\s\w+?)+) >
)
$'''
#<file local="" type="ppt" />
pattern4 = r'''(?x)
^
\s*
(?P<whole_thing>
< (?P<tag_open> (\w+?|\w*\s\w+!=?)+) \b
\s
(?P<tag_attrib1> (\w*\=.*?)) # 1st attribute
\s
(?P<tag_attrib2> (\w*\=.*)) # 2nd attribute
\s/>
)
$'''
pat_str = 'pattern'
pat_val = 1
return_dict = {}
while (pat_val <= 4):
pattern = pat_str+str(pat_val)
matchObj = re.match(eval(pattern), file_str, re.L|re.M)
if matchObj:
#for k, v in matchObj.groupdict().items():
# print('matchObj.group({!r}) == {!r}'.format(k, v))
if pat_val == 1:
body = matchObj.group('tag_body')
return_dict = {matchObj.group('tag_open'): body}
elif pat_val == 2:
return_dict = {matchObj.group('tag_open'): ''}
elif pat_val == 3:
attr1 = make_dict(matchObj.group('tag_attrib1'), '=')
attr2 = make_dict(matchObj.group('tag_attrib2'), '=')
body = {'url': matchObj.group('tag_body')}
attrib = merge_dicts(attr1, attr2, body)
return_dict = {matchObj.group('tag_open'): attrib}
elif pat_val == 4:
attr1 = make_dict(matchObj.group('tag_attrib1'), '=')
attr2 = make_dict(matchObj.group('tag_attrib2'), '=')
body = {'url': ''}
attrib = merge_dicts(attr1, attr2, body)
return_dict = {matchObj.group('tag_open'): attrib}
return return_dict
else:
pat_val = pat_val + 1
if pat_val > 4:
print("No match!!")
#print(match_pattern(lst))
def in_file(file):
result = {}
with open(file, "r") as file:
data = (file.read().splitlines())
for d in data:
if data.index(d) == 0 or data.index(d) == len(data)-1:
if data.index(d) == 0:
print(re.sub('<|/|>', '', d))
elif d:
lst = []
dct = {}
if 'file' in match_pattern(d).keys():
for i in match_pattern(d).items():
if 'file' in result.keys():
lst = result['file']
lst.append(i[1])
dct = {i[0]: lst}
result = merge_dicts(result, dct)
#print(result['file'])
else:
dct = {i[0]: [i[1]]}
result = merge_dicts(result, dct)
else:
result = merge_dicts(result, match_pattern(d))
print('else', match_pattern(d))
return result
print(in_file('C:\\test.nfo'))
NOTE: I dropped the top most dictionary from the original post

Python regex sub confusion

There are four keywords: title, blog, tags, state
Excess keyword occurrences are being removed from their respective matches.
Example:
blog: blog state title tags and returns state title tags and instead of
blog state title tags and
The sub function should be matching .+ after it sees blog:, so I don't know why it treats blog as an exception to .+
Regex:
re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a)
Code:
def n15():
import re
a = """blog: blog: fooblog
state: private
title: this is atitle bun
and text"""
kwargs = {}
def matcher(string):
v = string.group(1).replace(string.group(2), '').replace(string.group(3), '').replace(string.group(4), '').replace(string.group(5), '')
if string.group(3) == 'title':
kwargs['title'] = v
elif string.group(3) == 'blog':
kwargs['blog_url'] = v
elif string.group(3) == 'tags':
kwargs['comma_separated_tags'] = v
elif string.group(3) == 'state':
kwargs['post_state'] = v
return ''
a = re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a)
a = a.replace('\n', '<br />')
a = a.replace('\r', '')
a = a.replace('"', r'\"')
a = '<p>' + a + '</p>'
kwargs['body'] = a
print kwargs
Output:
{'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'foo', 'title': 'this is a bun'}
Edit:
Desired Output:
{'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'fooblog', 'title': 'this is atitle bun'}
replace(string.group(3), '')
is replacing all occurrences of 'blog' with '' .
Rather than try to replace all the other parts of the matched string, which will be hard to get right, I suggest capture the string you actually want in the original match.
r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s)(.+)(\n|$))'
which has () around the .+ to capture that part of the string, then
v = match.group(5)
at the start of matcher.

Categories