There are four keywords: title, blog, tags, state
Excess keyword occurrences are being removed from their respective matches.
Example:
blog: blog state title tags and returns state title tags and instead of
blog state title tags and
The sub function should be matching .+ after it sees blog:, so I don't know why it treats blog as an exception to .+
Regex:
re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a)
Code:
def n15():
import re
a = """blog: blog: fooblog
state: private
title: this is atitle bun
and text"""
kwargs = {}
def matcher(string):
v = string.group(1).replace(string.group(2), '').replace(string.group(3), '').replace(string.group(4), '').replace(string.group(5), '')
if string.group(3) == 'title':
kwargs['title'] = v
elif string.group(3) == 'blog':
kwargs['blog_url'] = v
elif string.group(3) == 'tags':
kwargs['comma_separated_tags'] = v
elif string.group(3) == 'state':
kwargs['post_state'] = v
return ''
a = re.sub(r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s).+(\n|$))', matcher, a)
a = a.replace('\n', '<br />')
a = a.replace('\r', '')
a = a.replace('"', r'\"')
a = '<p>' + a + '</p>'
kwargs['body'] = a
print kwargs
Output:
{'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'foo', 'title': 'this is a bun'}
Edit:
Desired Output:
{'body': '<p>and text</p>', 'post_state': 'private', 'blog_url': 'fooblog', 'title': 'this is atitle bun'}
replace(string.group(3), '')
is replacing all occurrences of 'blog' with '' .
Rather than try to replace all the other parts of the matched string, which will be hard to get right, I suggest capture the string you actually want in the original match.
r'((^|\n|\s|\b)(title|blog|tags|state)(\:\s)(.+)(\n|$))'
which has () around the .+ to capture that part of the string, then
v = match.group(5)
at the start of matcher.
Related
I have a long dictionary which looks like this:
name = 'Barack.'
name_last = 'Obama!'
street_name = "President Streeet?"
list_of_slot_names = {'name':name, 'name_last':name_last, 'street_name':street_name}
I want to remove the punctation for every slot (name, name_last,...).
I could do it this way:
name = name.translate(str.maketrans('', '', string.punctuation))
name_last = name_last.translate(str.maketrans('', '', string.punctuation))
street_name = street_name.translate(str.maketrans('', '', string.punctuation))
Do you know a shorter (more compact) way to write this?
Result:
>>> print(name, name_last, street_name)
>>> Barack Obama President Streeet
Use a loop / dictionary comprehension
{k: v.translate(str.maketrans('', '', string.punctuation)) for k, v in list_of_slot_names.items()}
You can either assign this back to list_of_slot_names if you want to overwrite existing values or assign to a new variable
You can also then print via
print(*list_of_slot_names.values())
name = 'Barack.'
name_last = 'Obama!'
empty_slot = None
street_name = "President Streeet?"
print([str_.strip('.?!') for str_ in (name, name_last, empty_slot, street_name) if str_ is not None])
-> Barack Obama President Streeet
Unless you also want to remove them from the middle. Then do this
import re
name = 'Barack.'
name_last = 'Obama!'
empty_slot = None
street_name = "President Streeet?"
print([re.sub('[.?!]+',"",str_) for str_ in (name, name_last, empty_slot, street_name) if str_ is not None])
import re, string
s = 'hell:o? wor!d.'
clean = re.sub(rf"[{string.punctuation}]", "", s)
print(clean)
output
hello world
I'm trying to remove trademark symbol (™) but only in the case it's not followed by any other symbol for instance I might have ’ which is a bad encoding of quotation mark (') so I don't want to remove trademark symbol (™) and hence broking the pattern that i'm using to replace xx™ with quotation mark.
dict = {};
chars = {
'\xe2\x84\xa2': '', # ™
'\xe2\x80\x99': "'", # ’
}
def stats_change(char, number):
if dict.has_key(char):
dict[char] = dict[char]+number
else:
dict[char] = number # Add new entry
def replace_chars(match):
char = match.group(0)
stats_change(char,1)
return chars[char]
i, nmatches = re.subn("(\\" + '|\\'.join(chars.keys()) + ")", replace_chars, i)
count_matches += nmatches
Input: foo™ oof
Output: foo oof
Input: o’f oof
Output: o'f oof
Any suggestions ?
I'm trying to parse the item names and it's corresponding values from the below snippet. dt tag holds names and dd containing values. There are few dt tags which do not have corresponding values. So, all the names do not have values. What I wish to do is keep the values blank against any name if the latter doesn't have any values.
These are the elements I would like to scrape data from:
content="""
<div class="movie_middle">
<dl>
<dt>Genres:</dt>
<dt>Resolution:</dt>
<dd>1920*1080</dd>
<dt>Size:</dt>
<dd>1.60G</dd>
<dt>Quality:</dt>
<dd>1080p</dd>
<dt>Frame Rate:</dt>
<dd>23.976 fps</dd>
<dt>Language:</dt>
</dl>
</div>
"""
I've tried like below:
soup = BeautifulSoup(content,"lxml")
title = [item.text for item in soup.select(".movie_middle dt")]
result = [item.text for item in soup.select(".movie_middle dd")]
vault = dict(zip(title,result))
print(vault)
It gives me messy results (wrong pairs):
{'Genres:': '1920*1080', 'Resolution:': '1.60G', 'Size:': '1080p', 'Quality:': '23.976 fps'}
My expected result:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p','Frame Rate:':'23.976 fps','Language:':''}
Any help on fixing the issue will be highly appreciated.
You can loop through the elements inside dl. If the current element is dt and the next element is dd, then store the value as the next element, else set the value as empty string.
dl = soup.select('.movie_middle dl')[0]
elems = dl.find_all() # Returns the list of dt and dd
data = {}
for i, el in enumerate(elems):
if el.name == 'dt':
key = el.text.replace(':', '')
# check if the next element is a `dd`
if i < len(elems) - 1 and elems[i+1].name == 'dd':
data[key] = elems[i+1].text
else:
data[key] = ''
You can use BeautifulSoup to parse the dl structure, and then write a function to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a[4:-5], _d[0][4:-5]]
d = _d[1:]
else:
yield [a[4:-5], '']
d = _d
else:
yield [a[4:-5], '']
d = []
print(dict(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1])))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
For a slightly longer, although cleaner solution, you can create a decorator to strip the HTML tags of the output, thus removing the need for the extra string slicing in the main parse_result function:
def strip_tags(f):
def wrapper(data):
return {a[4:-5]:b[4:-5] for a, b in f(data)}
return wrapper
#strip_tags
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a, _d[0]]
d = _d[1:]
else:
yield [a, '']
d = _d
else:
yield [a, '']
d = []
print(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1]))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
from collections import defaultdict
test = soup.text.split('\n')
d = defaultdict(list)
for i in range(len(test)):
if (':' in test[i]) and (':' not in test[i+1]):
d[test[i]] = test[i+1]
elif ':' in test[i]:
d[test[i]] = ''
d
defaultdict(list,
{'Frame Rate:': '23.976 fps',
'Genres:': '',
'Language:': '',
'Quality:': '1080p',
'Resolution:': '1920*1080',
'Size:': '1.60G'})
The logic here is that you know that every key will have a colon. Knowing this, you can write an if else statement to capture the unique combinations, whether that is key followed by key or key followed by value
Edit:
In case you wanted to clean your keys, below replaces the : in each one:
d1 = { x.replace(':', ''): d[x] for x in d.keys() }
d1
{'Frame Rate': '23.976 fps',
'Genres': '',
'Language': '',
'Quality': '1080p',
'Resolution': '1920*1080',
'Size': '1.60G'}
The problem is that empty elements are not present. Since there is no hierarchy between the <dt> and the <dd>, I'm afraid you'll have to craft the dictionary yourself.
vault = {}
category = ""
for item in soup.find("dl").findChildren():
if item.name == "dt":
if category == "":
category = item.text
else:
vault[category] = ""
category = ""
elif item.name == "dd":
vault[category] = item.text
category = ""
Basically this code iterates over the child elements of the <dl> and fills the vault dictionary with the values.
I have a list of strings in python and if an element of the list contains the word "parthipan" I should print a message. But the below script is not working
import re
a = ["paul Parthipan","paul","sdds","sdsdd"]
last_name = "Parthipan"
my_regex = r"(?mis){0}".format(re.escape(last_name))
if my_regex in a:
print "matched"
The first element of the list contains the word "parthipan", so it should print the message.
If you want to do this with a regexp, you can't use the in operator. Use re.search() instead. But it works with strings, not a whole list.
for elt in a:
if re.search(my_regexp, elt):
print "Matched"
break # stop looking
Or in more functional style:
if any(re.search(my_regexp, elt) for elt in a)):
print "Matched"
You don't need regex for this simply use any.
>>> a = ["paul Parthipan","paul","sdds","sdsdd"]
>>> last_name = "Parthipan".lower()
>>> if any(last_name in name.lower() for name in a):
... print("Matched")
...
Matched
Why not:
a = ["paul Parthipan","paul","sdds","sdsdd"]
last_name = "Parthipan"
if any(last_name in ai for ai in a):
print "matched"
Also what for is this part:
...
import re
my_regex = r"(?mis){0}".format(re.escape(last_name))
...
EDIT:
Im just too blind to see what for do You need regex here. It would be best if You would give some real input and output. This is small example which could be done in that way too:
a = ["paul Parthipan","paul","sdds","sdsdd",'Mala_Koala','Czarna,Pala']
last_name = "Parthipan"
names=[]
breakers=[' ','_',',']
for ai in a:
for b in breakers:
if b in ai:
names.append(ai.split(b))
full_names=[ai for ai in names if len(ai)==2]
last_names=[ai[1] for ai in full_names]
if any(last_name in ai for ai in last_names):
print "matched"
But if regex part is really needed I cant imagine how to find '(?mis)Parthipan' in 'Parthipan'. Most simple would be in reverse direction 'Parthipan' in '(?mis)Parthipan'. Like here...
import re
a = ["paul Parthipan","paul","sdds","sdsdd",'Mala_Koala','Czarna,Pala']
last_name = "Parthipan"
names=[]
breakers=[' ','_',',']
for ai in a:
for b in breakers:
if b in ai:
names.append(ai.split(b))
full_names=[ai for ai in names if len(ai)==2]
last_names=[r"(?mis){0}".format(re.escape(ai[1])) for ai in full_names]
print last_names
if any(last_name in ai for ai in last_names):
print "matched"
EDIT:
Yhm, with regex You have few possibilities...
import re
a = ["paul Parthipan","paul","sdds","sdsdd",'jony-Parthipan','koala_Parthipan','Parthipan']
lastName = "Parthipan"
myRegex = r"(?mis){0}".format(re.escape(lastName))
strA=';'.join(a)
se = re.search(myRegex, strA)
ma = re.match(myRegex, strA)
fa = re.findall(myRegex, strA)
fi=[i.group() for i in re.finditer(myRegex, strA, flags=0)]
se = '' if se is None else se.group()
ma = '' if ma is None else ma.group()
print se, 'match' if any(se) else 'no match'
print ma, 'match' if any(ma) else 'no match'
print fa, 'match' if any(fa) else 'no match'
print fi, 'match' if any(fi) else 'no match'
output, only first one seems ok, so only re.search gives proper solution:
Parthipan match
no match
['Parthipan', 'Parthipan', 'Parthipan', 'Parthipan'] match
['Parthipan', 'Parthipan', 'Parthipan', 'Parthipan'] match
There are lots of articles pertaining to parsing xml with elementtree. I've gone through a bunch of them and read through the docs but I can't come up with a solution that works for me. I'm trying to supplement info thats created by another app in a nfo file but i need to preserve the conventions in the file.
Here is an example of how the file is laid out
<title>
<name>Test Name</name>
<alt name />
<file local="C:\file\file1.doc" type="word">http://filestore/file1.doc</file>
<file local="" type="excel">http://filestore/file2.xls</file>
<file local="C:\file\file3.xls" type="excel" />
<file local="" type="ppt" />
</title>
Note: Elements are not closed properly e.g...
<alt name /> should be <alt name></alt name>
This is what I'm running...
import xml.etree.ElementTree as ET
tree = ET.parse('file.nfo')
root = tree.getroot()
The error I'm getting is...
xml.etree.ElementTree.ParseError: not well-formed (invalid token):
I've tried...
myparser = ET.XMLParser(encoding='UTF-8')
tree = ET.parse('file.nfo', myparser)
Also tried, xmlparser, opening with codecs but i'm pretty sure its the formatting. I'm guessing the immediate issue is non-escaped > but i suspect ET needs opening/closing?
I'm sure i could open this file and go through it with regex but i was hoping to use ElementTree.
The end goal is to have the details from the nfo as a dictionary that looks like...
dict = {'title': [{'name': 'Test Name',
'alt name': '',
'file': [{'local': 'C:\file\file1.doc', 'type': 'word', 'url': 'http://filestore/file1.doc'},
{'local': '', 'type': 'excel', 'url': 'http://filestore/file2.xls'},
{'local': 'C:\file\file3.xls', 'type': 'excel', 'url': ''},
{'local': '', 'type': 'ppt', 'url': ''}]
}]}
I'm sure there is a better (more pythonic) way to do this but I'm pretty new to python.
Any help would be appreciated
EDIT: I'm also trying to avoid using 3rd party libraries if possible
So I ended up creating a customer parser of sorts, its not ideal but it works. It was suggested to me that lxml and html.parser may parse malformed xml better but i just went with this.
I'm also still very interested in any feedback whether it be on this or using any other method.
import re
def merge_dicts(*dict_args):
result = {}
for dictionary in dict_args:
result.update(dictionary)
return result
def make_dict(str_arg, op):
result = {}
result = dict(s.split(op) for s in str_arg.split(","))
return result
'''
Samples
lst = r' <name>Test Name</name>'
lst = r' <alt name />'
lst = r' <file local="C:\file\file1.doc" type="word">http://filestore/file1.doc</file>'
lst = r' <file local="" type="excel">http://filestore/file2.xls</file>'
lst = r' <file local="C:\file\file3.xls" type="excel" />'
lst = r' <file local="" type="ppt" />'
'''
def match_pattern(file_str):
#<description>desc blah</description>'
pattern1 = r'''(?x)
^
\s* # cut leading whitespace
(?P<whole_thing>
< (?P<tag_open> (\w+?|\w*\s\w+?)+) \b # word boundary, so we can
> # skip attributes
(?P<tag_body> .+? ) # insides
</ (?P<tag_close> (\w+?|\w*\s\w+?)+) > # closing tag, nothing interesting
)
$'''
#<alt name />
pattern2 = r'''(?x)
^
\s*
(?P<whole_thing>
< (?P<tag_open> (\w+?|\w*\s\w+?)+) \b
\s/>
)
$'''
#<file local="C:\file\file1.doc" type="word">http://filestore/file1.doc</file>'
pattern3 = r'''(?x)
^
\s*
(?P<whole_thing>
< (?P<tag_open> (\w+?|\w*\s\w+!=?)+) \b
\s
(?P<tag_attrib1> (\w*\=.*?)) # 1st attribute
\s
(?P<tag_attrib2> (\w*\=.*)) # 2nd attribute
.*? >
(?P<tag_body> .+? )
</ (?P<tag_close> (\w+?|\w*\s\w+?)+) >
)
$'''
#<file local="" type="ppt" />
pattern4 = r'''(?x)
^
\s*
(?P<whole_thing>
< (?P<tag_open> (\w+?|\w*\s\w+!=?)+) \b
\s
(?P<tag_attrib1> (\w*\=.*?)) # 1st attribute
\s
(?P<tag_attrib2> (\w*\=.*)) # 2nd attribute
\s/>
)
$'''
pat_str = 'pattern'
pat_val = 1
return_dict = {}
while (pat_val <= 4):
pattern = pat_str+str(pat_val)
matchObj = re.match(eval(pattern), file_str, re.L|re.M)
if matchObj:
#for k, v in matchObj.groupdict().items():
# print('matchObj.group({!r}) == {!r}'.format(k, v))
if pat_val == 1:
body = matchObj.group('tag_body')
return_dict = {matchObj.group('tag_open'): body}
elif pat_val == 2:
return_dict = {matchObj.group('tag_open'): ''}
elif pat_val == 3:
attr1 = make_dict(matchObj.group('tag_attrib1'), '=')
attr2 = make_dict(matchObj.group('tag_attrib2'), '=')
body = {'url': matchObj.group('tag_body')}
attrib = merge_dicts(attr1, attr2, body)
return_dict = {matchObj.group('tag_open'): attrib}
elif pat_val == 4:
attr1 = make_dict(matchObj.group('tag_attrib1'), '=')
attr2 = make_dict(matchObj.group('tag_attrib2'), '=')
body = {'url': ''}
attrib = merge_dicts(attr1, attr2, body)
return_dict = {matchObj.group('tag_open'): attrib}
return return_dict
else:
pat_val = pat_val + 1
if pat_val > 4:
print("No match!!")
#print(match_pattern(lst))
def in_file(file):
result = {}
with open(file, "r") as file:
data = (file.read().splitlines())
for d in data:
if data.index(d) == 0 or data.index(d) == len(data)-1:
if data.index(d) == 0:
print(re.sub('<|/|>', '', d))
elif d:
lst = []
dct = {}
if 'file' in match_pattern(d).keys():
for i in match_pattern(d).items():
if 'file' in result.keys():
lst = result['file']
lst.append(i[1])
dct = {i[0]: lst}
result = merge_dicts(result, dct)
#print(result['file'])
else:
dct = {i[0]: [i[1]]}
result = merge_dicts(result, dct)
else:
result = merge_dicts(result, match_pattern(d))
print('else', match_pattern(d))
return result
print(in_file('C:\\test.nfo'))
NOTE: I dropped the top most dictionary from the original post