Python: Replacing strings [duplicate] - python

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I'm iterating through pages and I'd like to modify lines containing
<span class="font16"></span>
How can I correct the code below?
text = re.sub(r'<span class="font(.*)"></span><span', r'<span class="font\1"> </span><span', text)

The pattern .* will match anything until the end of line, so the match will look like this:
16"></span>....
which isn't what you want. Use a pattern that stops at the first " (since they aren't allowed inside attribute values which are quoted with "):
r'<span class="font([^"]+)"></span><span'

Related

Regex not finding contents of multi-line XML tag with Python [duplicate]

This question already has answers here:
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
(10 answers)
How to parse XML and get instances of a particular node attribute?
(19 answers)
Closed 24 days ago.
Cannot get multiline XML tag with line break and tabs with regex
**
Data I am parsing for contents of all tags**
enter image description here
I am using
for line in lines:
if re.search('\n.*\n</data>
I'm getting no results
I tried using \t and \s and \n because there are three tabs before each line and a line break after each

Matching an empty paragraph at the end of HTML text [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 2 years ago.
I have written the following pattern to match an empty paragraph at the end of HTML:
https://regex101.com/r/6TNgUV/1
But when I try the following Python code, the result is None
html_desc = '</span><p></p><p></p>'
res = re.match('(<p>){1}(\s)*(<br>|<br\/>){0,9}(\s)*(<\/p>){1}(\s)*$', html_desc)
# returns None
I am not able to understand the issue.
re.match matches starting with the first character, and since your HTML string starts with a tag, it returns the default case, None, maybe use re.search() instead of re.match()

Python regex find everything between substring and first space [duplicate]

This question already has answers here:
Python non-greedy regexes
(7 answers)
Closed 2 years ago.
I have a string string = "radios label="Does the command/question above meet the Rules to the left?" name="tq_utt_test" validates="required" gold="true" aggregation="agg"" and I want to be able to extract the substring within the "name". So in this case I want to extract "tq_utt_test" because it is the substring inside name.
I've tried regex re.findall('name=(.*)\s', string) which I thought would extract everything after the substring name= and before the first space. But after running that regex, it actually returned "tq_utt_test" validates="required" gold="true". So seems like it's returning everything between name= and the last space, instead of everything between name= and first space.
Is there a way to twist this regex so that it returns everything after name= and before the first space?
I will just do re.findall('name=([^ ]*)\s', string)

Finding latex class name using python regex [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I want to use python regex to find the document class in a latex document.
A latex file contains \documentclass{myclass} somewhere near the top. I want to find myclass using regex.
This is what I've tried so far:
latex_text = "blank /documentclass{myclass} words, more text /documentclassdoc{11} more words"
s=re.search(r'/documentclass{(?P<class_name>.*)}', latex_text)
It matches: myclass} words, more text /documentclassdoc{11
How can I change it to only match myclass. It should also stop searching after it finds a match, as the document can get quite long.
I know the file should only have one documentclass, but I want to handle the case where there is more than 1 as well.
import re
latex_text = "blank /documentclass{myclass} words, more text /documentclassdoc{11} more words"
print(re.search(r'/documentclass\{(.*?)\}', latex_text).group())

regex : ignore several downstream xml tags [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 7 years ago.
I need to extract only with regex , the content of an xml , but by ignoring subtags :
Input is like this :
<firstTag>k</firstTag><secondTag>jkjk</secondTag>
<ignoreTag><subIgnoreTag>j</subIgnoreTage>...</ignoreTag>
<ignoreTag><subIgnoreTag>j</subIgnoreTage>...</ignoreTag>
<thirdTage>3<thirdTag>...
I would like to have the following :
<firstTag>k</firstTag><secondTag>jkjk</secondTag>
<thirdTage>3<thirdTag>...
I've tried this :
(?P<test>.*)<ignoreTag>
to see if I can get the first part at least, but I is only ignoring the last occurrence of IgnoreTag...
import re
xml = """<firstTag>k</firstTag><secondTag>jkjk</secondTag>
<ignoreTag><subIgnoreTag>j</subIgnoreTage>...</ignoreTag>
<ignoreTag><subIgnoreTag>j</subIgnoreTage>...</ignoreTag>
<thirdTage>3<thirdTag>"""
print(re.sub("<ignoreTag>.*?</ignoreTag>\n?", '', xml))

Categories