python regular expression replace - python

I'm trying to change a string that contains substrings such as
the</span></p>
<p><span class=font7>currency
to
the currency
At the line break is CRLF
The words before and after the code change. I only want to replace if the second word starts with a lower case letter. The only thing that changes in the code is the digit after 'font'
I tried:
p = re.compile('</span></p>\r\n<p><span class=font\d>([a-z])')
res = p.sub(' \1', data)
but this isn't working
How should I fix this?

Use a lookahead assertion.
p = re.compile('</span></p>\r\n<p><span class=font\d>(?=[a-z])')
res = p.sub(' ', data)

I think you should use the flag re.DOTALL, which means it will "see" nonprintable characters, such as linebreaks, as if they were regular characters.
So, first line of your code would become :
p = re.compile('</span></p>..<p><span class=font\d>([a-z])', re.DOTALL)
(not the two unescaped dots instead of the linebreak).
Actually, there is also re.MULTILINE, everytime I have a problem like this one of those end up solving the problem.
Hope it helps.

This :
result = re.sub("(?si)(.*?)</?[A-Z][A-Z0-9]*[^>]*>.*</?[A-Z][A-Z0-9]*[^>]*>(.*)", r"\1 \2", subject)
Applied to :
the</span></p>
<p><span class=font7>currency
Produces :
the currency
Although I would strongly suggest against using regex with xml/html/xhtml. THis generic regex will remove all elements and capture any text before / after to groups 1,2.

Related

Python replace between two chars (no split function)

I currently investigate a problem that I want to replace something in a string.
For example. I have the following string:
'123.49, 19.30, 02\n'
I only want the first two numbers like '123.49, 19.30'. The split function is not possible, because a I have a lot of data and some with and some without the last number.
I tried something like this:
import re as regex
#result = regex.match(', (.*)\n', string)
result = re.search(', (.*)\\n', string)
print(result.group(1))
This is not working finde. Can someone help me?
Thanks in advance
You could do something like this:
reg=r'(\d+\.\d+), (\d+\.\d+).*'
if(re.search(reg, your_text)):
match = re.search(reg, your_text)
first_num = match.group(1)
second_num = match.group(2)
Alternatively, also adding the ^ sign at the beginning, making sure to always only take the first two.
import re
string = '123.49, 19.30, 02\n'
pattern = re.compile('^(\d*.?\d*), (\d*.?\d*)')
result = re.findall(pattern, string)
result
Output:
[('123.49', '19.30')]
In the code you are using import re as regex. If you do that, you would have to use regex.search instead or re.search.
But in this case you can just use re.
If you use , (.*) you would capture all after the first occurrence of , and you are not taking digits into account.
If you want the first 2 numbers as stated in the question '123.49, 19.30' separated by comma's you can match them without using capture groups:
\b\d+\.\d+,\s*\d+\.\d+\b
Or matching 1 or more repetitions preceded by a comma:
\b\d+\.\d+(?:,\s*\d+\.\d+)+\b
regex demo | Python demo
As re.search can also return None, you can first check if there is a result (no need to run re.search twice)
import re
regex = r"\b\d+\.\d+(?:,\s*\d+\.\d+)+\b"
s = "123.49, 19.30, 02"
match = re.search(regex, s)
if match:
print(match.group())
Output
123.49, 19.30

Regex matching: Case insensitive German words with spaces (Python)

I have a problem where I want to match any number of German words inside [] braces, ignoring the case. The expression should only match spaces and words, nothing else i.e no punctuation marks or parenthesis
E.g :
The expression ['über das thema schreibt'] should be matched with ['Über', 'das', 'Thema', 'schreibt']
I have one list with items of the former order and another with the latter order, as long as the words are same, they both should match.
The code I tried with is -
regex = re.findall('[(a-zA-Z_äöüÄÖÜß\s+)]', str(term))
or
re.findall('[(\S\s+)]', str(term))
But they are not working. Kindly help me find a solution
In the simplest form using \w+ works for finding words (needs Unicode flag for non-ascii chars), but since you want them to be within the square brackets (and quotes I assume) you'd need something a bit complex
\[(['\"])((\w+\s?)+)\1\]
\[ and \] are used to match the square brackets
['\"] matches either quote and the \1 makes sure the same quote is one the other end
\w+ captures 1 word. The \s? is for an optional space.
The whole string is in the second group which you can split to get the list
import re
text = "['über das thema schreibt']"
regex = re.compile("\[(['\"])((\w+\s?)+)['\"]\]", flags=re.U)
match = regex.match(text)
if match:
print(match.group(2).split())
(slight edit as \1 did not seem to work in the terminal for me)
I found the easiest solution to it :
for a, b in zip(list1, list2):
reg_a = re.findall('[(\w\s+)]', str(a).lower())
reg_b = re.findall('[(\w\s+)]', str(b).lower())
if reg_a == reg_b:
return True
else
return False
Updated based on comments to match each word. This simply ignores spaces, single quotes and square braces
import re
text = "['über das thema schreibt']"
re.findall("([a-zA-Z_äöüÄÖÜß]+)", str(text))
# ['über', 'das', 'thema', 'schreibt']
If you are solving case sensitivity issue, add the regex flaf re.IGNORECASE
like
re.findall('[(\S\s+)]', str(term),re.IGNORECASE)
You might need to consider converting them to unicode, if it did not help.

Find something that does not match a pattern at the beginning of line

I am using regex in Python to find something at the beginning of line that does not match pattern "SCENE" and before colon. The text looks like this
SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n.
I need to find AQW, CER, DYU, EOI in this case.
I have tried
findall(r"^(?!SCENE)[^:]*, text, re.M)
I get AQW and EOI, but I get ddd\nDYU instead of DYU, ddd\nd\nEOI instead of EOI.
How could I get exactly AQW,CER,DYU,EOI?
You can try this to brteak your string to substrings and try to find there:
import re
line = "SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n."
lines = re.split("\\n([A-Z])", line)
lines = [a+b for a,b in zip(lines[1::2], lines[2::2])]
for line in lines:
if re.match(r"^(?!SCENE)[^:]*", line):
print(line.split(":")[0])
result is:
AQW
CER
DYU
EOI
This answer is not the best in terms of performance I assume
This could probably be simplified more, and I'm assuming that \n in your example string is a literal newline character.
This should match all of your use cases. It starts by looking for any number of characters that aren't SCENE preceding a : then it finds any characters after the colon that don't follow a newline and precede a : then the last . there is probably a way around, but the final character wasn't being properly matched because it was directly followed by the negative lookahead.
findall( r"([A-Z]+(?<!SCENE):(?:[\s\S](?!\n[A-Z]+:))+.)", text )
https://regex101.com/r/NwdUcR/2
EDIT: I realize the above may not be exactly what you're looking for. If you're looking to match just the letters before the colon you can use this:
findall( r"([A-Z]+(?<!SCENE)):", text )
I use
findall (r"\n(?!SCENE)(.+?):")
which works. The point is I did not realize that I can use parenthesis to select what I would like to display in the result.
You don't really need regex in this case.
Here is a solution using plain and simple str.split().
s = 'SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n.'
matches = [m.split('\n')[-1] for m in s.split(':') if 'SCENE' not in m]
>>> matches
['AQW', 'CER', 'DYU', 'EOI', '.']
If you want to exclude the last '.', you can use matches = [m.split('\n')[-1] for m in s.split(':') if (('SCENE' not in m) and (m[-1] != '.'))]
or simply matches = matches[:-1]

how to use python regex find matched string?

for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]

Find and extract two substrings from string

I have some strings (in fact they are lines read from a file). The lines are just copied to some other file, but some of them are "special" and need a different treatment.
These lines have the following syntax:
someText[SUBSTRING1=SUBSTRING2]someMoreText
So, what I want is: When I have a line on which this "mask" can be applied, I want to store SUBSTRING1 and SUBSTRING2 into variables. The braces and the = shall be stripped.
I guess this consists of several tasks:
Decide if a line contains this mask
If yes, get the positions of the substrings
Extract the substrings
I'm sure this is a easy task with regex, however, I'm not used to it. I can write a huge monster function using string manipulation, but I guess this is not the "Python Way" to do this.
Any suggestions on this?
re.search() returns None if it doesn't find a match. \w matches an alphanumeric, + means 1 or more. Parenthesis indicate the capturing groups.
s = """
bla bla
someText[SUBSTRING1=SUBSTRING2]someMoreText"""
results = {}
for line_num, line in enumerate(s.split('\n')):
m = re.search(r'\[(\w+)=(\w+)\]', line)
if m:
results.update({line_num: {'first': m.group(0), 'second': m.group(1)}})
print(results)
^[^\[\]]*\[([^\]\[=]*)=([^\]\[=]*)\][^\]\[]*$
You can try this.Group 1and Group 2 has the two string you want.See demo.
https://regex101.com/r/pT4tM5/26
import re
p = re.compile(r'^[^\[\]]*\[([^\]\[=]*)=([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
test_str = "someText[SUBSTRING1=SUBSTRING2]someMoreText\nsomeText[SUBSTRING1=SUBSTRING2someMoreText\nsomeText[SUBSTRING1=SUBSTRING2]someMoreText"
re.findall(p, test_str)

Categories