How to escape special regex characters in a string? - python

I use re.findall(p, text) to match a pattern generally, but now I came across a question:
I just want p to be matched as a normal string, not regex.
For example: p may contain '+' or '*', I don't want these characters have special meanings as in regex. In another word, I want p to be matched character by character.
In this case p is unknown to me, so I can't add '\' into it to ignore special character.

You can use re.escape:
>>> p = 'foo+*bar'
>>> import re
>>> re.escape(p)
'foo\\+\\*bar'
Or just use string operations to check if p is inside another string:
>>> p in 'blablafoo+*bar123'
True
>>> 'foo+*bar foo+*bar'.count(p)
2
By the way, this is mainly useful if you want to embed p into a proper regex:
>>> re.match(r'\d.*{}.*\d'.format(re.escape(p)), '1 foo+*bar 2')
<_sre.SRE_Match object at 0x7f11e83a31d0>

If you don't need a regex, and just want to test if the pattern is a substring of the string, use:
if pattern in string:
If you want to test at the start or end of the string:
if string.startswith(pattern): # or .endswith(pattern)
See the string methods section of the docs for other string methods.
If you need to know all locations of a substring in a string, use str.find:
offsets = []
offset = string.find(pattern, 0)
while offset != -1:
offsets.append(offset)
# start from after the location of the previous match
offset = string.find(pattern, offset + 1)

You can use .find on strings. This returns the index of the first occurence of the "needle" string (or -1 if it's not found). e.g.
>>> a = 'test string 1+2*3'
>>> a.find('str')
5
>>> a.find('not there')
-1
>>> a.find('1+2*')
12

Related

Replace a substring with two different strings depending on where they are

Let's say I have a string with some asterisks:
myvar = "this is an *italicized* substring"
I want to replace *italicized* with {i}italicized{/i} for the project I'm working on, txt2rpy, but I'm not sure how to have two different substrings being replaced depending on what order they come in.
You can use a regular expression to substitute the pattern as a whole:
re.sub(r'\*(.*?)\*', r'{i}\1{/i}', myvar)
In the regexp:
\* matches a literal * (used twice)
(.*?) matches any number of any (non-newline) characters, as few as possible - it is also in a capture group
In the replacement:
{i} and {/i} are literals
\1 means to put what was in the first (and in this case, only) capture group
This gives:
>>> import re
>>> myvar = "this is an *italicized* substring"
>>> print(re.sub(r'\*(.*?)\*', r'{i}\1{/i}', myvar))
this is an {i}italicized{/i} substring
If you have more than one occurrence of the pattern, that will work also:
myvar = "this is an *italicized* substring, and here is *another* one"
will give
this is an {i}italicized{/i} substring, and here is {i}another{/i} one
You can use re.sub with capture groups for that:
import re
txt = "this is an *italicized* substring"
res = re.sub(r"\*([^*]+)\*", "{i}\g<1>{/i}", txt)
will have res as:
this is an {i}italicized{/i} substring
This pattern is pretty basic: It matches a literal *, then character not an asterisk, then another literal *. The main point here is that we use a capture group to catch the word part.
Then we simply substitute the full match with the word we saved (accessed by \g<1>) surrounded with your wanted characters.
Demo here
Create a loop for parsing and keep a counter to track,whenever the counter is even keep the second tag and replace odd ones with first tag.
you could you use a for loop and say
myvar = "this is an *italicized* substring"
positions = []
for x in range(len(myvar)):
if myvar[x] == "*":
positions.append(x)
inAsteriks = myvar[0:positions[0]] + "{i}" + myvar[positions[0]+1:positions[1]] + "{/i}" + myvar[positions[0]+1:]

Extract string with specific format

I'm novice to Python and I am trying to extract a string from another string with specific format, for example:
I have original string: -
--#$_ABC1234-XX12X
I need to extract exactly the string ABC1234 (must include three first characters and followed by four digits).
You can use the curly brace repetition qualifiers {} to match exactly three alphabetic characters and exactly four numeric characters:
>>> from re import search
>>>
>>> string = '---#$_ABC1234-XX12X'
>>> match = search('[a-zA-Z]{3}\d{4}', string)
>>> match
<_sre.SRE_Match object; span=(6, 13), match='ABC1234'>
>>> match.group(0) # Use this to get the string that was matched.
'ABC1234'
Explanation of regex:
[a-zA-Z]: Match any letter upper case of lower case...
{3}: Exactly three times. And...
\d: Any digit character...
{4} Exactly four times.
You can make use of re module in Python
matcher = re.search((?P<matched_string>[a-zA-Z]{3}\d{4}))
needed_string = matcher.groupdict()['matched_string']
needed_string will be your desired output.
For the re module refer to: https://docs.python.org/3.4/library/re.html
If you now the exact coordinates of the string you can use something like this:
>>> var = "--#$_ABC1234-XX12X"
>>> newstring = var[5:12]
>>> newstring
'ABC1234'
a python string has a slice method.

Why does the Python regex ".*PATTERN*" match "XXPATTERXX"?

Suppose I want to find "PATTERN" in a string, where "PATTERN" could be anywhere in the string. My first try was *PATTERN*, but this generates an error saying that there is "nothing to repeat", which I can accept so I tried .*PATTERN*. This regex does however not give the expected result, see below
import re
p = re.compile(".*PATTERN*")
s = "XXPATTERXX"
if p.match(s):
print s + " match with '.*PATTERN*'"
The result is
XXPATTERXX match with '.*PATTERN*'
Why does "PATTER" match?
Note: I know that I could use .*PATTERN.* to get the expected result, but I am curious to find out why the asterisk on it self fails to get the results.
Your pattern matches 0 or more N characters at the end, but doesn't say anything about what comes after those N characters.
You could add $ to the pattern to anchor to the end of the input string to disallow the XX:
>>> import re
>>> re.compile(".*PATTERN*$")
<_sre.SRE_Pattern object at 0x10029fb90>
>>> import re
>>> p = re.compile(".*PATTERN*$")
>>> p.match("XXPATTERXX") is None
True
>>> p.match("XXPATTER") is None
False
>>> p.match("XXPATTER")
<_sre.SRE_Match object at 0x1004627e8>
You may want to look into the different types of anchor. \b may also fit your needs; it matches word boundaries (so between a \w and \W class character, or between \W and \w), or you could use negative look-ahead and look-behinds to disallow other characters around your PATTERN string.

Replacing Certain Parts of a String Python

I can not seem to solve this. I have many different strings, and they are always different. I need to replace the ends of them though, but they are always different lengths. Here is a example of a couple strings:
string1 = "thisisnumber1(111)"
string2 = "itsraining(22252)"
string3 = "fluffydog(3)"
Now when I print these out it will of course print the following:
thisisnumber1(111)
itsraining(22252)
fluffydog(3)
What I would like it to print though is the follow:
thisisnumber1
itsraining
fluffydog
I would like it to remove the part in the parentheses for each string, but I do not know how sense the lengths are always changing. Thank You
You can use str.rsplit for this:
>>> string1 = "thisisnumber1(111)"
>>> string2 = "itsraining(22252)"
>>> string3 = "fluffydog(3)"
>>>
>>> string1.rsplit("(")
['thisisnumber1', '111)']
>>> string1.rsplit("(")[0]
'thisisnumber1'
>>>
>>> string2.rsplit("(")
['itsraining', '22252)']
>>> string2.rsplit("(")[0]
'itsraining'
>>>
>>> string3.rsplit("(")
['fluffydog', '3)']
>>> string3.rsplit("(")[0]
'fluffydog'
>>>
str.rsplit splits the string from right-to-left rather than left-to-right like str.split. So, we split the string from right-to-left on ( and then retrieve the element at index 0 (the first element). This will be everything before the (...) at the end of each string.
Your other option is to use regular expressions, which can give you more precise control over what you want to get.
import re
regex = regex = r"(.+)\(\d+\)"
print re.match(regex, string1).groups()[0] #returns thisisnumber1
print re.match(regex, string2).groups()[0] #returns itsraining
print re.match(regex, string3).groups()[0] #returns fluffydog
Breakdown of what's happening:
regex = r"(.+)\(\d+\)" is the regular expression, the formula for the string you're trying to find
.+ means match 1 or more character of any kind except newline
\d+ means match 1 or more digit
\( and \) are the "(" and ")" characters
putting .+ in parentheses puts that string sequence in a group, meaning that group of characters is one that you want to be able to access later on. We don't put the sequence \(\d+\) in a group because we don't care about those characters.
regex.match(regex, string1).groups() gives every substring in string1 that was part of a group. Since you only want 1 substring, you just access the 0th element.
There's a nice tutorial on regular expressions on Tutorial's Point here if you want to learn more.
Since you say in a comment:
"all that will be in the parentheses will be numbers"
so you'll always have digits between your parens, I'd recommend taking a look at removing them with the regular expression module:
import re
string1 = "thisisnumber1(111)"
string2 = "itsraining(22252)"
string3 = "fluffydog(3)"
strings = string1, string2, string3
for s in strings:
s_replaced = re.sub(
r'''
\( # must escape the parens, since these are special characters in regex
\d+ # one or more digits, 0-9
\)
''', # this regular expression will be replaced by the next argument
'', replace the above with an empty string
s, # the string we're modifying
re.VERBOSE) # verbose flag allows us to comment regex clearly
print(s_replaced)
prints:
thisisnumber1
itsraining
fluffydog

Regex related to * and + in python

I am new to python. I didnt understand the behaviour of these program in python.
import re
sub="dear"
pat="[aeiou]+"
m=re.search(pat,sub)
print(m.group())
This prints "ea"
import re
sub="dear"
pat="[aeiou]*"
m=re.search(pat,sub)
print(m.group())
This doesnt prints anything.
I know + matches 1 or more occurences and * matches 0 or more occurrences. I am expecting it to print "ea" in both program.But it doesn't.
Why this happens?
This doesnt prints anything.
Not exactly. It prints an empty string which you just of course you didn't notice, as it's not visible. Try using this code instead:
l = re.findall(pat, sub)
print l
this will print:
['', 'ea', '', '']
Why this behaviour?
This is because when you use * quantifier - [aeiou]*, this regex pattern also matches an empty string before every non-matching string and also the empty string at the end. So, for your string dear, it matches like this:
*d*ea*r* // * where the pattern matches.
All the *'s denote the position of your matches.
d doesn't match the pattern. So match is the empty string before it.
ea matches the pattern. So next match is ea.
r doesn't match the pattern. So the match is empty string before r.
The last empty string is the empty string after r.
Using [aeiou]*, the pattern match at the beginning. You can confirm that using MatchObject.start:
>>> import re
>>> sub="dear"
>>> pat="[aeiou]*"
>>> m=re.search(pat,sub)
>>> m.start()
0
>>> m.end()
0
>>> m.group()
''
+ matches at least one of the character or group before it. [aeiou]+ will thus match at least one of a, e, i, o or u (vowels).
The regex will look everywhere in the string to find the minimum 1 vowel it's looking for and does what you expect it to (it will relentlessly try to get the condition satisfied).
* however means at least 0, which also means it can match nothing. That said, when the regex engine starts to look for a match at the beginning of the string to be tested, it doesn't find a match, so that the 0 match condition is satisfied and this is the result that you obtain.
If you had used the string ear, note that you would have ea as match.

Categories