How can I substitute a regex only once in Python? - python

So right now, re.sub does this:
>>> re.sub("DELETE THIS", "", "I want to DELETE THIS472 go to DON'T DELETE THIS847 the supermarket")
"I want to go to DON'T the supermarket"
I want it to instead delete only the first instance of "DELETE THISXXX," where XXX is a number, so that the result is
"I want to go to DON'T DELETE THIS847 the supermarket"
The XXX is a number that varies, and so I actually do need a regex. How can I accomplish this?

As written in the documentation for re.sub(pattern, repl, string, count=0, flags=0) you can specify the count argument in:
re.sub(pattern, repl, string[, count, flags])
if you only give a count of 1 it will only replace the first

From http://docs.python.org/library/re#re.sub:
The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'.

The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer.
re.sub(pattern, repl, string, count=0, flags=0)
Set count = 1 to only replace the first instance.

I think your phrasing, "first instance," caused everyone else to answer in the direction of count, but if you meant that you want to delete a phrase only if it fully matches a phrase you seek, then first you have to define what you mean by a "phrase", e.g. non-lower-case characters:
DON'T DELETE THIS
In which case, you can do something like this:
(?<![^a-z]+)\s+DELETE THIS\s+(?![^a-z]+)
I'm not sure whether Python allows arbitrary-length negative lookbehind assertions. If not, remove the first +.

you can use str.replace() for this:
In [9]: strs="I want to DELETE THIS go to DON'T DELETE THIS the supermarket"
In [10]: strs.replace("DELETE THIS","",1) # here 1 is the count
Out[10]: "I want to go to DON'T DELETE THIS the supermarket"

Related

Check for very specified numbers padding

I am trying to check for a list of items in my scene to see if they bear 3 (version) paddings at the end of their name - eg. test_model_001 and if they do, that item will be pass and items that do not pass the condition will be affected by a certain function..
Suppose if my list of items is as follows:
test_model_01
test_romeo_005
test_charlie_rig
I tried and used the following code:
eg_list = ['test_model_01', 'test_romeo_005', 'test_charlie_rig']
for item in eg_list:
mo = re.sub('.*?([0-9]*)$',r'\1', item)
print mo
And it return me 01 and 005 as the output, in which I am hoping it will return me just the 005 only.. How do I ask it to check if it contains 3 paddings? Also, is it possible to include underscore in the check? Is that the best way?
You can use the {3} to ask for 3 consecutive digits only and prepend underscore:
eg_list = ['test_model_01', 'test_romeo_005', 'test_charlie_rig']
for item in eg_list:
match = re.search(r'_([0-9]{3})$', item)
if match:
print(match.group(1))
This would print 005 only.
The asterisk after the [0-9] specification means that you are expecting any random number of occurrences of the digits 0-9. Technically this expression matches test_charlie_rig as well. You can test that out here http://pythex.org/
Replacing the asterisk with a {3} says that you want 3 digits.
.*?([0-9]{3})$
If you know your format will be close to the examples you showed, you can be a bit more explicit with the regex pattern to prevent even more accidental matches
^.+_(\d{3})$
for item in eg_list:
if re.match(".*_\d{3}$", item):
print item.split('_')[-1]
This matches anything which ends in:
_ and underscore, \d a digit, {3} three of them, and $ the end of the line.
Debuggex Demo
printing the item, we split it on _ underscores and take the last value, index [-1]
The reason .*?([0-9]*)$ doesn't work is because [0-9]* matches 0 or more times, so it can match nothing. This means it will also match .*?$, which will match any string.
See the example on regex101.com
I usually don't like regex unless needed. This should work and be more readable.
def name_validator(name, padding_count=3):
number = name.split("_")[-1]
if number.isdigit() and number == number.zfill(padding_count):
return True
return False
name_validator("test_model_01") # Returns False
name_validator("test_romeo_005") # Returns True
name_validator("test_charlie_rig") # Returns False

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Extract substring using python re.match

I have a string as
sg_ts_feature_name_01_some_xyz
In this, i want to extract two words that comes after the pattern - sg_ts with the underscore seperation between them
It must be,
feature_name
This regex,
st = 'sg_ts_my_feature_01'
a = re.match('sg_ts_([a-zA-Z_]*)_*', st)
print a.group()
returns,
sg_ts_my_feature_
whereas, i expect,
my_feature
The problem is that you are asking for the whole match, not just the capture group. From the manual:
group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group.
and you asked for a.group() which is equivalent to a.group(0) which is the whole match. Asking for a.group(1) will give you only the capture group in the parentheses.
You can ask for the group surrounded by the parentheses, 'a.group(1)', which returns
'my_feature_'
In addition, if your string is always in this form you could also use the end-of string character $ and to make the inner match lazy instead of greedy (so it doesn't swallow the _).
a = re.match('sg_ts_([a-zA-Z_]*?)[_0-9]*$',st)

Regex match even number of letters

I need to match an expression in Python with regular expressions that only matches even number of letter occurrences. For example:
AAA # no match
AA # match
fsfaAAasdf # match
sAfA # match
sdAAewAsA # match
AeAiA # no match
An even number of As SHOULD match.
Try this regular expression:
^[^A]*((AA)+[^A]*)*$
And if the As don’t need to be consecutive:
^[^A]*(A[^A]*A[^A]*)*$
This searches for a block with an odd number of A's. If you found one, the string is bad for you:
(?<!A)A(AA)*(?!A)
If I understand correctly, the Python code should look like:
if re.search("(?<!A)A(AA)*(?!A)", "AeAAi"):
print "fail"
Why work so hard coming up with a hard to read pattern? Just search for all occurrences of the pattern and count how many you find.
len(re.findall("A", "AbcAbcAbcA")) % 2 == 0
That should be instantly understandable by all experienced programmers, whereas a pattern like "(?
Simple is better.
'A*' means match any number of A's. Even 0.
Here's how to match a string with an even number of a's, upper or lower:
re.compile(r'''
^
[^a]*
(
(
a[^a]*
){2}
# if there must be at least 2 (not just 0), change the
# '*' on the following line to '+'
)*
$
''',re.IGNORECASE|re.VERBOSE)
You probably are using a as an example. If you want to match a specific character other than a, replace a with %s and then insert
[...]
$
'''%( other_char, other_char, other_char )
[...]
'*' means 0 or more occurences
'AA' should do the trick.
The question is if you want the thing to match 'AAA'. In that case you would have to do something like:
r = re.compile('(^|[^A])(AA)+(?!A)',)
r.search(p)
That would work for match even (and only even) number of'A'.
Now if you want to match 'if there is any even number of subsequent letters', this would do the trick:
re.compile(r'(.)\1')
However, this wouldn't exclude the 'odd' occurences. But it is not clear from your question if you really want that.
Update:
This works for you test cases:
re.compile('^([^A]*)AA([^A]|AA)*$')
First of all, note that /A*/ matches the empty string.
Secondly, there are some things that you just can't do with regular expressions. This'll be a lot easier if you just walk through the string and count up all occurences of the letter you're looking for.
A* means match "A" zero or more times.
For an even number of "A", try: (AA)+
It's impossible to count arbitrarily using regular expressions. For example, making sure that you have matching parenthesis. To count you need 'memory' which requires something at least as strong as a pushdown automaton, although in this case you can use the regular expression that #Gumbo provided.
The suggestion to use finditeris the best workaround for the general case.

Check String for / against Characters in Python

I need to be able to tell the difference between a string that can contain letters and numbers, and a string that can contain numbers, colons and hyphens.
>>> def checkString(s):
... pattern = r'[-:0-9]'
... if re.search(pattern,s):
... print "Matches pattern."
... else:
... print "Does not match pattern."
# 3 Numbers seperated by colons. 12, 24 and minus 14
>>> s1 = "12:24:-14"
# String containing letters and string containing letters/numbers.
>>> s2 = "hello"
>>> s3 = "hello2"
When I run the checkString method on each of the above strings:
>>>checkString(s1)
Matches Pattern.
>>>checkString(s2)
Does not match Pattern.
>>>checkString(s3)
Matches Pattern
s3 is the only one that doesn't do what I want. I'd like to be able to create a regex that allows numbers, colons and hyphens, but excludes EVERYTHING else (or just alphabetical characters). Can anyone point me in the right direction?
EDIT:
Therefore, I need a regex that would accept:
229 // number
187:657 //two numbers
187:678:-765 // two pos and 1 neg numbers
and decline:
Car //characters
Car2 //characters and numbers
you need to match the whole string, not a single character as you do at the moment:
>>> re.search('^[-:0-9]+$', "12:24:-14")
<_sre.SRE_Match object at 0x01013758>
>>> re.search('^[-:0-9]+$', "hello")
>>> re.search('^[-:0-9]+$', "hello2")
To explain regex:
within square brackets (character class): match digits 0 to 9, hyphen and colon, only once.
+ is a quantifier, that indicates that preceding expression should be matched as many times as possible but at least once.
^ and $ match start and end of the string. For one-line strings they're equivalent to \A and \Z.
This way you restrict content of the whole string to be at least one-charter long and contain any permutation of characters from the character class. What you were doing before hand was to search for a single character from the character class within subject string. This is why s3 that contains a digit matched.
SilentGhost's answer is pretty good, but take note that it would also match strings like "---::::" with no digits at all.
I think you're looking for something like this:
'^(-?\d+:)*-?\d+$'
^ Matches the beginning of the line.
(-?\d+:)* Possible - sign, at least one digit, a colon. That whole pattern 0 or many times.
-?\d+ Then the pattern again, at least once, without the colon
$ The end of the line
This will better match the strings you describe.
pattern = r'\A([^-:0-9]+|[A-Za-z0-9])\Z'
Your regular expression is almost fine; you just need to make it match the whole string. Also, as a commenter pointed out, you don't really need a raw string (the r prefix on the string) in this case. Voila:
def checkString(s):
if re.match('[-:0-9]+$', s):
print "Matches pattern."
else:
print "Does not match pattern."
The '+' means "match one or more of the previous expression". (This will make checkString return False on an empty string. If you want True on an empty string, change the '+' to a '*'.) The '$' means "match the end of the string".
re.match means "the string must match the regular expression starting at the first character"; re.search means "the regular expression can match a sequence anywhere inside the string".
Also, if you like premature optimization--and who doesn't!--note that 're.match' needs to compile the regular expression each time. This version compiles the regular expression only once:
__checkString_re = re.compile('[-:0-9]+$')
def checkString(s):
global __checkString_re
if __checkString_re.match(s):
print "Matches pattern."
else:
print "Does not match pattern."

Categories