How to ignore \n in regular expressions in python?

How to ignore \n in regular expressions in python? - python

So i have a regex telling if a number is integer.
regex = '^(0|[1-9][0-9]*)$'
import re
bool(re.search(regex, '42\n'))
returns True, and it is not supposed to?
Where does the problem come from ?

From the documentation:
'$'
Matches the end of the string or just before the newline at the end of the string
Try \Z instead.
Also, any time you find yourself writing a regular expression that starts with ^ or \A and ends with $ or \Z, if your intent is to only match the entire string, you should probably use re.fullmatch() instead of re.search() (and omit the boundary markers from the regex). Or if you're using a version of Python that's too old to have re.fullmatch(), (you really need to upgrade but) you can use re.match() and omit the beginning-of-string boundary marker.

regex ahould be regex = '\b^(0|[1-9][0-9]*)$\b'

The regex in the question matches ->start of line, numbers and end of line. And the given string matches that, thats why it is returning true. If you want it to return False when there is a number present, you can use "!" to indicate NOT.
Refer https://docs.python.org/2/library/re.html
regex = '!(0|[1-9][0-9]*)$'
bool(re.search(regex, '42\n')) => (Returns false)

Yeah, that $ matching one \n before the end is kind of trap/inconsistency. Check out my list of regex traps for python: http://www.cofoh.com/advanced-regex-tutorial-python/traps

Related

Regex search fail when input has line breaks [duplicate]

This question already has an answer here:
Why is Python Regex Wildcard only matching newLine
(1 answer)
Closed 1 year ago.
The following regular expression is not returning any match:
import re
regex = '.*match.*fail.*'
pattern = re.compile(regex)
text = '\ntestmatch\ntestfail'
match = pattern.search(text)
I managed to solve the problem by changing text to repr(text) or setting text as a raw string with r'\ntestmatch\ntestfail', but I'm not sure if these are the best approaches. What is the best way to solve this problem?

Using repr or raw string on a target string is a bad idea!
By doing that newline characters are treated as literal '\n'.
This is likely to cause unexpected behavior on other test cases.
The real problem is that . matches any character EXCEPT newline.
If you want to match everything, replace . with [\s\S].
This means "whitespace or not whitespace" = "anything".
Using other character groups like [\w\W] also works,
and it is more efficient for adding exception just for newline.
One more thing, it is a good practice to use raw string in pattern string(not match target).
This will eliminate the need to escape every characters that has special meaning in normal python strings.

You could add it as an or, but make sure you \ in the regex string, so regex actually gets the \n and not a actual newline.
Something like this:
regex = '.*match(.|\\n)*fail.*'
This would match anything from the last \n to match, then any mix or number of \n until testfail. You can change this how you want, but the idea is the same. Put what you want into a grouping, and then use | as an or.
On the left is what this regex pattern matched from your example.

How to replace '..' and '?.' with single periods and question marks in pandas? df['column'].str.replace not working

This is a follow up to this SO post which gives a solution to replace text in a string column
How to replace text in a column of a Pandas dataframe?
df['range'] = df['range'].str.replace(',','-')
However, this doesn't seem to work with double periods or a question mark followed by a period
testList = ['this is a.. test stence', 'for which is ?. was a time']
testDf = pd.DataFrame(testList, columns=['strings'])
testDf['strings'].str.replace('..', '.').head()
results in
0 ...........e
1 .............
Name: strings, dtype: object
and
testDf['strings'].str.replace('?.', '?').head()
results in
error: nothing to repeat at position 0

Add regex=False parameter, because as you can see in the docs, regex it's by default True:
-regex bool, default True
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
And ? . are special characters in regular expressions.
So, one way to do it without regex will be this double replacing:
testDf['strings'].str.replace('..', '.',regex=False).str.replace('?.', '?',regex=False)
Output:
strings
0 this is a. test stence
1 for which is ? was a time

Replace using regular expression. In this case, replace any sepcial character '.' followed immediately by white space. This is abit curly, I advice you go with #Mark Reed answer.
testDf.replace(regex=r'([.](?=\s))', value=r'')
strings
0 this is a. test stence
1 for which is ? was a time

str.replace() works with a Regex where . is a special character which denotes "any" character. If you want a literal dot, you need to escape it: "\.". Same for other special Regex characters like ?.

First, be aware that the Pandas replace method is different from the standard Python one, which operates only on fixed strings. The Pandas one can behave as either the regular string.replace or re.sub (the regular-expression substitute method), depending on the value of a flag, and the default is to act like re.sub. So you need to treat your first argument as a regular expression. That means you do have to change the string, but it also has the benefit of allowing you to do both substitutions in a single call.
A regular expression isn't a string to be searched for literally, but a pattern that acts as instructions telling Python what to look for. Most characters just ask Python to match themselves, but some are special, and both . and ? happen to be in the special category.
The easiest thing to do is to use a character class to match either . or ? followed by a period, and remember which one it was so that it can be included in the replacement, just without the following period. That looks like this:
testDF.replace(regex=r'([.?])\.', value=r'\1')
The [.?] means "match either a period or a question mark"; since they're inside the [...], those normally-special characters don't need to be escaped. The parentheses around the square brackets tell Python to remember which of those two characters is the one it actually found. The next thing that has to be there in order to match is the period you're trying to get rid of, which has to be escaped with a backslash because this one's not inside [...].
In the replacement, the special sequence \1 means "whatever you found that matched the pattern between the first set of parentheses", so that's either the period or question mark. Since that's the entire replacement, the following period is removed.
Now, you'll notice I used raw strings (r'...') for both; that keeps Python from doing its own interpretation of the backslashes before replace can. If the replacement were just '\1' without the r it would replace them with character code 1 (control-A) instead of the first matched group.

To replace both the ? and . at the same time you can separate by | (the regex OR operator).
testDf['strings'].str.replace('\?.|\..', '.')
Prefix the .. with a \, because you need to escape as . is a regex character:
testDf['strings'].str.replace('\..', '.')
You can do the same with the ?, which is another regex character.
testDf['strings'].str.replace('\?.', '.')

re.fullmatch equivalent in pandas text handling [duplicate]

I'm trying to check if a string is a number, so the regex "\d+" seemed good. However that regex also fits "78.46.92.168:8000" for some reason, which I do not want, a little bit of code:
class Foo():
_rex = re.compile("\d+")
def bar(self, string):
m = _rex.match(string)
if m != None:
doStuff()
And doStuff() is called when the ip adress is entered. I'm kind of confused, how does "." or ":" match "\d"?

\d+ matches any positive number of digits within your string, so it matches the first 78 and succeeds.
Use ^\d+$.
Or, even better: "78.46.92.168:8000".isdigit()

There are a couple of options in Python to match an entire input with a regex.
Python 2 and 3
In Python 2 and 3, you may use
re.match(r'\d+$') # re.match anchors the match at the start of the string, so $ is what remains to add
or - to avoid matching before the final \n in the string:
re.match(r'\d+\Z') # \Z will only match at the very end of the string
Or the same as above with re.search method requiring the use of ^ / \A start-of-string anchor as it does not anchor the match at the start of the string:
re.search(r'^\d+$')
re.search(r'\A\d+\Z')
Note that \A is an unambiguous string start anchor, its behavior cannot be redefined with any modifiers (re.M / re.MULTILINE can only redefine the ^ and $ behavior).
Python 3
All those cases described in the above section and one more useful method, re.fullmatch (also present in the PyPi regex module):
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
So, after you compile the regex, just use the appropriate method:
_rex = re.compile("\d+")
if _rex.fullmatch(s):
doStuff()

re.match() always matches from the start of the string (unlike re.search()) but allows the match to end before the end of the string.
Therefore, you need an anchor: _rex.match(r"\d+$") would work.
To be more explicit, you could also use _rex.match(r"^\d+$") (which is redundant) or just drop re.match() altogether and just use _rex.search(r"^\d+$").

\Z matches the end of the string while $ matches the end of the string or just before the newline at the end of the string, and exhibits different behaviour in re.MULTILINE. See the syntax documentation for detailed information.
>>> s="1234\n"
>>> re.search("^\d+\Z",s)
>>> s="1234"
>>> re.search("^\d+\Z",s)
<_sre.SRE_Match object at 0xb762ed40>

Change it from \d+ to ^\d+$

Using anchors in python regex to get exact match

I need to validate a version number consisting of 'v' plus positive int, and nothing else
eg "v4", "v1004"
I have
import re
pattern = "\Av(?=\d+)\W"
m = re.match(pattern, "v303")
if m is None:
print "noMatch"
else:
print "match"
But this doesn't work! Removing the \A and \W will match for v303 but will also match for v30G, for example
Thanks

Pretty straightforward. First, put anchors on your pattern:
"^patternhere$"
Now, let's put together the pattern:
"^v\d+$"
That should do it.

I think you may want \b (word boundary) rather than \A (start of string) and \W (non word character), also you don't need to use lookahead (the (?=...)).
Try: "\bv(\d+)" if you need to capture the int, "\bv\d+" if you don't.
Edit: You probably want to use raw string syntax for Python regexes, r"\bv\d+\b", since "\b" is a backspace character in a regular string.
Edit 2: Since + is "greedy", no trailing \b is necessary or desired.

Simply use
\bv\d+\b
Or enclosed it with ^\bv\d+\b$
to match it entirely..

Python regular expression to match # followed by 0-7 followed by ##

I would like to intercept string starting with \*#\*
followed by a number between 0 and 7
and ending with: ##
so something like \*#\*0##
but I could not find a regex for this

Assuming you want to allow only one # before and two after, I'd do it like this:
r'^(\#{1}([0-7])\#{2})'
It's important to note that Alex's regex will also match things like
###7######
########1###
which may or may not matter.
My regex above matches a string starting with #[0-7]## and ignores the end of the string. You could tack a $ onto the end if you wanted it to match only if that's the entire line.
The first backreference gives you the entire #<number>## string and the second backreference gives you the number inside the #.

None of the above examples are taking into account the *#*
^\*#\*[0-7]##$
Pass : *#*7##
Fail : *#*22324324##
Fail : *#3232#
The ^ character will match the start of the string, \* will match a single asterisk, the # characters do not need to be escape in this example, and finally the [0-7] will only match a single character between 0 and 7.

r'\#[0-7]\#\#'

The regular expression should be like ^#[0-7]##$

As I understand the question, the simplest regular expression you need is:
rex= re.compile(r'^\*#\*([0-7])##$')
The {1} constructs are redundant.
After doing rex.match (or rex.search, but it's not necessary here), .group(1) of the match object contains the digit given.
EDIT: The whole matched string is always available as match.group(0). If all you need is the complete string, drop any parentheses in the regular expression:
rex= re.compile(r'^\*#\*[0-7]##$')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.