how do I include a boolean AND within a regex? - python

Is there a way to get single regex to satisfy this condition??
I am looking for a "word" that has three letters from the set MBIPI, any order,
but MUST contain an I.
ie.
re.match("[MBDPI]{3}", foo) and "I" in foo
So this is the correct result (in python using the re module), but can I get this from a single regex?
>>> for foo in ("MBI", "MIB", "BIM", "BMI", "IBM", "IMB", "MBD"):
... print foo,
... print re.match("[MBDPI]{3}", foo) and "I" in foo
MBI True
MIB True
BIM True
BMI True
IBM True
IMB True
MBD False
with regex I know I can use | as a boolean OR operator, but is there a boolean AND equivalent?
or maybe I need some forward or backward lookup?

You can fake boolean AND by using lookaheads. According to http://www.regular-expressions.info/lookaround2.html, this will work for your case:
"\b(?=[MBDPI]{3}\b)\w*I\w*"

with regex I know I can use | as a boolean OR operator, but is there a boolean AND equivalent?
A and B = not ( not A or not B) = (?![^A]|[^B])
A and B being expressions that actually may have members in common.

Or is about the only thing you can do:
\b(I[MBDPI]{2}|[MBDPI]I[MBDPI]|[MBDPI]{2}I)\b
The \b character matches a zero-width word boundary. This ensures you match something that is exactly three characters long.
You're otherwise running into the limits to what a regular language can do.
An alternative is to match:
\b[MBDPI]{3}\b
capture that group and then look for an I.
Edit: for the sake of having a complete answer, I'll adapt Jens' answer that uses Testing The Same Part of a String for More Than One Requirement:
\b(?=[MBDPI]{3}\b)\w*I\w*
with the word boundary checks to ensure it's only three characters long.
This is a bit more of an advanced solution and applicable in more situations but I'd generally favour what's easier to read (being the "or" version imho).

You could use lookahead to see if an I is present:
(?=[MBDPI]{0,2}I)[MBDPI]{3}

Related

How to match a numeric range out of 23:59?

I have possible strings in the format of:
x:y
where x & y could be multiple digits. I want to match the opposite of 23:59, meaning that x must be > 23 & y must be > 59, how to write that pattern? My intention is that, if a string x:y is not like a time format, i.e. 08:23, I want to exclude it. Note that the string could be:
8:23 OR
08:23
Both refers to 8:23am. I have to match the opposite of 23:59, since my program's logic works this way. The following pattern seems to match 0<x<=23 & 0<y<=59
([0-1][1-9]|2[0-4]):[0-5][0-9]
How to match the opposite of this, if the above regex is correct?
One way using dateutil.parser:
import dateutil.parser as dparser
def is_time(str_):
try:
dparser.parse(str_, fuzzy=True)
return True
except ValueError:
return False
times = ["8:23", "08:23", "28:23", "23:61"]
for t in times:
print(t, is_time(t))
Output:
8:23 True
08:23 True
28:23 False # Wrong hour
23:61 False # Wrong min
Look aheads might come to the rescue here:
\b(?!23:59)([0-1][0-9]|2[0-3]):[0-5][0-9]\b
The negative lookahead at the very start of the pattern (?!23:59) excludes 23:59, and the rest of the pattern allows all other hours:minutes.
Demo
(((2[4-9]|[3-9][0-9]):\d\d)|(\d\d:([6-9][0-9])))
I think the correct Regex for what you want is
([01]?[0-9]|2[0-3]):[0-5][0-9]
and for the opposite of the whole thing you can do the following, negating the accepted set.
(?!([01]?[0-9]|2[0-3]):[0-5][0-9])
?! = Negative lookahead.
My intention is that, if a string x:y is not like a time format, i.e.
08:23, I want to exclude it.
This to me seems like you just want to attempt to match the two formats and if no match, discard it.
Try this:
https://regex101.com/r/pQGNyj/1
Expression:
^([0-1][0-9]|[2][0-3]):([0-5][0-9])|((?<!\d)[1]*[0-9]|[2][0-3]):([0-5][0-9])$
It might be what you're after...

How to test a string that only contains alphabets and numbers?

I am trying to test either a string contains only alphabets or numbers. Following statement should return false but it doesn't return. What am I doing wrong?
bool(re.match('[A-Z\d]', '2ae12'))
Just use the string method isalnum(), it does exactly what you want.
While not regex, you can use the very concise str.isalnum():
s = "sdfsdfq34sd"
print(s.isalnum())
Output:
True
However, if you do want a pure regex solution:
import re
if re.findall('^[a-zA-Z0-9]+$', s):
pass #string just contains letters and digits
Using a dataframe solution, courtesy of #Wen:
df.col1.apply(lambda x : x.isalnum())
df=pd.DataFrame( {'col1':["sdfsdfq34sd","sdfsdfq###34sd","sdfsdf!q34sd","sdfs‌​dfq34s#d"]})
Pandas answer: Consider this df
col
0 2ae12
1 2912
2 da2ae12
3 %2ae12
4 #^%6f
5 &^$*
You can select the rows that contain only alphabets or numbers using
df[~df.col.str.contains('(\W+)')]
You get
col
0 2ae12
1 2912
2 da2ae12
If you just want a boolean column, use
~df.col.str.contains('(\W+)')
0 True
1 True
2 True
3 False
4 False
5 False
If you are looking to return True if the string is either all digits or all letters, you can do:
for case in ('abcdefg','12345','2ae12'):
print case, case.isalpha() or case.isdigit()
Prints:
abcdefg True
12345 True
2ae12 False
If you want the same logic with a regex, you would do:
import re
for case in ('abcdefg','12345','2ae12'):
print case, bool(re.search(r'^(?:[a-zA-Z]+|\d+)$', case))
You regex is only matching one character, and I think the \d is being treated as an escaped D instead of the set of all integer characters.
If you really want to use a regex here's how I would do it;
def isalphanum(test_str):
alphanum_re = re.compile(r"[0-9A-Z]+", re.I)
return bool(alphanum_re.match(test_str)
Let's focus on the alphanum regex. I compiled it with a raw literal, indicated by the string with an 'r' next to it. This type of string won't escape certain characters when a slash is present, meaning r"\n" is interpreted as a slash and an N instead of a newline. This is helpful when using regexs, and certain text editors will even change the syntax highlighting of an R string to highlight features in the regex to help you out. The re.I flag ignores the case of the test string, so [A-Z] will match A through Z in either upper or lower case.
The simpler, Zen of Python solution involves invoking the isalnum method of the string;
test_str = "abc123"
test_str.isalnum()
You need to check is the string is made up of either alphabets or digits!
import re
bool(re.match('^[A-Za-z]+|\d+$', df['some_column'].str))
As dawg has suggested you can also use isalpha and isdigit,
df['some_column'].str.isalpha() or df['some_column'].str.isdigit()

How to improve the performance of this regular expression?

Consider the regular expression
^(?:\s*(?:[\%\#].*)?\n)*\s*function\s
It is intended to match Octave/MATLAB script files that start with a function definition.
However, the performance of this regular expression is incredibly slow, and I'm not entirely sure why. For example, if I try evaluating it in Python,
>>> import re, time
>>> r = re.compile(r"^(?:\s*(?:[\%\#].*)?\n)*\s*function\s")
>>> t0=time.time(); r.match("\n"*15); print(time.time()-t0)
0.0178489685059
>>> t0=time.time(); r.match("\n"*20); print(time.time()-t0)
0.532235860825
>>> t0=time.time(); r.match("\n"*25); print(time.time()-t0)
17.1298530102
In English, that last line is saying that my regular expression takes 17 seconds to evaluate on a simple string containing 25 newline characters!
What is it about my regex that is making it so slow, and what could I do to fix it?
EDIT: To clarify, I would like my regex to match the following string containing comments:
# Hello world
function abc
including any amount of whitespace, but not
x = 10
function abc
because then the string does not start with "function". Note that comments can start with either "%" or with "#".
Replace your \s with [\t\f ] so they don't catch newlines. This should only be done by the whole non-capturing group (?:[\t\f ]*(?:[\%\#].*)?\n).
The problem is that you have three greedy consumers that all match '\n' (\s*, (...\n)* and again \s*).
In your last timing example, they will try out all strings a, b and c (one for each consumer) that make up 25*'\n' or any substring d it begins with, say e is what is ignored, then d+e == 25*'\n'.
Now find all combinations of a, b, c and e so that a+b+c+e == d+e == 25*'\n' considering also the empty string for one or more variables. It's too late for me to do the maths right now but I bet the number is huge :D
By the way regex101 is a great site to try out regular expressions. They automatically break up expressions and explain their parts and they even provide a debugger.
To speedup you can use this regex:
p = re.compile(r"^\s*function\s", re.MULTILINE)
Since you're not actually capturing lines starting with # or % anyway, you can use MULTILINE mode and start matching from the same line where function keyword is found.

Python Regex match or potential match

Question:
How do I use Python's regular expression module (re) to determine if a match has been made, or that a potential match could be made?
Details:
I want a regex pattern which searches for a pattern of words in a correct order regardless of what's between them. I want a function which returns Yes if found, Maybe if a match could still be found or No if no match can be found. We are looking for the pattern One|....|Two|....|Three, here are some examples (Note the names, their count, or their order are not important, all I care about is the three words One, Two and Three, and the acceptable words in between are John, Malkovich, Stamos and Travolta).
Returns YES:
One|John|Malkovich|Two|John|Stamos|Three|John|Travolta
Returns YES:
One|John|Two|John|Three|John
Returns YES:
One|Two|Three
Returns MAYBE:
One|Two
Returns MAYBE:
One
Returns NO:
Three|Two|One
I understand the examples are not airtight, so here is what I have for the regex to get YES:
if re.match('One\|(John\||Malkovich\||Stamos\||Travolta\|)*Two\|(John\||Malkovich\||Stamos\||Travolta\|)*Three\|(John\||Malkovich\||Stamos\||Travolta\|)*', 'One|John|Malkovich|Two|John|Stamos|Three|John|Travolta') != None
return 'Yes'
Obviously if the pattern is Three|Two|One the above will fail, and we can return No, but how do I check for the Maybe case? I thought about nesting the parentheses, like so (note, not tested)
if re.match('One\|((John\||Malkovich\||Stamos\||Travolta\|)*Two(\|(John\||Malkovich\||Stamos\||Travolta\|)*Three\|(John\||Malkovich\||Stamos\||Travolta\|)*)*)*', 'One|John|Malkovich|Two|John|Stamos|Three|John|Travolta') != None
return 'Yes'
But I don't think that will do what I want it to do.
More Details:
I am not actually looking for Travoltas and Malkovichs (shocking, I know). I am matching against inotify Patterns such as IN_MOVE, IN_CREATE, IN_OPEN, and I am logging them and getting hundreds of them, then I go in and then look for a particular pattern such as IN_ACCESS...IN_OPEN....IN_MODIFY, but in some cases I don't want an IN_DELETE after the IN_OPEN and in others I do. I'm essentially pattern matching to use inotify to detect when text editors gone wild and they try to crush programmers souls by doing a temporary-file-swap-save instead of just modifying the file. I don't want to free up those logs instantly, but I only want to hold on to them for as long as is necessary. Maybe means dont erase the logs. Yes means do something then erase the log and No means don't do anything but still erase the logs. As I will have multiple rules for each program (ie. vim v gedit v emacs) I wanted to use a regular expression which would be more human readable and easier to write then creating a massive tree, or as user Joel suggested, just going over the words with a loop
I wouldn't use a regex for this. But it's definitely possible:
regex = re.compile(
r"""^ # Start of string
(?: # Match...
(?: # one of the following:
One() # One (use empty capturing group to indicate match)
| # or
\1Two() # Two if One has matched previously
| # or
\1\2Three() # Three if One and Two have matched previously
| # or
John # any of the other strings
| # etc.
Malkovich
|
Stamos
|
Travolta
) # End of alternation
\|? # followed by optional separator
)* # any number of repeats
$ # until the end of the string.""",
re.VERBOSE)
Now you can check for YES and MAYBE by checking if you get a match at all:
>>> yes = regex.match("One|John|Malkovich|Two|John|Stamos|Three|John|Travolta")
>>> yes
<_sre.SRE_Match object at 0x0000000001F90620>
>>> maybe = regex.match("One|John|Malkovich|Two|John|Stamos")
>>> maybe
<_sre.SRE_Match object at 0x0000000001F904F0>
And you can differentiate between YES and MAYBE by checking whether all of the groups have participated in the match (i. e. are not None):
>>> yes.groups()
('', '', '')
>>> maybe.groups()
('', '', None)
And if the regex doesn't match at all, that's a NO for you:
>>> no = regex.match("Three|Two|One")
>>> no is None
True
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski
Perhaps an algorithm like this would be more appropriate. Here is some pseudocode.
matchlist.current = matchlist.first()
for each word in input
if word = matchlist.current
matchlist.current = matchlist.next() // assuming next returns null if at end of list
else if not allowedlist.contains(word)
return 'No'
if matchlist.current = null // we hit the end of the list
return 'Yes'
return 'Maybe'

Regular expression to confirm whether a string is a valid Python identifier?

I have the following definition for an Identifier:
Identifier --> letter{ letter| digit}
Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.
I've tried this:
if re.match('\w+(\w\d)?', i):
return True
else:
return False
but when I run my program every time it meets an integer it thinks that it's a valid identifier.
For example
c = 0 ;
it prints c as a valid identifier which is fine, but it also prints 0 as a valid identifer.
What am I doing wrong here?
Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:
No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.
The reasons are:
As #JoeCondron pointed out, Python reserved keywords such as True, if, return, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.
Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of \d, \w, \W in the re module, as demonstrated in #martineau's counter-example and explained in great detail by #Hatshepsut's amazing research.
While we could try to solve the first issue using keyword.iskeyword(), as #Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?
As Hatshepsut said:
str.isidentifier() works
Just use it, problem solved.
As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:
identifier ::= (letter|"_") (letter | digit | "_")*
Which can be expressed by the regular expression:
^[^\d\W]\w*\Z
Example:
import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)
tests = [ "a", "a1", "_a1", "1a", "aa$%#%", "aa bb", "aa_bb", "aa\n" ]
for test in tests:
result = re.match(identifier, test)
print("%r\t= %s" % (test, (result is not None)))
Result:
'a' = True
'a1' = True
'_a1' = True
'1a' = False
'aa$%#%' = False
'aa bb' = False
'aa_bb' = True
'aa\n' = False
str.isidentifier() works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.
str.isidentifier() Return true if the string is a valid identifier
according to the language definition, section Identifiers and
keywords.
Use keyword.iskeyword() to test for reserved identifiers such as def
and class.
#martineau's comment gives the example of '℘᧚' where the regex solutions fail.
>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False
Why does this happen?
Lets define the sets of code points that match the given regular expression, and the set that match str.isidentifier.
import re
import unicodedata
chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}
How many regex matches are not identifiers?
In [26]: len(chars - identifiers)
Out[26]: 698
How many identifiers are not regex matches?
In [27]: len(identifiers - chars)
Out[27]: 4
Interesting -- which ones?
In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}
Out[37]:
set([
('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
('℘', 'SCRIPT CAPITAL P', 'Sm'),
('℮', 'ESTIMATED SYMBOL', 'So'),
])
What's different about these two sets?
They have different Unicode "General Category" values.
In [31]: {unicodedata.category(c) for c in chars - identifiers}
Out[31]: set(['Lm', 'Lo', 'No'])
From wikipedia, that's Letter, modifier; Letter, other; Number, other. This is consistent with the re docs, since \d is only decimal digits:
\d Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
What about the other way?
In [32]: {unicodedata.category(c) for c in identifiers - chars}
Out[32]: set(['Mn', 'Sm', 'So'])
That's Mark, nonspacing; Symbol, math; Symbol, other.
Where is this all documented?
In the Python Language Reference
In PEP 3131 - Supporting non-ascii identifiers
Where is it implemented?
https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255
I still want a regular expression
Look at the regex module on PyPI.
This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.
It includes filters for "General Category".
For Python 3, you need to handle Unicode letters and digits. So if that's a concern, you should get along with this:
re_ident = re.compile(r"^[^\d\W]\w*$", re.UNICODE)
[^\d\W] matches a character that is not a digit and not "not alphanumeric" which translates to "a character that is a letter or underscore".
\w matches digits and characters. Try ^[_a-zA-Z]\w*$
Works like a charm: r'[^\d\W][\w\d]+'
The question is about regex, so my answer may look out of subject. The point is that regex is simply not the right approach.
Interested in getting the problematic characters ?
Using str.isidentifier, one can perform the check character by character, prefixing them with, say, an underscore to avoid false positive such as digits and so on... How could a name be valid if one of its (prefixed) component is not (?) E.g.
def checker(str_: str) -> 'set[str]':
return {
c for i, c in enumerate(str_)
if not (f'_{c}' if i else c).isidentifier()
}
>>> checker('℘3᧚₂')
{'₂'}
Which solution deals with unauthorised first characters, such as digits or e.g. ᧚. See
>>> checker('᧚℘3₂')
{'₂', '᧚'}
>>> checker('3᧚℘₂')
{'3', '₂'}
>>> checker("a$%##%\n")
{'#', '#', '\n', '$', '%'}
To be improved, since it does check neither for reserved names, nor tells anything about why ᧚ is sometime problematic, whereas ₂ always is... but here is my without-regex approach.
My answer in your terms:
if not checker(i):
return True
else:
return False
which could be contracted into
return not checker(i)

Categories