I have the following definition for an Identifier:
Identifier --> letter{ letter| digit}
Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.
I've tried this:
if re.match('\w+(\w\d)?', i):
return True
else:
return False
but when I run my program every time it meets an integer it thinks that it's a valid identifier.
For example
c = 0 ;
it prints c as a valid identifier which is fine, but it also prints 0 as a valid identifer.
What am I doing wrong here?
Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:
No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.
The reasons are:
As #JoeCondron pointed out, Python reserved keywords such as True, if, return, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.
Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of \d, \w, \W in the re module, as demonstrated in #martineau's counter-example and explained in great detail by #Hatshepsut's amazing research.
While we could try to solve the first issue using keyword.iskeyword(), as #Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?
As Hatshepsut said:
str.isidentifier() works
Just use it, problem solved.
As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:
identifier ::= (letter|"_") (letter | digit | "_")*
Which can be expressed by the regular expression:
^[^\d\W]\w*\Z
Example:
import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)
tests = [ "a", "a1", "_a1", "1a", "aa$%#%", "aa bb", "aa_bb", "aa\n" ]
for test in tests:
result = re.match(identifier, test)
print("%r\t= %s" % (test, (result is not None)))
Result:
'a' = True
'a1' = True
'_a1' = True
'1a' = False
'aa$%#%' = False
'aa bb' = False
'aa_bb' = True
'aa\n' = False
str.isidentifier() works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.
str.isidentifier() Return true if the string is a valid identifier
according to the language definition, section Identifiers and
keywords.
Use keyword.iskeyword() to test for reserved identifiers such as def
and class.
#martineau's comment gives the example of '℘᧚' where the regex solutions fail.
>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False
Why does this happen?
Lets define the sets of code points that match the given regular expression, and the set that match str.isidentifier.
import re
import unicodedata
chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}
How many regex matches are not identifiers?
In [26]: len(chars - identifiers)
Out[26]: 698
How many identifiers are not regex matches?
In [27]: len(identifiers - chars)
Out[27]: 4
Interesting -- which ones?
In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}
Out[37]:
set([
('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
('℘', 'SCRIPT CAPITAL P', 'Sm'),
('℮', 'ESTIMATED SYMBOL', 'So'),
])
What's different about these two sets?
They have different Unicode "General Category" values.
In [31]: {unicodedata.category(c) for c in chars - identifiers}
Out[31]: set(['Lm', 'Lo', 'No'])
From wikipedia, that's Letter, modifier; Letter, other; Number, other. This is consistent with the re docs, since \d is only decimal digits:
\d Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
What about the other way?
In [32]: {unicodedata.category(c) for c in identifiers - chars}
Out[32]: set(['Mn', 'Sm', 'So'])
That's Mark, nonspacing; Symbol, math; Symbol, other.
Where is this all documented?
In the Python Language Reference
In PEP 3131 - Supporting non-ascii identifiers
Where is it implemented?
https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255
I still want a regular expression
Look at the regex module on PyPI.
This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.
It includes filters for "General Category".
For Python 3, you need to handle Unicode letters and digits. So if that's a concern, you should get along with this:
re_ident = re.compile(r"^[^\d\W]\w*$", re.UNICODE)
[^\d\W] matches a character that is not a digit and not "not alphanumeric" which translates to "a character that is a letter or underscore".
\w matches digits and characters. Try ^[_a-zA-Z]\w*$
Works like a charm: r'[^\d\W][\w\d]+'
The question is about regex, so my answer may look out of subject. The point is that regex is simply not the right approach.
Interested in getting the problematic characters ?
Using str.isidentifier, one can perform the check character by character, prefixing them with, say, an underscore to avoid false positive such as digits and so on... How could a name be valid if one of its (prefixed) component is not (?) E.g.
def checker(str_: str) -> 'set[str]':
return {
c for i, c in enumerate(str_)
if not (f'_{c}' if i else c).isidentifier()
}
>>> checker('℘3᧚₂')
{'₂'}
Which solution deals with unauthorised first characters, such as digits or e.g. ᧚. See
>>> checker('᧚℘3₂')
{'₂', '᧚'}
>>> checker('3᧚℘₂')
{'3', '₂'}
>>> checker("a$%##%\n")
{'#', '#', '\n', '$', '%'}
To be improved, since it does check neither for reserved names, nor tells anything about why ᧚ is sometime problematic, whereas ₂ always is... but here is my without-regex approach.
My answer in your terms:
if not checker(i):
return True
else:
return False
which could be contracted into
return not checker(i)
Related
I am trying to test either a string contains only alphabets or numbers. Following statement should return false but it doesn't return. What am I doing wrong?
bool(re.match('[A-Z\d]', '2ae12'))
Just use the string method isalnum(), it does exactly what you want.
While not regex, you can use the very concise str.isalnum():
s = "sdfsdfq34sd"
print(s.isalnum())
Output:
True
However, if you do want a pure regex solution:
import re
if re.findall('^[a-zA-Z0-9]+$', s):
pass #string just contains letters and digits
Using a dataframe solution, courtesy of #Wen:
df.col1.apply(lambda x : x.isalnum())
df=pd.DataFrame( {'col1':["sdfsdfq34sd","sdfsdfq###34sd","sdfsdf!q34sd","sdfsdfq34s#d"]})
Pandas answer: Consider this df
col
0 2ae12
1 2912
2 da2ae12
3 %2ae12
4 #^%6f
5 &^$*
You can select the rows that contain only alphabets or numbers using
df[~df.col.str.contains('(\W+)')]
You get
col
0 2ae12
1 2912
2 da2ae12
If you just want a boolean column, use
~df.col.str.contains('(\W+)')
0 True
1 True
2 True
3 False
4 False
5 False
If you are looking to return True if the string is either all digits or all letters, you can do:
for case in ('abcdefg','12345','2ae12'):
print case, case.isalpha() or case.isdigit()
Prints:
abcdefg True
12345 True
2ae12 False
If you want the same logic with a regex, you would do:
import re
for case in ('abcdefg','12345','2ae12'):
print case, bool(re.search(r'^(?:[a-zA-Z]+|\d+)$', case))
You regex is only matching one character, and I think the \d is being treated as an escaped D instead of the set of all integer characters.
If you really want to use a regex here's how I would do it;
def isalphanum(test_str):
alphanum_re = re.compile(r"[0-9A-Z]+", re.I)
return bool(alphanum_re.match(test_str)
Let's focus on the alphanum regex. I compiled it with a raw literal, indicated by the string with an 'r' next to it. This type of string won't escape certain characters when a slash is present, meaning r"\n" is interpreted as a slash and an N instead of a newline. This is helpful when using regexs, and certain text editors will even change the syntax highlighting of an R string to highlight features in the regex to help you out. The re.I flag ignores the case of the test string, so [A-Z] will match A through Z in either upper or lower case.
The simpler, Zen of Python solution involves invoking the isalnum method of the string;
test_str = "abc123"
test_str.isalnum()
You need to check is the string is made up of either alphabets or digits!
import re
bool(re.match('^[A-Za-z]+|\d+$', df['some_column'].str))
As dawg has suggested you can also use isalpha and isdigit,
df['some_column'].str.isalpha() or df['some_column'].str.isdigit()
This question already has answers here:
Test if string ONLY contains given characters [duplicate]
(7 answers)
Closed 8 years ago.
Forgive the simplistic question, but I've read through the SO questions and the Python documentation and still haven't been able to figure this out.
How can I create a Python regex to test whether a string contains ANY but ONLY the A, U, G and C characters? The string can contain either one or all of those characters, but if it contains any other characters, I'd like the regex to fail.
I tried:
>>> re.match(r"[AUGC]", "AUGGAC")
<_sre.SRE_Match object at 0x104ca1850>
But adding an X on to the end of the string still works, which is not what I expected:
>>> re.match(r"[AUGC]", "AUGGACX")
<_sre.SRE_Match object at 0x104ca1850>
Thanks in advance.
You need the regex to consume the whole string (or fail, if it can't). re.match implicitly adds an anchor at the start of the string, you need to add one to the end:
re.match(r"[AUGC]+$", string_to_check)
Also note the +, which repeatedly matches your character set (since, again, the point is to consume the whole string)
if the value is the only characters in the string, you can do the following:
>>> r = re.compile(r'^[AUGC]+$')
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
>>>
then if you want your regex to match the empty string as well, you can do:
>>> r = re.compile(r'^[AUGC]*$')
>>> r.match("")
<_sre.SRE_Match object at 0x10ee16718>
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
Here's a description of what the first regexp does:
Walk through it
Use ^[AUCG]*$; this will match against the entire string.
Or, if there has to be at least one letter, ^[AUCG]+$ — ^ and $ stand for beginning of string and end of string respectively; * and + stand for zero or more and one or more respectively.
This is purely about regular expressions and not specific to Python really.
You are actually really close. What you have just tests for a single character that A or U or G or C.
What you want is to match a string that has one or more letters that are all A or U or G or C, you can accomplish this by adding the plus modifier to your regular expression.
re.match(r"^[AUGC]+$", "AUGGAC")
Additionally, adding $ at the end marks the end of string, you can optionally use ^ at the front to match the beginning of the string.
Just check to see if there is anything other than "AUGC" in there:
if re.search('[^AUGC]', string_to_check):
#fail
You can add a check to make sure the string is not empty in the same statement:
if not string_to_check or re.search('[^AUGC]', string_to_check):
#fail
No real need to use a regex:
>>> good = 'AUGGCUA'
>>> bad = 'AUGHACUA'
>>> all([c in 'AUGC' for c in good])
True
>>> all([c in 'AUGC' for c in bad])
False
I know you're asking about regular expressions but I though it was worth mentioning set. To establish whether your string only contains A U G or C, you could do this:
>>> input = "AUCGCUAGCGAU"
>>> s = set("AUGC")
>>> set(input) <= s
True
>>> bad = "ASNMSA"
>>> set(bad) <= s
False
edit: thanks to #roippi for spotting my mistake, <= should be used, not ==.
Instead of using <=, the method issubset can be used:
>>> set("AUGAUG").issubset(s)
True
if all characters in the string input are in the set s, then issubset will return True.
From: https://docs.python.org/2/library/re.html
Characters that are not within a range can be matched by complementing the set.
If the first character of the set is '^', all the characters that are not in the set will be matched.
For example, [^5] will match any character except '5', and [^^] will match any character except '^'.
^ has no special meaning if it’s not the first character in the set.
So you could do [^AUGC] and if it matches that then reject it, else keep it.
I'm running search below Idle, in Python 2.7 in a Windows Bus. 64 bit environment.
According to RegexBuddy, the search pattern ('patternalphaonly') should not produce a match against a string of digits.
I looked at "http://docs.python.org/howto/regex.html", but did not see anything there that would explain why the search and match appear to be successful in finding something matching the pattern.
Does anyone know what I'm doing wrong, or misunderstanding?
>>> import re
>>> numberstring = '3534543234543'
>>> patternalphaonly = re.compile('[a-zA-Z]*')
>>> result = patternalphaonly.search(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
>>> result = patternalphaonly.match(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
Thanks
The star operator (*) indicates zero or more repetitions. Your string has zero repetitions of an English alphabet letter because it is entirely numbers, which is perfectly valid when using the star (repeat zero times). Instead use the + operator, which signifies one or more repetitions. Example:
>>> n = "3534543234543"
>>> r1 = re.compile("[a-zA-Z]*")
>>> r1.match(n)
<_sre.SRE_Match object at 0x07D85720>
>>> r2 = re.compile("[a-zA-Z]+") #using the + operator to make sure we have at least one letter
>>> r2.match(n)
Helpful link on repetition operators.
Everything eldarerathis says is true. However, with a variable named: 'patternalphaonly' I would assume that the author wants to verify that a string is composed of alpha chars only. If this is true then I would add additional end-of-string anchors to the regex like so:
patternalphaonly = re.compile('^[a-zA-Z]+$')
result = patternalphaonly.search(numberstring)
Or, better yet, since this will only ever match at the beginning of the string, use the preferred match method:
patternalphaonly = re.compile('[a-zA-Z]+$')
result = patternalphaonly.match(numberstring)
(Which, as John Machin has pointed out, is evidently faster for some as-yet unexplained reason.)
I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.
Is there a way to get single regex to satisfy this condition??
I am looking for a "word" that has three letters from the set MBIPI, any order,
but MUST contain an I.
ie.
re.match("[MBDPI]{3}", foo) and "I" in foo
So this is the correct result (in python using the re module), but can I get this from a single regex?
>>> for foo in ("MBI", "MIB", "BIM", "BMI", "IBM", "IMB", "MBD"):
... print foo,
... print re.match("[MBDPI]{3}", foo) and "I" in foo
MBI True
MIB True
BIM True
BMI True
IBM True
IMB True
MBD False
with regex I know I can use | as a boolean OR operator, but is there a boolean AND equivalent?
or maybe I need some forward or backward lookup?
You can fake boolean AND by using lookaheads. According to http://www.regular-expressions.info/lookaround2.html, this will work for your case:
"\b(?=[MBDPI]{3}\b)\w*I\w*"
with regex I know I can use | as a boolean OR operator, but is there a boolean AND equivalent?
A and B = not ( not A or not B) = (?![^A]|[^B])
A and B being expressions that actually may have members in common.
Or is about the only thing you can do:
\b(I[MBDPI]{2}|[MBDPI]I[MBDPI]|[MBDPI]{2}I)\b
The \b character matches a zero-width word boundary. This ensures you match something that is exactly three characters long.
You're otherwise running into the limits to what a regular language can do.
An alternative is to match:
\b[MBDPI]{3}\b
capture that group and then look for an I.
Edit: for the sake of having a complete answer, I'll adapt Jens' answer that uses Testing The Same Part of a String for More Than One Requirement:
\b(?=[MBDPI]{3}\b)\w*I\w*
with the word boundary checks to ensure it's only three characters long.
This is a bit more of an advanced solution and applicable in more situations but I'd generally favour what's easier to read (being the "or" version imho).
You could use lookahead to see if an I is present:
(?=[MBDPI]{0,2}I)[MBDPI]{3}