How to test a string that only contains alphabets and numbers? - python

I am trying to test either a string contains only alphabets or numbers. Following statement should return false but it doesn't return. What am I doing wrong?
bool(re.match('[A-Z\d]', '2ae12'))

Just use the string method isalnum(), it does exactly what you want.

While not regex, you can use the very concise str.isalnum():
s = "sdfsdfq34sd"
print(s.isalnum())
Output:
True
However, if you do want a pure regex solution:
import re
if re.findall('^[a-zA-Z0-9]+$', s):
pass #string just contains letters and digits
Using a dataframe solution, courtesy of #Wen:
df.col1.apply(lambda x : x.isalnum())
df=pd.DataFrame( {'col1':["sdfsdfq34sd","sdfsdfq###34sd","sdfsdf!q34sd","sdfs‌​dfq34s#d"]})

Pandas answer: Consider this df
col
0 2ae12
1 2912
2 da2ae12
3 %2ae12
4 #^%6f
5 &^$*
You can select the rows that contain only alphabets or numbers using
df[~df.col.str.contains('(\W+)')]
You get
col
0 2ae12
1 2912
2 da2ae12
If you just want a boolean column, use
~df.col.str.contains('(\W+)')
0 True
1 True
2 True
3 False
4 False
5 False

If you are looking to return True if the string is either all digits or all letters, you can do:
for case in ('abcdefg','12345','2ae12'):
print case, case.isalpha() or case.isdigit()
Prints:
abcdefg True
12345 True
2ae12 False
If you want the same logic with a regex, you would do:
import re
for case in ('abcdefg','12345','2ae12'):
print case, bool(re.search(r'^(?:[a-zA-Z]+|\d+)$', case))

You regex is only matching one character, and I think the \d is being treated as an escaped D instead of the set of all integer characters.
If you really want to use a regex here's how I would do it;
def isalphanum(test_str):
alphanum_re = re.compile(r"[0-9A-Z]+", re.I)
return bool(alphanum_re.match(test_str)
Let's focus on the alphanum regex. I compiled it with a raw literal, indicated by the string with an 'r' next to it. This type of string won't escape certain characters when a slash is present, meaning r"\n" is interpreted as a slash and an N instead of a newline. This is helpful when using regexs, and certain text editors will even change the syntax highlighting of an R string to highlight features in the regex to help you out. The re.I flag ignores the case of the test string, so [A-Z] will match A through Z in either upper or lower case.
The simpler, Zen of Python solution involves invoking the isalnum method of the string;
test_str = "abc123"
test_str.isalnum()

You need to check is the string is made up of either alphabets or digits!
import re
bool(re.match('^[A-Za-z]+|\d+$', df['some_column'].str))
As dawg has suggested you can also use isalpha and isdigit,
df['some_column'].str.isalpha() or df['some_column'].str.isdigit()

Related

Python string regular expression

I need to do a string compare to see if 2 strings are equal, like:
>>> x = 'a1h3c'
>>> x == 'a__c'
>>> True
independent of the 3 characters in middle of the string.
You need to use anchors.
>>> import re
>>> x = 'a1h3c'
>>> pattern = re.compile(r'^a.*c$')
>>> pattern.match(x) != None
True
This would check for the first and last char to be a and c . And it won't care about the chars present at the middle.
If you want to check for exactly three chars to be present at the middle then you could use this,
>>> pattern = re.compile(r'^a...c$')
>>> pattern.match(x) != None
True
Note that end of the line anchor $ is important , without $, a...c would match afoocbarbuz.
Your problem could be solved with string indexing, but if you want an intro to regex, here ya go.
import re
your_match_object = re.match(pattern,string)
the pattern in your case would be
pattern = re.compile("a...c") # the dot denotes any char but a newline
from here, you can see if your string fits this pattern with
print pattern.match("a1h3c") != None
https://docs.python.org/2/howto/regex.html
https://docs.python.org/2/library/re.html#search-vs-match
if str1[0] == str2[0]:
# do something.
You can repeat this statement as many times as you like.
This is slicing. We're getting the first value. To get the last value, use [-1].
I'll also mention, that with slicing, the string can be of any size, as long as you know the relative position from the beginning or the end of the string.

How to check if a string only contains letters?

I'm trying to check if a string only contains letters, not digits or symbols.
For example:
>>> only_letters("hello")
True
>>> only_letters("he7lo")
False
Simple:
if string.isalpha():
print("It's all letters")
str.isalpha() is only true if all characters in the string are letters:
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise.
Demo:
>>> 'hello'.isalpha()
True
>>> '42hello'.isalpha()
False
>>> 'hel lo'.isalpha()
False
The str.isalpha() function works. ie.
if my_string.isalpha():
print('it is letters')
For people finding this question via Google who might want to know if a string contains only a subset of all letters, I recommend using regexes:
import re
def only_letters(tested_string):
match = re.match("^[ABCDEFGHJKLM]*$", tested_string)
return match is not None
You can leverage regular expressions.
>>> import re
>>> pattern = re.compile("^[a-zA-Z]+$")
>>> pattern.match("hello")
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> pattern.match("hel7lo")
>>>
The match() method will return a Match object if a match is found. Otherwise it will return None.
An easier approach is to use the .isalpha() method
>>> "Hello".isalpha()
True
>>> "Hel7lo".isalpha()
False
isalpha() returns true if there is at least 1 character in the string and if all the characters in the string are alphabets.
Actually, we're now in globalized world of 21st century and people no longer communicate using ASCII only so when anwering question about "is it letters only" you need to take into account letters from non-ASCII alphabets as well. Python has a pretty cool unicodedata library which among other things allows categorization of Unicode characters:
unicodedata.category('陳')
'Lo'
unicodedata.category('A')
'Lu'
unicodedata.category('1')
'Nd'
unicodedata.category('a')
'Ll'
The categories and their abbreviations are defined in the Unicode standard. From here you can quite easily you can come up with a function like this:
def only_letters(s):
for c in s:
cat = unicodedata.category(c)
if cat not in ('Ll','Lu','Lo'):
return False
return True
And then:
only_letters('Bzdrężyło')
True
only_letters('He7lo')
False
As you can see the whitelisted categories can be quite easily controlled by the tuple inside the function. See this article for a more detailed discussion.
The string.isalpha() function will work for you.
See http://www.tutorialspoint.com/python/string_isalpha.htm
Looks like people are saying to use str.isalpha.
This is the one line function to check if all characters are letters.
def only_letters(string):
return all(letter.isalpha() for letter in string)
all accepts an iterable of booleans, and returns True iff all of the booleans are True.
More generally, all returns True if the objects in your iterable would be considered True. These would be considered False
0
None
Empty data structures (ie: len(list) == 0)
False. (duh)
(1) Use str.isalpha() when you print the string.
(2) Please check below program for your reference:-
str = "this"; # No space & digit in this string
print str.isalpha() # it gives return True
str = "this is 2";
print str.isalpha() # it gives return False
Note:- I checked above example in Ubuntu.
A pretty simple solution I came up with: (Python 3)
def only_letters(tested_string):
for letter in tested_string:
if letter not in "abcdefghijklmnopqrstuvwxyz":
return False
return True
You can add a space in the string you are checking against if you want spaces to be allowed.

applying a filter on a string in python

I have a user typing in his username and I only want valid strings to pass through, meaning only characters in [a-zA-Z0-9]. I am pretty new to python and unsure of the syntax.
Here's an example of what I want in code, which is to check through the username and return false upon a illegal character.:
def _checkInput(input):
for char in input:
if !(char in [a-zA-Z0-9]):
return False
return True
Thanks!
There is a method in string called isalnum. It does what you are trying to achieve.
In [7]: 'ab123fd'.isalnum()
Out[7]: True
In [8]: 'ab123fd **'.isalnum()
Out[8]: False
You need isalnum:
>>> name = raw_input('Enter your name: ')
Enter your name: foo_bar
>>> name.isalnum()
False
>>> name = raw_input('Enter your name: ')
Enter your name: foobar
>>> name.isalnum()
True
Python strings have lots of useful methods for doing this sort of check, such as:
str.isalnum()
str.isalpha()
str.isdigit()
str.islower()
str.istitle()
str.isupper()
What you need is str.isalnum() which returns true if all characters in the string are alphanumeric and there is at least one character.
>>> 'hello1'.isalnum()
True
>>> 'hello 1'.isalnum()
False
>>> 'hello!'.isalnum()
False
>>> ''.isalnum()
False
As the example above shows, letters and numbers are considered alphanumeric, but spaces and punctuation marks are not.
Also note that contrary to what would be mathematically pure, the empty string is not considered alphanumeric. However in most cases this actually what you need and certainly what you need in your case, as a user name of length zero does not make much sense.
That's very close to being Python:
def _checkInput(input):
for c in input:
if not (c in string.ascii_letters or c in string.digits):
return False
return True
This can also be solved with regular expressions, but the above is perhaps clearer and less complex.
You can easily check input strings using regular expressions:
>>> import re
>>> s = getinput()
>>> if not re.match(r'^[a-zA-Z0-9]+$', s)
... print "bad input"
Use * instead of + if the empty string is valid input too.
Using isalnum, as suggested in other answers, is a nice approach too, but with regular expressions you can easily adjust your check in case the requirements for input get more complex.
One way to achieve this is to use the regular expression module of Python. It is a standard library.
import re
_pmatcher = re.compile(r'[0-9a-zA-Z]*$')
def _checkInput(input):
return _pmatcher.match(input)
The r in front of the string is not a typo, it is to treat the string as raw, which you may want rather than typing escape characters.
You can refer to this Python 2.7 Documents (or your chosen version of Python)
You may need in the future to verify that all the characters of a string are present in a particular list.
Without regex, it is possible like that:
ch = 'becdi30!&'
okchars = 'abcdefghijk012345,;:!&-'
print all(c in okchars for c in ch)
result
True

Regular expression to confirm whether a string is a valid Python identifier?

I have the following definition for an Identifier:
Identifier --> letter{ letter| digit}
Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.
I've tried this:
if re.match('\w+(\w\d)?', i):
return True
else:
return False
but when I run my program every time it meets an integer it thinks that it's a valid identifier.
For example
c = 0 ;
it prints c as a valid identifier which is fine, but it also prints 0 as a valid identifer.
What am I doing wrong here?
Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:
No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.
The reasons are:
As #JoeCondron pointed out, Python reserved keywords such as True, if, return, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.
Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of \d, \w, \W in the re module, as demonstrated in #martineau's counter-example and explained in great detail by #Hatshepsut's amazing research.
While we could try to solve the first issue using keyword.iskeyword(), as #Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?
As Hatshepsut said:
str.isidentifier() works
Just use it, problem solved.
As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:
identifier ::= (letter|"_") (letter | digit | "_")*
Which can be expressed by the regular expression:
^[^\d\W]\w*\Z
Example:
import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)
tests = [ "a", "a1", "_a1", "1a", "aa$%#%", "aa bb", "aa_bb", "aa\n" ]
for test in tests:
result = re.match(identifier, test)
print("%r\t= %s" % (test, (result is not None)))
Result:
'a' = True
'a1' = True
'_a1' = True
'1a' = False
'aa$%#%' = False
'aa bb' = False
'aa_bb' = True
'aa\n' = False
str.isidentifier() works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.
str.isidentifier() Return true if the string is a valid identifier
according to the language definition, section Identifiers and
keywords.
Use keyword.iskeyword() to test for reserved identifiers such as def
and class.
#martineau's comment gives the example of '℘᧚' where the regex solutions fail.
>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False
Why does this happen?
Lets define the sets of code points that match the given regular expression, and the set that match str.isidentifier.
import re
import unicodedata
chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}
How many regex matches are not identifiers?
In [26]: len(chars - identifiers)
Out[26]: 698
How many identifiers are not regex matches?
In [27]: len(identifiers - chars)
Out[27]: 4
Interesting -- which ones?
In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}
Out[37]:
set([
('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
('℘', 'SCRIPT CAPITAL P', 'Sm'),
('℮', 'ESTIMATED SYMBOL', 'So'),
])
What's different about these two sets?
They have different Unicode "General Category" values.
In [31]: {unicodedata.category(c) for c in chars - identifiers}
Out[31]: set(['Lm', 'Lo', 'No'])
From wikipedia, that's Letter, modifier; Letter, other; Number, other. This is consistent with the re docs, since \d is only decimal digits:
\d Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
What about the other way?
In [32]: {unicodedata.category(c) for c in identifiers - chars}
Out[32]: set(['Mn', 'Sm', 'So'])
That's Mark, nonspacing; Symbol, math; Symbol, other.
Where is this all documented?
In the Python Language Reference
In PEP 3131 - Supporting non-ascii identifiers
Where is it implemented?
https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255
I still want a regular expression
Look at the regex module on PyPI.
This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.
It includes filters for "General Category".
For Python 3, you need to handle Unicode letters and digits. So if that's a concern, you should get along with this:
re_ident = re.compile(r"^[^\d\W]\w*$", re.UNICODE)
[^\d\W] matches a character that is not a digit and not "not alphanumeric" which translates to "a character that is a letter or underscore".
\w matches digits and characters. Try ^[_a-zA-Z]\w*$
Works like a charm: r'[^\d\W][\w\d]+'
The question is about regex, so my answer may look out of subject. The point is that regex is simply not the right approach.
Interested in getting the problematic characters ?
Using str.isidentifier, one can perform the check character by character, prefixing them with, say, an underscore to avoid false positive such as digits and so on... How could a name be valid if one of its (prefixed) component is not (?) E.g.
def checker(str_: str) -> 'set[str]':
return {
c for i, c in enumerate(str_)
if not (f'_{c}' if i else c).isidentifier()
}
>>> checker('℘3᧚₂')
{'₂'}
Which solution deals with unauthorised first characters, such as digits or e.g. ᧚. See
>>> checker('᧚℘3₂')
{'₂', '᧚'}
>>> checker('3᧚℘₂')
{'3', '₂'}
>>> checker("a$%##%\n")
{'#', '#', '\n', '$', '%'}
To be improved, since it does check neither for reserved names, nor tells anything about why ᧚ is sometime problematic, whereas ₂ always is... but here is my without-regex approach.
My answer in your terms:
if not checker(i):
return True
else:
return False
which could be contracted into
return not checker(i)

how do I include a boolean AND within a regex?

Is there a way to get single regex to satisfy this condition??
I am looking for a "word" that has three letters from the set MBIPI, any order,
but MUST contain an I.
ie.
re.match("[MBDPI]{3}", foo) and "I" in foo
So this is the correct result (in python using the re module), but can I get this from a single regex?
>>> for foo in ("MBI", "MIB", "BIM", "BMI", "IBM", "IMB", "MBD"):
... print foo,
... print re.match("[MBDPI]{3}", foo) and "I" in foo
MBI True
MIB True
BIM True
BMI True
IBM True
IMB True
MBD False
with regex I know I can use | as a boolean OR operator, but is there a boolean AND equivalent?
or maybe I need some forward or backward lookup?
You can fake boolean AND by using lookaheads. According to http://www.regular-expressions.info/lookaround2.html, this will work for your case:
"\b(?=[MBDPI]{3}\b)\w*I\w*"
with regex I know I can use | as a boolean OR operator, but is there a boolean AND equivalent?
A and B = not ( not A or not B) = (?![^A]|[^B])
A and B being expressions that actually may have members in common.
Or is about the only thing you can do:
\b(I[MBDPI]{2}|[MBDPI]I[MBDPI]|[MBDPI]{2}I)\b
The \b character matches a zero-width word boundary. This ensures you match something that is exactly three characters long.
You're otherwise running into the limits to what a regular language can do.
An alternative is to match:
\b[MBDPI]{3}\b
capture that group and then look for an I.
Edit: for the sake of having a complete answer, I'll adapt Jens' answer that uses Testing The Same Part of a String for More Than One Requirement:
\b(?=[MBDPI]{3}\b)\w*I\w*
with the word boundary checks to ensure it's only three characters long.
This is a bit more of an advanced solution and applicable in more situations but I'd generally favour what's easier to read (being the "or" version imho).
You could use lookahead to see if an I is present:
(?=[MBDPI]{0,2}I)[MBDPI]{3}

Categories