String may only contain A, U, G or C [duplicate] - python

This question already has answers here:
Test if string ONLY contains given characters [duplicate]
(7 answers)
Closed 8 years ago.
Forgive the simplistic question, but I've read through the SO questions and the Python documentation and still haven't been able to figure this out.
How can I create a Python regex to test whether a string contains ANY but ONLY the A, U, G and C characters? The string can contain either one or all of those characters, but if it contains any other characters, I'd like the regex to fail.
I tried:
>>> re.match(r"[AUGC]", "AUGGAC")
<_sre.SRE_Match object at 0x104ca1850>
But adding an X on to the end of the string still works, which is not what I expected:
>>> re.match(r"[AUGC]", "AUGGACX")
<_sre.SRE_Match object at 0x104ca1850>
Thanks in advance.

You need the regex to consume the whole string (or fail, if it can't). re.match implicitly adds an anchor at the start of the string, you need to add one to the end:
re.match(r"[AUGC]+$", string_to_check)
Also note the +, which repeatedly matches your character set (since, again, the point is to consume the whole string)

if the value is the only characters in the string, you can do the following:
>>> r = re.compile(r'^[AUGC]+$')
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
>>>
then if you want your regex to match the empty string as well, you can do:
>>> r = re.compile(r'^[AUGC]*$')
>>> r.match("")
<_sre.SRE_Match object at 0x10ee16718>
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
Here's a description of what the first regexp does:
Walk through it

Use ^[AUCG]*$; this will match against the entire string.
Or, if there has to be at least one letter, ^[AUCG]+$ — ^ and $ stand for beginning of string and end of string respectively; * and + stand for zero or more and one or more respectively.
This is purely about regular expressions and not specific to Python really.

You are actually really close. What you have just tests for a single character that A or U or G or C.
What you want is to match a string that has one or more letters that are all A or U or G or C, you can accomplish this by adding the plus modifier to your regular expression.
re.match(r"^[AUGC]+$", "AUGGAC")
Additionally, adding $ at the end marks the end of string, you can optionally use ^ at the front to match the beginning of the string.

Just check to see if there is anything other than "AUGC" in there:
if re.search('[^AUGC]', string_to_check):
#fail
You can add a check to make sure the string is not empty in the same statement:
if not string_to_check or re.search('[^AUGC]', string_to_check):
#fail

No real need to use a regex:
>>> good = 'AUGGCUA'
>>> bad = 'AUGHACUA'
>>> all([c in 'AUGC' for c in good])
True
>>> all([c in 'AUGC' for c in bad])
False

I know you're asking about regular expressions but I though it was worth mentioning set. To establish whether your string only contains A U G or C, you could do this:
>>> input = "AUCGCUAGCGAU"
>>> s = set("AUGC")
>>> set(input) <= s
True
>>> bad = "ASNMSA"
>>> set(bad) <= s
False
edit: thanks to #roippi for spotting my mistake, <= should be used, not ==.
Instead of using <=, the method issubset can be used:
>>> set("AUGAUG").issubset(s)
True
if all characters in the string input are in the set s, then issubset will return True.

From: https://docs.python.org/2/library/re.html
Characters that are not within a range can be matched by complementing the set.
If the first character of the set is '^', all the characters that are not in the set will be matched.
For example, [^5] will match any character except '5', and [^^] will match any character except '^'.
^ has no special meaning if it’s not the first character in the set.
So you could do [^AUGC] and if it matches that then reject it, else keep it.

Related

Python - an extremely odd behavior of function lstrip [duplicate]

This question already has answers here:
Python string.strip stripping too many characters [duplicate]
(3 answers)
Closed 6 years ago.
I have encountered a very odd behavior of built-in function lstrip.
I will explain with a few examples:
print 'BT_NAME_PREFIX=MUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=NUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=PUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=SUV'.lstrip('BT_NAME_PREFIX=') # SUV
print 'BT_NAME_PREFIX=mUV'.lstrip('BT_NAME_PREFIX=') # mUV
As you can see, the function trims one additional character sometimes.
I tried to model the problem, and noticed that it persisted if I:
Changed BT_NAME_PREFIX to BT_NAME_PREFIY
Changed BT_NAME_PREFIX to BT_NAME_PREFIZ
Changed BT_NAME_PREFIX to BT_NAME_PREF
Further attempts have made it even more weird:
print 'BT_NAME=MUV'.lstrip('BT_NAME=') # UV
print 'BT_NAME=NUV'.lstrip('BT_NAME=') # UV
print 'BT_NAME=PUV'.lstrip('BT_NAME=') # PUV - different than before!!!
print 'BT_NAME=SUV'.lstrip('BT_NAME=') # SUV
print 'BT_NAME=mUV'.lstrip('BT_NAME=') # mUV
Could someone please explain what on earth is going on here?
I know I might as well just use array-slicing, but I would still like to understand this.
Thanks
You're misunderstanding how lstrip works. It treats the characters you pass in as a bag and it strips characters that are in the bag until it finds a character that isn't in the bag.
Consider:
'abc'.lstrip('ba') # 'c'
It is not removing a substring from the start of the string. To do that, you need something like:
if s.startswith(prefix):
s = s[len(prefix):]
e.g.:
>>> s = 'foobar'
>>> prefix = 'foo'
>>> if s.startswith(prefix):
... s = s[len(prefix):]
...
>>> s
'bar'
Or, I suppose you could use a regular expression:
>>> s = 'foobar'
>>> import re
>>> re.sub('^foo', '', s)
'bar'
The argument given to lstrip is a list of things to remove from the left of a string, on a character by character basis. The phrase is not considered, only the characters themselves.
S.lstrip([chars]) -> string or unicode
Return a copy of the string S with leading whitespace removed. If
chars is given and not None, remove characters in chars instead. If
chars is unicode, S will be converted to unicode before stripping
You could solve this in a flexible way using regular expressions (the re module):
>>> import re
>>> re.sub('^BT_NAME_PREFIX=', '', 'BT_NAME_PREFIX=MUV')
MUV

Python string regular expression

I need to do a string compare to see if 2 strings are equal, like:
>>> x = 'a1h3c'
>>> x == 'a__c'
>>> True
independent of the 3 characters in middle of the string.
You need to use anchors.
>>> import re
>>> x = 'a1h3c'
>>> pattern = re.compile(r'^a.*c$')
>>> pattern.match(x) != None
True
This would check for the first and last char to be a and c . And it won't care about the chars present at the middle.
If you want to check for exactly three chars to be present at the middle then you could use this,
>>> pattern = re.compile(r'^a...c$')
>>> pattern.match(x) != None
True
Note that end of the line anchor $ is important , without $, a...c would match afoocbarbuz.
Your problem could be solved with string indexing, but if you want an intro to regex, here ya go.
import re
your_match_object = re.match(pattern,string)
the pattern in your case would be
pattern = re.compile("a...c") # the dot denotes any char but a newline
from here, you can see if your string fits this pattern with
print pattern.match("a1h3c") != None
https://docs.python.org/2/howto/regex.html
https://docs.python.org/2/library/re.html#search-vs-match
if str1[0] == str2[0]:
# do something.
You can repeat this statement as many times as you like.
This is slicing. We're getting the first value. To get the last value, use [-1].
I'll also mention, that with slicing, the string can be of any size, as long as you know the relative position from the beginning or the end of the string.

python re, find expression containing an optional group

I have a regular expression that can have either from:
(src://path/to/foldernames canhave spaces/file.xzy)
(src://path/to/foldernames canhave spaces/file.xzy "optional string")
These expressions occur within a much longer string (they are not individual strings). I am having trouble matching both expressions when using re.search or re.findall (as there may be multiple expression in the string).
It's straightforward enough to match either individually but how can I go about matching either case so that two groups are returned, the first with src://path/... and the second with the optional string if it exists or None if not?
I am thinking that I need to somehow specify OR groups---for instance, consider:
The pattern \((.*)( ".*")\) matches the second instance but not the first because it does not contain "...".
r = re.search(r'\((.*)( ".*")\)', '(src://path/to/foldernames canhave spaces/file.xzy)'
r.groups() # Nothing found
AttributeError: 'NoneType' object has no attribute 'groups'
While \((.*)( ".*")?\) matches the first group but does not individually identify the "optional string" as a group in the second instance.
r = re.search(r'\((.*)( ".*")?\)', '(src://path/to/foldernames canhave spaces/file.xzy "optional string")')
r.groups()
('src://path/to/foldernames canhave spaces/file.xzy "optional string"', None)
Any thoughts, ye' masters of expressions (of the regular variety)?
The simplest way is to make the first * non-greedy:
>>> import re
>>> string = "(src://path/to/foldernames canhave spaces/file.xzy)"
>>> string2 = \
... '(src://path/to/foldernames canhave spaces/file.xzy "optional string")'
>>> re.findall(r'\((.*?)( ".*")?\)', string2)
[('src://path/to/foldernames canhave spaces/file.xzy', ' "optional string"')]
>>> re.findall(r'\((.*?)( ".*")?\)', string)
[('src://path/to/foldernames canhave spaces/file.xzy', '')]
Since " aren't usually allowed to appear in file names, you can simply exclude them from the first group:
r = re.search(r'\(([^"]*)( ".*")?\)', input)
This is generally the preferred alternative to ungreedy repetition, because tends to be a lot more efficient. If your file names can actually contain quotes for some reason, then ungreedy repetition (as in agf's answer) is your best bet.

Python regular expression; why do the search & match appear to find alpha chars in a number string?

I'm running search below Idle, in Python 2.7 in a Windows Bus. 64 bit environment.
According to RegexBuddy, the search pattern ('patternalphaonly') should not produce a match against a string of digits.
I looked at "http://docs.python.org/howto/regex.html", but did not see anything there that would explain why the search and match appear to be successful in finding something matching the pattern.
Does anyone know what I'm doing wrong, or misunderstanding?
>>> import re
>>> numberstring = '3534543234543'
>>> patternalphaonly = re.compile('[a-zA-Z]*')
>>> result = patternalphaonly.search(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
>>> result = patternalphaonly.match(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
Thanks
The star operator (*) indicates zero or more repetitions. Your string has zero repetitions of an English alphabet letter because it is entirely numbers, which is perfectly valid when using the star (repeat zero times). Instead use the + operator, which signifies one or more repetitions. Example:
>>> n = "3534543234543"
>>> r1 = re.compile("[a-zA-Z]*")
>>> r1.match(n)
<_sre.SRE_Match object at 0x07D85720>
>>> r2 = re.compile("[a-zA-Z]+") #using the + operator to make sure we have at least one letter
>>> r2.match(n)
Helpful link on repetition operators.
Everything eldarerathis says is true. However, with a variable named: 'patternalphaonly' I would assume that the author wants to verify that a string is composed of alpha chars only. If this is true then I would add additional end-of-string anchors to the regex like so:
patternalphaonly = re.compile('^[a-zA-Z]+$')
result = patternalphaonly.search(numberstring)
Or, better yet, since this will only ever match at the beginning of the string, use the preferred match method:
patternalphaonly = re.compile('[a-zA-Z]+$')
result = patternalphaonly.match(numberstring)
(Which, as John Machin has pointed out, is evidently faster for some as-yet unexplained reason.)

Regular expression to confirm whether a string is a valid Python identifier?

I have the following definition for an Identifier:
Identifier --> letter{ letter| digit}
Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.
I've tried this:
if re.match('\w+(\w\d)?', i):
return True
else:
return False
but when I run my program every time it meets an integer it thinks that it's a valid identifier.
For example
c = 0 ;
it prints c as a valid identifier which is fine, but it also prints 0 as a valid identifer.
What am I doing wrong here?
Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:
No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.
The reasons are:
As #JoeCondron pointed out, Python reserved keywords such as True, if, return, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.
Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of \d, \w, \W in the re module, as demonstrated in #martineau's counter-example and explained in great detail by #Hatshepsut's amazing research.
While we could try to solve the first issue using keyword.iskeyword(), as #Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?
As Hatshepsut said:
str.isidentifier() works
Just use it, problem solved.
As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:
identifier ::= (letter|"_") (letter | digit | "_")*
Which can be expressed by the regular expression:
^[^\d\W]\w*\Z
Example:
import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)
tests = [ "a", "a1", "_a1", "1a", "aa$%#%", "aa bb", "aa_bb", "aa\n" ]
for test in tests:
result = re.match(identifier, test)
print("%r\t= %s" % (test, (result is not None)))
Result:
'a' = True
'a1' = True
'_a1' = True
'1a' = False
'aa$%#%' = False
'aa bb' = False
'aa_bb' = True
'aa\n' = False
str.isidentifier() works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.
str.isidentifier() Return true if the string is a valid identifier
according to the language definition, section Identifiers and
keywords.
Use keyword.iskeyword() to test for reserved identifiers such as def
and class.
#martineau's comment gives the example of '℘᧚' where the regex solutions fail.
>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False
Why does this happen?
Lets define the sets of code points that match the given regular expression, and the set that match str.isidentifier.
import re
import unicodedata
chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}
How many regex matches are not identifiers?
In [26]: len(chars - identifiers)
Out[26]: 698
How many identifiers are not regex matches?
In [27]: len(identifiers - chars)
Out[27]: 4
Interesting -- which ones?
In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}
Out[37]:
set([
('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
('℘', 'SCRIPT CAPITAL P', 'Sm'),
('℮', 'ESTIMATED SYMBOL', 'So'),
])
What's different about these two sets?
They have different Unicode "General Category" values.
In [31]: {unicodedata.category(c) for c in chars - identifiers}
Out[31]: set(['Lm', 'Lo', 'No'])
From wikipedia, that's Letter, modifier; Letter, other; Number, other. This is consistent with the re docs, since \d is only decimal digits:
\d Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
What about the other way?
In [32]: {unicodedata.category(c) for c in identifiers - chars}
Out[32]: set(['Mn', 'Sm', 'So'])
That's Mark, nonspacing; Symbol, math; Symbol, other.
Where is this all documented?
In the Python Language Reference
In PEP 3131 - Supporting non-ascii identifiers
Where is it implemented?
https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255
I still want a regular expression
Look at the regex module on PyPI.
This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.
It includes filters for "General Category".
For Python 3, you need to handle Unicode letters and digits. So if that's a concern, you should get along with this:
re_ident = re.compile(r"^[^\d\W]\w*$", re.UNICODE)
[^\d\W] matches a character that is not a digit and not "not alphanumeric" which translates to "a character that is a letter or underscore".
\w matches digits and characters. Try ^[_a-zA-Z]\w*$
Works like a charm: r'[^\d\W][\w\d]+'
The question is about regex, so my answer may look out of subject. The point is that regex is simply not the right approach.
Interested in getting the problematic characters ?
Using str.isidentifier, one can perform the check character by character, prefixing them with, say, an underscore to avoid false positive such as digits and so on... How could a name be valid if one of its (prefixed) component is not (?) E.g.
def checker(str_: str) -> 'set[str]':
return {
c for i, c in enumerate(str_)
if not (f'_{c}' if i else c).isidentifier()
}
>>> checker('℘3᧚₂')
{'₂'}
Which solution deals with unauthorised first characters, such as digits or e.g. ᧚. See
>>> checker('᧚℘3₂')
{'₂', '᧚'}
>>> checker('3᧚℘₂')
{'3', '₂'}
>>> checker("a$%##%\n")
{'#', '#', '\n', '$', '%'}
To be improved, since it does check neither for reserved names, nor tells anything about why ᧚ is sometime problematic, whereas ₂ always is... but here is my without-regex approach.
My answer in your terms:
if not checker(i):
return True
else:
return False
which could be contracted into
return not checker(i)

Categories