python re, find expression containing an optional group - python

I have a regular expression that can have either from:
(src://path/to/foldernames canhave spaces/file.xzy)
(src://path/to/foldernames canhave spaces/file.xzy "optional string")
These expressions occur within a much longer string (they are not individual strings). I am having trouble matching both expressions when using re.search or re.findall (as there may be multiple expression in the string).
It's straightforward enough to match either individually but how can I go about matching either case so that two groups are returned, the first with src://path/... and the second with the optional string if it exists or None if not?
I am thinking that I need to somehow specify OR groups---for instance, consider:
The pattern \((.*)( ".*")\) matches the second instance but not the first because it does not contain "...".
r = re.search(r'\((.*)( ".*")\)', '(src://path/to/foldernames canhave spaces/file.xzy)'
r.groups() # Nothing found
AttributeError: 'NoneType' object has no attribute 'groups'
While \((.*)( ".*")?\) matches the first group but does not individually identify the "optional string" as a group in the second instance.
r = re.search(r'\((.*)( ".*")?\)', '(src://path/to/foldernames canhave spaces/file.xzy "optional string")')
r.groups()
('src://path/to/foldernames canhave spaces/file.xzy "optional string"', None)
Any thoughts, ye' masters of expressions (of the regular variety)?

The simplest way is to make the first * non-greedy:
>>> import re
>>> string = "(src://path/to/foldernames canhave spaces/file.xzy)"
>>> string2 = \
... '(src://path/to/foldernames canhave spaces/file.xzy "optional string")'
>>> re.findall(r'\((.*?)( ".*")?\)', string2)
[('src://path/to/foldernames canhave spaces/file.xzy', ' "optional string"')]
>>> re.findall(r'\((.*?)( ".*")?\)', string)
[('src://path/to/foldernames canhave spaces/file.xzy', '')]

Since " aren't usually allowed to appear in file names, you can simply exclude them from the first group:
r = re.search(r'\(([^"]*)( ".*")?\)', input)
This is generally the preferred alternative to ungreedy repetition, because tends to be a lot more efficient. If your file names can actually contain quotes for some reason, then ungreedy repetition (as in agf's answer) is your best bet.

Related

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

simple regex pattern not matching [duplicate]

>>> import re
>>> s = 'this is a test'
>>> reg1 = re.compile('test$')
>>> match1 = reg1.match(s)
>>> print match1
None
in Kiki that matches the test at the end of the s. What do I miss? (I tried re.compile(r'test$') as well)
Use
match1 = reg1.search(s)
instead. The match function only matches at the start of the string ... see the documentation here:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
Your regex does not match the full string. You can use search instead as Useless mentioned, or you can change your regex to match the full string:
'^this is a test$'
Or somewhat harder to read but somewhat less useless:
'^t[^t]*test$'
It depends on what you're trying to do.
It's because of that match method returns None if it couldn't find expected pattern, if it find the pattern it would return an object with type of _sre.SRE_match .
So, if you want Boolean (True or False) result from match you must check the result is None or not!
You could examine texts are matched or not somehow like this:
string_to_evaluate = "Your text that needs to be examined"
expected_pattern = "pattern"
if re.match(expected_pattern, string_to_evaluate) is not None:
print("The text is as you expected!")
else:
print("The text is not as you expected!")

String may only contain A, U, G or C [duplicate]

This question already has answers here:
Test if string ONLY contains given characters [duplicate]
(7 answers)
Closed 8 years ago.
Forgive the simplistic question, but I've read through the SO questions and the Python documentation and still haven't been able to figure this out.
How can I create a Python regex to test whether a string contains ANY but ONLY the A, U, G and C characters? The string can contain either one or all of those characters, but if it contains any other characters, I'd like the regex to fail.
I tried:
>>> re.match(r"[AUGC]", "AUGGAC")
<_sre.SRE_Match object at 0x104ca1850>
But adding an X on to the end of the string still works, which is not what I expected:
>>> re.match(r"[AUGC]", "AUGGACX")
<_sre.SRE_Match object at 0x104ca1850>
Thanks in advance.
You need the regex to consume the whole string (or fail, if it can't). re.match implicitly adds an anchor at the start of the string, you need to add one to the end:
re.match(r"[AUGC]+$", string_to_check)
Also note the +, which repeatedly matches your character set (since, again, the point is to consume the whole string)
if the value is the only characters in the string, you can do the following:
>>> r = re.compile(r'^[AUGC]+$')
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
>>>
then if you want your regex to match the empty string as well, you can do:
>>> r = re.compile(r'^[AUGC]*$')
>>> r.match("")
<_sre.SRE_Match object at 0x10ee16718>
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
Here's a description of what the first regexp does:
Walk through it
Use ^[AUCG]*$; this will match against the entire string.
Or, if there has to be at least one letter, ^[AUCG]+$ — ^ and $ stand for beginning of string and end of string respectively; * and + stand for zero or more and one or more respectively.
This is purely about regular expressions and not specific to Python really.
You are actually really close. What you have just tests for a single character that A or U or G or C.
What you want is to match a string that has one or more letters that are all A or U or G or C, you can accomplish this by adding the plus modifier to your regular expression.
re.match(r"^[AUGC]+$", "AUGGAC")
Additionally, adding $ at the end marks the end of string, you can optionally use ^ at the front to match the beginning of the string.
Just check to see if there is anything other than "AUGC" in there:
if re.search('[^AUGC]', string_to_check):
#fail
You can add a check to make sure the string is not empty in the same statement:
if not string_to_check or re.search('[^AUGC]', string_to_check):
#fail
No real need to use a regex:
>>> good = 'AUGGCUA'
>>> bad = 'AUGHACUA'
>>> all([c in 'AUGC' for c in good])
True
>>> all([c in 'AUGC' for c in bad])
False
I know you're asking about regular expressions but I though it was worth mentioning set. To establish whether your string only contains A U G or C, you could do this:
>>> input = "AUCGCUAGCGAU"
>>> s = set("AUGC")
>>> set(input) <= s
True
>>> bad = "ASNMSA"
>>> set(bad) <= s
False
edit: thanks to #roippi for spotting my mistake, <= should be used, not ==.
Instead of using <=, the method issubset can be used:
>>> set("AUGAUG").issubset(s)
True
if all characters in the string input are in the set s, then issubset will return True.
From: https://docs.python.org/2/library/re.html
Characters that are not within a range can be matched by complementing the set.
If the first character of the set is '^', all the characters that are not in the set will be matched.
For example, [^5] will match any character except '5', and [^^] will match any character except '^'.
^ has no special meaning if it’s not the first character in the set.
So you could do [^AUGC] and if it matches that then reject it, else keep it.

python regular expression substitute

I need to find the value of "taxid" in a large number of strings similar to one given below. For this particular string, the 'taxid' value is '9606'. I need to discard everything else. The "taxid" may appear anywhere in the text, but will always be followed by a ":" and then number.
score:0.86|taxid:9606(Human)|intact:EBI-999900
How to write regular expression for this in python.
>>> import re
>>> s = 'score:0.86|taxid:9606(Human)|intact:EBI-999900'
>>> re.search(r'taxid:(\d+)', s).group(1)
'9606'
If there are multiple taxids, use re.findall, which returns a list of all matches:
>>> re.findall(r'taxid:(\d+)', s)
['9606']
for line in lines:
match = re.match(".*\|taxid:([^|]+)\|.*",line)
print match.groups()

Python regular expression; why do the search & match appear to find alpha chars in a number string?

I'm running search below Idle, in Python 2.7 in a Windows Bus. 64 bit environment.
According to RegexBuddy, the search pattern ('patternalphaonly') should not produce a match against a string of digits.
I looked at "http://docs.python.org/howto/regex.html", but did not see anything there that would explain why the search and match appear to be successful in finding something matching the pattern.
Does anyone know what I'm doing wrong, or misunderstanding?
>>> import re
>>> numberstring = '3534543234543'
>>> patternalphaonly = re.compile('[a-zA-Z]*')
>>> result = patternalphaonly.search(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
>>> result = patternalphaonly.match(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
Thanks
The star operator (*) indicates zero or more repetitions. Your string has zero repetitions of an English alphabet letter because it is entirely numbers, which is perfectly valid when using the star (repeat zero times). Instead use the + operator, which signifies one or more repetitions. Example:
>>> n = "3534543234543"
>>> r1 = re.compile("[a-zA-Z]*")
>>> r1.match(n)
<_sre.SRE_Match object at 0x07D85720>
>>> r2 = re.compile("[a-zA-Z]+") #using the + operator to make sure we have at least one letter
>>> r2.match(n)
Helpful link on repetition operators.
Everything eldarerathis says is true. However, with a variable named: 'patternalphaonly' I would assume that the author wants to verify that a string is composed of alpha chars only. If this is true then I would add additional end-of-string anchors to the regex like so:
patternalphaonly = re.compile('^[a-zA-Z]+$')
result = patternalphaonly.search(numberstring)
Or, better yet, since this will only ever match at the beginning of the string, use the preferred match method:
patternalphaonly = re.compile('[a-zA-Z]+$')
result = patternalphaonly.match(numberstring)
(Which, as John Machin has pointed out, is evidently faster for some as-yet unexplained reason.)

Categories