Python: Regular Expressions on getting repeating set of numbers - python

I'm working with a file, that is a Genbank entry (similar to this)
My goal is to extract the numbers in the CDS line, e.g.:
CDS join(1200..1401,3490..4302)
but my regex should also be able to extract the numbers from multiple lines, like this:
CDS join(1200..1401,1550..1613,1900..2010,2200..2250,
2300..2660,2800..2999,3100..3333)
I'm using this regular expression:
import re
match=re.compile('\w+\D+\W*(\d+)\D*')
result=match.findall(line)
print(result)
This gives me the correct numbers but also numbers from the rest of the file, like
gene complement(3300..4037)
so how can I change my regex to get the numbers?
I should only use regex on it..
I'm going to use the numbers to print the coding part of the base sequence.

You could use the heavily improved regex module by Matthew Barnett (which provides the \G functionality). With this, you could come up with the following code:
import regex as re
rx = re.compile("""
(?:
CDS\s+join\( # look for CDS, followed by whitespace and join(
| # OR
(?!\A)\G # make sure it's not the start of the string and \G
[.,\s]+ # followed by ., or whitespace
)
(\d+) # capture these digits
""", re.VERBOSE)
string = """
CDS join(1200..1401,1550..1613,1900..2010,2200..2250,
2300..2660,2800..2999,3100..3333)
"""
numbers = rx.findall(string)
print numbers
# ['1200', '1401', '1550', '1613', '1900', '2010', '2200', '2250', '2300', '2660', '2800', '2999', '3100', '3333']
\G makes sure the regex engine looks for the next match at the end of the last match.
See a demo on regex101.com (in PHP as the emulator does not provide the same functionality for Python [it uses the original re module]).
A far inferior solution (if you are only allowed to use the re module), would be to use lookarounds:
(?<=[(.,\s])(\d+)(?=[,.)])
(?<=) is a positive lookbehind, while (?=) is a positive lookahead, see a demo for this approach on regex101.com. Be aware though there might be a couple of false positives.

The following re pattern might work:
>>> match = re.compile(\s+CDS\s+\w+\([^\)]*\))
But you'll need to call findall on the whole text body, not just a line at a time.
You can use parentheses just to grab out the numbers:
>>> match = re.compile(\s+CDS\s+\w+\(([^\)]*)\))
>>> match.findall(stuff)
1200..1401,3490..4302 # Numbers only
Let me know if that achieves what you want!

Related

Extracting the inside of an expression using REGEX

I currently have this regular expression that I use to match the result of an SQL query: [^\\n]+(?=\\r\\n\\r\\n\(1 rows affected\)). However, it is not working as intended....
'\r\n----------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------\r\nCS: GPS
on Date.
\r\n\r\n(1 rows affected)\r\n'
What I get from the expression above is Date whereas I would want to match CS: GPS on Date. It's fine if there's leading and following spaces... Nothing Python's trim can't handle. How do I change my regular expression so that the match is done properly?
Thanks in advance.
Edit: The Python version I am using is Python 3.6
You get your current match because the character class [^\\n]+ matches 1+ times any char except \ or n.
Then the positive lookahead asserts what is on the right is \r\n\r\n(1 rows affected) which results in matching Date.
See https://regex101.com/r/wDzq8l/1
You could use a non greedy .+? in a capturing group and match what follows instead of using a positive lookahead.
In the code use re.DOTALL to let the dot match a newline.
-\\r\\n(.+?) ?\\r\\n\\r\\n\(\d+ rows affected\)
Regex demo
Maybe, some expression similar to:
-{5,}\s*([A-Za-z][^.]+\.)
would extract that or somewhat similar to that.
Demo
Test
import re
regex = r'-{5,}\s*([A-Za-z][^.]+\.)'
string = '''
----------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------
CS: GPS
on Date.
\r\n\r\n(1 rows affected)\r\n
'''
print(re.findall(regex, string, re.DOTALL))
Output
['CS: GPS\non Date.']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Regular expressions: distinguish strings including/excluding a given word

I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions.
I have strings such as
string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"
I wrote the following regex:
pattern = re.compile('C\((.+?)\)\[T\.(.+?)\]')
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')
So both of these strings match the pattern. But we want to get State in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.
How can we distinguish the string_2 pattern from string_1's and extract only State without , Treatment?
You can add an optional non-capturing group instead of just allowing all characters:
pattern = re.compile('C\((.+?)(?:, .+?)?\)\[T\.(.+?)\]')
(?:...) groups the contents together without capturing it. The trailing ? makes the group optional.
You may use this regex using negative character classes:
C\((\w+)[^[]*\[T\.([^]]+)\]
RegEx Demo

Python regex number by looking behind

I am extracting numbers in such format string.
AB1234
AC1234
AD1234
As you see, A is always there and the second char excludes ". I write below code to extract number.
re.search(r'(?<=A[^"])\d*',input)
But I encountered an error.
look-behind requires fixed-width pattern
So is there any convenient way to extract numbers? Now I know how to search twice to get them.Thanks in advance.
Note A is a pattern , in fact A is a world in a long string.
The regex in your example works, so I'm guessing your actual pattern has variable width character matches (*, +, etc). Unfortunately, regex look behinds do not support those. What I can suggest as an alternative, is to use a capture group and extract the matching string -
m = re.search(r'A\D+(\d+)', s)
if m:
r = m.group(1)
Details
A # your word
\D+ # anything that is not a digit
( # capture group
\d+ # 1 or more digits
)
If you want to take care of double quotes, you can make a slight modification to the regular expression by including a character class -
r'A[^\d"]+(\d+)'
Tye using this regex instead:
re.search(r'(?=A[^"]\d*)\d*',input)

Odd behavior on negative look behind in python

I am trying to do a re.split using a regex that is utilizing look-behinds. I want to split on newlines that aren't preceded by a \r. To complicate things, I also do NOT want to split on a \n if it's preceded by a certain substring: XYZ.
I can solve my problem by installing the regex module which lets me do variable width groups in my look behind. I'm trying to avoid installing anything, however.
My working regex looks like:
regex.split("(?<!(?:\r|XYZ))\n", s)
And an example string:
s = "DATA1\nDA\r\n \r\n \r\nTA2\nDA\r\nTA3\nDAXYZ\nTA4\nDATA5"
Which when split would look like:
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
My closest non-working expression without the regex module:
re.split("(?<!(?:..\r|XYZ))\n", s)
But this split results in:
['DATA1', 'DA\r\n \r', ' \r', 'TA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
And this I don't understand. From what I understand about look behinds, this last expression should work. Any idea how to accomplish this with the base re module?
You can use:
>>> re.split(r"(?<!\r)(?<!XYZ)\n", s)
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
Here we have broken your lookbehind assertions into two assertions:
(?<!\r) # previous char is not \r
(?<!XYZ) # previous text is not XYZ
Python regex engine won't allow (?<!(?:\r|XYZ)) in lookbehind due to this error
error: look-behind requires fixed-width pattern
You could use re.findall
>>> s = "DATA1\nDA\r\n \r\n \r\nTA2\nDA\r\nTA3\nDAXYZ\nTA4\nDATA5"
>>> re.findall(r'(?:(?:XYZ|\r)\n|.)+', s)
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
Explanation:
(?:(?:XYZ|\r)\n|.)+ This would match XYZ\n or \r\n greedily if there's any if the character going to be matched is not the one from the two then the control transfered to the or part that is . which would match any character but not of line breaks. + after the non-capturing group would repeat the whole pattern one or more times.

Python regex for line of digits and optional dash+digits. Why not matching?

I have been struggling with this for an embarrassingly long time so I have come here for help.
I want to match all strings that have a number followed by an optional dash followed by more numbers.
Example:
#Match
1
34-1
2-5-2
15-2-3-309-1
# Don't match
1--
--
#$#%^#$##
dafadf
10-asdf-1
-12-1-
I started with this regex (one or more digits, followed optionally by a dash and one and more digits):
\d+(-\d+)*
That didn't work. Then I tried parenthesizing around the \d:
(\d)+(-(\d)+)*
That didn't work either. Can anybody help me out?
You can use:
^(\d+(?:$|(?:-\d+)+))
See it work here.
Or, Debugex version of the same regex:
^(\d+(?:$|(?:-\d+)+))
Debuggex Demo
Perhaps even a better alternative since it is anchored on both ends:
^(\d+(?:-\d+)*)$
Debuggex Demo
Make sure that you use the right flags and re method:
import re
tgt='''
#Match
1
34-1
2-5-2
15-2-3-309-1
# Don't match
1--
--
#$#%^#$##
dafadf
10-asdf-1
-12-1-
'''
print re.findall(r'^(\d+(?:-\d+)*)$', tgt, re.M)
# ['1', '34-1', '2-5-2', '15-2-3-309-1']
Here's a regex I constructed that covers all your positive test cases; the ruleset is python:
^(?=\d)([-\d]+)*(?<=\d)$
Debuggex Demo
Basically, there's a lookahead to make sure it starts with a number at the start. There's a lookbehind to make sure it ends with a number, too, and each capturing group inbetween is consisting strictly of digits and hyphens.
This should do it:
^((?:\d+(?:-|$))+)$
Working regex example:
http://regex101.com/r/sD0oL7
Your original regex seems to work fine for the inputs you've given for examples, with one caveat: You need to be using either line-begin (^) and line-end($) anchors or specify full-line matching instead of string search which will implicitly use ^ and $ to enclose your regex. (i.e. re.match() vs. re.search() in Python)
The other examples all work fine, but the ^$ is what's really doing it.
Cheers.

Categories