Not getting intended results with "either/or" character in python regex - python

I'm trying to match some fairly simple text but am having trouble with the "|" character. The text is:
"TF0876 some text Y N 2.31 - 0.01\n TF9788 more text N Y - 2.3 -\n TF1626"
and I want to extract two items using re.findall:
"TF0876 some text for Y N 2.31" and
"TF9788 more text N Y -"
The code I thought would work is:
mat = re.compile(r"TF\d{4}.*?[Y|N] [Y|N] [-|\d\.\d*]",flags=re.DOTALL)
test2 = re.findall(mat,text)
print(test2)
However, this gives me the following list:
['TF0876 some text for Y N 2', 'TF9788 more text N Y -']
For some reason, in the first match that the regex finds stops at the "2", rather than the "2.31" which is what I want. If instead of the \d\.\d* I simply type in2.31 then it still only matches only up to the "2". In fact whatever I type, I only seem to get one character from either side of the "|". I don't understand this; the regex HOWTO says that the expression Crow|Servo will match "Crow" or "Servo", but nothing smaller (such as "Cro"). In my case the opposite seems to be happening, so I clearly don't understand something and would be grateful for help.
Thanks.

The problem lies within your compiled statement, try changing it to
mat = re.compile(r"TF\d{4}.*?[YN] [YN] [-\d\.]*",flags=re.DOTALL)
You will not need the "|" within "[]". These brackets already signalize a range or collection of different possible expressions.
Second Option is to use groups by applying "()" brackets instead of your "[]". Depends on what you want to match exactly. Both will work on your given example texts.

The problem is that you are using brackets [] instead of parentheses () to separate subgroups. Try this:
import re
text = "TF0876 some text Y N 2.31 - 0.01\n TF9788 more text N Y - 2.3 -\n TF1626"
mat = re.compile(r"TF\d{4}.*?(?:Y|N) (?:Y|N) (?:-|\d\.\d*)",flags=re.DOTALL)
test2 = re.findall(mat, text)
print(test2)
# ['TF0876 some text Y N 2.31', 'TF9788 more text N Y -']
Here the ?: bits are just so subgroups are not captured. Note that (?:Y|N) is basically the same as simply [YN].

Related

Python Regex Search Returns Positive Non-Integer Number <1 As Empty String

I am using Python Regex module to search a string, an example of string of interest is "*MBps 2.57".
I am using the following code:
temp_string = re.search('MBps, \d*\.?\d*', line)
if (temp_string != None):
temp_number = re.split(' ', temp_string.group(), 1)
I want to find instances where MBps is > 0, then take that number and process it.
The code works fine as long as the number after MBps is > 1. For example, if it's 'MBps 182.57', the RegEx object when converted to string shows 'MBps, 182.57'.
However, when the number after MBps is <1, for example, if it's 'MBps 0.31', then RegEx object returned shows 'MBps' but no number. It's just an empty string following the first match.
I have tried different regex matching strategies (re.match, re.findall), but none seemed to work correctly. In the regex101 testing site, it showed the regex expression working but I can't get Python regex module to match the behavior.
Any ideas on why it's happening and how to correct it?
Thanks
I would use re.findall here:
inp = "The first speed is 3.14 MBps and the second is 5.43 MBps"
matches = re.findall(r'\b(\d+(?:\.\d+)?) MBps\b', inp)
print(matches)
This prints:
['3.14', '5.43']
OK, I found a way to make this work.
I changed the code to:
temp_string = re.search('MBps, [0-9\.]+', line)
if (temp_string != None):
temp_number = re.split(' ', temp_string.group(), 1)
That worked to capture all the numbers. I think being explicit in Regex matching rather than just \d+ or \d* makes this work better.
Thanks

Extracting Numbers from Formatted Strings with Unusual Delimiters in Python

How can I get the numbers in a formatted string like the following in Python? It has a mixed combination of delimiters such as tab, parenthesis, cm, space, and #.
I used the following code but it does not split the numbers correctly.
s = "1.0000e+036 (1.2365e-004,6.3265e+003cm) (2.3659e-002, 2.3659e-002#)"
parts = re.split('\s|(?<!\d)[,.](?!\d)', s)
print(parts)
['1.0000e+036', '(1.2365e-004,6.3265e+003cm)', '(2.3659e-002,', '2.3659e-002#)']
I am trying to extract:
[1.0000e+036, 1.2365e-004, 6.3265e+003, 2.3659e-002, 2.3659e-002]
Could someone kindly help?
Update:
I tried the regular expression as following, which fails to split the positive exponential numbers
s = "1.0000e+036 (1.2365e-004,6.3265e+003cm) (2.3659e-002, 2.3659e-002#)"
match_number = re.compile('-?\ *[0-9]+\.?[0-9]*(?:[Ee]\ *-?\ *[0-9]+)?')
final_list = [float(x) for x in re.findall(match_number, s)]
print(final_list)
[1.0, 36.0, 0.00012365, 6.3265, 3.0, 0.023659, 0.023659]
As can be seen, the first number is 1e36 which was parsed as two numbers 1.0 and 36.0.
You don't need to treat those items as delimiters. Rather, all you appear to need is a regex to extract all the floats in the line (including exponential / engineering notation), and simply ignore the remaining characters. Comprehensive numerical expressions are readily available on line with a simple search.

Printing substrings' patterns from a string in Python

The input to this problem is a string and has a specific form. For example if s is a string then inputs can be s='3(a)2(b)' or s='3(aa)2(bbb)' or s='4(aaaa)'. The output should be a string, that is the substring inside the brackets multiplied by numerical substring value the substring inside the brackets follows.
For example,
Input ='3(a)2(b)'
Output='aaabb'
Input='4(aaa)'
Output='aaaaaaaaaaaa'
and similarly for other inputs. The program should print an empty string for wrong or invalid inputs.
This is what I've tried so far
s='3(aa)2(b)'
p=''
q=''
for i in range(0,len(s)):
#print(s[i],end='')
if s[i]=='(':
k=int(s[i-1])
while(s[i+1]!=')'):
p+=(s[i+1])
i+=1
if s[i]==')':
q+=k*p
print(q)
Can anyone tell what's wrong with my code?
A oneliner would be:
''.join(int(y[0])*y[1] for y in (x.split('(') for x in Input.split(')')[:-1]))
It works like this. We take the input, and split on the close paren
In [1]: Input ='3(a)2(b)'
In [2]: a = Input.split(')')[:-1]
In [3]: a
Out[3]: ['3(a', '2(b']
This gives us the integer, character pairs we're looking for, but we need to get rid of the open paren, so for each x in a, we split on the open paren to get a two-element list where the first element is the int (as a string still) and the character. You'll see this in b
In [4]: b = [x.split('(') for x in a]
In [5]: b
Out[5]: [['3', 'a'], ['2', 'b']]
So for each element in b, we need to cast the first element as an integer with int() and multiply by the character.
In [6]: c = [int(y[0])*y[1] for y in b]
In [7]: c
Out[7]: ['aaa', 'bb']
Now we join on the empty string to combine them into one string with
In [8]: ''.join(c)
Out[8]: 'aaabb'
Try this:
a = re.findall(r'[\d]+', s)
b = re.findall(r'[a-zA-Z]+', s)
c = ''
for i, j in zip(a, b):
c+=(int(i)*str(j))
print(c)
Here is how you could do it:
Step 1: Simple case, getting the data out of a really simple template
Let's assume your template string is 3(a). That's the simplest case I could think of. We'll need to extract pieces of information from that string. The first one is the count of chars that will have to be rendered. The second is the char that has to be rendered.
You are in a case where regex are more than suited (hence, the use of re module from python's standard library).
I won't do a full course on regex. You'll have to do that by our own. However, I'll explain quickly the step I used. So, count (the variable that holds the number of times we should render the char to render) is a digit (or several). Hence our first capturing group will be something like (\d+). Then we have a char to extract that is enclosed by parenthesis, hence \((\w+)\) (I actually enable several chars to be rendered at once). So, if we put them together, we get (\d+)\((\w+)\). For testing you can check this out.
Applied to our case, a straight forward use of the re module is:
import re
# Our template
template = '3(a)'
# Run the regex
match = re.search(r'(\d+)\((\w+)\)', template)
if match:
# Get the count from the first capturing group
count = int(match.group(1))
# Get the string to render from the second capturing group
string = match.group(2)
# Print as many times the string as count was given
print count * string
Output:
aaa
Yeah!
Step 2: Full case, with several templates
Okay, we know how to do it for 1 template, how to do the same for several, for instance 3(a)4(b)? Well... How would we do it "by hand"? We'd read the full template from left to right and apply each template one by one. Then this is what we'll do with python!
Hopefully for us the re module has a function just for that: finditer. It does exactly what we described above.
So, we'll do something like:
import re
# Our template
template = '3(a)4(b)'
# Iterate through found templates
for match in re.finditer(r'(\d+)\((\w+)\)', template):
# Get the count from the first capturing group
count = int(match.group(1))
# Get the string to render from the second capturing group
string = match.group(2)
print count * string
Output:
aaa
bbbb
Okay... Just remains the combination of that stuff. We know we can put everything at each step in an array, and then join each items of this array at the end, no?
Let's do it!
import re
template = '3(a)4(b)'
parts = []
for match in re.finditer(r'(\d+)\((\w+)\)', template):
parts.append(int(match.group(1)) * match.group(2))
print ''.join(parts)
Output:
aaabbb
Yeah!
Step 3: Final step, optimization
Because we can always do better, we won't stop. for loops are cool. But what I love (it's personal) about python is that there is so much stuff you can actually just write with one line! Is it the case here? Well yes :).
First we can remove the for loop and the append using a list comprehension:
parts = [int(match.group(1)) * match.group(2) for match in re.finditer(r'(\d+)\((\w+)\)', template)]
rendered = ''.join(parts)
Finally, let's remove the two lines with parts populating and then join and let's do all that in a single line:
import re
template = '3(a)4(b)'
rendered = ''.join(
int(match.group(1)) * match.group(2) \
for match in re.finditer(r'(\d+)\((\w+)\)', template))
print rendered
Output:
aaabbb
Yeah! Still the same output :).
Hope it helped!
The value of 'p' should be refreshed after each iteration.
s='1(aaa)2(bb)'
p=''
q=''
i=0
while i<len(s):
if s[i]=='(':
k=int(s[i-1])
p=''
while(s[i+1]!=')'):
p+=(s[i+1])
i+=1
if s[i]==')':
q+=k*p
i+=1
print(q)
The code is not behaving the way I want it to behave. The problem here is the placement of 'p'. 'p' is the variable that adds the substring inside the ( )s. I'm repeating the process even after sufficient adding is done. Placing 'p' inside the 'if' block will do the job.
s='2(aa)2(bb)'
q=''
for i in range(0,len(s)):
if s[i]=='(':
k=int(s[i-1])
p=''
while(s[i+1]!=')'):
#print(i,'first time')
p+=s[i+1]
i+=1
q+=p*k
#print(i,'second time')
print(q)
what you want is not print substrings . the real purpose is most like to generate text based regular expression or comands.
you can parametrize a function to read it or use something like it:
The python library rstr has the function xeger() to do what you need by using random strings and only returning ones that match:
Example
Install with pip install rstr
In [1]: from __future__ import print_function
In [2]: import rstr
In [3]: for dummy in range(10):
...: print(rstr.xeger(r"(a|b)[cd]{2}\1"))
...:
acca
bddb
adda
bdcb
bccb
bcdb
adca
bccb
bccb
acda
Warning
For complex re patterns this might take a long time to generate any matches.

Python - how to substitute a substring using regex with n occurrencies

I have a string with a lot of recurrencies of a single pattern like
a = 'eresQQQutnohnQQQjkhjhnmQQQlkj'
and I have another string like
b = 'rerTTTytu'
I want to substitute the entire second string having as a reference the 'QQQ' and the 'TTT', and I want to find in this case 3 different results:
'ererTTTytuohnQQQjkhjhnmQQQlkj'
'eresQQQutnrerTTTytujhnmQQQlkj'
'eresQQQutnohnQQQjkhjrerTTTytu'
I've tried using re.sub
re.sub('\w{3}QQQ\w{3}' ,b,a)
but I obtain only the first one, and I don't know how to get the other two solutions.
Edit: As you requested, the two characters surrounding 'QQQ' will be replaced as well now.
I don't know if this is the most elegant or simplest solution for the problem, but it works:
import re
# Find all occurences of ??QQQ?? in a - where ? is any character
matches = [x.start() for x in re.finditer('\S{2}QQQ\S{2}', a)]
# Replace each ??QQQ?? with b
results = [a[:idx] + re.sub('\S{2}QQQ\S{2}', b, a[idx:], 1) for idx in matches]
print(results)
Output
['errerTTTytunohnQQQjkhjhnmQQQlkj',
'eresQQQutnorerTTTytuhjhnmQQQlkj',
'eresQQQutnohnQQQjkhjhrerTTTytuj']
Since you didn't specify the output format, I just put it in a list.

regular expression replace (if pattern found replace symbol for symbol)

I have several lines of text (RNA sequence), I want to make a matrix regarding conservation of characters, because they are aligned according similarity.
But I have several gaps (-) which actually mean missing a whole structure (e.g.#- > 100) If this happens I want to change that for dots (other symbol for making a distinguishment) with the same amount found.
I thought I can do this with regular expression, but I am not able to replace only the pattern, or when I do so, I replace everything but with the incorrect number of dots.
My code looks like this:
with alnfile as f_in:
if re.search('-{100,}', elem,):
elem = re.sub('-{100,}','.', elem, ) #failed alternative*len(m.groups(x)), elem)
print len(elem) # check if I am keeping the lenghth of my sequence
print elem[0:100] # check the start
f1.write(elem)
if my file is:
ONE ----(*100)atgtgca----(*20)
I am getting:
ONE ..(*100)atgtgca----(*20)
My other change was only dots then I get:
ONE ....(*100)atgtgca....(*20)
WHAT I NEED:
ONE ....(*100)atgtgca----(*20)
I know that I am missing something, but I can not figure it out? Is there a flag or something that help me or would allow the exact change of this?
You could try the following:
data = "ONE " + "-" * 100 + "atgtgca" + "-" * 20
print re.sub(r'-{100,}', lambda x: '.' * len(x.group(0)), data)
This would display:
ONE ....................................................................................................atgtgca--------------------

Categories