Why does this regex not match the second binary gap?

Why does this regex not match the second binary gap? - python

Trying a solution for the problem listed here in python, I thought I'd try a nice little regex to capture the maximum "binary gap" (chains of zeroes in the binary representation of a number).
The function I wrote for the problem is below:
def solution(N):
max_gap = 0
binary_N = format(N, 'b')
list = re.findall(r'1(0+)1', binary_N)
for element in list:
if len(element) > max_gap:
max_gap = len(element)
return max_gap
And it works pretty well. However... for some reason, it does not match the second set of zeroes in 10000010000000001 (binary representation of 66561). The 9 zeroes don't appear in the list of matches so it must be a problem with the regex - but I can't see where it is as it matches every other example given!

The same bit can't be included in two matches. Your regex matches a 1 followed by one or more 0s and ends with another 1. Once the first match has been found you are left with 0000000001 which doesn't start with a 1 so isn't matched by your regex.
As mentioned by #JoachimIsaksson, if you want to match both sets of 0s, you can use a lookahead so that the final 1 is checked but isn't included in the match. r'1(0+)(?=1)'.

Related

Finding matching motifs on sequence and their positions

I am trying to find some matching motifs on a sequence, as well as the position that the motif is located in and then output that into a fasta file. The code below shows that the motif [L**L*L] is present in the sequence, when I run it returns as "YES" but I do not know where it is positioned
The ** inside the square bracket is to show that any amino acid there is permited.
`
This is the code I used to check whether the motif is present in the sequence, and it worked because it returned "YES".
peptide1= "MKFSNEVVHKSMNITEDCSALTGALLKYSTDKSNMNFETLYRDAAVESPQHEVSNESGSTLKEHDYFGLSEVSSSNSSSGKQPEKCCREELNLNESATTLQLGPPAAVKPSGHADGADAHDEGAGPENPAKRPAHHMQQESLADGRKAAAEMGSFKIQRKNILEEFRAMKAQAHMTKSPKPVHTMQHNMHASFSGAQMAFGGAKNNGVKRVFSEAVGGNHIAASGVGVGVREGNDDVSRCEEMNGTEQLDLKVHLPKGMGMARMAPVSGGQNGSAWRNLSFDNMQGPLNPFFRKSLVSKMPVPDGGDSSANASNDCANRKGMVASPSVQPPPAQNQTVGWPPVKNFNKMNTPAPPASTPARACPSVQRKGASTSSSGNLVKIYMDGVPFGRKVDLKTNDSYDKLYSMLEDMFQQYISGQYCGGRSSSSGESHWVASSRKLNFLEGSEYVLIYEDHEGDSMLVGDVPWELFVNAVKRLRIMKGSEQVNLAPKNADPTKVQVAVG"
if re.search(r"L*L*L", peptide1):
print("YES")
else:
print("NO")
The code that I wrote to find the position is below, but when I run it says invalid syntax. Could you please assist as I have no clue whether in the right track or not, as I am still new in the field and python.
for position in range(len(s)):
if peptide[position:].startswith(r"L*L*L"):
print(position+1)
I am expecting to see the position of these motifs has been identified, for example the output should state whether the motif is found in position [2, 10] or any other number. This is just random posiitions that I chose since I dont know where this is positioned

You can use re.finditer() to search for multiple regex pattern matches within a string. Your peptide1 example does not contain an "L*L*L" motif, so I designated a random simple string as a demo.
simple_demo_string = "ABCLXLYLZLABC" # use a simple string to demonstrate code
The demo string contains two overlapping motifs. Normally, regex matches do not account for overlap
Example 1
simple_regex = "L.L.L" # in regex, periods are match-any wildcards
for x in re.finditer(simple_regex, simple_demo_string):
print( x.start(), x.end(), x.group() )
# Output: 3 8 LXLYL
However, if you use a capturing group inside a lookahead, you'll be able to get everything even if there's overlap.
Example 2
lookahead_regex = "(?=(L.L.L))"
for x in re.finditer(lookahead_regex, simple_demo_string):
# note - x.end() becomes same as x.start() due to lookahead
# but can be corrected by simply adding length of match
print( x.start(), x.start()+len(x.group(1)), x.group(1) )
# Output: 3 8 LXLYL
#. 5 10 LYLZL

Replace a substring with defined region and follow up variable region in Python

I have a seemingly simple problem that for the life of me is just outside my reach of understanding. What I mean by that is that I can come up with many complex ways to attempt this, but there must be an easy way.
What I am trying to do is find and replace a substring in a string, but the catch is that it is based on a mix of a defined region and then variable regions based on length.
Here is an example:
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC' and I want to replace AATCGATCGTA with <span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>
So in this example the first part will always be constant AATCGA and will be used to locate the region to replace. This is then followed by a "spacer", in this case a single character but could be more than one and needs to be specified, and finally the last bit that will follow the "tail", in this case four characters, but could also be more or less. A set-up in this case would be:
to_find = 'AATCGA'
spacer = 'T' #Variable based on number and not on the character
tail = 'CGTA' #Variable based on number and not on the character
With this information I need to do something like:
new_seq = sequence.replace(f'{to_find}{len(spacer)}{len(tail)}', f'<span color="blue">{to_find}</span><span>{spacer}</span><span color="green">{tail}</span>')
print(new_seq)
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
But the spacer could be 3 characters from the end of to_find and it may vary, the same with the tail section. Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
Any help would be much appreciated!

I'm not quite sure I understand you fully. Nevertheless, you don't seem to be too far off. Just use regex.
import re
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC'
expected_new_seq = '<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC'
to_find = 'AATCGA'
spacer = 'T' # Variable based on number and not on the character
tail = 'CGTA' # Variable based on number and not on the character
# In this case, the pattern is (AATCGA)(.{1})(.{4})
# It matches "AATCGA" that is followed by 1 character and then 4 characters.
# AATCGA is captured in group 1, then the next unknown character is captured
# in group 2, and the next 4 unknown characters are captured in group 3
# (the brackets create capturing groups).
pattern = f'({to_find})(.{{{len(spacer)}}})(.{{{len(tail)}}})'
# \1 refers to capture group 1 (to_find), \2 refers to capture group 2 (spacer),
# and \3 refers to capture group 3 (tail).
# This no longer needs to be a f-string. But making it a raw string means we
# don't need to escape the slashes
repl = r'<span color="blue">\1</span><span>\2</span><span color="green">\3</span>'
new_seq = re.sub(pattern, repl, sequence)
print(new_seq)
print(new_seq == expected_new_seq)
Output:
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
True
Have a play around with it here (also includes interactive explanation): https://regex101.com/r/2mshrI/1
Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
How do you know when to replace it when it's in reverse instead of forward? After all, all you're doing is matching a short string followed/lead by n characters. I imagine you'd get matches in both directions, so which replacement do you carry out? Please provide more examples - longer input with expected output.

Extract only percentage information from text in python using regex

I'm trying to extract only valid percentage information and eliminate any incorrect representation from a string using regular expression in python. The function should work like this,
For,
0-100% = TRUE
0.12% = TRUE
23.1245467% = TRUE
9999% = FALSE
8937.2435% = FALSE
7.% = FALSE
I have checked a few solutions in stack overflow which only extract 0-100%. I have tried the following solutions,
('(\s100|[123456789][0-9]|[0-9])(\.\d+)+%')
'(\s100|\s\d{1,2})(\.\d+)+%'
'(\s100|\s\d[0-99])(\.\d+)+%'
All these works for all other possibilities except 0-99%(gives FALSE) and 12411.23526%(gives TRUE). The reason for space is that I want to extract only two digit numbers.

Figured it out. The problem lied in '+' in the expression '(\.\d+)+' whereas it should have been '(\.\d+)*'. The first expression expects to have decimal values for any two digit percentage values whereas the second doesn't. My final version is given below.
'\s(100|(\d{1,2}(\.\d+)*))%'
You can replace \s with $ for percentage values at the beginning of a sentence. Also, the versions in my question section accepted decimal values for 100 which is invalid percentage value.

I would not rely on regex alone - it is not meant to filter ranges in the first place.
Better look for candidates in your string and analyze them programmatically afterwards, like so:
import re
string = """
some gibberish in here 0-100% = TRUE
some gibberish in here 0.12% = TRUE
some gibberish in here 23.1245467% = TRUE
some gibberish in here 9999% = FALSE
some gibberish in here 8937.2435% = FALSE
some gibberish in here 7.% = FALSE
"""
numbers = []
# look for -, a digit, a dot ending with a digit and a percentage sign
rx = r'[-\d.]+\d%'
# loop over the results
for match in re.finditer(rx, string):
interval = match.group(0).split('-')
for number in interval:
if 0 <= float(number.strip('%')) <= 100:
numbers.append(number)
print numbers
# ['0', '100%', '0.12%', '23.1245467%']

Considering all possibilities following regex works.
If you just ignore the ?: i.e non-capturing group regex is not that intimidating.
Regex: ^(?:(?:\d{1,2}(?:\.\d+)?\-)?(?:(?:\d{1,2}(?:\.\d+)?)|100))%$
Explanation:
(?:(?:\d{1,2}(?:\.\d+)?\-)? matches lower limit if there is any, as in case of 0-100% with optional decimal part.
(?:(?:\d{1,2}(?:\.\d+)?)|100) matches the upper limit or if only single number with limit of 100 with optional decimal part.
Regex101 Demo
Another version of the same regex for matching such occurrences within the string would be to remove the anchor ^ and $ and check for non-digits at the beginning.
Regex: (?<=\D|^)(?:(?:\d{1,2}(?:\.\d+)?\-)?(?:(?:\d{1,2}(?:\.\d+)?)|100))%
Regex101 Demo

Regular expression match numbers in Python

I have a list of numbers from the interval (0;1]. For example:
0.235
0.4
1.00
0.533
1
I need to append some new numbers to the list. To check correctness of new numbers, I need to write regex.
Firstly I write simple regex: [0|1\.]{2}\d+, but it ignores one condition: if the integer part is 1, the fractional part must contain 0 or more zeros.
So, I tried to use lookahead assertions to emulate if-else condition: (?([0\.]{2})\d+|[0]+), but it isn't working. Where is my mistake? How can I provide checking, that none of the numbers can't be more, than 1?

Better than regex is to try to convert the string to a float and check whether it is in the range:
def convert(s):
f = float(s)
if not 0. < f <= 1.:
raise ValueError()
return f
This method returns a float between 0 and 1 or it raises a ValueError (if invalid string or float not between 0 and 1)

So explaining my comment from above:
The Regex you Want should be:
"1 maybe followed by only 0's" OR "0 followed by a dot then some more numbers, which aren't all zeroes"
Breaking it down like this makes it easier to write.
For the first part "1 maybe followed by only 0's":
^1(\.0+)?$
This is fairly straightforward. "1" followed by (.0+) zero or one times. Where (.0+) is "." followed by one or more "0"'s.
And for the second part
^0\.(?!0+$)\d+$
This is a bit trickier. It is "0." followed by a lookahead "(?!0+$)". What this means is that if "0+$" (= "0" one or more times before the end of the string) is found it won't match. After that check you have "\d+$", which is digits, one or more times.
Combining these with an or you get:
^1(\.0+)?$|^0\.(?!0+$)\d+$

Finding the recurring pattern

Let's say I have a number with a recurring pattern, i.e. there exists a string of digits that repeat themselves in order to make the number in question. For example, such a number might be 1234123412341234, created by repeating the digits 1234.
What I would like to do, is find the pattern that repeats itself to create the number. Therefore, given 1234123412341234, I would like to compute 1234 (and maybe 4, to indicate that 1234 is repeated 4 times to create 1234123412341234)
I know that I could do this:
def findPattern(num):
num = str(num)
for i in range(len(num)):
patt = num[:i]
if (len(num)/len(patt))%1:
continue
if pat*(len(num)//len(patt)):
return patt, len(num)//len(patt)
However, this seems a little too hacky. I figured I could use itertools.cycle to compare two cycles for equality, which doesn't really pan out:
In [25]: c1 = itertools.cycle(list(range(4)))
In [26]: c2 = itertools.cycle(list(range(4)))
In [27]: c1==c2
Out[27]: False
Is there a better way to compute this? (I'd be open to a regex, but I have no idea how to apply it there, which is why I didn't include it in my attempts)
EDIT:
I don't necessarily know that the number has a repeating pattern, so I have to return None if there isn't one.
Right now, I'm only concerned with detecting numbers/strings that are made up entirely of a repeating pattern. However, later on, I'll likely also be interested in finding patterns that start after a few characters:
magic_function(78961234123412341234)
would return 1234 as the pattern, 4 as the number of times it is repeated, and 4 as the first index in the input where the pattern first presents itself

(.+?)\1+
Try this. Grab the capture. See demo.
import re
p = re.compile(ur'(.+?)\1+')
test_str = u"1234123412341234"
re.findall(p, test_str)
Add anchors and flag Multiline if you want the regex to fail on 12341234123123, which should return None.
^(.+?)\1+$
See demo.

One way to find a recurring pattern and number of times repeated is to use this pattern:
(.+?)(?=\1+$|$)
w/ g option.
It will return the repeated pattern and number of matches (times repeated)
Non-repeated patterns (fails) will return only "1" match
Repeated patterns will return 2 or more matches (number of times repeated).
Demo

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does this regex not match the second binary gap? - python

Related

Finding matching motifs on sequence and their positions

Replace a substring with defined region and follow up variable region in Python

Extract only percentage information from text in python using regex

Regular expression match numbers in Python

Finding the recurring pattern

Categories

Resources