New to regex -- unexpected results in for loop - python

I'm not sure if this is a problem in my understanding of regex modules, or a silly mistake I'm making in my for loop.
I have a list of numbers that look like this:
4; 94
3; 92
1; 53
etc.
I made a regex pattern to match just the last two digits of the string:
'^.*\s([0-9]+)$'
This works when I take each element of the list 1 at a time.
However when I try and make a for loop
for i in xData:
if re.findall('^.*\s([0-9]+)$', i)
print i
The output is simply the entire string instead of just the last two digits.
I'm sure I'm missing something very simple here but if someone could point me in the right direction that would be great. Thanks.

You are printing the whole string, i. If you wanted to print the output of re.findall(), then store the result and print that result:
for i in xData:
results = re.findall('^.*\s([0-9]+)$', i)
if results:
print results
I don't think that re.findall() is the right method here, since your lines contain just the one set of digits. Use re.search() to get a match object, and if the match object is not None, take the first group data:
for i in xData:
match = re.search('^.*\s([0-9]+)$', i)
if match:
print match.group(1)

I might be missing something here, but if all you're looking to do is get the last 2 characters, could you use the below?
for i in xData:
print(i[-2:])

Related

Why doesn't replace () change all occurrences?

I have the following code:
dna = "TGCGAGAAGGGGCGATCATGGAGATCTACTATCCTCTCGGGGTATGGTGGGGTTGAGA"
print(dna.count("GAGA"))
dna = dna.replace("GAGA", "AGAG")
print(dna.count("GAGA"))
Replace does not replace all occurrences. Could somebody help my in understanding why it happened?
It replaces all occurences. That might lead to new occurences (look at your replacement string!).
I'd say, logically, all is fine.
You could repeat this replace while dna.count("GAGA") > 0 , but: that sounds not like what you should be doing. (I bet you really just want to do one round of replacement to simulate something specific happening. Not a genetics expert at all though.)
It did make all replacements (that's what .replace() does in Python unless specified otherwise), but some of these replacements inadvertently introduced new instances of GAGA. Take the beginning of your string:
TGCGAGAA
There's GAGA at indices 3-6. If you replace that with AGAG, you get
TGCAGAGA
So the last G from that AGAG, together with the subsequent A that was already there before, forms a new GAGA.
Replacements does not occur "until exhausted"; they occur when a substring is matched in your original string.
Consider the following from your string:
>>> a = "TGCGAGAA"
>>> a.replace("GAGA", "AGAG")
'TGCAGAGA'
>>>
The replacement does not happen again, since the original string did not match GAGA in that location.
If you want to do the replacement until no match is found, you can wrap it in a loop:
>>> while a.count("GAGA") > 0: # you probably don't want to use count here if the string is long because of performance considerations
... a = a.replace("GAGA", "AGAG")
...
>>> a
'TGCAAGAG'

Test multiple substrings against a string

If I have an list of strings:
matches = [ 'string1', 'anotherstring', 'astringystring' ]
And I have another string that I want to test:
teststring = 'thestring1'
And I want to test each string, and if any match, do something. I have:
match = 0
for matchstring in matches:
if matchstring in teststring:
match = 1
if !match:
continue
This is in a loop, so we just go around again if we don't get a match (I can reverse this logic of course and do something if it matches), but the code looks clumsy and not pythonic, if easy to follow.
I am thinking there is a better way to do this, but I don't grok python as well as I would like. Is there a better approach?
Note the "duplicate" is the opposite question (though the same answer approach is the same).
You could use any here
Code:
if any(matchstring in teststring for matchstring in matches):
print "Matched"
Notes:
any exits as soon it see's a match.
As per as the loop what is happening is for matchstring in matches here each string from the matches is iterated.
And here matchstring in teststring we are checking if the iterated string is in the defined check string.
The any will exit as soon as it see's a True[match] in the expression.
If you want to know what the first match was you can use next:
match = next((match for match in matches if match in teststring), None)
You have to pass None as the second parameter if you don't want it to raise an exception when nothing matches. It will use the value as the default, so match will be None if nothing is found.
How about you try this:
len([ x for x in b if ((a in x) or (x in a)) ]) > 0
I've updated the answer to check the substring both ways. You can pick and choose or modify as you see fit but I think the basics should be pretty clear.

String splitting in python by finding non-zero character

I want to do the following split:
input: 0x0000007c9226fc output: 7c9226fc
input: 0x000000007c90e8ab output: 7c90e8ab
input: 0x000000007c9220fc output: 7c9220fc
I use the following line of code to do this but it does not work!
split = element.rpartition('0')
I got these outputs which are wrong!
input: 0x000000007c90e8ab output: e8ab
input: 0x000000007c9220fc output: fc
what is the fastest way to do this kind of split?
The only idea for me right now is to make a loop and perform checking but it is a little time consuming.
I should mention that the number of zeros in input is not fixed.
Each string can be converted to an integer using int() with a base of 16. Then convert back to a string.
for s in '0x000000007c9226fc', '0x000000007c90e8ab', '0x000000007c9220fc':
print '%x' % int(s, 16)
Output
7c9226fc
7c90e8ab
7c9220fc
input[2:].lstrip('0')
That should do it. The [2:] skips over the leading 0x (which I assume is always there), then the lstrip('0') removes all the zeros from the left side.
In fact, we can use lstrip ability to remove more than one leading character to simplify:
input.lstrip('x0')
format is handy for this:
>>> print '{:x}'.format(0x000000007c90e8ab)
7c90e8ab
>>> print '{:x}'.format(0x000000007c9220fc)
7c9220fc
In this particular case you can just do
your_input[10:]
You'll most likely want to properly parse this; your idea of splitting on separation of non-zero does not seem safe at all.
Seems to be the XY problem.
If the number of characters in a string is constant then you can use
the following code.
input = "0x000000007c9226fc"
output = input[10:]
Documentation
Also, since you are using rpartitionwhich is defined as
str.rpartition(sep)
Split the string at the last occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself.
Since your input can have multiple 0's, and rpartition only splits the last occurrence this a malfunction in your code.
Regular expression for 0x00000 or its type is (0x[0]+) and than replace it with space.
import re
st="0x000007c922433434000fc"
reg='(0x[0]+)'
rep=re.sub(reg, '',st)
print rep

regular expressions to extract phone numbers

I am new to regular expressions and I am trying to write a pattern of phone numbers, in order to identify them and be able to extract them. My doubt can be summarized to the following simple example:
I try first to identify whether in the string is there something like (+34) which should be optional:
prefixsrch = re.compile(r'(\(?\+34\)?)?')
that I test in the following string in the following way:
line0 = "(+34)"
print prefixsrch.findall(line0)
which yields the result:
['(+34)','']
My first question is: why does it find two occurrences of the pattern? I guess that this is related to the fact that the prefix thing is optional but I do not completely understand it. Anyway, now for my big doubt
If we do a similar thing searching for a pattern of 9 digits we get the same:
numsrch = re.compile(r'\d{9}')
line1 = "971756754"
print numsrch.findall(line1)
yields something like:
['971756754']
which is fine. Now what I want to do is identify a 9 digits number, preceded or not, by (+34). So to my understanding I should do something like:
phonesrch = re.compile(r'(\(?\+34\)?)?\d{9}')
If I test it in the following strings...
line0 = "(+34)971756754"
line1 = "971756754"
print phonesrch.findall(line0)
print phonesrch.findall(line1)
this is, to my surprise, what I get:
['(+34)']
['']
What I was expecting to get is ['(+34)971756754'] and ['971756754']. Does anybody has the insight of this? thank you very much in advance.
Your capturing group is wrong. Make the country code within a non-capturing group and the entire expression in the capturing group
>>> line0 = "(+34)971756754"
>>> line1 = "971756754"
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line0)
['(+34)971756754']
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line1)
['971756754']
My first question is: why does it find two occurrences of the pattern?
This is because, ? which means it match 0 or 1 repetitions, so an empty string is also a valid match

How to select only certain Substrings

from a string say dna = 'ATAGGGATAGGGAGAGAGCGATCGAGCTAG'
i got substring say dna.format = 'ATAGGGATAG','GGGAGAGAG'
i only want to print substring whose length is divisible by 3
how to do that? im using modulo but its not working !
import re
if mydna = 'ATAGGGATAGGGAGAGAGCAGATCGAGCTAG'
print re.findall("ATA"(.*?)"AGA" , mydna)
if len(mydna)%3 == 0
print mydna
corrected code
import re
mydna = 'ATAGGGATAGGGAGAGAGCAGATCGAGCTAG'
re.findall("ATA"(.*?)"AGA" , mydna.format)
if len(mydna.format)%3 == 0:
print mydna.format
this still doesnt give me substring with length divisible by three . . any idea whats wrong ?
im expecting only substrings which has length divisible by three to be printed
For including overlap substrings, I have the following lengthy version. The idea is to find all starting and ending marks and calculate the distance between them.
mydna = 'ATAGGGATAGGGAGAGAGCAGATCGAGCTAG'
[mydna[start.start():end.start()+3] for start in re.finditer('(?=ATA)',mydna) for end in re.finditer('(?=AGA)',mydna) if end.start()>start.start() and (end.start()-start.start())%3 == 0]
['ATAGGGATAGGG', 'ATAGGG']
Show all substrings, including overlapping ones:
[mydna[start.start():end.start()+3] for start in re.finditer('(?=ATA)',mydna) for end in re.finditer('(?=AGA)',mydna) if end.start()>start.start()]
['ATAGGGATAGGG', 'ATAGGGATAGGGAG', 'ATAGGGATAGGGAGAGAGC', 'ATAGGG', 'ATAGGGAG', 'ATAGGGAGAGAGC']
You can also use the regular expression for that:
re.findall('ATA((...)*?)AGA', mydna)
the inner braces match 3 letters at once.
Using modulo is the correct procedure. If it's not working, you're doing it wrong. Please provide an example of your code in order to debug it.
re.findAll() will return you an array of matching strings, You need to iterate on each of those and do a modulo on those strings to achieve what you want.

Categories