Python 2.7 Regex; Finding varying number of expressions - python

I am working on a bioinformatics project and am currently trying to split a certain string containing locations on a chromosome.
Example of a few strings, which go by the name "location":
NC_000023.11:g.154532082
NC_000023.11:g.154532058_154532060
NC_000023.11:g.154532046
What I would like returned looks like:
([154532082])
([154532058], [154532060])
([154532046])
I can not think of a regex that normally captures only the first number, and when present, separately captures the second number, without creating a second group, as with:
re.findall(":g.(\d*)_?(\d*)", location)
which gives:
([154532082], [])
([154532058], [154532060])
([154532046], [])
or
re.findall(":g.(\d*)", location), re.findall("\d_(\d*)", location)
which gives:
[(154532082), ()]
[(154532058), (154532060)]
[154532046), ()]
Is there any expression that would solve this? Or should I see and try to remove the empty lists after finding them the way I do?

Here is what you could do:
[re.search("(?<=:g.)(\d*)_?(\d*)", item).group() for item in location.split("\n")]
What I did here was to make a list comprehension to do everything in a single line. Going by parts:
for item in location.split("\n")
This iterates over a list built from the location string, where I split the string in all the line breaks. Now the for loop will iterate over every part of the string between the line breaks. Each of these parts is now called 'item'.
re.search("(?<=:g.)(\d*)_?(\d*)", item).group()
Here I perform a positive lookbehind assertion, which means that the regex will look for ':g.' (the ?<=:g. part), match everything after that, and ditch the ':g.'. As for group(), this is just to print the match from the re.search() method.
Read the python documentation on regex, it helps a lot:
https://docs.python.org/2/library/re.html

Related

How to append character to specific part of item in list - Python

I have the following list Novar:
["'population_3000|road_class_3_3000'", "'population_3000|road_class_3_3000|trafBuf25'", "'population_3000|road_class_3_3000|trafBuf25|population_1000'"]
How can I append a character (e.g. $) to a specific part of the items in the list, for instance to road_class_3_3000, so that the upgraded list becomes:
["'population_3000|road_class_3_3000$'", "'population_3000|road_class_3_3000$|trafBuf25'", "'population_3000|road_class_3_3000$|trafBuf25|population_1000'"]
Most similar questions on Stack Overflow seem to focus on manipulating the item itself, rather than a part of the item, e.g. here and here.
Therefore, applying the following code:
if (item == "road_class_3_3000"):
item.append("$")
Would be of no use since road_class_3_3000 is part of the items "'population_3000|road_class_3_3000'", "'population_3000|road_class_3_3000|trafBuf25'" and "'population_3000|road_class_3_3000|trafBuf25|population_1000'"
You might harness re module (part of standard library) for this task following way
import re
novar = ["'population_3000|road_class_3_3000'", "'population_3000|road_class_3_3000|trafBuf25'", "'population_3000|road_class_3_3000|trafBuf25|population_1000'"]
novar2 = [re.sub(r'(?<=road_class_3_3000)', '$', i) for i in novar]
print(novar2)
output
["'population_3000|road_class_3_3000$'", "'population_3000|road_class_3_3000$|trafBuf25'", "'population_3000|road_class_3_3000$|trafBuf25|population_1000'"]
Feature I used is called positive lookbehind, it is kind of zero-length assertion. I look for place after road_class_3_3000 of zer-length, which I then replace using $ character.

Regex to remove strings from list that do not match given prefix

I have a string that includes multiple comma-separated lists of values, always embedded between <mks:Field name="MyField"> and </mks:Field>.
For example:
<mks:Field name="MyField">X001_ABC</mks:Field><mks:Field name="AnotherField">X002_XYZ</mks:Field><mks:Field name="MyField"></mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X001_ABC,X000_Test1</mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2,X002_XYZ</mks:Field>
In this example I have the following values to work with:
X001_ABC
(empty)
X000_Test1,X000_Test2
X001_ABC,X000_Test1
X000_Test1,X000_Test2,X002_XYZ
Now I want to remove all the values that do not start with the prefix ""X000_", including any needless commas, so that my result looks like this:
<mks:Field name="MyField"></mks:Field><mks:Field name="AnotherField">X002_XYZ</mks:Field><mks:Field name="MyField"></mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X000_Test1</mks:Field><mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field>
I have tried the following regex, but it does not work properly if only one value exists not matching my regex and I do not want to change my regex if a new value matching my prefix is introduced (e.g. X000_Test3).
Search: (?<=name="MyField">)[^<>](?:.*?(X000_Test1,X000_Test2|X000_Test1|X000_Test2))?.*?(?=</mks:Field>)
Replace: \1
This gives me the following result that does not match the expected output:
<mks:Field name="MyField">X000_Test1,X000_Test2</mks:Field><mks:Field name="MyField">X000_Test1</mks:Field><mks:Field name="MyField">X000_Test2</mks:Field>
Unfortunately I cannot simply parse the string with something else - I only have the option of a regex search/replace in this case.
Thank you in advance, any help would be appreciated.
If you are using Javascript use this:
prefix='X000';
let pattern= new RegExp(`((?<=>)|,)((?!${prefix}|[>\<,]).)*(,|(?=\<))`, 'g');
For any other language use this:
'/((?<=>)|,)((?!X000|[>\<,]).)*(,|(?=\<))/';
X000 being the prefix you want to keep

Python regex matching words with repeating consonant

First off, this is homework. (I couldn't use a tag in the title and nothing showed up in the tag list at the bottom for homework, so please let me know if I should EDIT something else regarding this matter).
So I have been reading through the python docs and scavenging SO, finding several solutions that are close to what I want, but not exact.
I have a dictionary which I read in to a string:
a
aa
aabbaa
...
z
We are practicing various regex patters on this data.
The specific problem here is to return a list of words which match the pattern, NOT tuples with the groups within each match.
For example:
Given a subset of this dictionary like:
someword
sommmmmeword
someworddddd
sooooomeword
I want to return:
['sommmmmword', 'someworddddd']
NOT:
[('sommmmword', 'mmmmm', ...), ...] # or any other variant
EDIT:
My reasoning behind the above example, is that I want to see how I can avoid making a second pass over the results. That is instead of saying:
res = re.match(re.compile(r'pattern'), dictionary)
return [r[0] for r in res]
I specifically want a mechanism where I can just use:
return re.match(re.compile(r'pattern'), dictionary)
I know that may sound silly, but I am doing this to really dig into regex. I mention this at the bottom.
This is what I have tried:
# learned about back refs
r'\b([b-z&&[^eiou]])\1+\b' -> # nothing
# back refs were weird, I want to match something N times
r'\b[b-z&&[^eiou]]{2}\b' -> # nothing
Somewhere in testing I noticed a pattern returning things like '\nsomeword'. I couldn't figure out what it was but if I find the pattern again I will include it here for completeness.
# Maybe the \b word markers don't work how I think?
r'.*[b-z&&[^eiou]]{2}' -> # still nothing
# Okay lets just try to match something in between anything
r'.*[b-z&&[^eiou]].*' -> # nope
# Since its words, maybe I should be more explicit.
r'[a-z]*[b-z&&[^eiou]][a-z]*' -> # still nope
# Decided to go back to grouping.
r'([b-z&&[^eiou]])(\1)' # I realize set difference may be the issue
# I saw someone (on SO) use set difference claiming it works
# but I gave up on it...
# OKAY getting close
r'(([b-df-hj-np-tv-xz])(\2))' -> [('ll', 'l', 'l'), ...]
# Trying the the previous ones without set difference
r'\b(.*(?:[b-df-hj-np-tv-xz]{3}).*)\b' -> # returned everything (all words)
# Here I realize I need a non-greedy leading pattern (.* -> .*?)
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3}).*)\b' -> # still everything
# Maybe I need the comma in {3,} to get anything 3 or more
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3,}).*)\b' -> # still everything
# okay I'll try a 1 line test just in case
r'\b(.*?([b-df-hj-np-tv-xz])(\2{3,}).*)\b'
# Using 'asdfdffff' -> [('asdfdffff', 'f', 'fff')]
# Using dictionary -> [] # WAIT WHAT?!
How does this last one work? Maybe there there are no 3+ repeating consonant words? I'm using /usr/share/dict/cracklib-small on my schools server which is about 50,000 words I think.
I am still working on this but any advice would be awesome.
One thing I find curious is that you can not back reference a non-capturing group. If I want to output only the full word, I use (?:...) to avoid capture, but then I can not back reference. Obviously I could leave the captures, loop over the results and filter out the extra stuff, but I absolutely want to figure this out using ONLY regex!
Perhaps there is a way to do the non-capture, but still allow back reference? Or maybe there is an entirely different expression I haven't tested yet.
Here are some points to consider:
Use re.findall to get all the results, not re.match (that only searches for 1 match and only at the string start).
[b-z&&[^eiou]] is a Java/ICU regex, this syntax is not supported by Python re. In Python, you can either redefine the ranges to skip the vowels, or use (?![eiou])[b-z].
To avoid "extra" values in tuples with re.findall, do not use capturing groups. If you need backreferences, use re.finditer instead of re.findall and access .group() of each match.
Coming back to the question, how you can use a backreference and still get the whole match, here is a working demo:
import re
s = """someword
sommmmmeword
someworddddd
sooooomeword"""
res =[x.group() for x in re.finditer(r"\w*([b-df-hj-np-tv-xz])\1\w*", s)]
print(res)
# => ['sommmmmeword', 'someworddddd']

regex extraction 2 groups resulting only in one match

New to regex.
Consider you have the following text structure:
"hello_1:45||hello_2:67||bye_1:45||bye_5:89||.....|| bye_last:100" and so on
I want to build a dictionary out of it taking the string value as a key, and the decimal number as the dict value.
I was trying to check my concept using this nice tool
I wrote my regex expression:
(\w+):(\d+)
And got only one match ->the first in the string : hello_1:45
I tried also something like:
.*(\w+):(\d+).*
But also not good, any ideas?
You should use the g (global) modifier to get all the matches and not stop to the first one. In python you can use the re.findall function to get all the matches. Check the example here.
You may achieve this only through split function.
s = "hello_1:45||hello_2:67||bye_1:45||bye_5:89"
print {i.split(':')[0]:i.split(':')[1] for i in s.split('||')}
Try this if you want to convert the value part as int.
print {i.split(':')[0]:int(i.split(':')[1]) for i in s.split('||')}
or
print {i.split(':')[0]:float(i.split(':')[1]) for i in s.split('||')}

Remove duplicates from a list in Python

I have a python script which parses an xml file and then gives me the required information. My output looks like this, and is 100% correct:
output = ['77:275,77:424,77:425,77:426,77:427,77:412,77:413,77:414,77:412,77:413,77:414,77:412,77:413,77:414,77:412,77:413,77:414,77:431,77:432,77:433,77:435,77:467,77:470,77:471,77:484,77:485,77:475,77:476,77:437,77:438,77:439,77:440,77:442,77:443,77:444,77:445,77:446,77:447,77:449,77:450,77:451,77:454,77:455,77:456,77:305,77:309,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:317,77:321,77:346,77:349,77:350,77:351,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:362,77:367,77:369,77:374,77:370,77:372,77:373,77:387,77:388,77:389,77:392,77:393,77:394,77:328,77:283,77:284,77:285,77:288,77:289,77:290,77:292,']
It is all fine, but I want to remove the duplicate elements in an element, like in the case above. I tried using the OrderedDict package or just simple list(set(output)), but obvoiusly they both didn't work. Does anyone have a tip for me on how to solve this problem.
You have one element in a list. If you expected it to be treated as separate elements, you need to explicitly split it.
You could split the string on the ',' comma character into a list with str.split():
separate_elements = output[0].split(',')
after which you can use set() (unordered) or OrderedDict (maintaining order) and re-join the string if you still need just the one string object:
','.join(set(separate_elements))
You can put that back into a list with just one element, but there is little point if all you ever handle is that one string.

Categories