This question already has answers here:
regex to match a word with unique (non-repeating) characters
(3 answers)
Closed 4 years ago.
I need to match all upper case letters in a string, but not duplicates of the same letter in python I've been using
from re import compile
regex = compile('[A-Z]')
variables = regex.findall('(B or P) and (P or not Q)')
but that will match ['B', 'P', 'P', 'Q'] but I need ['B', 'P', 'Q'].
Thanks in advance!
You can use negative lookahead with a backreference to avoid matching duplicates:
re.findall(r'([A-Z])(?!.*\1.*$)', '(B or P) and (P or not Q)')
This returns:
['B', 'P', 'Q']
And if order matters do:
print(sorted(set(variables),key=variables.index))
Or if you have the more_itertools package:
from more_itertools import unique_everseen as u
print(u(variables))
Or if version >= 3.6:
print(list({}.fromkeys(variables)))
Or OrderedDict:
from collections import OrderedDict
print(list(OrderedDict.fromkeys(variables)))
All reproduce:
['B', 'P', 'Q']
Related
This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 2 years ago.
After searching for a while, I could only find how to match specific subpattern repetitions. Is there a way I can find (3 or more) repetitions for any subpattern ?
For example:
re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
→ ['a', 'b', 'x', 'aaabbbxxx_']
re.findall(<the_regex>, 'lalala luuluuluul')
→ ['la', 'luu', 'uul']
I apologize in advance if this is a duplicate and would be grateful to be redirected to the original question.
Using this lookahead based regex you may not get exactly as you are showing in question but will get very close.
r'(?=(.+)\1\1)'
RegEx Demo
Code:
>>> reg = re.compile(r'(?=(.+)\1\1)')
>>> reg.findall('aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
>>> reg.findall('lalala luuluuluul')
['la', 'luu', 'uul']
RegEx Details:
Since we're using a lookahead as full regex we are not really consuming character since lookahead is a zero width match. This allows us to return overlapping matches from input.
Using findall we only return capture group in our regex.
(?=: Start lookahead
(.+): Match 1 or more of any character (greedy) and capture in group #1
\1\1: Match 2 occurrence of group #1 using back-reference \1\1
): End lookahead
re.findall() won't find overlapping matches. But you can find the non-overlapping matches using a capture group followed by a positive lookahead that matches a back-reference to that group.
>>> import re
>>> regex = r'(.+)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'a', 'b', 'x', 'a', 'b', 'x']
>>> re.findall(regex, 'lalala luuluuluul')
['la', 'luu']
>>>
This will find the longest matches; if you change (.+) to (.+?) you'll get the shortest matches at each point.
>>> regex = r'(.+?)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['a', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
It is not possible without defining the subpattern first.
Anyway, if the subpattern is just <any_alphanumeric>, then re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_') would produce something like this :
['a', 'b', 'x', 'aa', 'ab', 'bb', 'bx', 'xx', 'x_', 'aaa', 'aaab', 'aaabb', ....]
ie, every alphanumeric combination that is repeated thrice - so a lot of combinations, not just ['a', 'b', 'x', 'aaabbbxxx_']
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I get this:
import re;
print re.findall(r"q(?=u)", "qqq queen quick qeel")
> ['q', 'q'] # for queen and quick
But I don't get this:
import re;
print re.findall(r"q(?!=u)", "qqq queen quick qeel")
> ['q', 'q', 'q', 'q', 'q', 'q'] # every q matches
I expected only 4 qs to match because the negative lookahead should see that in the word qeel for example, the letter after the q is not u.
What gives?
It is
import re
print(re.findall(r"q(?!u)", "qqq queen quick qeel"))
# ---^---
# ['q', 'q', 'q', 'q']
Without the =, that is. Otherwise, you don't want to have =u in front which is true for all of your qs here. In general, a positive lookahead is formed via (?=...) whereas a negative one is just (?!...).
Sidenote: the ; is not needed at the end of the line unless you want to write everything in one line which is not considered "Pythonic" but totally valid:
import re; print(re.findall(r"q(?!u)", "qqq queen quick qeel"))
I'm interested in creating a program that will search for a certain string (known henceforth as string A) in a large library of other strings. Basically, if string A existed in the library it would be discarded and another string's existence would be checked for within the library. The program would then give me a final list of strings that did not exist as substrings within the large library. I was able to make a program that finds EXACT matches, but I need to add an additional module that allows the sub-string search to allow for partial matches. Namely, one or two of the sub-string characters would be alright. The list of string A's (which are all permutations of a,t,g,c in a 7-letter string 4^7 different ones) has difficulties with highly diverse libraries.
My initial thought was to use regex and perhaps a hamming distance algorithm to find all those partial matches. Basically this first attempt allows me to put a "?" or wildcard into all positions of the string A in question (1-7), but I can only get it into the first position. The wildcard would then allow me to search for partial matches of the particular string A in question. If this the wrong way to approach this problem, I'd gladly change it up. I used fnmatch as per suggestion on another question This is what I have so far:
from Bio import SeqIO
import fnmatch
import random
import itertools
#Define a splitting string algorithm
def split_by_n(seq,n):
while seq:
yield seq[:n]
seq = seq[n:]
#Import all combinations/permutations from fasta fille, 4^7
my_combinations = []
fasta_sequences = SeqIO.parse(open("Combinations/base_combinations_7.fasta"),'fasta')
for fasta in fasta_sequences:
name, sequence = fasta.id, str(fasta.seq)
x = sequence.lower()
my_combinations.append(x)
primer = "tgatgag"
final = []
#List to make wildcard permutations
wildCard = ['?']
i = list(split_by_n(primer, 1))
for letter in i:
wildCard.append(letter)
del wildCard[1]
final.append(''.join(wildCard))
#Search for wildcard permutation
for entry in final:
filtered = fnmatch.filter(my_combinations, entry)
This is my desired output:
primer = "tgatgag"
['?', 'g', 'a', 't', 'g', 'a', 'g']
['t', '?', 'a', 't', 'g', 'a', 'g']
['t', 'g', '?', 't', 'g', 'a', 'g']
['t', 'g', 'a', '?', 'g', 'a', 'g']
['t', 'g', 'a', 't', '?', 'a', 'g']
['t', 'g', 'a', 't', 'g', '?', 'g']
['t', 'g', 'a', 't', 'g', 'a', '?']
['agatgag', 'tgatgag', 'cgatgag', 'ggatgag']
['taatgag', 'ttatgag', 'tcatgag', 'tgatgag']
['tgatgag', 'tgttgag', 'tgctgag', 'tggtgag']
['tgaagag', 'tgatgag', 'tgacgag', 'tgaggag']
['tgataag', 'tgattag', 'tgatcag', 'tgatgag']
['tgatgag', 'tgatgtg', 'tgatgcg', 'tgatggg']
['tgatgaa', 'tgatgat', 'tgatgac', 'tgatgag']
Here's an example solution for 2 element replacement:
primer = 'cattagc'
bases = ['a','c','g','t']
# this is the generator for all possible index combinations
p = itertools.permutations(range(len(primer)), 2)
# this is the list of all possible base pair combinations
c = list(itertools.combinations_with_replacement(bases, 2))
results = []
for i1, i2 in p:
for c1, c2 in c:
temp = list(primer)
temp[i1], temp[i2] = c1, c2
results.append(''.join(temp))
This will create all possible replacements for subing out any two elements of the original primer.
This question already has answers here:
RegExp match repeated characters
(6 answers)
Closed 8 years ago.
I have string :-
s = 'bubble'
how to use regular expression to get a list like:
['b', 'u', 'bb', 'l', 'e']
I want to filter single as well as double occurrence of a letter.
This should do it:
import re
[m.group(0) for m in re.finditer('(.)\\1*',s)]
For 'bubbles' this returns:
['b', 'u', 'bb', 'l', 'e', 's']
For 'bubblesssss' this returns:
['b', 'u', 'bb', 'l', 'e', 'sssss']
You really have two questions. The first question is how to split the list, the second is how to filter.
The splitting takes advantage of back references in a pattern. In this case we'll construct a pattern the will find one or two occurrences of a letter then construct a list from the search results. The \1 in the code block refers to the first parenthesized expression.
import re
pattern = re.compile(r'(.)\1?')
s = "bubble"
result = [x.group() for x in pattern.finditer(s)]
print(result)
To filter the list stored in result you could use a list comprehension that filters on length.
filtered_result = [x for x in result if len(x) == 2]
print(filtered_result)
You could just get the set of duplications directly by tweaking the regular expression.
pattern2 = re.compile(r'(.)\1')
result2 = [x.group() for x in pattern2.finditer(s)]
print(result2)
The output from running the above is:
['b', 'u', 'bb', 'l', 'e']
['bb']
['bb']
I have the following and have flattened the list via this documentation
>>> wordlist = ['cat','dog','rabbit']
>>> letterlist = [lt for wd in wordlist for lt in wd]
>>> print(letterlist)
['c', 'a', 't', 'd', 'o', 'g', 'r', 'a', 'b', 'b', 'i', 't']
Can the list comprehension be extended to remove duplicate characters. The desired result is the following (in any order):
['a', 'c', 'b', 'd', 'g', 'i', 'o', 'r', 't']
I can convert to a set and then back to a list but I'd prefer to keep it as a list.
Easiest is to use a set comprehension instead of a list comp:
letterlist = {lt for wd in wordlist for lt in wd}
All I did was replace the square brackets with curly braces. This works in Python 2.7 and up.
For Python 2.6 and earlier, you'd use the set() callable with a generator expression instead:
letterlist = set(lt for wd in wordlist for lt in wd)
Last, but not least, you can replace the comprehension syntax altogether by producing the letters from all sequences by chaining the strings together, treat them all like one long sequence, with itertools.chain.from_iterable(); you give that a sequence of sequences and it'll give you back one long sequence:
from itertools import chain
letterlist = set(chain.from_iterable(wordlist))
Sets are an easy way to get unique elements from an iterable. To flatten a list of lists, itertools.chain provides a handy way to do that.
from itertools import chain
>>> set(chain.from_iterable(['cat','dog','rabbit'])
{'a', 'b', 'c', 'd', 'g', 'i', 'o', 'r', 't'}
I think set comprehension should be used
wordlist = ['cat','dog','rabbit']
letterlist = {lt for wd in wordlist for lt in wd}
print(letterlist)
this will work only in python 2.7 and higher
for previous versions use set instead of {}
wordlist = ['cat','dog','rabbit']
letterlist = set(lt for wd in wordlist for lt in wd)
print(letterlist)