Partial Substring Matching in Python - python

I'm interested in creating a program that will search for a certain string (known henceforth as string A) in a large library of other strings. Basically, if string A existed in the library it would be discarded and another string's existence would be checked for within the library. The program would then give me a final list of strings that did not exist as substrings within the large library. I was able to make a program that finds EXACT matches, but I need to add an additional module that allows the sub-string search to allow for partial matches. Namely, one or two of the sub-string characters would be alright. The list of string A's (which are all permutations of a,t,g,c in a 7-letter string 4^7 different ones) has difficulties with highly diverse libraries.
My initial thought was to use regex and perhaps a hamming distance algorithm to find all those partial matches. Basically this first attempt allows me to put a "?" or wildcard into all positions of the string A in question (1-7), but I can only get it into the first position. The wildcard would then allow me to search for partial matches of the particular string A in question. If this the wrong way to approach this problem, I'd gladly change it up. I used fnmatch as per suggestion on another question This is what I have so far:
from Bio import SeqIO
import fnmatch
import random
import itertools
#Define a splitting string algorithm
def split_by_n(seq,n):
while seq:
yield seq[:n]
seq = seq[n:]
#Import all combinations/permutations from fasta fille, 4^7
my_combinations = []
fasta_sequences = SeqIO.parse(open("Combinations/base_combinations_7.fasta"),'fasta')
for fasta in fasta_sequences:
name, sequence = fasta.id, str(fasta.seq)
x = sequence.lower()
my_combinations.append(x)
primer = "tgatgag"
final = []
#List to make wildcard permutations
wildCard = ['?']
i = list(split_by_n(primer, 1))
for letter in i:
wildCard.append(letter)
del wildCard[1]
final.append(''.join(wildCard))
#Search for wildcard permutation
for entry in final:
filtered = fnmatch.filter(my_combinations, entry)
This is my desired output:
primer = "tgatgag"
['?', 'g', 'a', 't', 'g', 'a', 'g']
['t', '?', 'a', 't', 'g', 'a', 'g']
['t', 'g', '?', 't', 'g', 'a', 'g']
['t', 'g', 'a', '?', 'g', 'a', 'g']
['t', 'g', 'a', 't', '?', 'a', 'g']
['t', 'g', 'a', 't', 'g', '?', 'g']
['t', 'g', 'a', 't', 'g', 'a', '?']
['agatgag', 'tgatgag', 'cgatgag', 'ggatgag']
['taatgag', 'ttatgag', 'tcatgag', 'tgatgag']
['tgatgag', 'tgttgag', 'tgctgag', 'tggtgag']
['tgaagag', 'tgatgag', 'tgacgag', 'tgaggag']
['tgataag', 'tgattag', 'tgatcag', 'tgatgag']
['tgatgag', 'tgatgtg', 'tgatgcg', 'tgatggg']
['tgatgaa', 'tgatgat', 'tgatgac', 'tgatgag']

Here's an example solution for 2 element replacement:
primer = 'cattagc'
bases = ['a','c','g','t']
# this is the generator for all possible index combinations
p = itertools.permutations(range(len(primer)), 2)
# this is the list of all possible base pair combinations
c = list(itertools.combinations_with_replacement(bases, 2))
results = []
for i1, i2 in p:
for c1, c2 in c:
temp = list(primer)
temp[i1], temp[i2] = c1, c2
results.append(''.join(temp))
This will create all possible replacements for subing out any two elements of the original primer.

Related

Is it possible to take an unknown amount of lists, and get every possible combination of those lists in the original order? [duplicate]

This question already has answers here:
How to get the cartesian product of multiple lists
(17 answers)
Closed 1 year ago.
I am attempting a cipher script where each encoded character could be one of multiple letters,
ex: BC = a w k.
the encoding part of the script was simple, however i'm running into a problem while trying to decode sentences. because each encoded character looks like BC i have to split the entire string into 2's, then I have to check the key file to see which letters are possible from each character,
so could look like TVVQBTTV and the lists would look like:
['G', 'H', 'P', 'R', 'T', 'Y']
['G', 'H', 'P', 'R', 'T', 'Y', 'E', 'L', 'Z', '.']
['G', 'H', 'P', 'R', 'T', 'Y', 'E', 'L', 'Z', '.', 'A', 'F', 'M', 'O', 'S', 'W', 'Z']
['G', 'H', 'P', 'R', 'T', 'Y', 'E', 'L', 'Z', '.', 'A', 'F', 'M', 'O', 'S', 'W', 'Z', 'G', 'H', 'P', 'R', 'T', 'Y']
my goal is to print out every possible combination (ex: GGGG, GGGH, GGGP, ect) in the console so the person receiving the message has to look through all of the combinations to find the right one. this is an attempt to make the cipher harder to break. however, because the amount of lists grow with the amount of characters in the sentence, so 'Hello, how are you?' could look like: TVVQVQVQBTVPBWVOBTTBTVVQDDVOBW
and there are to many lists for it to put in this box without it looking thoroughly messy. so is there any way to do this?
I'm a bit confused why your 4 lists extend the previous lists, rather than being 4 separate lists mapping to the possible values of TV, VQ, BT, and TV. In other words I don't get why the possible values of the key 'VQ' include the possible values of 'TV', and so on. But as the question is written:
def get_list_of_possible_vals(key):
# check key file
return ['a','b','c','d'] # just an example; list of characters the given key
# could map to
encoded_list = ['TV','VQ','VQ','VQ','BT','VP'] # input data
decoded_list = []
list_of_lists = []
for enc in encoded_list:
decoded_list += get_list_of_possible_vals(enc)
list_of_lists.append(decoded_list)
# If you want the lists of decoded values to be separate rather than stacking,
# do decoded_list.append() instead
# Cartesian product of the lists
for element in itertools.product(*list_of_lists):
print(''.join(element))

How to find double occurrence of a letter in a word [duplicate]

This question already has answers here:
RegExp match repeated characters
(6 answers)
Closed 8 years ago.
I have string :-
s = 'bubble'
how to use regular expression to get a list like:
['b', 'u', 'bb', 'l', 'e']
I want to filter single as well as double occurrence of a letter.
This should do it:
import re
[m.group(0) for m in re.finditer('(.)\\1*',s)]
For 'bubbles' this returns:
['b', 'u', 'bb', 'l', 'e', 's']
For 'bubblesssss' this returns:
['b', 'u', 'bb', 'l', 'e', 'sssss']
You really have two questions. The first question is how to split the list, the second is how to filter.
The splitting takes advantage of back references in a pattern. In this case we'll construct a pattern the will find one or two occurrences of a letter then construct a list from the search results. The \1 in the code block refers to the first parenthesized expression.
import re
pattern = re.compile(r'(.)\1?')
s = "bubble"
result = [x.group() for x in pattern.finditer(s)]
print(result)
To filter the list stored in result you could use a list comprehension that filters on length.
filtered_result = [x for x in result if len(x) == 2]
print(filtered_result)
You could just get the set of duplications directly by tweaking the regular expression.
pattern2 = re.compile(r'(.)\1')
result2 = [x.group() for x in pattern2.finditer(s)]
print(result2)
The output from running the above is:
['b', 'u', 'bb', 'l', 'e']
['bb']
['bb']

Flatten a list of strings to characters and then de-dupe the new list

I have the following and have flattened the list via this documentation
>>> wordlist = ['cat','dog','rabbit']
>>> letterlist = [lt for wd in wordlist for lt in wd]
>>> print(letterlist)
['c', 'a', 't', 'd', 'o', 'g', 'r', 'a', 'b', 'b', 'i', 't']
Can the list comprehension be extended to remove duplicate characters. The desired result is the following (in any order):
['a', 'c', 'b', 'd', 'g', 'i', 'o', 'r', 't']
I can convert to a set and then back to a list but I'd prefer to keep it as a list.
Easiest is to use a set comprehension instead of a list comp:
letterlist = {lt for wd in wordlist for lt in wd}
All I did was replace the square brackets with curly braces. This works in Python 2.7 and up.
For Python 2.6 and earlier, you'd use the set() callable with a generator expression instead:
letterlist = set(lt for wd in wordlist for lt in wd)
Last, but not least, you can replace the comprehension syntax altogether by producing the letters from all sequences by chaining the strings together, treat them all like one long sequence, with itertools.chain.from_iterable(); you give that a sequence of sequences and it'll give you back one long sequence:
from itertools import chain
letterlist = set(chain.from_iterable(wordlist))
Sets are an easy way to get unique elements from an iterable. To flatten a list of lists, itertools.chain provides a handy way to do that.
from itertools import chain
>>> set(chain.from_iterable(['cat','dog','rabbit'])
{'a', 'b', 'c', 'd', 'g', 'i', 'o', 'r', 't'}
I think set comprehension should be used
wordlist = ['cat','dog','rabbit']
letterlist = {lt for wd in wordlist for lt in wd}
print(letterlist)
this will work only in python 2.7 and higher
for previous versions use set instead of {}
wordlist = ['cat','dog','rabbit']
letterlist = set(lt for wd in wordlist for lt in wd)
print(letterlist)

How to print each letter in a string only once

Hello everyone I have a python question.
I'm trying to print each letter in the given string only once.
How do I do this using a for loop and sort the letters from a to z?
Heres what I have;
import string
sentence_str = ("No punctuation should be attached to a word in your list,
e.g., end. Not a correct word, but end is.")
letter_str = sentence_str
letter_str = letter_str.lower()
badchar_str = string.punctuation + string.whitespace
Alist = []
for i in badchar_str:
letter_str = letter_str.replace(i,'')
letter_str = list(letter_str)
letter_str.sort()
for i in letter_str:
Alist.append(i)
print(Alist))
Answer I get:
['a']
['a', 'a']
['a', 'a', 'a']
['a', 'a', 'a', 'a']
['a', 'a', 'a', 'a', 'a']
['a', 'a', 'a', 'a', 'a', 'b']
['a', 'a', 'a', 'a', 'a', 'b', 'b']
['a', 'a', 'a', 'a', 'a', 'b', 'b', 'c']....
I need:
['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'l', 'n', 'o', 'p', 'r', 's', 't', 'u', 'w', 'y']
no errors...
Just check if the letter is not already in your array before appending it:
for i in letter_str:
if not(i in Alist):
Alist.append(i)
print(Alist))
or alternatively use the Set data structure that Python provides instead of an array. Sets do not allow duplicates.
aSet = set(letter_str)
Using itertools ifilter which you can say has an implicit for-loop:
In [20]: a=[i for i in itertools.ifilter(lambda x: x.isalpha(), sentence_str.lower())]
In [21]: set(a)
Out[21]:
set(['a',
'c',
'b',
'e',
'd',
'g',
'i',
'h',
'l',
'o',
'n',
'p',
's',
'r',
'u',
't',
'w',
'y'])
Malvolio correctly states that the answer should be as simple as possible. For that we use python's set type which takes care of the issue of uniqueness in the most efficient and simple way possible.
However, his answer does not deal with removing punctuation and spacing. Furthermore, all answers as well as the code in the question do that pretty inefficiently(loop through badchar_str and replace in the original string).
The best(ie, simplest and most efficient as well as idiomatic python) way to find all unique letters in the sentence is this:
import string
sentence_str = ("No punctuation should be attached to a word in your list,
e.g., end. Not a correct word, but end is.")
bad_chars = set(string.punctuation + string.whitespace)
unique_letters = set(sentence_str.lower()) - bad_chars
If you want them to be sorted, simply replace the last line with:
unique_letters = sorted(set(sentence_str.lower()) - bad_chars)
If the order in which you want to print doesn't matter you can use:
sentence_str = ("No punctuation should be attached to a word in your list,
e.g., end. Not a correct word, but end is.")
badchar_str = string.punctuation + string.whitespace
for i in badchar_str:
letter_str = letter_str.replace(i,'')
print(set(sentence_str))
Or if you want to print in sorted order you could convert it back to list and use sort() and then print.
First principles, Clarice. Simplicity.
list(set(sentence_str))
You can use set() for remove duplicate characters and sorted():
import string
sentence_str = "No punctuation should be attached to a word in your list, e.g., end. Not a correct word, but end is."
letter_str = sentence_str
letter_str = letter_str.lower()
badchar_str = string.punctuation + string.whitespace
for i in badchar_str:
letter_str = letter_str.replace(i,'')
characters = list(letter_str);
print sorted(set(characters))

Is that a tag list or something else?

I am new to NLP and NLTK, and I want to find ambiguous words, meaning words with at least n different tags. I have this method, but the output is more than confusing.
Code:
def MostAmbiguousWords(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
if wordsUniqeTags.has_key(w):
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
else:
wordsUniqeTags[w] = set([t])
# Starting to count
res = []
for w in wordsUniqeTags:
if len(wordsUniqeTags[w]) >= n:
res.append((w, wordsUniqeTags[w]))
return res
MostAmbiguousWords(brown.tagged_words(), 13)
Output:
[("what's", set(['C', 'B', 'E', 'D', 'H', 'WDT+BEZ', '-', 'N', 'T', 'W', 'V', 'Z', '+'])),
("who's", set(['C', 'B', 'E', 'WPS+BEZ', 'H', '+', '-', 'N', 'P', 'S', 'W', 'V', 'Z'])),
("that's", set(['C', 'B', 'E', 'D', 'H', '+', '-', 'N', 'DT+BEZ', 'P', 'S', 'T', 'W', 'V', 'Z'])),
('that', set(['C', 'D', 'I', 'H', '-', 'L', 'O', 'N', 'Q', 'P', 'S', 'T', 'W', 'CS']))]
Now I have no idea what B,C,Q, ect. could represent. So, my questions:
What are these?
What do they mean? (In case they are tags)
I think they are not tags, because who and whats don't have the WH tag indicating "wh question words".
I'll be happy if someone could post a link that includes a mapping of all possible tags and their meaning.
It looks like you have a typo. In this line:
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
you should have set([t]) (not set(t)), like you do in the else case.
This explains the behavior you're seeing because t is a string and set(t) is making a set out of each character in the string. What you want is set([t]) which makes a set that has t as its element.
>>> t = 'WHQ'
>>> set(t)
set(['Q', 'H', 'W']) # bad
>>> set([t])
set(['WHQ']) # good
By the way, you can correct the problem and simplify things by just changing that line to:
wordsUniqeTags[w].add(t)
But, really, you should make use of the setdefault method on dict and list comprehension syntax to improve the method overall. So try this instead:
def most_ambiguous_words(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
wordsUniqeTags.setdefault(w, set()).add(t)
# Starting to count
return [(word,tags) for word,tags in wordsUniqeTags.iteritems() if len(tags) >= n]
You are splitting your POS tags into single characters in this line:
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
set('AT') results in set(['A', 'T']).
How about making use of the Counter and defaultdict functionality in the collections module?
from collection import defaultdict, Counter
def most_ambiguous_words(words, n):
counts = defaultdict(Counter)
for (word,tag) in words:
counts[word][tag] += 1
return [(w, counts[w].keys()) for w in counts if len(counts[word]) > n]

Categories