Find substrings in a set of strings

Find substrings in a set of strings - python

I have a large (50k-100k) set of strings mystrings. Some of the strings in mystrings may be exact substrings of others, and I would like to collapse these (discard the substring and only keep the longest). Right now I'm using a naive method, which has O(N^2) complexity.
unique_strings = set()
for s in sorted(mystrings, key=len, reverse=True):
keep = True
for us in unique_strings:
if s in us:
keep = False
break
if keep:
unique_strings.add(s)
Which data structures or algorithms would make this task easier and not require O(N^2) operations. Libraries are ok, but I need to stay pure Python.

Finding a substring in a set():
name = set()
name.add('Victoria Stuart') ## add single element
name.update(('Carmine Wilson', 'Jazz', 'Georgio')) ## add multiple elements
name
{'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}
me = 'Victoria'
if str(name).find(me):
print('{} in {}'.format(me, name))
# Victoria in {'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}
That's pretty easy -- but somewhat problematic, if you want to return the matching string:
for item in name:
if item.find(me):
print(item)
'''
Jazz
Georgio
Carmine Wilson
'''
print(str(name).find(me))
# 39 ## character offset for match (i.e., not a string)
As you can see, the loop above only executes until the condition is True, terminating before printing the item we want (the matching string).
It's probably better, easier to use regex (regular expressions):
import re
for item in name:
if re.match(me, item):
full_name = item
print(item)
# Victoria Stuart
print(full_name)
# Victoria Stuart
for item in name:
if re.search(me, item):
print(item)
# Victoria Stuart
From the Python docs:
search() vs. match()
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string ...

A naive approach:
1. sort strings by length, longest first # `O(N*log_N)`
2. foreach string: # O(N)
3. insert each suffix into tree structure: first letter -> root, and so on.
# O(L) or O(L^2) depending on string slice implementation, L: string length
4. if inserting the entire string (the longest suffix) creates a new
leaf node, keep it!
O[N*(log_N + L)] or O[N*(log_N + L^2)]
This is probably far from optimal, but should be significantly better than O(N^2) for large N (number of strings) and small L (average string length).
You could also iterate through the strings in descending order by length and add all substrings of each string to a set, and only keep those strings that are not in the set. The algorithmic big O should be the same as for the worse case above (O[N*(log_N + L^2)]), but the implementation is much simpler:
seen_strings, keep_strings = set(), set()
for s in sorted(mystrings, key=len, reverse=True):
if s not in seen_strings:
keep_strings.add(s)
l = len(s)
for start in range(0, l-1):
for end in range(start+1, l):
seen_strings.add(s[start:end])

In the mean time I came up with this approach.
from Bio.trie import trie
unique_strings = set()
suffix_tree = trie()
for s in sorted(mystrings, key=len, reverse=True):
if suffix_tree.with_prefix(contig) == []:
unique_strings.add(s)
for i in range(len(s)):
suffix_tree[s[i:]] = 1
The good: ≈15 minutes --> ≈20 seconds for the data set I was working with. The bad: introduces biopython as a dependency, which is neither lightweight nor pure python (as I originally asked).

You can presort the strings and create a dictionary that maps strings to positions in the sorted list. Then you can loop over the list of strings (O(N)) and suffixes (O(L)) and set those entries to None that exist in the position-dict (O(1) dict lookup and O(1) list update). So in total this has O(N*L) complexity where L is the average string length.
strings = sorted(mystrings, key=len, reverse=True)
index_map = {s: i for i, s in enumerate(strings)}
unique = set()
for i, s in enumerate(strings):
if s is None:
continue
unique.add(s)
for k in range(1, len(s)):
try:
index = index_map[s[k:]]
except KeyError:
pass
else:
if strings[index] is None:
break
strings[index] = None
Testing on the following sample data gives a speedup factor of about 21:
import random
from string import ascii_lowercase
mystrings = [''.join(random.choices(ascii_lowercase, k=random.randint(1, 10)))
for __ in range(1000)]
mystrings = set(mystrings)

Related

Find all combinations of strings up to given character length

I've got a list of strings, for example: ['Lion','Rabbit','Sea Otter','Monkey','Eagle','Rat']
I'm trying to find out the total number of possible combinations of these items, where item order matters, and the total string length, when all strings are concatenated with comma separators is less than a given length.
So, for max total string length 14, I would need to count combinations such as (not exhaustive list):
Lion
Rabbit
Eagle,Lion
Lion,Eagle
Lion,Eagle,Rat
Eagle,Lion,Rat
Sea Otter,Lion
etc...
but it would not include combinations where the total string length is more than the 14 character limit, such as Sea Otter,Monkey
I know for this pretty limited sample it wouldn't be that hard to manually calculate or determine with a few nested loops, but the actual use case will be a list of a couple hundred strings and a much longer character limit, meaning the number of nested iterations to write manually would be extremely confusing...
I tried to work through writing this via Python's itertools, but keep getting lost as none of the examples I'm finding come close enough to what I'm needing, especially with the limited character length (not limited number of items) and the need to allow repeated combinations in different orders.
Any help getting started would be great.

You can use itertools.combinations in a list comprehension that creates lists of max 3 items (more items will per definition be more than 14 characters combined), while filtering by total string length:
import itertools
lst = ['Lion','Rabbit','Sea Otter','Monkey','Eagle','Rat']
sorted_lst = sorted(lst, key=len)
#find max number of items
for n, i in enumerate(sorted_lst):
if len(','.join(sorted_lst[:n+1])) > 14:
items_limit = n+1
break
[x for l in range(1, items_limit) for x in itertools.combinations(lst, l) if len(','.join(x))<15]
PS. use itertools.permutations if you need permutations (as in your sample output), your question is about combinations.

You can use a recursive generator function:
s, m_s = ['Lion','Rabbit','Sea Otter','Monkey','Eagle','Rat'], 14
def get_combos(d, c = []):
yield ','.join(c) #yield back valid combination
for i in range(len(d)):
if d[i] not in c and len(','.join(c+[d[i]])) <= m_s:
yield from get_combos(d[:i]+d[i+1:], c+[d[i]]) #found a new valid combination
if len(d[i]) <= m_s:
yield from get_combos(d[:i]+d[i+1:], [d[i]]) #ignore running combo and replace with new single string
yield from get_combos(d[:i]+d[i+1:], c) #ignore string at current iteration of `for` loop and keep the running combination
_, *vals = set(get_combos(s))
print(vals)
Output:
['Rat,Rabbit', 'Lion,Rabbit', 'Eagle', 'Eagle,Lion,Rat', 'Monkey', 'Rabbit,Lion', 'Rat,Eagle', 'Sea Otter', 'Rat,Lion,Eagle', 'Monkey,Rat', 'Rabbit,Monkey', 'Sea Otter,Rat', 'Rabbit', 'Lion,Sea Otter', 'Rabbit,Eagle', 'Rat,Eagle,Lion', 'Rat,Sea Otter', 'Lion,Monkey', 'Eagle,Lion', 'Eagle,Rat', 'Lion,Eagle,Rat', 'Rat', 'Lion,Rat,Eagle', 'Eagle,Rabbit', 'Rat,Lion', 'Monkey,Eagle', 'Lion,Eagle', 'Eagle,Monkey', 'Monkey,Lion', 'Rat,Monkey', 'Sea Otter,Lion', 'Rabbit,Rat', 'Monkey,Rabbit', 'Eagle,Rat,Lion', 'Lion,Rat', 'Lion']

Python - find index of unique substring contained in list of strings without going through all the items

I have a question that might sound like something already asked but in reality I can't find a real good answer for this.
Every day I have a list with a few thousand strings in it. I also know that this string will always contain literally one item containing the word "other".
For example, one day I may have:
a = ['mark','george', .... , " ...other ...", "matt','lisa', ... ]
another day I may get:
a = ['karen','chris','lucas', ............................., '...other']
As you can see the position of the item containing the substring "other" is random.
My goal is to get as fast as possible the index of the item containing the substring 'other'.
I found other answers here where most of the people suggest list comprehensions of look for.
for example: Finding a substring within a list in Python and Check if a Python list item contains a string inside another string
They don't work for me because they are too slow.
Also, other solutions suggest to use "any" to simply check if "other" is contained in the list, but I need the index not a boolean value.
I believe regex might be a good potential solution even though I'm having a hard time to figure out how.
So far I simply managed to do the following:
# any_other_value_available will tell me extremely quickly if 'other' is contained in list.
any_other_value_available = 'other' in str(list_unique_keys_in_dict).lower()
from here, I don't quite know what to do. Any suggestions? Thank you

Methods Explored
1. Generator Method
next(i for i,v in enumerate(test_strings) if 'other' in v)
2. List Comprehension Method
[i for i,v in enumerate(test_strings) if 'other' in v]
3. Using Index with Generator (suggested by #HeapOverflow)
test_strings.index(next(v for v in test_strings if 'other' in v))
4. Regular Expression with Generator
re_pattern = re.compile('.*other.*')
next(test_strings.index(x) for x in test_strings if re_pattern.search(x))
Conclusion
Index Method had the fastest time (method suggested by #HeapOverflow in comments).
Test Code
Using Perfplot which uses timeit
import random
import string
import re
import perfplot
def random_string(N):
return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(N))
def create_strings(length):
M = length // 2
random_strings = [random_string(5) for _ in range(length)]
front = ['...other...'] + random_strings
middle = random_strings[:M] + ['...other...'] + random_strings[M:]
end_ = random_strings + ['...other...']
return front, middle, end_
def search_list_comprehension(test_strings):
return [i for i,v in enumerate(test_strings) if 'other' in v][0]
def search_genearator(test_strings):
return next(i for i,v in enumerate(test_strings) if 'other' in v)
def search_index(test_strings):
return test_strings.index(next(v for v in test_strings if 'other' in v))
def search_regex(test_strings):
re_pattern = re.compile('.*other.*')
return next(test_strings.index(x) for x in test_strings if re_pattern.search(x))
# Each benchmark is run with the '..other...' placed in the front, middle and end of a random list of strings.
out = perfplot.bench(
setup=lambda n: create_strings(n), # create front, middle, end strings of length n
kernels=[
lambda a: [search_list_comprehension(x) for x in a],
lambda a: [search_genearator(x) for x in a],
lambda a: [search_index(x) for x in a],
lambda a: [search_regex(x) for x in a],
],
labels=["list_comp", "generator", "index", "regex"],
n_range=[2 ** k for k in range(15)],
xlabel="lenght list",
# More optional arguments with their default values:
# title=None,
# logx="auto", # set to True or False to force scaling
# logy="auto",
# equality_check=numpy.allclose, # set to None to disable "correctness" assertion
# automatic_order=True,
# colors=None,
# target_time_per_measurement=1.0,
# time_unit="s", # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
# relative_to=1, # plot the timings relative to one of the measurements
# flops=lambda n: 3*n, # FLOPS plots
)
out.show()
print(out)
Results
length list regex list_comp generator index
1.0 10199.0 3699.0 4199.0 3899.0
2.0 11399.0 3899.0 4300.0 4199.0
4.0 13099.0 4300.0 4599.0 4300.0
8.0 16300.0 5299.0 5099.0 4800.0
16.0 22399.0 7199.0 5999.0 5699.0
32.0 34900.0 10799.0 7799.0 7499.0
64.0 59300.0 18599.0 11799.0 11200.0
128.0 108599.0 33899.0 19299.0 18500.0
256.0 205899.0 64699.0 34699.0 33099.0
512.0 403000.0 138199.0 69099.0 62499.0
1024.0 798900.0 285600.0 142599.0 120900.0
2048.0 1599999.0 582999.0 288699.0 239299.0
4096.0 3191899.0 1179200.0 583599.0 478899.0
8192.0 6332699.0 2356400.0 1176399.0 953500.0
16384.0 12779600.0 4731100.0 2339099.0 1897100.0

If you are looking for a substring, regular expressions are a good way to find it.
In your case you are looking for all substrings that contain 'other'.
As you have already mentioned, there is no special order of the elements in the list. Therefore the search for your desired element is linear, even if it is ordered.
A regular expression that might describe your search is query='.*other.*'.
Regarding the documentation
. (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
* Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
With .* before and after other there can be 0 or more repetitions of any character.
For example
import re
list_of_variables = ['rossum', 'python', '..other..', 'random']
query = '.*other.*'
indices = [list_of_variables.index(x) for x in list_of_variables if re.search(query, x)]
Which will return a list of indices containing your query.
In this example indices will be [2], since '...other...' is the third element in the list.

Efficiently searching for prefixes, with wildcards and mismatches

Given a string str and a list of variable-length prefixes p, I want to find all possible prefixes found at the start of str, allowing for up to k mismatches and wildcards (dot character) in str.
I only want to search at the beginning of the string and need to do this efficiently for len(p) <= 1000; k <= 5 and millions of strs.
So for example:
str = 'abc.efghijklmnop'
p = ['abc', 'xxx', 'xbc', 'abcxx', 'abcxxx']
k = 1
result = ['abc', 'xbc', 'abcxx'] #but not 'xxx', 'abcxxx'
Is there an efficient algorithm for this, ideally with a python implementation already available?
My current idea would be to walk through str character by character and keep a running tally of each prefix's mismatch count.
At each step, I would calculate a new list of candidates which is the list of prefixes that do not have too many mismatches.
If I reach the end of a prefix it gets added to the returned list.
So something like this:
def find_prefixes_with_mismatches(str, p, k):
p_with_end = [prefix+'$' for prefix in p]
candidates = list(range(len(p)))
mismatches = [0 for _ in candidates]
result = []
for char_ix in range(len(str)):
#at each iteration we build a new set of candidates
new_candidates = []
for prefix_ix in candidates:
#have we reached the end?
if p_with_end[prefix_ix][char_ix] == '$':
#then this is a match
result.append(p[prefix_ix])
#do not add to new_candidates
else:
#do we have a mismatch
if str[char_ix] != p_with_end[prefix_ix][char_ix] and str[char_ix] != '.' and p_with_end[prefix_ix][char_ix] != '.':
mismatches[prefix_ix] += 1
#only add to new_candidates if the number is still not >k
if mismatches[prefix_ix] <= k:
new_candidates.append(prefix_ix)
else:
#if not, this remains a candidate
new_candidates.append(prefix_ix)
#update candidates
candidates = new_candidates
return result
But I'm not sure if this will be any more efficient than simply searching one prefix after the other, since it requires rebuilding this list of candidates at every step.

I do not know of something that does exactly this.
But if I were to write it, I'd try constructing a trie of all possible decision points, with an attached vector of all states you wound up in. You would then take each string, walk the trie until you hit a final matched node, then return the precompiled vector of results.
If you've got a lot of prefixes and have set k large, that trie may be very big. But if you're amortizing creating it against running it on millions of strings, it may be worthwhile.

How to split string everywhere a letter appears?

I have a string containing letters and numbers like this -
12345A6789B12345C
How can I get a list that looks like this
[12345A, 6789B, 12345C]

>>> my_string = '12345A6789B12345C'
>>> import re
>>> re.findall('\d*\w', my_string)
['12345A', '6789B', '12345C']

For the sake of completeness, non-regex solution:
data = "12345A6789B12345C"
result = [""]
for char in data:
result[-1] += char
if char.isalpha():
result.append("")
if not result[-1]:
result.pop()
print(result)
# ['12345A', '6789B', '12345C']
Should be faster for smaller strings, but if you're working with huge data go with regex as once compiled and warmed up, the search separation happens on the 'fast' C side.

You could build this with a generator, too. The approach below keeps track of start and end indices of each slice, yielding a generator of strings. You'll have to cast it to list to use it as one, though (splitonalpha(some_string)[-1] will fail, since generators aren't indexable)
def splitonalpha(s):
start = 0
for end, ch in enumerate(s, start=1):
if ch.isalpha:
yield s[start:end]
start = end
list(splitonalpha("12345A6789B12345C"))
# ['12345A', '6789B', '12345C']

Looking for the shared motif between several sequences

I need to write a script which will loop over a list of sequences, find shared motifs between them (it is possible multiple solutions exist for different motifs) and print this motif which has been shared between all sequences.
In the below example
chains = ['GATTACA', 'TAGACCA', 'ATACA']
the AT is one of the shared motifs. I'll be thankful for any solution of such task including usage of BioPython functions.
Recently I've made script which have loop the same set for the shorter sequence setting its as the reference and then try to find this ref sequence in each positions of the other chains. But I really don't know how to find shared motifs without defining the reference
# reference
xz=" ".join(chains)
ref= min(xz.split(), key=len)
# LOOKING FOR THE MOTIFS
for chain in chains:
for i in range(len(chain)):
if chain==ref:
pass
elif ref not in chain:
print "%s has not been found in the %s"%(ref, chain)
break
elif chain[i:].startswith(ref):
print "%s has been detected in %s in the %d position" %(ref, chain, i+1)

It is only quick idea. You have to improve it, because it search almost all space. I hope it will help.
def cut_into_parts(chain, n):
return [chain[x:x+n] for x in range(0, len(chain)-n)]
def cut_chains(chains, n):
rlist = []
for k,v in enumerate(chains):
rlist.extend(cut_into_parts(chains, n))
return rlist
def is_str_common(str, chains):
for k,v in enumerate(chains):
if !chains[k].contains(str):
return false
return true
def find_best_common(chains):
clist = []
for i in inverse(range(0, len(chains)))://inverse - I dont remmeber exactly the name of func
clist.extend(cut_chains(chains, i))
for k, v in enumerate(clist):
return is_str_common(clist[k], chains)

The simplest approach starts with the realization that the longest common substring can not be longer than the shortest string we're looking at. It should also be obvious that if we start with the longest possible candidate and only examine shorter candidates after eliminating longer ones, then we can stop as soon as we find a common substring.
So, we begin by sorting the DNA strings by length. We'll refer to the length of the shortest one as l. Then the procedure is to test its substrings, beginning with the single substring of length l, and then the two substrings of length l-1, and so forth, until a match is found and we return it.
from Bio import SeqIO
def get_all_substrings(iterable):
s = tuple(iterable)
seen = set()
for size in range(len(s)+1, 1, -1):
for index in range(len(s)+1-size):
substring = iterable[index:index+size]
if substring not in seen:
seen.add(substring)
yield substring
def main(input_file, return_all=False):
substrings = []
records = list(SeqIO.parse(open(input_file),'fasta'))
records = sorted(records, key=lambda record: len(str(record.seq)))
first, rest = records[0], records[1:]
rest_sequences = [str(record.seq) for record in rest]
for substring in get_all_substrings(str(first.seq)):
if all(substring in seq for seq in rest_sequences):
if return_all:
substrings.append(substring)
else:
return substring
return substrings

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find substrings in a set of strings - python

Related

Find all combinations of strings up to given character length

Python - find index of unique substring contained in list of strings without going through all the items

Efficiently searching for prefixes, with wildcards and mismatches

How to split string everywhere a letter appears?

Looking for the shared motif between several sequences

Categories

Resources