Finding common string in list and displaying them

Finding common string in list and displaying them - python

I am trying to create a function compare(lst1,lst2) which compares the each element in a list and returns every common element in a new list and shows percentage of how common it is. All the elements in the list are going to be strings. For example the function should return:
lst1 = AAAAABBBBBCCCCCDDDD
lst2 = ABCABCABCABCABCABCA
common strand = AxxAxxxBxxxCxxCxxxx
similarity = 25%
The parts of the list which are not similar will simply be returned as x.
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.

This is what I came up with.
lst1 = 'AAAAABBBBBCCCCCDDDD'
lst2 = 'ABCABCABCABCABCABCA'
common_strand = ''
score = 0
for i in range(len(lst1)):
if lst1[i] == lst2[i]:
common_strand = common_strand + str(lst1[i])
score += 1
else:
common_strand = common_strand + 'x'
print('Common Strand: ', common_strand)
print('Similarity Score: ', score/len(lst1))
Output:
Common Strand: AxxAxxxBxxxCxxCxxxx
Similarity Score: 0.2631578947368421

I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
You have two strings A and B. Strings are ordered sequences of characters.
Suppose both A and B have equal length (the same number of characters). Choose some position i < len(A), len(B) (remember Python sequences are 0-indexed). Your problem statement requires:
If character i in A is identical to character i in B, yield that character
Otherwise, yield some placeholder to denote the mismatch
How do you find the ith character in some string A? Take a look at Python's string methods. Remember: strings are sequences of characters, so Python strings also implement several sequence-specific operations.
If len(A) != len(B), you need to decide what to do if you're comparing the ith element in either string to a string smaller than i. You might think to represent these as the same placeholder in (2).
If you know how to iterate the result of zip, you know how to use for loops. All you need is a way to iterate over the sequence of indices. Check out the language built-in functions.
Finally, for your measure of similarity: if you've compared n characters and found that N <= n are mismatched, you can define 1 - (N / n) as your measure of similarity. This works well for equally-long strings (for two strings with different lengths, you're always going to be calculating the proportion relative to the longer string).

Related

Finding consecutive numbers from multiple lists in python

Consider, for example, that I have 3 python lists of numbers, like this:
a = [1,5,7]
b = [2,6,8]
c = [4,9]
I need to be able to check if there are consecutive numbers from these lists, one number from each list, and return true if there are.
In the above example, 7 from list a, 8 from list b and 9 from list c are consecutive, so the returned value should be true.This should be extendable to any number of lists (the number of lists is not known in advance, because they are created on the fly based on prior conditions).
Also, values in a list is not present in any other list. For example, list a above contains the element '1', so '1' is not present in any other list.
Is there a way to accomplish? It seems simple, yet too complex. I am a python newbie, and have been trying all sorts of loops but not even getting close to what I am looking for.
Looking for suggestions. Thanks in advance.
UPDATE: Here is the context for this question.
I am trying to implement a 'phrase search' in a sentence (which is part of a much bigger task).
Here is an example.
The sentence is:
My friend is my colleague.
I have created an index, which is a dictionary having the word as the key and a list of its positions as the value. So for the above sentence, I get:
{
'My': [0,3],
'friend': [1],
'is': [2],
'colleague': [4]
}
I need to search for the phrase 'friend is my' in the above sentence.
So I am trying to do something like this:
First get the positions of words in the phrase from the dictionary, to get:
{
'My': [0,3],
'friend': [1],
'is': [2],
}
Then check if the words in my phrase have consecutive positions, which goes back to my original question of finding consecutive numbers in different lists.
Since 'friend' is in position 1, 'is' is in position 2, and 'my' is in position 3. Hence, I should be able to conclude that the given sentence contains my phrase.

Can you assume
lists are sorted?
O(n) memory usage is acceptable?
As a start, you could merge the lists and then check for consecutive elements. This isn't a complete solution because it would match consecutive elements that all appear in a single list (see comments).
from itertools import chain, pairwise
# from https://docs.python.org/3/library/itertools.html#itertools-recipes
def triplewise(iterable):
"Return overlapping triplets from an iterable"
# triplewise('ABCDEFG') --> ABC BCD CDE DEF EFG
for (a, _), (b, c) in pairwise(pairwise(iterable)):
yield a, b, c
def consecutive_numbers_in_list(*lists: list[list]) -> bool:
big_list = sorted(chain(*lists))
for first, second, third in triplewise(big_list):
if (first + 1) == second == (third - 1):
return True
return False
consecutive_numbers_in_list(a, b, c)
# True
Note itertools.pairwise is py 3.10
If the lists are sorted but you need constant memory, then you can use an n pointer approach in which you have a pointer to the first element of each list, then advance the lowest pointer on each iteration and keep track of the last three values seen at all times.
Ultimately, your question doesn't make that much sense, in that this doesn't seem like a typical programming task. If you are a newbie to programming, you can ask what you are trying to accomplish, instead of how to implement your candidate solution, and we might be able to suggest a better method overall. See https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
UPDATE
You are implementing phrase search. So an additional requirement, compared to the original question, is that the first list contain the first index of the sequence, the second list contain the second index of the sequence, etc. (As I assume that "friend my is" is not an acceptable search result for the query "my friend is".)
Pseudocode:
for each index i in the j=1th list:
for each list from the jth list to the nth list:
see whether i + j - 1 appears in list j
Depending on the characteristics of your data, you may find there are easier/more efficient approaches
can find all the documents matching n of the search terms in the phrase, then do exact substring matching in the document
if search terms have max token length that is relatively short, then you can add n-grams to your search index
This is a very general problem, you can look at implementations in popular search engines like ElasticSearch.

Finding overlapping bits in a set of binary strings

I am working on a project in which I need to generate several identifiers for combinatorial pooling of different molecules. To do so, I assign each molecule an n-bit string (where n is the number of pools I have. In this case, 79 pools) and each string has 4 "on" bits (4 bits equal to 1) corresponding to which pools that molecule will appear in. Next, I want to pare down the number of strings such that no two molecules appear in the same pool more than twice (in other words, the greatest number of overlapping bits between two strings can be no greater than 2).
To do this, I: 1) compiled a list of all n-bit strings with k "on" bits, 2) generated a list of lists where each element is a list of indices where the bit is on using re.finditer and 3) iterate through the list to compare strings, adding only strings that meet my criteria into my final list of strings.
The code I use to compare strings:
drug_strings = [] #To store suitable strings for combinatorial pooling rules
class badString(Exception): pass
for k in range(len(strings)):
bit_current = bit_list[k]
try:
for bits in bit_list[:k]:
intersect = set.intersection(set(bit_current),set(bits))
if len(intersect) > 2:
raise badString() #pass on to next iteration if string has overlaps in previous set
drug_strings.append(strings[k])
except badString:
pass
However, this code takes forever to run. I am running this with n=79-bit strings with k=4 "on" bits per string (~1.5M possible strings) so I assume that the long runtime is because I am comparing each string to every previous string. Is there an alternative/smarter way to go about doing this? An algorithm that would work faster and be more robust?
EDIT: I realized that the simpler way to approach this problem instead of identifying the entire subset of strings that would be suitable for my project was to just randomly sample the larger set of n-bit strings with k "on" bits, store only the strings that fit my criteria, and then once I have an appropriate amount of suitable strings, simply take as many as I need from those. New code is as follows:
my_strings = []
my_bits = []
for k in range(2000):
random = np.random.randint(0, len(strings_77))
string = strings_77.pop(random)
bits = [m.start()+1 for m in re.finditer('1',string)]
if all(len(set(bits) & set(my_bit)) <= 2
for my_bit in my_bits[:k]):
my_strings.append(string)
my_bits.append(bits)
Now I only have to compare against strings I've already pulled (at most 1999 previous strings instead of up to 1 million). It runs much more quickly this way. Thanks for the help!

Raising exceptions is expensive. A complex data structure is created and the stack has to be unwound. In fact, setting up a try/except block is expensive.
Really you're wanting to check that all intersections have length less than or equal to two, and then append. There is no need for exceptions.
for k in range(len(strings)):
bit_current = bit_list[k]
if all(len(set(bit_current) & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(strings[k])
Also, instead of having to look up the strings and bit_list index, you can iterate over all the parts you need at the same time. You still need the index for the bit_list slice:
for index, (drug_string, bit_current) in enumerate(zip(strings, bit_list)):
if all(len(set(bit_current) & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(drug_string)
You can also avoid creating the bit_current set with each loop:
for index, (drug_string, bit_current) in enumerate(zip(strings, bit_list)):
bit_set = set(bit_current)
if all(len(bit_set & set(bits)) <= 2
for bits in bit_list[:k]):
drug_strings.append(drug_string)

Some minor things that I would improve in your code that may be causing some overhead:
set(bit_current) move this outside the inner loop;
remove the raise except part;
Since you have this condition if len(intersect) > 2: you could try to implement the interception method to stop when that condition is meet. So that you avoid unnecessary computation.
So the code would become:
for k in range(len(strings)):
bit_current = set(bit_list[k])
intersect = []
for bits in bit_list[:k]:
intersect = []
b = set(bits)
for i in bit_current:
if i in b:
intersect.append(i)
if len(intersect) > 2:
break
if len(intersect) > 2:
break
if len(intersect) <= 2:
drug_strings.append(strings[k])

Need to reduce the number of recursive calls in this function

The problem is given a string S and an integer k<len(S) we need to find the highest string in dictionary order with any k characters removed but maintaining relative ordering of string.
This is what I have so far:
def allPossibleCombinations(k,s,strings):
if k == 0:
strings.append(s)
return strings
for i in range(len(s)):
new_str = s[:i]+s[i+1:]
strings = allPossibleCombinations(k-1, new_str, strings)
return strings
def stringReduction(k, s):
strings = []
combs = allPossibleCombinations(k,s, strings)
return sorted(combs)[-1]
This is working for a few test cases but it says that I have too many recursive calls for other testcases. I don't know the testcases.

This should get you started -
from itertools import combinations
def all_possible_combinations(k = 0, s = ""):
yield from combinations(s, len(s) - k)
Now for a given k=2, and s="abcde", we show all combinations of s with k characters removed -
for c in all_possible_combinations(2, "abcde"):
print("".join(c))
# abc
# abd
# abe
# acd
# ace
# ade
# bcd
# bce
# bde
# cde

it says that I have too many recursive calls for other testcases
I'm surprised that it failed on recursive calls before it failed on taking too long to come up with an answer. The recursion depth is the same as k, so k would have had to reach 1000 for default Python to choke on it. However, your code takes 4 minutes to solve what appears to be a simple example:
print(stringReduction(8, "dermosynovitis"))
The amount of time is a function of k and string length. The problem as I see it, recursively, is this code:
for i in range(len(s)):
new_str = s[:i]+s[i+1:]
strings = allPossibleCombinations(k-1, new_str, strings, depth + 1)
Once we've removed the first character say, and done all the combinations without it, there's nothing stopping the recursive call that drops out the second character from again removing the first character and trying all the combinations. We're (re)testing too many strings!
The basic problem is that you need to prune (i.e. avoid) strings as you test, rather than generate all possibilties and test them. If a candidate's first letter is less than that of the best string you've seen so far, no manipulation of the remaining characters in that candidate is going to improve it.

need to decrease the run time of my program

I had a question where I had to find contiguous substrings of a string, and the condition was the first and last letters of the substring had to be same. I tried doing it, but the runtime exceed the time-limit for the question for several test cases. I tried using map for a for loop, but I have no idea what to do for the nested for loop. Can anyone please help me to decrease the runtime of this program?
n = int(raw_input())
string = str(raw_input())
def get_substrings(string):
length = len(string)
list = []
for i in range(length):
for j in range(i,length):
list.append(string[i:j + 1])
return list
substrings = get_substrings(string)
contiguous = filter(lambda x: (x[0] == x[len(x) - 1]), substrings)
print len(contiguous)

If i understand properly the question, please let me know if thats not the case but try this:
Not sure if this will speed up runtime, but i believe this algorithm may for longer strings especially (eliminates nested loop). Iterate through the string once, storing the index (position) of each character in a data structure with constant time lookup (hashmap, or an array if setup properly). When finished you should have a datastructure storing all the different locations of every character. Using this you can easily retrieve the substrings.
Example:
codingisfun
take the letter i for example, after doing what i said above, you look it up in the hashmap and see that it occurs at index 3 and 6. Meaning you can do something like substring(3, 6) to get it.
not the best code, but it seems reasonable for a starting point...you may be able to eliminate a loop with some creative thinking:
import string
import itertools
my_string = 'helloilovetocode'
mappings = dict()
for index, char in enumerate(my_string):
if not mappings.has_key(char):
mappings[char] = list()
mappings[char].append(index)
print char
for char in mappings:
if len(mappings[char]) > 1:
for subset in itertools.combinations(mappings[char], 2):
print my_string[subset[0]:(subset[1]+1)]

The problem is that your code far too inefficient in terms of algorithmic complexity.
Here's an alternative (a cleaner but slightly slower version of soliman's I believe)
import collections
def index_str(s):
"""
returns the indices characters show up at
"""
indices = collections.defaultdict(list)
for index, char in enumerate(s):
indices[char].append(index)
return indices
def get_substrings(s):
indices = index_str(s)
for key, index_lst in indices.items():
num_indices = len(index_lst)
for i in range(num_indices):
for j in range(i, num_indices):
yield s[index_lst[i]: index_lst[j] + 1]
The algorithmic problem with your solution is that you blindly check each possible substring, when you can easily determine what actual pairs are in a single, linear time pass. If you only want the count, that can be determined easily in O(MN) time, for a string of length N and M unique characters (given the number of occurrences of a char, you can mathematically figure out how many substrings there are). Of course, in the worst case (all chars are the same), your code will have the same complexity as ours, but the in average case complexity yours is much worse since you have a nested for loop (n^2 time)

Rosalind: overlap graphs

I have come across a problem on Rosalind that I think I ave solved correctly, yet I get told my answer is incorrect. The problem can be found here: http://rosalind.info/problems/grph/
It's basic graph theory, more specifically it deals with returning an adjacency list of overlapping DNA strings.
"For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).
Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.
Return: The adjacency list corresponding to O3. You may return edges in any order."
So, if you've got:
Rosalind_0498
AAATAAA
Rosalind_2391
AAATTTT
Rosalind_2323
TTTTCCC
Rosalind_0442
AAATCCC
Rosalind_5013
GGGTGGG
you must return:
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
My python code, after having parsed the FASTA file containing the DNA strings, is as follows:
listTitle = []
listContent = []
#SPLIT is the parsed list of DNA strings
#here i create two new lists, one (listTitle) containing the four numbers identifying a particular string, and the second (listContent) containing the actual strings ('>Rosalind_' has been removed, because it is what I split the file with)
while i < len(SPLIT):
curr = SPLIT[i]
title = curr[0:4:1]
listTitle.append(title)
content = curr[4::1]
listContent.append(content)
i+=1
start = []
end = []
#now I create two new lists, one containing the first three chars of the string and the second containing the last three chars, a particular string's index will be the same in both lists, as well as in the title list
for item in listContent:
start.append(item[0:3:1])
end.append(item[len(item)-3:len(item):1])
list = []
#then I iterate through both lists, checking if the suffix and prefix are equal, but not originating from the same string, and append their titles to a last list
p=0
while p<len(end):
iterator=0
while iterator<len(start):
if p!=iterator:
if end[p] == start[iterator]:
one=listTitle[p]
two=listTitle[iterator]
list.append(one)
list.append(two)
iterator+=1
p+=1
#finally I print the list in the format that they require for the answer
listInc=0
while listInc < len(list):
print "Rosalind_"+list[listInc]+' '+"Rosalind_"+list[listInc+1]
listInc+=2
Where am I going wrong? Sorry that the code is a bit tedious, I have had very little training in python

I'm not sure what is wrong with your code, but here is an approach that might be considered more "pythonic".
I'll suppose that you've read your data into a dictionary mapping names to DNA strings:
{'Rosalind_0442': 'AAATCCC',
'Rosalind_0498': 'AAATAAA',
'Rosalind_2323': 'TTTTCCC',
'Rosalind_2391': 'AAATTTT',
'Rosalind_5013': 'GGGTGGG'}
We define a simple function that checks whether a string s1 has a k-suffix matching the k-prefix of a string s2:
def is_k_overlap(s1, s2, k):
return s1[-k:] == s2[:k]
Then we look at all combinations of DNA sequences to find those that match. This is made easy by itertools.combinations:
import itertools
def k_edges(data, k):
edges = []
for u,v in itertools.combinations(data, 2):
u_dna, v_dna = data[u], data[v]
if is_k_overlap(u_dna, v_dna, k):
edges.append((u,v))
if is_k_overlap(v_dna, u_dna, k):
edges.append((v,u))
return edges
For example, on the data above we get:
>>> k_edges(data, 3)
[('Rosalind_2391', 'Rosalind_2323'),
('Rosalind_0498', 'Rosalind_2391'),
('Rosalind_0498', 'Rosalind_0442')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.