How do I find what is similar in multiple strings in python?

How do I find what is similar in multiple strings in python? - python

I want to get what is similar in a few strings. For example, I have 6 strings:
HELLO3456
helf04g
hell0r
h31l0
I want to get what is similar in these strings, for example in this case I would like it to tell me something like:
h is always at the start
That example is pretty simple and I can figure that out in my head but with something like:
61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj
pHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM
fW9K4luEx65RscfUiPDakiqp15jiK5f6
17xz7MYEBoXLPoi8RdqbgkPwTV2T2H0y
Jvt0B5uZIDPJ5pbCqMo12CqD7pdnMSEd
n7voYT0TVVzZGVSLaQNRnnkkWgVqxA3b
it's not that easy. I have seen and tried:
Find the similarity metric between two strings
how to find similarity between many strings and plot it
to name a few but they're all not what I'm looking for. They give a value of how similar they are and I need to know what is similar in them.
I want to know if this is even possible and if so, how I can do it. Thank you in advance.

Minimal Solution
You are on the correct soltuion path with the difflib Library. I just picked the first two examples from your question to create a minimal Solution.
from difflib import SequenceMatcher
a = "61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj"
b = "pHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM"
Sequencer = SequenceMatcher(None, a, b)
print(Sequencer.ratio())
matches = Sequencer.get_matching_blocks()
print(matches)
for match in matches:
idx_a = match.a
idx_b = match.b
if not (idx_a == len(a) or idx_b == len(b)):
print(30*'-' + 'Found Match' + 30*'-')
print('found at idx {} of str "a" and at idx {} of str "b" the value {}'.format(idx_a, idx_b, a[idx_a]))
Output:
0.0625
[Match(a=2, b=18, size=1), Match(a=5, b=29, size=1), Match(a=32, b=32, size=0)]
------------------------------Found Match------------------------------
found at idx 2 of str "a" and at idx 18 of str "b" the value T
------------------------------Found Match------------------------------
found at idx 5 of str "a" and at idx 29 of str "b" the value 2
Explanation
I just used the ratio() to see if any similarity is existing. The function get_matching_blocks() returns a list with all matches in your string sequence. My minimal Solution doesn't care for same position, but this should be an easy fix with checking the indices. In the Situation that the return value of ratio() is rqual to 0.0 the matcher does not generate an empty list. The list contains always a match for the end of Sequence. I worked around with checking against length of the sequence with the matching idices. Another solution is to use only matches with a size > 0, as shown below:
if match.size > 0:
...
My Example also doesn't handle matches with size > 1. I think you will figure out to handle this problem ;)

I think this should be your desire solution. I have added "a" at the start of every string because otherwise there is no similarity in the strings you mentioned.
lst = ["A61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj","apHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM","afW9K4luEx65RscfUiPDakiqp15jiK5f6","a17xz7MYEBoXLPoi8RdqbgkPwTV2T2H0y", "aJvt0B5uZIDPJ5pbCqMo12CqD7pdnMSEd","an7voYT0TVVzZGVSLaQNRnnkkWgVqxA3b"]
total_strings = len(lst)
string_length = len(lst[0])
for i in range(total_strings):
lst[i] = lst[i].lower()
for i in range(string_length):
flag = 0
lst_char = lst[total_strings-1][i]
for j in range(total_strings-1):
if lst[j][i] == lst_char:
flag = 1
continue
else:
flag = 0
break
if flag == 1:
print(lst[total_strings-1][i]+" is always at position "+str(i))

Related

The longest prefix that is also suffix of two lists

So I have two lists:
def function(w,w2): # => this is how I want to define my function (no more inputs than this 2 lists)
I want to know the biggest prefix of w which is also suffix of w2.
How can I do this only with logic (without importing anything)

I can try and help get you started on this problem, but it sort of sounds like a homework question so I won't give you a complete answer (per these guidelines).
If I were you I'd start with a small case and build up from there. Lets start with:
w = "ab"
w2 = "ba"
The function for this might look like:
def function(w,w2):
prefix = ""
# Does the first letter of w equal the last letter of w2?
if w[0] == w2[-1]:
prefix += w[0]
# What about the second letter?
if w[1] == w2[-2]:
prefix += w[1]
return prefix
Then when you run print(function(w,w2)) you get ab.
This code should work for 2 letter words, but what if the words are longer? This is when we would introduce a loop.
def function(w,w2):
prefix = ""
for i in range(0, len(w)):
if w[i] == w2[(i+1)*-1]:
prefix+= w[i]
else:
return prefix
return prefix
Hopefully this code will offer a good starting place for you! One issue with what I have written is what if w2 is shorter than w. Then you will get an index error! There are a few ways to solve this, but one way is to make sure that w is always the shorter word. Best of luck, and feel free to DM me if you have other questions.

A simple iterative approach could be:
Start from the longest possible prefix (i.e. all of w), and test it against a w2 suffix of the same length.
If they match, you can return it immediately, since it must be the longest possible match.
If they don't match, shorten it by one, and repeat.
If you never find a match, the answer is an empty string.
In code, this looks like:
>>> def function(w, w2):
... for i in range(len(w), 0, -1):
... if w[:i] == w2[-i:]:
... return w[:i]
... return ''
...
>>> function("asdfasdf", "qwertyasdf")
'asdf'
The slice operator (w[:i] for a prefix of length i, w2[-i:] for a suffix of length i) gracefully handles mismatched lengths by just giving you a shorter string if i is out of the range of the given string (which means they won't match, so the iteration is forced to continue until the lengths do match).
>>> function("aaaaaba", "ba")
'a'
>>> function("a", "abbbaababaa")
'a'

Given 2 strings, return number of positions where the two strings contain the same length 2 substring

here is my code:
def string_match(a, b):
count = 0
if len(a) < 2 or len(b) < 2:
return 0
for i in range(len(a)):
if a[i:i+2] == b[i:i+2]:
count = count + 1
return count
And here are the results:
Correct me if I am wrong but, I see that it didn't work probably because the two string lengths are the same. If I were to change the for loop statement to:
for i in range(len(a)-1):
then it would work for all cases provided. But can someone explain to me why adding the -1 makes it work? Perhaps I'm comprehending how the for loop works in this case. And can someone tell me a more optimal way to write this because this is probably really bad code. Thank you!

But can someone explain to me why adding the -1 makes it work?
Observe:
test = 'food'
i = len(test) - 1
test[i:i+2] # produces 'd'
Using len(a) as your bound means that len(a) - 1 will be used as an i value, and therefore a slice is taken at the end of a that would extend past the end. In Python, such slices succeed, but produce fewer characters.

String slicing can return strings that are shorter than requested. In your first failing example that checks "abc" against "abc", in the third iteration of the for loop, both a[i:i+2] and b[i:i+2] are equal to "c", and therefore count is incremented.
Using range(len(a)-1) ensures that your loop stops before it gets to a slice that would be just one letter long.

Since the strings may be of different lengths, you want to iterate only up to the end of the shortest one. In addition, you're accessing i+2, so you only want i to iterate up to the index before the last item (otherwise you might get a false positive at the end of the string by going off the end and getting a single-character string).
def string_match(a: str, b: str) -> int:
return len([
a[i:i+2]
for i in range(min(len(a), len(b)) - 1)
if a[i:i+2] == b[i:i+2]
])
(You could also do this counting with a sum, but this makes it easy to get the actual matches as well!)

You can use this :
def string_match(a, b):
if len(a) < 2 or len(b) < 0:
return 0
subs = [a[i:i+2] for i in range(len(a)-1)]
occurence = list(map(lambda x: x in b, subs))
return occurence.count(True)

determine if list is periodic python

I am curious to find out a function to check if a given list is periodic or not and return the periodic elements. lists are not loaded rather their elements are generated and added on the fly, if this note will make the algorithm easier anyhow.
For example, if the input to the function is [1,2,1,2,1,2,1,2], the output shall be (1,2).
I am looking for some tips and hints on the easier methods to achieve this.
Thanks in advance,

This problem can be solved with the Knuth-Morris-Pratt algorithm for string matching. Please get familiar with the way the fail-links are calculated before you proceed.
Lets consider the list as something like a sequence of values (like a String). Let the size of the list/sequence is n.
Then, you can:
Find the length of the longest proper prefix of your list which is also a suffix. Let the length of the longest proper prefix suffix be len.
If n is divisible by n - len, then the list is periodic and the period is of size len. In this case you can print the first len values.
More info:
GeeksForGeeks article.
Knuth-Morris-Pratt algorithm

NOTE: the original question had python and python-3.x tags, they were edited not by OP, that's why my answer is in python.
I use itertools.cycle and zip to determine if the list is k-periodic for a given k, then just iterate all possible k values (up to half the length of the list).
try this:
from itertools import cycle
def is_k_periodic(lst, k):
if len(lst) < k // 2: # we want the returned part to repaet at least twice... otherwise every list is periodic (1 period of its full self)
return False
return all(x == y for x, y in zip(lst, cycle(lst[:k])))
def is_periodic(lst):
for k in range(1, (len(lst) // 2) + 1):
if is_k_periodic(lst, k):
return tuple(lst[:k])
return None
print(is_periodic([1, 2, 1, 2, 1, 2, 1, 2]))
Output:
(1, 2)

Thank you all for answering my question. Neverthelss, I came up with an implementation that suits my needs.
I will share it here with you looking forward your inputs to optimize it for better performance.
The algorithm is:
assume the input list is periodic.
initialize a pattern list.
go over the list up to its half, for each element i in this first half:
add the element to the pattern list.
check if the pattern is matched throughout the list.
if it matches, declare success and return the pattern list.
else break and start the loop again adding the next element to the pattern list.
If a pattern list is found, check the last k elements of the list where k is len(list) - len(list) modulo the length of the pattern list, if so, return the pattern list, else declare failure.
The code in python:
def check_pattern(nums):
p = []
i = 0
pattern = True
while i < len(nums)//2:
p.append(nums[i])
for j in range(0, len(nums)-(len(nums) % len(p)), len(p)):
if nums[j:j+len(p)] != p:
pattern = False
break
else:
pattern = True
# print(nums[-(len(nums) % len(p)):], p[:(len(nums) % len(p))])
if pattern and nums[-(len(nums) % len(p)) if (len(nums) % len(p)) > 0 else -len(p):] ==\
p[:(len(nums) % len(p)) if (len(nums) % len(p)) > 0 else len(p)]:
return p
i += 1
return 0
This algorithm might be inefficient in terms of performance but it checks the list even if the last elements did not form a complete period.
Any hints or suggestions are highly appreciated.
Thanks in advance,,,

Let L the list. Classic method is: use your favorite algorithm to search the second occurence of the sublist L in the list L+L. If the list is present at index k, then the period is L[:k]:
L L
1 2 1 2 1 2 1 2 | 1 2 1 2 1 2 1 2
1 2 1 2 1 2 1 2
(This is conceptually identical to #KonstantinYovkov's answer). In Python: example with strings (because Python has no builtin sublist search method):
>>> L = "12121212"
>>> k = (L+L).find(L, 1) # skip the first occurrence
>>> L[:k]
'12'
But:
>>> L = "12121"
>>> k = (L+L).find(L, 1)
>>> L[:k] # k is None => return the whole list
'12121'

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?

You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']

This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)

Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG

I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']

If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Coding a program to detect a n-length pattern in a string, even without knowing where the pattern starts, could be easily done by creating a list of n-length substrings and check if starting at one point there are same items or the rest of the list. Without any piece of information other than the string to check through, is the only way to recognize the pattern is to brute-force through all lengths and check or is there a more efficient algorithm?
(I'm just a beginner in Python, so this may be easy to code... )
Current code that only suits checking for starting at index 0:
def search(s):
match=s[0]+s[1]
while (match != s) and (match[0] != match[-1]):
for matchLen in range(len(match),len(s)-1):
letter = s[matchLen]
if letter == match[-1]:
match += s[len(match)]
break
if match == s:
return None
else:
return match[:-1]

You can use re.findall(r'(.{2,})\1+', string). The parentheses creates a capture group that is later backreferenced by \1. The . matches any character (except for line breaks). The {2,} requires the pattern to be at least two characters long (otherwise strings like ss would be considered a pattern). Finally the + requires that pattern to repeat 1 or more times (in addition to the first time that it occurred inside the capture group). You can see it working in action.

Pattern is a far too vague term, but assuming you mean some string repeating itself, the regexp (?P<pat>.+)(?P=pat) will work.

Given a string what you could do is -
You start with length = 1, and take two pointer variables i and j which you shall use to traverse the string.
Set i = 0 and j = i+length
if str[i]==str[j]:
i++,j++ // till j not equal to length of string
else:
length = length + 1
//increase length by 1 and start the algorithm over from i = 0
Take the example abcdeabcde :
In this we see
Initially i = 0, j = 1 ,
but str[0]!=str[1] i.e. a!=b,
Then we get length = 2 i.e., i = 0,j = 2
but str[0]!=str[2] i.e. a!=c,
Continuing in the same fashion,
We see when length = 5 and i = 0 and j = 5,
str[0]==str[5]
and thus you can see that i and j increment till j is equal to string length.
And you have your answer that is the pattern length. It may not seem obvious but i would suggest you dry-run this algorithm over some of your test cases and let me know the results.

You can use re.findall() to find all matches:
import re
s = "somethingabcdeabcdeabcdeabcdeabcdeelseabcdeabcdeabcde"
li = re.findall(r'abcde',s)
print(li)
Output:
['abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I find what is similar in multiple strings in python? - python

Related

The longest prefix that is also suffix of two lists

Given 2 strings, return number of positions where the two strings contain the same length 2 substring

determine if list is periodic python

Finding regular expression with at least one repetition of each letter

Realizing if there is a pattern in a string (Does not need to start at index 0, could be any length)

Categories

Resources