how to generate a set of similar strings in python

how to generate a set of similar strings in python - python

I am wondering how to generate a set of similar strings based on Levenshtein distance (string edit distance). Ideally, I like to pass in, a source string (i.e. a string which is used to generate other strings that are similar to it), the number of strings need to be generated and a threshold as parameters, i.e. similarities among the strings in the generated set should be greater than the threshold. I am wondering what Python package(s) should I use to achieve that? Or any idea how to implement this?

I think you can think of the problem in another way (reversed).
Given a string, say it is sittin.
Given a threshold (edit distance), say it is k.
Then you apply combinations of different "edits" in k-steps.
For example, let's say k = 2. And assume the allowed edit modes you have are:
delete one character
add one character
substitute one character with another one.
Then the logic is something like below:
input = 'sittin'
for num in 1 ... n: # suppose you want to have n strings generated
my_input_ = input
# suppose the edit distance should be smaller or equal to k;
# but greater or equal to one
for i in in 1 ... randint(k):
pick a random edit mode from (delete, add, substitute)
do it! and update my_input_
If you need to stick with a pre-defined dictionary, that adds some complexity but it is still doable. In this case, the edit must be valid.

Borrowing heavily on the pseudocode in #greeness answer I thought I would include the code I used to do this for DNA sequences.
This may not be your exact use case but I think it should be easily adaptable.
import random
dna = set(["A", "C", "G", "T"])
class Sequence(str):
def mutate(self, d, n):
mutants = set([self])
while len(mutants) < n:
k = random.randint(1, d)
for _ in range(k):
mutant_type = random.choice(["d", "s", "i"])
if mutant_type == "i":
mutants.add(self.insertion(k))
elif mutant_type == "d":
mutants.add(self.deletion(k))
elif mutant_type == "s":
mutants.add(self.substitute(k))
return list(mutants)
def deletion(self, n):
if n >= len(self):
return ""
chars = list(self)
i = 0
while i < n:
idx = random.choice(range(len(chars)))
del chars[idx]
i += 1
return "".join(chars)
def insertion(self, n):
chars = list(self)
i = 0
while i < n:
idx = random.choice(range(len(chars)))
new_base = random.choice(list(dna))
chars.insert(idx, new_base)
i += 1
return "".join(chars)
def substitute(self, n):
idxs = random.sample(range(len(self)), n)
chars = list(self)
for i in idxs:
new_base = random.choice(list(dna.difference(chars[i])))
chars[i] = new_base
return "".join(chars)
To use this you can do the following
s = Sequence("AAAAA")
d = 2 # max edit distance
n = 5 # number of strings in result
s.mutate(d, n)
>>> ['AAA', 'GACAAAA', 'AAAAA', 'CAGAA', 'AACAAAA']

Related

How can I count sequences that meet these constraints?

I am trying to count permutations of a sequence of I and O symbols, representing e.g. people entering (I for "in") and leaving (O for "out") a room. For a given n many I symbols, there should be exactly as many O symbols, giving a total length of 2*n for the sequence. Also, at any point in a valid permutation, the number of O symbols must be less than or equal to the number of I symbols (since it is not possible for someone to leave the room when it is empty).
Additionally, I have some initial prefix of I and O symbols, representing people who previously entered or left the room. The output should only count sequences starting with that prefix.
For example, for n=1 and an initial state of '', the result should be 1 since the only valid sequence is IO; for n=3 and an initial state of II, the possible permutations are
IIIOOO
IIOIOO
IIOOIO
for a result of 3. (There are five ways for three people to enter and leave the room, but the other two involve the first person leaving immediately.)
I'm guessing the simplest way to solve this is using itertools.permutations. This is my code so far:
n=int(input()) ##actual length will be 2*n
string=input()
I_COUNT=string.count("I")
O_COUNT=string.count("O")
if string[0]!="I":
sys.exit()
if O_COUNT>I_COUNT:
sys.exit()
perms = [''.join(p) for p in permutations(string)]
print(perms)
the goal is to get the permutation for whatever is left out of the string and append it to the user's input, so how can I append user's input to the remaining length of the string and get the count for permutation?

#cache
def count_permutations(ins: int, outs: int):
# ins and outs are the remaining number of ins and outs to process
assert outs >= ins
if ins == 0 :
# Can do nothing but output "outs"
return 1
elif outs == ins:
# Your next output needs to be an I else you become unbalanced
return count_permutations(ins - 1, outs)
else:
# Your. next output can either be an I or an O
return count_permutations(ins - 1, outs) + count_permutations(ins, outs - 1)
If, say you have a total of 5 Is and 5 Os, and you've already output one I, then you want: count_permutations(4, 5).

I'm guessing the simplest way to solve this is using itertools.permutations
Sadly, this will not be very helpful. The problem is that itertools.permutations does not care about the value of the elements it's permuting; it treats them as all distinct regardless. So if you have 6 input elements, and ask for length-6 permutations, you will get 720 results, even if all the inputs are the same.
itertools.combinations has the opposite issue; it doesn't distinguish any elements. When it selects some elements, it only puts those elements in the order they initially appeared. So if you have 6 input elements and ask for length-6 combinations, you will get 1 result - the original sequence.
Presumably what you wanted to do is generate all the distinct ways of arranging the Is and Os, then take out the invalid ones, then count what remains. This is possible, and the itertools library can help with the first step, but it is not straightforward.
It will be simpler to use a recursive algorithm directly. The general approach is as follows:
At any given time, we care about how many people are in the room and how many people must still enter. To handle the prefix, we simply count how many people are in the room right now, and subtract that from the total number of people in order to determine how many must still enter. I leave the input handling as an exercise.
To determine that count, we count up the ways that involve the next action being I (someone comes in), plus the ways that involve the next action being O (someone leaves).
If everyone has entered, there is only one way forward: everyone must leave, one at a time. This is a base case.
Otherwise, it is definitely possible for someone to come in. We recursively count the ways for everyone else to enter after that; in the recursive call, there is one more person in the room, and one fewer person who must still enter.
If there are still people who have to enter, and there is also someone in the room right now, then it is also possible for someone to leave first. We recursively count the ways for others to enter after that; in the recursive call, there is one fewer person in the room, and the same number who must still enter.
This translates into code fairly directly:
def ways_to_enter(currently_in, waiting):
if waiting == 0:
return 1
result = ways_to_enter(currently_in + 1, waiting - 1)
if currently_in > 0:
result += ways_to_enter(currently_in - 1, waiting)
return result
Some testing:
>>> ways_to_enter(0, 1) # n = 1, prefix = ''
1
>>> ways_to_enter(2, 1) # n = 3, prefix = 'II'; OR e.g. n = 4, prefix = 'IIOI'
3
>>> ways_to_enter(0, 3) # n = 3, prefix = ''
5
>>> ways_to_enter(0, 14) # takes less than a second on my machine
2674440
We can improve the performance for larger values by decorating the function with functools.cache (lru_cache prior to 3.9), which will memoize results of the previous recursive calls. The more purpose-built approach is to use dynamic programming techniques: in this case, we would initialize 2-dimensional storage for the results of ways_to_enter(x, y), and compute those values one at a time, in such a way that the values needed for the "recursive calls" have already been done earlier in the process.
That direct approach would look something like:
def ways_to_enter(currently_in, waiting):
# initialize storage
results = [[0] * currently_in for _ in waiting]
# We will iterate with `waiting` as the major axis.
for w, row in enumerate(results):
for c, column in enumerate(currently_in):
if w == 0:
value = 1
else:
value = results[w - 1][c + 1]
if c > 0:
value += results[w][c - 1]
results[w][c] = value
return results[-1][-1]

The product() function from itertools will allow you to generate all the possible sequences of 'I' and 'O' for a given length.
From that list, you can filter by the sequences that start with the user-supplied start_seq.
From that list, you can filter by the sequences that are valid, given your rules of the number and order of the 'I's and 'O's:
from itertools import product
def is_valid(seq):
'''Evaluates a sequence I's and O's following the rules that:
- there cannot be more outs than ins
- the ins and outs must be balanced
'''
_in, _out = 0, 0
for x in seq:
if x == 'I':
_in += 1
else:
_out += 1
if (_out > _in) or (_in > len(seq)/2):
return False
return True
# User inputs...
start_seq = 'II'
assert start_seq[0] != 'O', 'Starting sequence cannot start with an OUT.'
n = 3
total_len = n*2
assert len(start_seq) < total_len, 'Starting sequence is at least as big as total number, nothing to iterate.'
# Calculate all possible sequences that are total_len long, as tuples of 'I' and 'O'
seq_tuples = product('IO', repeat=total_len)
# Convert tuples to strings, e.g., `('I', 'O', 'I')` to `'IOI'`
sequences = [''.join(seq_tpl) for seq_tpl in seq_tuples]
# Filter for sequences that start correctly
sequences = [seq for seq in sequences if seq.startswith(start_seq)]
# Filter for valid sequences
sequences = [seq for seq in sequences if is_valid(seq)]
print(sequences)
and I get:
['IIIOOO', 'IIOIOO', 'IIOOIO']

Not very elegant perhaps but this certainly seems to fulfil the brief:
from itertools import permutations
def isvalid(start, p):
for c1, c2 in zip(start, p):
if c1 != c2:
return 0
n = 0
for c in p:
if c == 'O':
if (n := n - 1) < 0:
return 0
else:
n += 1
return 1
def calc(n, i):
s = i + 'I' * (n - i.count('I'))
s += 'O' * (n * 2 - len(s))
return sum(isvalid(i, p) for p in set(permutations(s)))
print(calc(3, 'II'))
print(calc(3, 'IO'))
print(calc(3, 'I'))
print(calc(3, ''))
Output:
3
2
5
5

def solve(string,n):
countI =string.count('I')
if countI==n:
return 1
countO=string.count('O')
if countO > countI:
return 0
k= solve(string + 'O',n)
h= solve(string + 'I',n)
return k+h
n= int(input())
string=input()
print(solve(string,n))

This is a dynamic programming problem.
Given the number of in and out operations remaining, we do one of the following:
If we're out of either ins or outs, we can only use operations of the other type. There is only one possible assignment.
If we have an equal number of ins or outs, we must use an in operation according to the constraints of the problem.
Finally, if we have more ins than outs, we can perform either operation. The answer, then, is the sum of the number of sequences if we choose to use an in operation plus the number of sequences if we choose to use an out operation.
This runs in O(n^2) time, although in practice the following code snippet can be made faster using a 2D-list rather than the cache annotation (I've used #cache in this case to make the recurrence easier to understand).
from functools import cache
#cache
def find_permutation_count(in_remaining, out_remaining):
if in_remaining == 0 or out_remaining == 0:
return 1
elif in_remaining == out_remaining:
return find_permutation_count(in_remaining - 1, out_remaining)
else:
return find_permutation_count(in_remaining - 1, out_remaining) + find_permutation_count(in_remaining, out_remaining - 1)
print(find_permutation_count(3, 3)) # prints 5

The number of such permutations of length 2n is given by the n'th Catalan number. Wikipedia gives a formula for Catalan numbers in terms of central binomial coefficients:
from math import comb
def count_permutations(n):
return comb(2*n,n) // (n+1)
for i in range(1,10):
print(i, count_permutations(i))
# 1 1
# 2 2
# 3 5
# 4 14
# 5 42
# 6 132
# 7 429
# 8 1430
# 9 4862

Python recursion to split string by sliding window

Recently, I face an interesting coding task that involves splitting a string multiple permutations with a given K-limit size.
For example:
s = "iamfoobar"
k = 4 # the max number of the items on a list after the split
The s can split into the following combinations
[
["i", "a", "m", "foobar"],
["ia", "m", "f", "oobar"],
["iam", "f", "o", "obar"]
# etc
]
I tried to figure out how to do that with a quick recursively function, but I cannot get it to work.
I have try this out, but didn't seem to work
def sliding(s, k):
if len(s) < k:
return []
else:
for i in range(0, k):
return [s[i:i+1]] + sliding(s[i+1:len(s) - i], k)
print(sliding("iamfoobar", 4))
And only got this
['i', 'a', 'm', 'f', 'o', 'o']

Your first main problem is that although you use a loop, you immediately return a single list. So no matter how much you fix everything around, your output will never match what you expect as it will be.... a single list.
Second, on the recursive call you start with s[i:i+1] but according to your example you want all prefixes, so something like s[:i] is more suitable.
Additionaly, in the recursive call you never reduce k which is the natural recursive step.
Lastly, your stop condition seems wrong also. As above, if the natural step is reducing k, the natural stop would be if k == 1 then return [[s]]. This is because the only way to split the string to 1 part is the string itself...
The important thing is to keep in mind your final output format and think how that can work in your step. In this case you want to return a list of all possible permutations as lists. So in case of k == 1, you simply return a list of a single list of the string.
Now as the step, you want to take a different prefix each time, and add to it all permutations from the call of the rest of the string with k-1. All in all the code can be something like this:
def splt(s, k):
if k == 1: # base sace - stop condition
return [[s]]
res = []
# loop over all prefixes
for i in range(1, len(s)-k+2):
for tmp in splt(s[i:], k-1):
# add to prefix all permutations of k-1 parts of the rest of s
res.append([s[:i]] + tmp)
return res
You can test it on some inputs and see how it works.
If you are not restricted to recursion, another approach is to use itertools.combinations. You can use that to create all combinations of indexes inside the string to split it into k parts, and then simply concatenate those parts and put them in a list. A raw version is something like:
from itertools import combinations
def splt(s, k):
res = []
for indexes in combinations(range(1, len(s)), k-1):
indexes = [0] + list(indexes) + [len(s)] # add the edges to k-1 indexes to create k parts
res.append([s[start:end] for start, end in zip(indexes[:-1], indexes[1:])]) # concatenate the k parts
return res

The main issue in your implementation is that your loop does not do what is supposed to do as it returns the first result instead of appending the results.
Here's an example of an implementation:
def sliding(s, k):
# If there is not enough values of k is below 0
# there is no combination possible
if len(s) < k or k < 1:
return []
# If k is one, we return a list containing all the combinations,
# which is a single list containing the string
if k == 1:
return [[s]]
results = []
# Iterate through all the possible values for the first value
for i in range(1, len(s) - k + 2):
first_value = s[:i]
# Append the result of the sub call to the first values
for sub_result in sliding(s[i:], k - 1):
results.append([first_value] + sub_result)
return results
print(sliding("iamfoobar", 4))

I need limit for particular characters in Python itertools

How I can find a way to get all combination with some limits for particular characters. For now I have only limit for all characters. But I want to have character "Q" 4 times in every combinations? Is that possible with my code?
I use itertools combination_with_replacement
from itertools import combinations_with_replacement
import collections
def combine(arr, s):
return [x for x in combinations_with_replacement(symbols, s) if max(collections.Counter(x).values()) <= 3]
symbols = "LNhkPepm3684th"
max_length = 10
set = 10
print(combine(symbols, set))

I notice that your symbols collection contains the letter "h" twice. I'm not sure whether your "must appear 0 or 1 or 2 times, but no more" restriction applies individually to each h, or whether it applies to all "h"es collectively. In other words, is "LLLLLNNNNhh3684hh" a legal result? The "first h" appears twice, and the "second h" appears twice, and so there are four instances of "h" total.
Here's an approach that works if all symbols are individually restricted and "LLLLLNNNNhh3684hh" is a legal result. it works on the principle that any combination of a sequence can be uniquely represented as a list of numbers indicating how many times the element at that index appears in the combination.
def restricted_sum(n, s, restrictions):
"""
Restricted sum problem. Find each list that sums up to a certain number, and obeys restrictions regarding its size and contents.
input:
n -- an integer. Indicates the length of the result.
s -- an integer. Indicates the sum of the result.
restrictions -- a list of tuples. Indicates the minimum and maximum of each corresponding element in the result.
yields:
result -- A list of positive integers, satisfying the requirements:
len(result) == n
sum(result) == s
for i in range(len(result)):
restrictions[i][0] <= result[i] <= restrictions[i][1]
"""
if n == 0:
if s == 0:
yield ()
return
else:
return
else:
if sum(t[0] for t in restrictions) > s: return
if sum(t[1] for t in restrictions) < s: return
l,r = restrictions[0]
for amt in range(l, r+1):
for rest in restricted_sum(n-1, s-amt, restrictions[1:]):
yield (amt,) + rest
def combine(characters, size, character_restrictions):
assert len(characters) == len(set(characters)) #only works for character sets with no duplicates
n = len(characters)
s = size
restrictions = tuple(character_restrictions[c] for c in characters)
for seq in restricted_sum(n, s, restrictions):
yield "".join(c*i for i,c in zip(seq, characters))
symbols = "LNhkPepm3684th"
character_restrictions = {}
#most symbols can appear 0-2 times
for c in symbols:
character_restrictions[c] = (0,2)
#these characters must appear an exact number of times
limits = {"L":5, "N": 4}
for c, amt in limits.items():
character_restrictions[c] = (amt, amt)
for result in combine(symbols, 17, character_restrictions):
print(result)
Result:
LLLLLNNNN8844tthh
LLLLLNNNN6844tthh
LLLLLNNNN6884tthh
LLLLLNNNN68844thh
LLLLLNNNN68844tth
... 23,462 more values go here...
LLLLLNNNNhh3684hh
... 4,847 more values go here...
LLLLLNNNNhhkkPPe6
LLLLLNNNNhhkkPPe3
LLLLLNNNNhhkkPPem
LLLLLNNNNhhkkPPep
LLLLLNNNNhhkkPPee

Add a dictionary that specifies the limit for each character, and uses that instead of 3 in your condition. You can use .get() with a default value so you don't have to specify all the limits.
limits = {'Q': 4, 'A': 2}
def combine(arr, s):
return [x for x in combinations_with_replacement(symbols, s) if max(collections.Counter(x).values()) <= limits.get(x, 3)]

Python Hamming distance rewrite countless for cycles into recursion

I have created a code generating strings which have hamming distance n from given binary string. Though I'm not able to rewrite this in a simple recursive function. There are several sequences (edit: actually only one, the length change) in the for loops logic but I don't know how to write it into the recursive way (the input for the function is string and distance (int), but in my code the distance is represented by the count of nested for cycles. Could you please help me?
(e.g. for string '00100' and distance 4, code returns ['11010', '11001', '11111', '10011', '01011'],
for string '00100' and distance 3, code returns ['11000', '11110', '11101', '10010', '10001', '10111', '01010', '01001', '01111', '00011'])
def change(string, i):
if string[i] == '1':
return string[:i] + '0' + string[i+1:]
else: return string[:i] + '1' + string[i+1:] #'0' on input
def hamming_distance(number):
array = []
for i in range(len(number)-3): #change first bit
a = number
a = change(a, i) #change bit on index i
for j in range(i+1, len(number)-2): #change second bit
b = a
b = change(b, j)
for k in range(j+1, len(number)-1): #change third bit
c = b
c = change(c, k)
for l in range(k+1, len(number)): #change fourth bit
d = c
d = change(d, l)
array.append(d)
return array
print(hamming_distance('00100'))
Thank you!

Very briefly, you have three base cases:
len(string) == 0: # return; you've made all the needed changes
dist == 0 # return; no more changes to make
len(string) == dist # change all bits and return (no choice remaining)
... and two recursion cases; with and without the change:
ham1 = [str(1-int(string[0])) + alter
for alter in change(string[1:], dist-1) ]
ham2 = [str[0] + alter for alter in change(string[1:], dist) ]
From each call, you return a list of strings that are dist from the input string. On each return, you have to append the initial character to each item in that list.
Is that clear?
CLARIFICATION
The above approach also generates only those that change the string. "Without" the change refers to only the first character. For instance, given input string="000", dist=2, the algorithm will carry out two operations:
'1' + change("00", 2-1) # for each returned string, "10" and "01"
'0' + change("00", 2) # for the only returned string, "11"
Those two ham lines go in the recursion part of your routine. Are you familiar with the structure of such a function? It consists of base cases and recursion cases.

How to find number of ways that the integers 1,2,3 can add up to n?

Given a set of integers 1,2, and 3, find the number of ways that these can add up to n. (The order matters, i.e. say n is 5. 1+2+1+1 and 2+1+1+1 are two distinct solutions)
My solution involves splitting n into a list of 1s so if n = 5, A = [1,1,1,1,1]. And I will generate more sublists recursively from each list by adding adjacent numbers. So A will generate 4 more lists: [2,1,1,1], [1,2,1,1], [1,1,2,1],[1,1,1,2], and each of these lists will generate further sublists until it reaches a terminating case like [3,2] or [2,3]
Here is my proposed solution (in Python)
ways = []
def check_terminating(A,n):
# check for terminating case
for i in range(len(A)-1):
if A[i] + A[i+1] <= 3:
return False # means still can compute
return True
def count_ways(n,A=[]):
if A in ways:
# check if alr computed if yes then don't compute
return True
if A not in ways: # check for duplicates
ways.append(A) # global ways
if check_terminating(A,n):
return True # end of the tree
for i in range(len(A)-1):
# for each index i,
# combine with the next element and form a new list
total = A[i] + A[i+1]
print(total)
if total <= 3:
# form new list and compute
newA = A[:i] + [total] + A[i+2:]
count_ways(A,newA)
# recursive call
# main
n = 5
A = [1 for _ in range(n)]
count_ways(5,A)
print("No. of ways for n = {} is {}".format(n,len(ways)))
May I know if I'm on the right track, and if so, is there any way to make this code more efficient?
Please note that this is not a coin change problem. In coin change, order of occurrence is not important. In my problem, 1+2+1+1 is different from 1+1+1+2 but in coin change, both are same. Please don't post coin change solutions for this answer.
Edit: My code is working but I would like to know if there are better solutions. Thank you for all your help :)

The recurrence relation is F(n+3)=F(n+2)+F(n+1)+F(n) with F(0)=1, F(-1)=F(-2)=0. These are the tribonacci numbers (a variant of the Fibonacci numbers):
It's possible to write an easy O(n) solution:
def count_ways(n):
a, b, c = 1, 0, 0
for _ in xrange(n):
a, b, c = a+b+c, a, b
return a
It's harder, but possible to compute the result in relatively few arithmetic operations:
def count_ways(n):
A = 3**(n+3)
P = A**3-A**2-A-1
return pow(A, n+3, P) % A
for i in xrange(20):
print i, count_ways(i)

The idea that you describe sounds right. It is easy to write a recursive function that produces the correct answer..slowly.
You can then make it faster by memoizing the answer. Just keep a dictionary of answers that you've already calculated. In your recursive function look at whether you have a precalculated answer. If so, return it. If not, calculate it, save that answer in the dictionary, then return the answer.
That version should run quickly.

An O(n) method is possible:
def countways(n):
A=[1,1,2]
while len(A)<=n:
A.append(A[-1]+A[-2]+A[-3])
return A[n]
The idea is that we can work out how many ways of making a sequence with n by considering each choice (1,2,3) for the last partition size.
e.g. to count choices for (1,1,1,1) consider:
choices for (1,1,1) followed by a 1
choices for (1,1) followed by a 2
choices for (1) followed by a 3
If you need the results (instead of just the count) you can adapt this approach as follows:
cache = {}
def countwaysb(n):
if n < 0:
return []
if n == 0:
return [[]]
if n in cache:
return cache[n]
A = []
for last in range(1,4):
for B in countwaysb(n-last):
A.append(B+[last])
cache[n] = A
return A

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to generate a set of similar strings in python - python

Related

How can I count sequences that meet these constraints?

Python recursion to split string by sliding window

I need limit for particular characters in Python itertools

Python Hamming distance rewrite countless for cycles into recursion

How to find number of ways that the integers 1,2,3 can add up to n?

Categories

Resources