Generating all possible combinations of characters in a string - python

Say I have a string list:
li = ['a', 'b', 'c']
I would like to construct a new list such that each entry of the new list is a concatenation of a selection of 3 entries in the original list. Note that each entry can be chosen repeatedly:
new_li=['abc', 'acb', 'bac', 'bca', 'cab', 'cba', 'aab', 'aac',....'aaa', 'bbb', 'ccc']
The brutal force way is to construct a 3-fold nested for loop and insert each 3-combination into the new list. I was wondering if there is any Pythonic way to deal with that? Thanks.
Update:
Later I will convert the new list into a set, so the order does not matter anyway.

This looks like a job for itertools.product.
import itertools
def foo(l):
yield from itertools.product(*([l] * 3))
for x in foo('abc'):
print(''.join(x))
aaa
aab
aac
aba
abb
abc
aca
acb
acc
baa
bab
bac
bba
bbb
bbc
bca
bcb
bcc
caa
cab
cac
cba
cbb
cbc
cca
ccb
ccc
yield from is available to you from python3.3 and beyond. For older version, yield within a loop:
def foo(l):
for i in itertools.product(*([l] * 3)) :
yield i

The best way to get all combinations (also called cartesian product) of a list is to use itertools.product using the len of your iterable as repeat argument (that's where it differs from the other answer):
from itertools import product
li = ['a', 'b', 'c']
for comb in product(li, repeat=len(li)):
print(''.join(comb))
or if you want the result as list:
>>> combs = [''.join(comb) for comb in product(li, repeat=len(li))]
>>> combs
['aaa', 'aab', 'aac', 'aba', 'abb', 'abc', 'aca', 'acb', 'acc', 'baa',
'bab', 'bac', 'bba', 'bbb', 'bbc', 'bca', 'bcb', 'bcc', 'caa', 'cab',
'cac', 'cba', 'cbb', 'cbc', 'cca', 'ccb', 'ccc']
It's a bit cleaner to use the repeat argument than to multiply and unpack the list you have manually.

An alternate approach using list comprehension:
li = ['a', 'b', 'c']
new_li = [a+b+c for a in li for b in li for c in li]

import itertools
repeat=int(input("Enter length: ")
def password():
def foo(l):
yield from itertools.product(*([l] * repeat)))
for x in foo('abcdefghijklmnopqrstuvwxyz'):
# you could also use string.ascii_lowercase or ["a","b","c"]
print(''.join(x))
password()

I'll show you a way to do this without any libraries so that you can understand the logic behind how to achieve it.
First, we need to understand how to achieve all combinations mathematically.
Let's take a look at the pattern of every possible combination of characters ranging from a-b with a length of '1'.
a
b
Not much to see but from what we can see, there is one set of each character in the list. Let's increase our string length to '2' and see what pattern emerges.
aa
ab
ba
bb
So looking at this pattern, we see a new column has been added. The far right column is the same as the first example, with there being only 1 set of characters, but it's looped this time. The column on the far left has 2 set of characters. Could it be that for every new column added, one more set of characters is added? Let's take a look and find out by increasing the string length to '3'.
aaa
aab
aba
abb
baa
bab
bba
bbb
We can see the two columns on the right have stayed the same and the new column on the left has 4 of each characters! Not what we was expecting. So the number of characters doesn't increase by 1 for each column. Instead, if you notice the pattern, it is actually increasing by powers of 2.
The first column with only '1' set of characters : 2 ^ 0 = 1
The second column with '2' sets of characters : 2 ^ 1 = 2
The third column with '4' sets of characters : 2 ^ 2 = 4
So the answer here is, with each new column added, the number of each characters in the column is determined by it's position of powers, with the first column on the right being x ^ 0, then x ^ 1, then x ^ 2... and so on.
But what is x? In the example I gave x = 2. But is it always 2? Let's take a look.
I will now give an example of each possible combination of characters from range a-c
aa
ab
ac
ba
bb
bc
ca
cb
cc
If we count how many characters are in the first column on the right, there is still only one set of each characters for every time it loops, this is because the very first column on the right will always be equal to x ^ 0 and anything to the power of 0 is always 1. But if we look at the second column, we see 3 of each characters for every loop. So if x ^ 1 is for the second column, then x = 3. For the first example I gave with a range of a-b (range of 2), to the second example where I used a range a-c (range of 3), it seems as if x is always the length of characters used in your combinations.
With this first pattern recognised, we can start building a function that can identify what each column should represent. If we want to build every combination of characters from range a-b with a string length of 3, then we need a function that can understand that every set of characters in each column will as followed : [4, 2, 1].
Now create a function that can find how many set of characters should be in each column by returning a list of numbers that represent the total number of characters in a column based on it's position. We do this using powers.
Remember if we use a range of characters from a-b (2) then each column should have a total of x ^ y number of characters for each set, where x represents the length of characters being used, and y represents it's column position, where the very first column on the right is column number 0.
Example:
A combination of characters ranging from ['a', 'b'] with a string length of 3 will have a total of 4 a's and b's in the far left column for each set, a total of 2 a's and b's in the next for each set and a total of 1 a's and b's in the last for each set.
To return a list with this total number of characters respective to their columns as so [4, 2, 1] we can do this
def getCharPower(stringLength, charRange):
charpowers = []
for x in range(0, stringLength):
charpowers.append(len(charRange)**(stringLength - x - 1))
return charpowers
With the above function - if we want to create every possible combination of characters that range from a-b (2) and have a string length of 4, like so
aaaa
aaab
aaba
aabb
abaa
abab
abba
abbb
baaa
baab
baba
babb
bbaa
bbab
bbba
bbbb
which have a total set of (8) a's and b's, (4) a's and b's, (2) a's and b's, and (1) a's and b's, then we want to return a list of [8, 4, 2, 1]. The stringLength is 4 and our charRange is ['a', 'b'] and the result from our function is [8, 4, 2, 1].
So now all we have to do is print out each character x number of times depending on the value of it's column placement from our returned list.
In order to do this though, we need to find out how many times each set is printed in it's column. Take a look at the first column on the right of the previous combination example. All though a and b is only printed once per set, it loops and prints out the same thing 7 more times (8 total). If the string was only 3 characters in length then it loop a total of 4 times.
The reason for this is because the length of our strings determine how many combinations there will be in total. The formula for working this out is x ^ y = a, where x equals our range of characters, y equals the length of the string and a equals the total number of combinations that are possible within those specifications.
So to finalise this problem, our solution is to figure out
How many many characters in each set go into each column
How many times to repeat each set in each column
Our first option has already been solved with our previously created function.
Our second option can be solved by finding out how many combinations there are in total by calculating charRange ^ stringLength. Then running through a loop, we add how many sets of characters there are until a (total number of possible combinations) has been reached in that column. Run that for each column and you have your result.
Here is the function that solves this
def Generator(stringLength, charRange):
workbench = []
results = []
charpowers = getCharPower(stringLength, charRange)
for x in range(0, stringLength):
while len(workbench) < len(charRange)**stringLength:
for char in charRange:
for z in range(0, charpowers[x]):
workbench.append(char)
results.append(workbench)
workbench = []
results = ["".join(result) for result in list(zip(*results))]
return results
That function will return every possible combination of characters and of string length that you provide.
A way more simpler way of approaching this problem would be to just run a for loop for your total length.
So to create every possible combination of characters ranging from a-b with a length of 2
characters = ['a', 'b']
for charone in characters:
for chartwo in characters:
print(charone+chartwo)
All though this is a lot simpler, this is limited. This code only works to print every combination with a length of 2. To create more than this, we would have to manually add another for loop each time we wanted to change it. The functions I provided to you before this code however will print any combination for how many string length you give it, making it 100% adaptable and the best way to solve this issue manually yourself without any libraries.

Related

Insert elements in front of specific list elements

I have pandas data frame with two columns:
sentence - fo n bar
annotations [B-inv, B-inv, O, I-acc, O, B-com, I-com, I-com]
I want to insert additional 'O' elements in the annotations list in front of each annotation starting with 'B', which will look like this:
[O, B-inv, O, B-inv, O, I-acc, O, O, B-com, I-com, I-com]
' f o n bar'
And then insert additional whitespace in front of each element with an index equal to the 'B' annotation indexes from the initial annotation: meaning inserting in front of each char from the sentence with index in this list [0,1,5]
Maybe to make it more visibly appealing I should represent it this way:
Initial sentence:
Ind
Sentence char
Annot
0
f
B-inv
1
o
B-inv
2
whitespace
O
3
n
I-acc
4
whitespace
O
5
b
B-com
6
a
I-com
7
r
I-com
End sentence:
Ind
Sentence char
Annot
0
whitespace
O
1
f
B-inv
2
whitespace
O
3
o
B-inv
4
whitespace
O
5
n
I-acc
6
whitespace
O
7
whitespace
O
8
b
B-com
9
a
I-com
10
r
I-com
Updated answer (list comprehension)
from itertools import chain
annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
sent = list('fo n bar')
annot, sent = list(map(lambda l: list(chain(*l)), list(zip(*[(['O', a], [' ', s]) if a.startswith('B') else ([a], [s]) for a,s in zip(annot, sent)]))))
print(annot)
print(''.join(sent))
chain from itertools allow you to chain together a list of lists to form a single list. Then the rest is some clumsy use of zip together with list unpacking (the prefix * in argument names) to get it in one line. map is only used to apply the same operation to both lists basically.
But a more readable version, so you can also follow the steps better, could be:
# find where in the annotations the element starts with 'B'
loc = [a.startswith('B') for a in annot]
# Use this locator to add an element and Merge the list of lists with `chain`
annot = list(chain.from_iterable([['O', a] if l else [a] for a,l in zip(annot, loc)]))
sent = ''.join(chain.from_iterable([[' ', a] if l else [a] for a,l in zip(sent, loc)])) # same on sentence
Note that above, I do not use map as we process each list separately, and there is less zipping and casting to lists. So most probably, a much cleaner, and hence preferred solution.
Old answer (pandas)
I am not sure it is the most convenient to do this on a DataFrame. It might be easier on a simple list, before converting to a DataFrame.
But anyway, here is a way through it, assuming you don't really have meaningful indices in your DataFrame (so that indices are simply the integer count of each row).
The trick is to use .str strings functions such as startswith in this case to find matching strings in one of the column Series of interest and then you could loop over the matching indices ([0, 1, 5] in the example) and insert at a dummy location (half index, e.g. 0.5 to place the row before row 1) the row with the whitespace and 'O' data. Then sorting by sindices with .sort_index() will rearrange all rows in the way you want.
import pandas as pd
annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
sent = list('fo n bar')
df = pd.DataFrame({'sent':sent, 'annot':annot})
idx = np.argwhere(df.annot.str.startswith('B').values) # find rows where annotations start with 'B'
for i in idx.ravel(): # Loop over the indices before which we want to insert a new row
df.loc[i-0.5] = [' ', 'O'] # made up indices so that the subsequent sorting will place the row where you want it
df.sort_index().reset_index(drop=True) # this will output the new DataFrame

Common substring in list of strings

i encountered a problem while trying to solve a problem where given some strings and their lengths, you need to find their common substring. My code for the part where it loops through the list and then each through each word in it is this:
num_of_cases = int(input())
for i in range(1, num_of_cases+1):
if __name__ == '__main__':
len_of_str = list(map(int, input().split()))
len_of_virus = int(input())
strings = []
def string(strings, len_of_str):
len_of_list = len(len_of_str)
for i in range(1, len_of_list+1):
strings.append(input())
lst_of_subs = []
virus_index = []
def substr(strings, len_of_virus):
for word in strings:
for i in range(len(len_of_str)):
leng = word[i:len_of_virus]
lst_of_subs.append(leng)
virus_index.append(i)
print(string(strings, len_of_str))
print(substr(strings, len_of_virus))
And it prints the following given the strings: ananasso, associazione, tassonomia, massone
['anan', 'nan', 'an', 'n', 'asso', 'sso', 'so', 'o', 'tass', 'ass', 'ss', 's', 'mass', 'ass', 'ss', 's']
It seems that the end index doesn't increase, although i tried it by writing len_of_virus += 1 at the end of the loop.
sample input:
1
8 12 10 7
4
ananasso
associazione
tassonomia
massone
where the 1st letter is the number of cases, the second line is the name of the strings, 3rd is the length of the virus(the common substring), and then there are the given strings that i should loop through.
expected output:
Case #1: 4 0 1 1
where the four numbers are the starting indexes of the common substring.(i dont think that code for printing cares us for this particular problem)
What should i do? Please help!!
The problem, beside defining functions in odd places and using said function to get side effect in ways that aren't really encourage, is here:
for i in range(len(len_of_str)):
leng = word[i:len_of_virus]
i constantly increase in each iteration, but len_of_virus stay the same, so you are effectively doing
word[0:4] #when len_of_virus=4
word[1:4]
word[2:4]
word[3:4]
...
that is where the 'anan', 'nan', 'an', 'n', come from the first word "ananasso", and the same for the other
>>> word="ananasso"
>>> len_of_virus = 4
>>> for i in range(len(word)):
word[i:len_of_virus]
'anan'
'nan'
'an'
'n'
''
''
''
''
>>>
you can fix it moving the upper end by i, but that leave with the same problem in the other end
>>> for i in range(len(word)):
word[i:len_of_virus+i]
'anan'
'nana'
'anas'
'nass'
'asso'
'sso'
'so'
'o'
>>>
so some simple adjustments in the range and problem solve:
>>> for i in range(len(word)-len_of_virus+1):
word[i:len_of_virus+i]
'anan'
'nana'
'anas'
'nass'
'asso'
>>>
Now that the substring part is done, the rest is also easy
>>> def substring(text,size):
return [text[i:i+size] for i in range(len(text)-size+1)]
>>> def find_common(lst_text,size):
subs = [set(substring(x,size)) for x in lst_text]
return set.intersection(*subs)
>>> test="""ananasso
associazione
tassonomia
massone""".split()
>>> find_common(test,4)
{'asso'}
>>>
To find the common part to all the strings in our list we can use a set, first we put all the substring of a given word into a set and finally we intersect them all.
the rest is just printing it to your liking
>>> virus = find_common(test,4).pop()
>>> print("case 1:",*[x.index(virus) for x in test])
case 1: 4 0 1 1
>>>
First extract all the substrings of the give size from the shortest string. Then select the first of these substrings that is present in all of the strings. Finally output the position of this common substring in each of the strings:
def commonSubs(strings,size):
base = min(strings,key=len) # shortest string
subs = [base[i:i+size] for i in range(len(base)-size+1)] # all substrings
cs = next(ss for ss in subs if all(ss in s for s in strings)) # first common
return [s.index(cs) for s in strings] # indexes of common substring
output:
S = ["ananasso", "associazione", "tassonomia", "massone"]
print(commonSubs(S,4))
[4, 0, 1, 1]
You could also use a recursive approach:
def commonSubs(strings,size,i=0):
sub = strings[0][i:i+size]
if all(sub in s for s in strings):
return [s.index(sub) for s in strings]
return commonSubs(strings,size,i+1)
from suffix_trees import STree
STree.STree(["come have some apple pies",
'apple pie available',
'i love apple pie haha']).lcs()
the most simple way is use STree

Generate all substrings of a given string

I want to generate all the possible substrings from a given string without redundant values as follows:
input: 'abba'
output: 'a','b','ab','ba','abb','bba'
Here is my code
s='abba'
for i in range (0,len(s)):
for j in range (i+1,len(s)):
print(s[i:j])
My output is 'a','ab','abb','b','bb','b'
As you can see from the output 'b' is repeated, and 'bba' does not exist.
I want to know and learn the right logic to produce all unique substrings.
Fixing the indexing a bit
s='abba'
for i in range (0,len(s)):
for j in range (i,len(s)):
print(s[i:(j+1)])
yields the following output
a
ab
abb
abba
b
bb
bba
b
ba
a
Basically, the indexing fix takes into account that
'abba'[3:3] produces just zero-length string ''
but
'abba'[3:4] produces string 'a' which has length one.
Duplicates you may remove by using set(), as follows:
s='abba'
ss = set()
for i in range (0,len(s)):
for j in range (i,len(s)):
ss.add(s[i:(j+1)])
print(sorted(ss))
Then you will have the following result ['a', 'ab', 'abb', 'abba', 'b', 'ba', 'bb', 'bba'].

to create a population of "A" and "B" alphabets with 50% frequency of each

I am writing this code to generate a txt file which should have total of 600 letters comprising only of A(50%) and B(50%), distributed randomly. How should I do that, I am new to coding, please help.
q=[A, B]
popsize=500
def gen_pop(A):
population=[]
B=A
while len(population)<popsize:
random.shuffle(B)
print A
print B
gen_pop(q)
Create a list of 300 A and 300 B the shuffle it with random.shuffle:
>>> from random import shuffle
>>> mylist = ['A'] * 300 + ['B'] * 300
>>> shuffle(mylist)
>>> mylist
['B', 'B', 'A', 'B', 'A', 'A', ... 'A', 'B', 'A']
The best answer may depend on what exactly you want your distribution to be.
If you want exactly half your letters to be A and half to be B, use the technique from jabaldoneodo's answer, and build the sequence first, then shuffle it:
import random
result = ["A"]*300 + ["B"]*300
random.shuffle(result)
If on the other hand, you want each value to be selected independently of the others, with a 50% chance of it being A and a 50% chance of it being B, picking the number of each ahead of time will be inappropriate. Instead, you can use random.choice to pick from your alphabet in a list comprehension:
import random
alphabet = "AB"
result = [random.choice(alphabet) for _ in range(600)]
Using this method, the number of As (and Bs, for that matter) will be normally distributed, with a mean of 300. The same technique also works for larger alphabets.
Generate an array of 300 A's and 300 B's, and then shuffle it using an algorithm.

Python: comparing a few thousand strings. Any fast alternatives for comparison?

I have a set of around 6 000 packets which for comparison purposes I represent as strings (first 28 bytes) to compare against just as many packets, which I also represent as strings of 28 bytes.
I have to match each packet of one set with all of the other. Matchings are always unique.
I found that comparing strings takes a bit of time. Is there any way to speed up the process?
EDIT1: I wouldn't like to permutate string elements because I am always making sure that ordering between packet list and corresponding string list is preserved.
EDIT2: Here's my implementation:
list1, list2 # list of packets (no duplicates present in each list!)
listOfStrings1, listOfStrings2 # corresponding list of strings. Ordering is preserved.
alreadyMatchedlist2Indices = []
for list1Index in xrange(len(listOfStrings1)):
stringToMatch = listOfStrings1[list1Index]
matchinglist2Indices = [i for i, list2Str in enumerate(listOfStrings2)
if list2Str == stringToMatch and i not in alreadyMatchedlist2Indices]
if not matchinglist2Indices:
tmpUnmatched.append(list1Index)
elif len(matchinglist2Indices) == 1:
tmpMatched.append([list1Index, matchinglist2Indices[0]])
alreadyMatchedlist2Indices.append(matchinglist2Indices[0])
else:
list2Index = matchinglist2Indices[0] #taking first matching element anyway
tmpMatched.append([list1Index, list2Index])
alreadyMatchedlist2Indices.append(list2Index)
---Here I'm assuming you're taking every strings one by one and comparing to all others.---
I suggest sorting your list of string and comparing neighboring strings. This should have a runtime of O(nlogn).
Here's a simple linear time approach -- at least if I understand your question correctly:
>>> def get_matches(a, b):
... reverse_map = {x:i for i, x in enumerate(b)}
... return [(i, reverse_map[x]) for i, x in enumerate(a) if x in reverse_map]
...
>>> get_matches(['a', 'b', 'c'], ['c', 'd', 'e'])
[(2, 0)]
This accepts two sequences of strings, a and b, and returns a list of matches represented as tuples of indices into a and b. This is O(n + m) where m and n are the lengths of a and b.
what's wrong with:
matches = [packet for packet in list1 if packet in list2]

Categories