Generate all unique k-subsequences - python

I am trying to write a Python (at least initially) function to generate all subsequences of some length k (where k > 0). Since I only need unique subsequences, I am storing both the subsequences and partial subsequences in sets. The following, adapted from a colleague, is the best I could come up with. It seems...overly complex...and like I should be able to abuse itertools, or recursion, to do what I want to do. Can anyone do better?
from typing import Set, Tuple
def subsequences(string: str, k: int) -> Set[Tuple[str, ...]]:
if len(string) < k:
return set()
start = tuple(string[:k])
result = {start}
prev_state = [start]
curr_state = set()
for s in string[k:]:
for p in prev_state:
for i in range(k):
new = p[:i] + p[i + 1 :] + (s,)
curr_state.add(new)
result.update(curr_state)
prev_state = list(curr_state)
curr_state.clear()
return result
(For context, I am interested in induction of k-strictly piecewise languages, an efficiently learnable subclass of the regular languages, and the grammar can be characterized by all licit k-subsequences.
Ultimately I am also thinking about doing this in C++, where std::make_tuple isn't quite as powerful as Python tuple.)

You want a set of r combinations from n items (w/o replacement, <= (n choose r).
Given
import itertools as it
import more_itertools as mit
Code
Option 1 - itertools.combinations
set(it.combinations("foo", 2))
# {('f', 'o'), ('o', 'o')}
set(it.combinations("foobar", 3))
# {('b', 'a', 'r'),
# ('f', 'a', 'r'),
# ('f', 'b', 'a'),
# ('f', 'b', 'r'),
# ('f', 'o', 'a'),
# ('f', 'o', 'b'),
# ('f', 'o', 'o'),
# ('f', 'o', 'r'),
# ('o', 'a', 'r'),
# ('o', 'b', 'a'),
# ('o', 'b', 'r'),
# ('o', 'o', 'a'),
# ('o', 'o', 'b'),
# ('o', 'o', 'r')}
Option 2 - more_itertools.distinct_combinations
list(mit.distinct_combinations("foo", 2))
# [('f', 'o'), ('o', 'o')]
list(mit.distinct_combinations("foobar", 3))
# [('f', 'o', 'o'),
# ('f', 'o', 'b'),
# ('f', 'o', 'a'),
# ('f', 'o', 'r'),
# ('f', 'b', 'a'),
# ('f', 'b', 'r'),
# ('f', 'a', 'r'),
# ('o', 'o', 'b'),
# ('o', 'o', 'a'),
# ('o', 'o', 'r'),
# ('o', 'b', 'a'),
# ('o', 'b', 'r'),
# ('o', 'a', 'r'),
# ('b', 'a', 'r')]
Both options yield the same (unordered) output. However:
Option 1 takes the set of all combinations (including duplicates)
Option 2 does not compute duplicate intermediates
Install more_itertools via > pip install more_itertools.
See also a rough implementation of itertools.combinations in written Python.

Related

Delete Tuple Pairs if they are the same

I have a problem with tuples in python. I have the following tuple list:
gamma2 = [[('p', 'u'), ('r', 'w')], [('p', 'w'), ('r', 'u')], [('r', 'u'), ('p', 'w')],[('r', 'w'), ('p', 'u')]]
Now, the parts [('p', 'u'), ('r', 'w')] and [('r', 'w'), ('p', 'u')] are the same for me and also [('p', 'w'), ('r', 'u')] and [('r', 'u'), ('p', 'w')].
So I want to delete one of these double entries in my list, but I don't know how.
I've tried with hash tables and set, but the problem is, that this tuple pair is not the same for the hash table and it will be added by gamma2.add().
So do you have an idea?
you can try to use tuple ans set
gamma2 = [[('p', 'u'), ('r', 'w')], [('p', 'w'), ('r', 'u')], [('r', 'u'), ('p', 'w')],[('r', 'w'), ('p', 'u')]]
set([tuple(set(x)) for x in gamma2])
for some case it will be better to use sorted instead inside set (thanks #rockikz)
set([tuple(sorted(x)) for x in gamma2])
and third solution is to use frozenset
set([frozenset(x) for x in gamma2])
will give you the result:
{(('p', 'w'), ('r', 'u')), (('r', 'w'), ('p', 'u'))}
set - list of unique values
the set inside loop - need to to lead the items to make them equal
next use tuple only as sugar to make outer set
and the last set we use to get unique values
and if you want the same type in the result you can do it:
[list(y) for y in set([tuple(set(x)) for x in gamma2])]
will give you
[[('r', 'w'), ('p', 'u')], [('p', 'w'), ('r', 'u')]]

make a spark rdd from tuples list and use groupByKey

I have a list of tuples like below
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
I would like to use pyspark and groupByKey to produce:
nc=[['c','s', 'm', 'p'], ['h','bi','vi'], ['n','l', 'nc']
I dont know how to make a spark rdd and use groupByKey.
I tried:
tem=ls.groupByKey()
'list' object has no attribute 'groupByKey'
You are getting that error because your object is a list and not an rdd. Python lists do not have a groupByKey() method (as the error states).
You can first convert your list to an rdd using sc.parallelize:
myrdd = sc.parallelize(ls)
nc = myrdd.groupByKey().collect()
print(nc)
#[('c',['s', 'm', 'p']), ('h',['bi','vi']), ('n',['l', 'nc'])]
This returns a list of tuples where the first element is the key and the second element is a list of the values. If you wanted to flatten these tuples, you can use itertools.chain.from_iterable:
from itertools import chain
nc = [tuple(chain.from_iterable(v)) for v in nc]
print(nc)
#[('c', 's', 'm', 'p'), ('h', 'bi', 'vi'), ('n', 'l', 'nc')]
However, you can avoid spark completely achieve the desired result using itertools.groupby:
from itertools import groupby, chain
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
nc = [
(key,) + tuple(chain.from_iterable(g[1:] for g in list(group)))
for key, group in groupby(ls, key=lambda x: x[0])
]
print(nc)
#[('c', 's', 'm', 'p'), ('h', 'bi', 'vi'), ('n', 'l', 'nc')]
As pault mentioned, the problem here is that Spark operates on specialised parallelized datasets, such as an RDD. To get the exact format you're after using groupByKey you'll need to do some funky stuff with lists:
ls = sc.parallelize(ls)
tem=ls.groupByKey().map(lambda x: ([x[0]] + list(x[1]))).collect()
print(tem)
#[['h', 'bi', 'vi'], ['c', 's', 'm', 'p'], ['n', 'l', 'nc']]
However, generally its best to avoid groupByKey as it can result in a large number of shuffles. This problem could also be solved with reduceByKey using:
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
ls = sc.parallelize(ls)
tem=ls.map(lambda x: (x[0], [x[1]])).reduceByKey(lambda x,y: x + y).collect()
print(tem)
This will scale more effectively, but note that RDD operations can start to look a little cryptic when you need to manipulate list structure.

How to generate permutations without generating repeating results but with a fixed amount of characters Python [duplicate]

This question already has answers here:
Generate permutations of list with repeated elements
(6 answers)
Closed 6 years ago.
I am trying to figure out a way to generate all permutations possible of a string that has a couple repeating characters but without generating repeated tuples.
Right now I am using itertools.permutations(). It works but I need to remove repetition and I cannot use set() to remove the repetition.
What kind of results am I expecting? Well, for example, I want to get all the combinations for DDRR, the thing with itertools.permutations() is that I would get DDRR about four times, given that itertools sees the Ds as if they were different, same with Rs.
With list(itertools.permutations('DDRR')) I get:
[('D', 'D', 'R', 'R'), ('D', 'D', 'R', 'R'), ('D', 'R', 'D', 'R'), ('D', 'R', 'R', 'D'), ('D', 'R', 'D', 'R'), ('D', 'R', 'R', 'D'), ('D', 'D', 'R', 'R'), ('D', 'D', 'R', 'R'), ('D', 'R', 'D', 'R'), ('D', 'R', 'R', 'D'), ('D', 'R', 'D', 'R'), ('D', 'R', 'R', 'D'), ('R', 'D', 'D', 'R'), ('R', 'D', 'R', 'D'), ('R', 'D', 'D', 'R'), ('R', 'D', 'R', 'D'), ('R', 'R', 'D', 'D'), ('R', 'R', 'D', 'D'), ('R', 'D', 'D', 'R'), ('R', 'D', 'R', 'D'), ('R', 'D', 'D', 'R'), ('R', 'D', 'R', 'D'), ('R', 'R', 'D', 'D'), ('R', 'R', 'D', 'D')]
The ideal result I want is:
[('D', 'R', 'R', 'D'), ('R', 'D', 'R', 'D'), ('R', 'R', 'D', 'D'), ('D', 'R', 'D', 'R'), ('D', 'D', 'R', 'R'), ('R', 'D', 'D', 'R')]
If your string contains a lot of repeated characters, then you can use a combinations-based algorithm to generate your permutations.
Basically, this works by choosing a letter and finding all the places where the duplicates of that letter can go. With each of those possibilities, you find all the places where the next letter goes, and so on.
Code:
from collections import Counter
from itertools import combinations
def perms_without_reps(s):
partitions = list(Counter(s).items())
k = len(partitions)
def _helper(idxset, i):
if len(idxset) == 0:
yield ()
return
for pos in combinations(idxset, partitions[i][1]):
for res in _helper(idxset - set(pos), i+1):
yield (pos,) + res
n = len(s)
for poses in _helper(set(range(n)), 0):
out = [None] * n
for i, pos in enumerate(poses):
for idx in pos:
out[idx] = partitions[i][0]
yield out
Run it like so:
for p in perms_without_reps('DDRR'):
print p
Two important notes:
This doesn't generate output sorted in any particular way. If you want sorted output, add a permutations.sort() before k =, replace _helper(idxset - set(pos), i+1) with _helper(sorted(set(idxset) - set(pos)), i+1) and replace _helper(set(range(n)), 0) with _helper(list(range(n)), 0). This will make the function somewhat slower.
This function works very well if you have a large, unbalanced number of repeats. For example, any permutation-based method will just take forever on the input 'A'*100 + 'B'*2 (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABB), whereas this method will finish nearly instantly with the 5151 unique permutations.
Generally every permutation algorithm is supposed to generate n!/(n-r)! outcomes, but you can decide to implement an algorithm that checks for repetition which should be fun.
Let's see if this helps.
def __permutate(self, objectToPermute):
in_list = []
for i in range(self.__cur, len(objectToPermute)):
in_list.append(self.__swap(self.__cur, i, objectToPermute))
''' At initial call, self.__cur = 0
and self.__permSize would be the r in the permutation formula '''
if self.__cur < self.__permSize - 1:
self.__cur += 1
for obj in in_list:
self.__permutate(obj)
self.__cur -= 1
if self.__cur == self.__permSize -1:
for obj in in_list:
#this does the job
if self.norepeat and obj in self.__perm_obj:
continue
self.__perm_obj.append(obj[:self.__permSize])
You must have noticed, I pulled this out of a class I had written long ago, it's a Permutation algorithm I like to call the Melon algorithm (don't mind that).
What this part of the code does is simply recursively permute an object, a swap function was used as it was present in the class but you can easily code that out... Now to the main gist, To avoid repetition all you have to do is make sure the attrib self.norepeat = True and every repeated object would be skipped. If you'll be needing the class in full, I'll be glad to share
I'ud be needing a feedback so as to know if you get what I'm saying
This would be a standard way to create permutations of a set with duplicate elements: count the number of occurances of each element (using e.g. an associative array or dictionary), loop through the elements in the dictionary and each time append the element to the permutation, and recurse with the rest of the dictionary. It's never going to be fast for very long input arrays, though; but then nothing probably will.
(Code example in JavaScript; I don't speak Python; translating should be easy enough.)
function permute(set) {
var alphabet = {};
for (var s in set) {
if (!alphabet[set[s]]) alphabet[set[s]] = 1;
else ++alphabet[set[s]];
} // alphabet = {'a':5, 'b':2, 'r':2, 'c':1, 'd':1}
perm("", set.length);
function perm(part, level) {
for (var a in alphabet) { // every letter in alphabet
if (alphabet[a]) { // one or more of this letter available
--alphabet[a]; // decrement letter count
if (level > 1) perm(part + a, level - 1); // recurse with rest
else document.write(part + a + "<br>"); // deepest recursion level
++alphabet[a]; // restore letter count
}
}
}
}
permute(['a','b','r','a','c','a','d','a','b','r','a']); // 83,160 unique permutations
// instead of 39,916,800 non-
// unique ones plus filtering

What is wrong with this webapp2 code?

I am writing a webapp in python to convert a string by rot13 such as rot13(a)=n , rot13(b)=o and so on. But the code is only working for letters after m. From letters a to m there is no change. What am I doing wrong, here is my code:
import webapp2
import cgi
form = """<form method="post">
Type a string to convert:
<br>
<textarea name="text">%(text)s</textarea>
<br>
<input type="submit">
</form> """
class MainHandler(webapp2.RequestHandler):
def write_form(self, text=""):
self.response.out.write(form % {"text": cgi.escape(text)})
def get(self):
self.write_form()
def post(self):
new_text = self.request.get('text')
text = ""
for x in range(len(new_text)):
text += convert(new_text[x])
self.write_form(text)
def convert(t):
for (i, o) in (('a', 'n'),
('b', 'o'),
('c', 'p'),
('d', 'q'),
('e', 'r'),
('f', 's'),
('g', 't'),
('h', 'u'),
('i', 'v'),
('j', 'w'),
('k', 'x'),
('l', 'y'),
('m', 'z'),
('n', 'a'),
('o', 'b'),
('p', 'c'),
('q', 'd'),
('r', 'e'),
('s', 'f'),
('t', 'g'),
('u', 'h'),
('v', 'i'),
('w', 'j'),
('x', 'k'),
('y', 'l'),
('z', 'm')):
t = t.replace(i, o)
return t
app = webapp2.WSGIApplication([
('/', MainHandler)
], debug=True)
When i place the letters n to z above a, then a to m are giving correct result.
The issue is in the convert() method, Lets take a simple Example to understand this.
Lets take example of string 'an' and try to convert it.
First we get a from the tuple of tuple and we replace it with n in the string, so it becomes - 'nn' . Now after lots of misses, we get to n in the tuple of tuples and we again do replace on the whole string, and this time we get - 'aa' . As you can see we are again replacing complete string, not just the remaining of the not converted string. This is the basic issue in your code (atleast the issue you mention).
To fix this, Python already provides a str.translate() (with str.maketrans) function to do what you are trying to do, you should use that instead. Example -
Python 2.x -
def convert(t):
from string import maketrans
tt = maketrans('abcdefghijklmnopqrstuvwxyz','nopqrstuvwxyzabcdefghijklm')
t = t.translate(tt)
return t
For Python 3.x , you should use str.maketrans() instead of string.maketrans() .

replacing values in a nested tuple with the whole tuple

Okay,
I am working on a linguistic prover, and I have series of tuples that represent statements or expressions.
Sometimes, I end up with an embedded "and" statement, and I am trying to "bubble" it up to the surface.
I want to take a tuple that looks like this:
('pred', ('and', 'a', 'b'), 'x')
or, for a more simple example:
( ('and', 'a', 'b'), 'x')
and I want to separate out the ands into two statements such that the top one results in:
('and', ('pred', 'a', 'x',), ('pred', 'b', 'x') )
and the bottom one in:
('and', ('a', 'x'), ('b', 'x') )
I've tried a lot of things, but it always turns out to be quite ugly code. And I am having problems if there are more nested tuples such as:
('not', ('p', ('and', 'a', 'b'), 'x') )
which I want to result in
('not', ('and', ('p', 'a', 'x',), ('p', 'b', 'x') ) )
So basically, the problem is trying to replace a nested tuple with the value of the entire tuple, but the nested one modified. It's very ugly. :(
I'm not super duper python fluent so it gets very convoluted with lots of for loops that I know shouldn't be there. :(
Any help is much appreciated!
This recursive approach seems to work.
def recursive_bubble_ands_up(expr):
""" Bubble all 'and's in the expression up one level, no matter how nested.
"""
# if the expression is just a single thing, like 'a', just return it.
if is_atomic(expr):
return expr
# if it has an 'and' in one of its subexpressions
# (but the subexpression isn't just the 'and' operator itself)
# rewrite it to bubble the and up
and_clauses = [('and' in subexpr and not is_atomic(subexpr))
for subexpr in expr]
if any(and_clauses):
first_and_clause = and_clauses.index(True)
expr_before_and = expr[:first_and_clause]
expr_after_and = expr[first_and_clause+1:]
and_parts = expr[first_and_clause][1:]
expr = ('and',) + tuple([expr_before_and + (and_part,) + expr_after_and
for and_part in and_parts])
# apply recursive_bubble_ands_up to all the elements and return result
return tuple([recursive_bubble_ands_up(subexpr) for subexpr in expr])
def is_atomic(expr):
""" Return True if expr is an undividable component
(operator or value, like 'and' or 'a'). """
# not sure how this should be implemented in the real case,
# if you're not really just working on strings
return isinstance(expr, str)
Works on all your examples:
>>> tmp.recursive_bubble_ands_up(('pred', ('and', 'a', 'b'), 'x'))
('and', ('pred', 'a', 'x'), ('pred', 'b', 'x'))
>>> tmp.recursive_bubble_ands_up(( ('and', 'a', 'b'), 'x'))
('and', ('a', 'x'), ('b', 'x'))
>>> tmp.recursive_bubble_ands_up(('not', ('p', ('and', 'a', 'b'), 'x') ))
('not', ('and', ('p', 'a', 'x'), ('p', 'b', 'x')))
Note that this isn't aware of any other "special" operators, like not - as I said in my comment, I'm not sure what it should do with that. But it should give you something to start with.
Edit: Oh, oops, I just realized this only performs a single "bubble-up" operation, for example:
>>> tmp.recursive_bubble_ands_up(((('and', 'a', 'b'), 'x'), 'y' ))
(('and', ('a', 'x'), ('b', 'x')), 'y')
>>> tmp.recursive_bubble_ands_up((('and', ('a', 'x'), ('b', 'x')), 'y'))
('and', (('a', 'x'), 'y'), (('b', 'x'), 'y'))
So what you really want is probably to apply it in a while loop until the output is identical to the input, if you want your 'and' to bubble up from however many levels, like this:
def repeat_bubble_until_finished(expr):
""" Repeat recursive_bubble_ands_up until there's no change
(i.e. until all possible bubbling has been done).
"""
while True:
old_expr = expr
expr = recursive_bubble_ands_up(old_expr)
if expr == old_expr:
break
return expr
On the other hand, doing that shows that actually my program breaks your 'not' example, because it bubbles the 'and' ahead of the 'not', which you said you wanted left alone:
>>> tmp.recursive_bubble_ands_up(('not', ('p', ('and', 'a', 'b'), 'x')))
('not', ('and', ('p', 'a', 'x'), ('p', 'b', 'x')))
>>> tmp.repeat_bubble_until_finished(('not', ('p', ('and', 'a', 'b'), 'x')))
('and', ('not', ('p', 'a', 'x')), ('not', ('p', 'b', 'x')))
So I suppose you'd have to build in a special case for 'not' into recursive_bubble_ands_up, or just apply your not-handling function before running mine, and insert it before recursive_bubble_ands_up in repeat_bubble_until_finished so they're applied in alternation.
All right, I really should sleep now.

Categories