Related
I am trying to write a Python (at least initially) function to generate all subsequences of some length k (where k > 0). Since I only need unique subsequences, I am storing both the subsequences and partial subsequences in sets. The following, adapted from a colleague, is the best I could come up with. It seems...overly complex...and like I should be able to abuse itertools, or recursion, to do what I want to do. Can anyone do better?
from typing import Set, Tuple
def subsequences(string: str, k: int) -> Set[Tuple[str, ...]]:
if len(string) < k:
return set()
start = tuple(string[:k])
result = {start}
prev_state = [start]
curr_state = set()
for s in string[k:]:
for p in prev_state:
for i in range(k):
new = p[:i] + p[i + 1 :] + (s,)
curr_state.add(new)
result.update(curr_state)
prev_state = list(curr_state)
curr_state.clear()
return result
(For context, I am interested in induction of k-strictly piecewise languages, an efficiently learnable subclass of the regular languages, and the grammar can be characterized by all licit k-subsequences.
Ultimately I am also thinking about doing this in C++, where std::make_tuple isn't quite as powerful as Python tuple.)
You want a set of r combinations from n items (w/o replacement, <= (n choose r).
Given
import itertools as it
import more_itertools as mit
Code
Option 1 - itertools.combinations
set(it.combinations("foo", 2))
# {('f', 'o'), ('o', 'o')}
set(it.combinations("foobar", 3))
# {('b', 'a', 'r'),
# ('f', 'a', 'r'),
# ('f', 'b', 'a'),
# ('f', 'b', 'r'),
# ('f', 'o', 'a'),
# ('f', 'o', 'b'),
# ('f', 'o', 'o'),
# ('f', 'o', 'r'),
# ('o', 'a', 'r'),
# ('o', 'b', 'a'),
# ('o', 'b', 'r'),
# ('o', 'o', 'a'),
# ('o', 'o', 'b'),
# ('o', 'o', 'r')}
Option 2 - more_itertools.distinct_combinations
list(mit.distinct_combinations("foo", 2))
# [('f', 'o'), ('o', 'o')]
list(mit.distinct_combinations("foobar", 3))
# [('f', 'o', 'o'),
# ('f', 'o', 'b'),
# ('f', 'o', 'a'),
# ('f', 'o', 'r'),
# ('f', 'b', 'a'),
# ('f', 'b', 'r'),
# ('f', 'a', 'r'),
# ('o', 'o', 'b'),
# ('o', 'o', 'a'),
# ('o', 'o', 'r'),
# ('o', 'b', 'a'),
# ('o', 'b', 'r'),
# ('o', 'a', 'r'),
# ('b', 'a', 'r')]
Both options yield the same (unordered) output. However:
Option 1 takes the set of all combinations (including duplicates)
Option 2 does not compute duplicate intermediates
Install more_itertools via > pip install more_itertools.
See also a rough implementation of itertools.combinations in written Python.
I have a list of tuples like below
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
I would like to use pyspark and groupByKey to produce:
nc=[['c','s', 'm', 'p'], ['h','bi','vi'], ['n','l', 'nc']
I dont know how to make a spark rdd and use groupByKey.
I tried:
tem=ls.groupByKey()
'list' object has no attribute 'groupByKey'
You are getting that error because your object is a list and not an rdd. Python lists do not have a groupByKey() method (as the error states).
You can first convert your list to an rdd using sc.parallelize:
myrdd = sc.parallelize(ls)
nc = myrdd.groupByKey().collect()
print(nc)
#[('c',['s', 'm', 'p']), ('h',['bi','vi']), ('n',['l', 'nc'])]
This returns a list of tuples where the first element is the key and the second element is a list of the values. If you wanted to flatten these tuples, you can use itertools.chain.from_iterable:
from itertools import chain
nc = [tuple(chain.from_iterable(v)) for v in nc]
print(nc)
#[('c', 's', 'm', 'p'), ('h', 'bi', 'vi'), ('n', 'l', 'nc')]
However, you can avoid spark completely achieve the desired result using itertools.groupby:
from itertools import groupby, chain
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
nc = [
(key,) + tuple(chain.from_iterable(g[1:] for g in list(group)))
for key, group in groupby(ls, key=lambda x: x[0])
]
print(nc)
#[('c', 's', 'm', 'p'), ('h', 'bi', 'vi'), ('n', 'l', 'nc')]
As pault mentioned, the problem here is that Spark operates on specialised parallelized datasets, such as an RDD. To get the exact format you're after using groupByKey you'll need to do some funky stuff with lists:
ls = sc.parallelize(ls)
tem=ls.groupByKey().map(lambda x: ([x[0]] + list(x[1]))).collect()
print(tem)
#[['h', 'bi', 'vi'], ['c', 's', 'm', 'p'], ['n', 'l', 'nc']]
However, generally its best to avoid groupByKey as it can result in a large number of shuffles. This problem could also be solved with reduceByKey using:
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
ls = sc.parallelize(ls)
tem=ls.map(lambda x: (x[0], [x[1]])).reduceByKey(lambda x,y: x + y).collect()
print(tem)
This will scale more effectively, but note that RDD operations can start to look a little cryptic when you need to manipulate list structure.
I have the list below (in fact it's longer but it's just to give the idea):
[[('P', 0.3178082191780822, 1750.0, 12.5),
('C', 0.8191780821917808, 1800.0, 332.80000000000001),
('P', 0.3178082191780822, 1325.0, 1.95),
('P', 0.14520547945205478, 1550.0, 1.0),
('C', 1.8136986301369864, 1900.0, 305.56999999999999),
('P', 0.3178082191780822, 1700.0, 9.9000000000000004),
('P', 0.14520547945205478, 2010.0, 18.949999999999999)]]
where each tuple refers to (option_type, time_to_maturity, strike, option_price).
I have to perform a double integration over the time_to_maturity and the strikes, so I would select for each different time_to_maturity (the second element of each tuple) the corresponding strike value (the third element of each tuples). What I would obtain is a list containing the time to maturity and another list containing the tuple of strikes which corresponds to the single time to maturity (a time to maturity is associated to different strikes, but generally the opposite does not hold). Is there a way to do that?
EDIT
This is one of the 10 list in which I would remove the tuples with 'P' that have the same strike of consecutive tuples with 'C':
(0.8328767123287671, [('P', 1200.0, 7.75), ('P', 1300.0, 11.199999999999999), ('P', 1400.0, 15.5), ('P', 1500.0, 21.600000000000001), ('C', 1500.0, 590.14999999999998), ('P', 1550.0, 24.75), ('P', 1575.0, 26.0), ('C', 1575.0, 522.0), ('P', 1600.0, 29.100000000000001), ('P', 1650.0, 33.5), ('P', 1675.0, 35.899999999999999), ('P', 1700.0, 39.700000000000003), ('P', 1725.0, 42.600000000000001), ('P', 1800.0, 53.0), ('P', 1850.0, 62.100000000000001), ('P', 1875.0, 67.5), ('P', 1900.0, 72.700000000000003), ('C', 1900.0, 243.09999999999999), ('P', 1950.0, 84.900000000000006), ('C', 1975.0, 189.30000000000001), ('P', 2000.0, 98.0), ('C', 2000.0, 171.0), ('C', 2050.0, 139.09999999999999), ('C', 2075.0, 122.59999999999999), ('P', 2075.0, 126.0), ('C', 2100.0, 108.0), ('P', 2100.0, 133.0), ('C', 2150.0, 81.400000000000006), ('C', 2200.0, 57.700000000000003), ('C', 2250.0, 39.0), ('P', 2250.0, 217.59999999999999), ('C', 2300.0, 24.350000000000001), ('P', 2300.0, 253.40000000000001), ('C', 2350.0, 14.35), ('C', 2375.0, 11.0), ('C', 2400.0, 8.0), ('C', 2500.0, 2.5499999999999998), ('P', 2500.0, 427.85000000000002)])
If I understand correctly, you want to group your records by time_to_maturity, so why not use itertools.groupby? This requires you to sort but to be able to integrate you have to sort anyway, so I guess that's ok.
import itertools as it
import operator as op
data, = [[('P', 0.3178082191780822, 1750.0, 12.5),
('C', 0.8191780821917808, 1800.0, 332.80000000000001),
('P', 0.3178082191780822, 1325.0, 1.95),
('P', 0.14520547945205478, 1550.0, 1.0),
('C', 1.8136986301369864, 1900.0, 305.56999999999999),
('P', 0.3178082191780822, 1700.0, 9.9000000000000004),
('P', 0.14520547945205478, 2010.0, 18.949999999999999)]]
# sort records ignoring 0th column
ds = sorted(data, key=op.itemgetter(slice(1, None)))
# group by 1st column
gr = it.groupby(ds, op.itemgetter(1))
# cut the first two entries from each record in each group
# the 1st entry is redundant with key, and the 0th I don't know what
# it's good for. To retain it use vi[:1] + vi[2:] instead of just vi[2:]
gr = [(k, [vi[2:] for vi in v]) for k, v in gr]
print(gr)
Prints:
[(0.14520547945205478, [(1550.0, 1.0), (2010.0, 18.95)]), (0.3178082191780822, [(1325.0, 1.95), (1700.0, 9.9), (1750.0, 12.5)]), (0.8191780821917808, [(1800.0, 332.8)]), (1.8136986301369864, [(1900.0, 305.57)])]
Note that as it stands this drops the 'P'/'C' column. But this can be easily remedied should you require to retain it, see comment in code.
You can use a list comprehension to extract a dimension:
time_to_maturity_list = [time_to_maturity for option_type, time_to_maturity, strike, option_price in my_list]
strikes_list = [strike for option_type, time_to_maturity, strike, option_price in my_list]
This is very readable but it does mean looping over the list twice. An alternative is to create two lists and append the items as you do a normal for loop:
time_to_maturity_list = []
strike_list = []
for option_type, time_to_maturity, strike, option_price in my_list:
time_to_maturity_list.append(time_to_maturity)
strike_list.append(strike)
NOTE: my_list is just a single list [tuple, tuple] either take the first element of your data or do something to concatenate all of the lists (like a nested for loop)
I am writing a webapp in python to convert a string by rot13 such as rot13(a)=n , rot13(b)=o and so on. But the code is only working for letters after m. From letters a to m there is no change. What am I doing wrong, here is my code:
import webapp2
import cgi
form = """<form method="post">
Type a string to convert:
<br>
<textarea name="text">%(text)s</textarea>
<br>
<input type="submit">
</form> """
class MainHandler(webapp2.RequestHandler):
def write_form(self, text=""):
self.response.out.write(form % {"text": cgi.escape(text)})
def get(self):
self.write_form()
def post(self):
new_text = self.request.get('text')
text = ""
for x in range(len(new_text)):
text += convert(new_text[x])
self.write_form(text)
def convert(t):
for (i, o) in (('a', 'n'),
('b', 'o'),
('c', 'p'),
('d', 'q'),
('e', 'r'),
('f', 's'),
('g', 't'),
('h', 'u'),
('i', 'v'),
('j', 'w'),
('k', 'x'),
('l', 'y'),
('m', 'z'),
('n', 'a'),
('o', 'b'),
('p', 'c'),
('q', 'd'),
('r', 'e'),
('s', 'f'),
('t', 'g'),
('u', 'h'),
('v', 'i'),
('w', 'j'),
('x', 'k'),
('y', 'l'),
('z', 'm')):
t = t.replace(i, o)
return t
app = webapp2.WSGIApplication([
('/', MainHandler)
], debug=True)
When i place the letters n to z above a, then a to m are giving correct result.
The issue is in the convert() method, Lets take a simple Example to understand this.
Lets take example of string 'an' and try to convert it.
First we get a from the tuple of tuple and we replace it with n in the string, so it becomes - 'nn' . Now after lots of misses, we get to n in the tuple of tuples and we again do replace on the whole string, and this time we get - 'aa' . As you can see we are again replacing complete string, not just the remaining of the not converted string. This is the basic issue in your code (atleast the issue you mention).
To fix this, Python already provides a str.translate() (with str.maketrans) function to do what you are trying to do, you should use that instead. Example -
Python 2.x -
def convert(t):
from string import maketrans
tt = maketrans('abcdefghijklmnopqrstuvwxyz','nopqrstuvwxyzabcdefghijklm')
t = t.translate(tt)
return t
For Python 3.x , you should use str.maketrans() instead of string.maketrans() .
Okay,
I am working on a linguistic prover, and I have series of tuples that represent statements or expressions.
Sometimes, I end up with an embedded "and" statement, and I am trying to "bubble" it up to the surface.
I want to take a tuple that looks like this:
('pred', ('and', 'a', 'b'), 'x')
or, for a more simple example:
( ('and', 'a', 'b'), 'x')
and I want to separate out the ands into two statements such that the top one results in:
('and', ('pred', 'a', 'x',), ('pred', 'b', 'x') )
and the bottom one in:
('and', ('a', 'x'), ('b', 'x') )
I've tried a lot of things, but it always turns out to be quite ugly code. And I am having problems if there are more nested tuples such as:
('not', ('p', ('and', 'a', 'b'), 'x') )
which I want to result in
('not', ('and', ('p', 'a', 'x',), ('p', 'b', 'x') ) )
So basically, the problem is trying to replace a nested tuple with the value of the entire tuple, but the nested one modified. It's very ugly. :(
I'm not super duper python fluent so it gets very convoluted with lots of for loops that I know shouldn't be there. :(
Any help is much appreciated!
This recursive approach seems to work.
def recursive_bubble_ands_up(expr):
""" Bubble all 'and's in the expression up one level, no matter how nested.
"""
# if the expression is just a single thing, like 'a', just return it.
if is_atomic(expr):
return expr
# if it has an 'and' in one of its subexpressions
# (but the subexpression isn't just the 'and' operator itself)
# rewrite it to bubble the and up
and_clauses = [('and' in subexpr and not is_atomic(subexpr))
for subexpr in expr]
if any(and_clauses):
first_and_clause = and_clauses.index(True)
expr_before_and = expr[:first_and_clause]
expr_after_and = expr[first_and_clause+1:]
and_parts = expr[first_and_clause][1:]
expr = ('and',) + tuple([expr_before_and + (and_part,) + expr_after_and
for and_part in and_parts])
# apply recursive_bubble_ands_up to all the elements and return result
return tuple([recursive_bubble_ands_up(subexpr) for subexpr in expr])
def is_atomic(expr):
""" Return True if expr is an undividable component
(operator or value, like 'and' or 'a'). """
# not sure how this should be implemented in the real case,
# if you're not really just working on strings
return isinstance(expr, str)
Works on all your examples:
>>> tmp.recursive_bubble_ands_up(('pred', ('and', 'a', 'b'), 'x'))
('and', ('pred', 'a', 'x'), ('pred', 'b', 'x'))
>>> tmp.recursive_bubble_ands_up(( ('and', 'a', 'b'), 'x'))
('and', ('a', 'x'), ('b', 'x'))
>>> tmp.recursive_bubble_ands_up(('not', ('p', ('and', 'a', 'b'), 'x') ))
('not', ('and', ('p', 'a', 'x'), ('p', 'b', 'x')))
Note that this isn't aware of any other "special" operators, like not - as I said in my comment, I'm not sure what it should do with that. But it should give you something to start with.
Edit: Oh, oops, I just realized this only performs a single "bubble-up" operation, for example:
>>> tmp.recursive_bubble_ands_up(((('and', 'a', 'b'), 'x'), 'y' ))
(('and', ('a', 'x'), ('b', 'x')), 'y')
>>> tmp.recursive_bubble_ands_up((('and', ('a', 'x'), ('b', 'x')), 'y'))
('and', (('a', 'x'), 'y'), (('b', 'x'), 'y'))
So what you really want is probably to apply it in a while loop until the output is identical to the input, if you want your 'and' to bubble up from however many levels, like this:
def repeat_bubble_until_finished(expr):
""" Repeat recursive_bubble_ands_up until there's no change
(i.e. until all possible bubbling has been done).
"""
while True:
old_expr = expr
expr = recursive_bubble_ands_up(old_expr)
if expr == old_expr:
break
return expr
On the other hand, doing that shows that actually my program breaks your 'not' example, because it bubbles the 'and' ahead of the 'not', which you said you wanted left alone:
>>> tmp.recursive_bubble_ands_up(('not', ('p', ('and', 'a', 'b'), 'x')))
('not', ('and', ('p', 'a', 'x'), ('p', 'b', 'x')))
>>> tmp.repeat_bubble_until_finished(('not', ('p', ('and', 'a', 'b'), 'x')))
('and', ('not', ('p', 'a', 'x')), ('not', ('p', 'b', 'x')))
So I suppose you'd have to build in a special case for 'not' into recursive_bubble_ands_up, or just apply your not-handling function before running mine, and insert it before recursive_bubble_ands_up in repeat_bubble_until_finished so they're applied in alternation.
All right, I really should sleep now.