make a spark rdd from tuples list and use groupByKey - python

I have a list of tuples like below
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
I would like to use pyspark and groupByKey to produce:
nc=[['c','s', 'm', 'p'], ['h','bi','vi'], ['n','l', 'nc']
I dont know how to make a spark rdd and use groupByKey.
I tried:
tem=ls.groupByKey()
'list' object has no attribute 'groupByKey'

You are getting that error because your object is a list and not an rdd. Python lists do not have a groupByKey() method (as the error states).
You can first convert your list to an rdd using sc.parallelize:
myrdd = sc.parallelize(ls)
nc = myrdd.groupByKey().collect()
print(nc)
#[('c',['s', 'm', 'p']), ('h',['bi','vi']), ('n',['l', 'nc'])]
This returns a list of tuples where the first element is the key and the second element is a list of the values. If you wanted to flatten these tuples, you can use itertools.chain.from_iterable:
from itertools import chain
nc = [tuple(chain.from_iterable(v)) for v in nc]
print(nc)
#[('c', 's', 'm', 'p'), ('h', 'bi', 'vi'), ('n', 'l', 'nc')]
However, you can avoid spark completely achieve the desired result using itertools.groupby:
from itertools import groupby, chain
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
nc = [
(key,) + tuple(chain.from_iterable(g[1:] for g in list(group)))
for key, group in groupby(ls, key=lambda x: x[0])
]
print(nc)
#[('c', 's', 'm', 'p'), ('h', 'bi', 'vi'), ('n', 'l', 'nc')]

As pault mentioned, the problem here is that Spark operates on specialised parallelized datasets, such as an RDD. To get the exact format you're after using groupByKey you'll need to do some funky stuff with lists:
ls = sc.parallelize(ls)
tem=ls.groupByKey().map(lambda x: ([x[0]] + list(x[1]))).collect()
print(tem)
#[['h', 'bi', 'vi'], ['c', 's', 'm', 'p'], ['n', 'l', 'nc']]
However, generally its best to avoid groupByKey as it can result in a large number of shuffles. This problem could also be solved with reduceByKey using:
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
ls = sc.parallelize(ls)
tem=ls.map(lambda x: (x[0], [x[1]])).reduceByKey(lambda x,y: x + y).collect()
print(tem)
This will scale more effectively, but note that RDD operations can start to look a little cryptic when you need to manipulate list structure.

Related

Select first item in each list

Here is my list:
[(('A', 'B'), ('C', 'D')), (('E', 'F'), ('G', 'H'))]
Basically, I'd like to get:
[('A', 'C'), ('E', 'G')]
So, I'd like to select first elements from the lowest-level lists and build mid-level lists with them.
====================================================
Additional explanation below:
I could just zip them by
list(zip([w[0][0] for w in list1], [w[1][0] for w in list1]))
But later I'd like to add a condition: the second elements in the lowest level lists must be 'B' and 'D' respectively, so the final outcome should be:
[('A', 'C')] # ('E', 'G') must be sorted out
I'm a beginner, but can't find the case anywhere... Would be grateful for help.
I'd do it the following way
list = [(('A', 'B'), ('C', 'D')), (('E', 'F'), ('G', 'H'))]
out = []
for i in list:
listAux = []
for j in i:
listAux.append(j[0])
out.append((listAux[0],listAux[1]))
print(out)
I hope that's what you're looking for.

Generate all unique k-subsequences

I am trying to write a Python (at least initially) function to generate all subsequences of some length k (where k > 0). Since I only need unique subsequences, I am storing both the subsequences and partial subsequences in sets. The following, adapted from a colleague, is the best I could come up with. It seems...overly complex...and like I should be able to abuse itertools, or recursion, to do what I want to do. Can anyone do better?
from typing import Set, Tuple
def subsequences(string: str, k: int) -> Set[Tuple[str, ...]]:
if len(string) < k:
return set()
start = tuple(string[:k])
result = {start}
prev_state = [start]
curr_state = set()
for s in string[k:]:
for p in prev_state:
for i in range(k):
new = p[:i] + p[i + 1 :] + (s,)
curr_state.add(new)
result.update(curr_state)
prev_state = list(curr_state)
curr_state.clear()
return result
(For context, I am interested in induction of k-strictly piecewise languages, an efficiently learnable subclass of the regular languages, and the grammar can be characterized by all licit k-subsequences.
Ultimately I am also thinking about doing this in C++, where std::make_tuple isn't quite as powerful as Python tuple.)
You want a set of r combinations from n items (w/o replacement, <= (n choose r).
Given
import itertools as it
import more_itertools as mit
Code
Option 1 - itertools.combinations
set(it.combinations("foo", 2))
# {('f', 'o'), ('o', 'o')}
set(it.combinations("foobar", 3))
# {('b', 'a', 'r'),
# ('f', 'a', 'r'),
# ('f', 'b', 'a'),
# ('f', 'b', 'r'),
# ('f', 'o', 'a'),
# ('f', 'o', 'b'),
# ('f', 'o', 'o'),
# ('f', 'o', 'r'),
# ('o', 'a', 'r'),
# ('o', 'b', 'a'),
# ('o', 'b', 'r'),
# ('o', 'o', 'a'),
# ('o', 'o', 'b'),
# ('o', 'o', 'r')}
Option 2 - more_itertools.distinct_combinations
list(mit.distinct_combinations("foo", 2))
# [('f', 'o'), ('o', 'o')]
list(mit.distinct_combinations("foobar", 3))
# [('f', 'o', 'o'),
# ('f', 'o', 'b'),
# ('f', 'o', 'a'),
# ('f', 'o', 'r'),
# ('f', 'b', 'a'),
# ('f', 'b', 'r'),
# ('f', 'a', 'r'),
# ('o', 'o', 'b'),
# ('o', 'o', 'a'),
# ('o', 'o', 'r'),
# ('o', 'b', 'a'),
# ('o', 'b', 'r'),
# ('o', 'a', 'r'),
# ('b', 'a', 'r')]
Both options yield the same (unordered) output. However:
Option 1 takes the set of all combinations (including duplicates)
Option 2 does not compute duplicate intermediates
Install more_itertools via > pip install more_itertools.
See also a rough implementation of itertools.combinations in written Python.

Delete Tuple Pairs if they are the same

I have a problem with tuples in python. I have the following tuple list:
gamma2 = [[('p', 'u'), ('r', 'w')], [('p', 'w'), ('r', 'u')], [('r', 'u'), ('p', 'w')],[('r', 'w'), ('p', 'u')]]
Now, the parts [('p', 'u'), ('r', 'w')] and [('r', 'w'), ('p', 'u')] are the same for me and also [('p', 'w'), ('r', 'u')] and [('r', 'u'), ('p', 'w')].
So I want to delete one of these double entries in my list, but I don't know how.
I've tried with hash tables and set, but the problem is, that this tuple pair is not the same for the hash table and it will be added by gamma2.add().
So do you have an idea?
you can try to use tuple ans set
gamma2 = [[('p', 'u'), ('r', 'w')], [('p', 'w'), ('r', 'u')], [('r', 'u'), ('p', 'w')],[('r', 'w'), ('p', 'u')]]
set([tuple(set(x)) for x in gamma2])
for some case it will be better to use sorted instead inside set (thanks #rockikz)
set([tuple(sorted(x)) for x in gamma2])
and third solution is to use frozenset
set([frozenset(x) for x in gamma2])
will give you the result:
{(('p', 'w'), ('r', 'u')), (('r', 'w'), ('p', 'u'))}
set - list of unique values
the set inside loop - need to to lead the items to make them equal
next use tuple only as sugar to make outer set
and the last set we use to get unique values
and if you want the same type in the result you can do it:
[list(y) for y in set([tuple(set(x)) for x in gamma2])]
will give you
[[('r', 'w'), ('p', 'u')], [('p', 'w'), ('r', 'u')]]

How to get all permutations of string as list of strings (instead of list of tuples)?

The goal was to create a list of all possible combinations of certain letters in a word... Which is fine, except it now ends up as a list of tuples with too many quotes and commas.
import itertools
mainword = input(str("Enter a word: "))
n_word = int((len(mainword)))
outp = (list(itertools.permutations(mainword,n_word)))
What I want:
[yes, yse, eys, esy, sye, sey]
What I'm getting:
[('y', 'e', 's'), ('y', 's', 'e'), ('e', 'y', 's'), ('e', 's', 'y'), ('s', 'y', 'e'), ('s', 'e', 'y')]
Looks to me I just need to remove all the brackets, quotes, and commas.
I've tried:
def remove(old_list, val):
new_list = []
for items in old_list:
if items!=val:
new_list.append(items)
return new_list
print(new_list)
where I just run the function a few times. But it doesn't work.
You can recombine those tuples with a comprehension like:
Code:
new_list = [''.join(d) for d in old_list]
Test Code:
data = [
('y', 'e', 's'), ('y', 's', 'e'), ('e', 'y', 's'),
('e', 's', 'y'), ('s', 'y', 'e'), ('s', 'e', 'y')
]
data_new = [''.join(d) for d in data]
print(data_new)
Results:
['yes', 'yse', 'eys', 'esy', 'sye', 'sey']
You need to call str.join() on your string tuples in order to convert it back to a single string. Your code can be simplified with list comprehension as:
>>> from itertools import permutations
>>> word = 'yes'
>>> [''.join(w) for w in permutations(word)]
['yes', 'yse', 'eys', 'esy', 'sye', 'sey']
OR you may also use map() to get the desired result as:
>>> list(map(''.join, permutations(word)))
['yes', 'yse', 'eys', 'esy', 'sye', 'sey']
You can use the join function . Below code works perfect .
I am also attach the screenshot of the output.
import itertools
mainword = input(str("Enter a word: "))
n_word = int((len(mainword)))
outp = (list(itertools.permutations(mainword,n_word)))
for i in range(0,6):
outp[i]=''.join(outp[i])
print(outp)

replacing values in a nested tuple with the whole tuple

Okay,
I am working on a linguistic prover, and I have series of tuples that represent statements or expressions.
Sometimes, I end up with an embedded "and" statement, and I am trying to "bubble" it up to the surface.
I want to take a tuple that looks like this:
('pred', ('and', 'a', 'b'), 'x')
or, for a more simple example:
( ('and', 'a', 'b'), 'x')
and I want to separate out the ands into two statements such that the top one results in:
('and', ('pred', 'a', 'x',), ('pred', 'b', 'x') )
and the bottom one in:
('and', ('a', 'x'), ('b', 'x') )
I've tried a lot of things, but it always turns out to be quite ugly code. And I am having problems if there are more nested tuples such as:
('not', ('p', ('and', 'a', 'b'), 'x') )
which I want to result in
('not', ('and', ('p', 'a', 'x',), ('p', 'b', 'x') ) )
So basically, the problem is trying to replace a nested tuple with the value of the entire tuple, but the nested one modified. It's very ugly. :(
I'm not super duper python fluent so it gets very convoluted with lots of for loops that I know shouldn't be there. :(
Any help is much appreciated!
This recursive approach seems to work.
def recursive_bubble_ands_up(expr):
""" Bubble all 'and's in the expression up one level, no matter how nested.
"""
# if the expression is just a single thing, like 'a', just return it.
if is_atomic(expr):
return expr
# if it has an 'and' in one of its subexpressions
# (but the subexpression isn't just the 'and' operator itself)
# rewrite it to bubble the and up
and_clauses = [('and' in subexpr and not is_atomic(subexpr))
for subexpr in expr]
if any(and_clauses):
first_and_clause = and_clauses.index(True)
expr_before_and = expr[:first_and_clause]
expr_after_and = expr[first_and_clause+1:]
and_parts = expr[first_and_clause][1:]
expr = ('and',) + tuple([expr_before_and + (and_part,) + expr_after_and
for and_part in and_parts])
# apply recursive_bubble_ands_up to all the elements and return result
return tuple([recursive_bubble_ands_up(subexpr) for subexpr in expr])
def is_atomic(expr):
""" Return True if expr is an undividable component
(operator or value, like 'and' or 'a'). """
# not sure how this should be implemented in the real case,
# if you're not really just working on strings
return isinstance(expr, str)
Works on all your examples:
>>> tmp.recursive_bubble_ands_up(('pred', ('and', 'a', 'b'), 'x'))
('and', ('pred', 'a', 'x'), ('pred', 'b', 'x'))
>>> tmp.recursive_bubble_ands_up(( ('and', 'a', 'b'), 'x'))
('and', ('a', 'x'), ('b', 'x'))
>>> tmp.recursive_bubble_ands_up(('not', ('p', ('and', 'a', 'b'), 'x') ))
('not', ('and', ('p', 'a', 'x'), ('p', 'b', 'x')))
Note that this isn't aware of any other "special" operators, like not - as I said in my comment, I'm not sure what it should do with that. But it should give you something to start with.
Edit: Oh, oops, I just realized this only performs a single "bubble-up" operation, for example:
>>> tmp.recursive_bubble_ands_up(((('and', 'a', 'b'), 'x'), 'y' ))
(('and', ('a', 'x'), ('b', 'x')), 'y')
>>> tmp.recursive_bubble_ands_up((('and', ('a', 'x'), ('b', 'x')), 'y'))
('and', (('a', 'x'), 'y'), (('b', 'x'), 'y'))
So what you really want is probably to apply it in a while loop until the output is identical to the input, if you want your 'and' to bubble up from however many levels, like this:
def repeat_bubble_until_finished(expr):
""" Repeat recursive_bubble_ands_up until there's no change
(i.e. until all possible bubbling has been done).
"""
while True:
old_expr = expr
expr = recursive_bubble_ands_up(old_expr)
if expr == old_expr:
break
return expr
On the other hand, doing that shows that actually my program breaks your 'not' example, because it bubbles the 'and' ahead of the 'not', which you said you wanted left alone:
>>> tmp.recursive_bubble_ands_up(('not', ('p', ('and', 'a', 'b'), 'x')))
('not', ('and', ('p', 'a', 'x'), ('p', 'b', 'x')))
>>> tmp.repeat_bubble_until_finished(('not', ('p', ('and', 'a', 'b'), 'x')))
('and', ('not', ('p', 'a', 'x')), ('not', ('p', 'b', 'x')))
So I suppose you'd have to build in a special case for 'not' into recursive_bubble_ands_up, or just apply your not-handling function before running mine, and insert it before recursive_bubble_ands_up in repeat_bubble_until_finished so they're applied in alternation.
All right, I really should sleep now.

Categories