How to make generator work in spark mapPartitions()? - python

I am trying to use mapPartiton in spark to process large text corpus:
Let's say we have some half-processed data that looks like this:
text_1 = [['A', 'B', 'C', 'D', 'E'],
['F', 'E', 'G', 'A', 'B'],
['D', 'E', 'H', 'A', 'B'],
['A', 'B', 'C', 'F', 'E'],
['A', 'B', 'C', 'J', 'E'],
['E', 'H', 'A', 'B', 'C'],
['E', 'G', 'A', 'B', 'C'],
['C', 'F', 'E', 'G', 'A'],
['C', 'D', 'E', 'H', 'A'],
['C', 'J', 'E', 'H', 'A'],
['H', 'A', 'B', 'C', 'F'],
['H', 'A', 'B', 'C', 'J'],
['B', 'C', 'F', 'E', 'G'],
['B', 'C', 'D', 'E', 'H'],
['B', 'C', 'F', 'E', 'K'],
['B', 'C', 'J', 'E', 'H'],
['G', 'A', 'B', 'C', 'F'],
['J', 'E', 'H', 'A', 'B']]
Each letter is a word. I also have vocabulary :
V = ['D','F','G','C','J','K']
text_1RDD = sc.parallelize(text_1)
and I want to run the following in spark:
filtered_lists = text_1RDD.mapPartitions(partitions)
filtered_lists.collect()
I have this function:
def partitions(list_of_lists,vc):
for w in vc:
iterator = []
for sub_list in list_of_lists:
if w in sub_list:
iterator.append(sub_list)
yield (w,len(iterator))
If I run it like this:
c = partitions(text_1,V)
for item in c:
print(item)
it returns correct count
('D', 4)
('F', 7)
('G', 5)
('C', 15)
('J', 5)
('K', 1)
However, I have no idea how to run it in spark:
filtered_lists = text_1RDD.mapPartitions(partitions)
filtered_lists.collect()
It has just one argument and generates a lot of errors when running in Spark...
But even if I code vocabulary inside partitions function:
def partitionsV(list_of_lists):
vc = ['D','F','G','C','J','K']
for w in vc:
iterator = []
for sub_list in list_of_lists:
if w in sub_list:
iterator.append(sub_list)
yield (w,len(iterator))
..I got this:
filtered_lists = text_1RDD.mapPartitions(partitionsV)
filtered_lists.collect()
output:
[('D', 2),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0),
('D', 0),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0),
('D', 1),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0),
('D', 1),
('F', 0),
('G', 0),
('C', 0),
('J', 0),
('K', 0)]
Obviously, generator didn't work as expected. I am totally stuck.
I am very new to spark. I would be so grateful if someone can explain to me what is going on here...

That's yet another word-count problem, and mapPartitions is not the tool for the job:
from operator import add
v = set(['D','F','G','C','J','K'])
result = text_1RDD.flatMap(v.intersection).map(lambda x: (x, 1)).reduceByKey(add)
And the result is
for x in result.sortByKey().collect():
print(x)
('C', 15)
('D', 4)
('F', 7)
('G', 5)
('J', 5)
('K', 1)

Related

Filter List of Tuples to Exclude from Another List of Tuples which Contains

(Using Python3)
I have a list of tuples, (of strings)
have = [
('a', 'b', 'c', 'd'), ('a', 'b', 'c', 'e'), ('a', 'b', 'c', 'f'), ('a', 'b', 'c', 'g'), ('a', 'b', 'd', 'e'),
('a', 'b', 'd', 'f'), ('a', 'b', 'd', 'g'), ('a', 'b', 'e', 'f'), ('a', 'b', 'e', 'g'), ('a', 'b', 'f', 'g'),
('a', 'c', 'd', 'e'), ('a', 'c', 'd', 'f'), ('a', 'c', 'd', 'g'), ('a', 'c', 'e', 'f'), ('a', 'c', 'e', 'g'),
('a', 'c', 'f', 'g'), ('a', 'd', 'e', 'f'), ('a', 'd', 'e', 'g'), ('a', 'd', 'f', 'g'), ('a', 'e', 'f', 'g'),
('b', 'c', 'd', 'e'), ('b', 'c', 'd', 'f'), ('b', 'c', 'd', 'g'), ('b', 'c', 'e', 'f'), ('b', 'c', 'e', 'g'),
('b', 'c', 'f', 'g'), ('b', 'd', 'e', 'f'), ('b', 'd', 'e', 'g'), ('b', 'd', 'f', 'g'), ('b', 'e', 'f', 'g'),
('c', 'd', 'e', 'f'), ('c', 'd', 'e', 'g'), ('c', 'd', 'f', 'g'), ('c', 'e', 'f', 'g'), ('d', 'e', 'f', 'g')
]
I also have a list of tuples (also strings) which I want to "exclude"
exclude = [('a', 'd'), ('b', 'c')]
I'm trying to find an efficient way to remove any element in have that contains both the elements in each exclude tuple. Ordering does not matter.
My goal is to return something like this:
[
('a', 'b', 'e', 'f'), ('a', 'b', 'e', 'g'), ('a', 'b', 'f', 'g'), ('a', 'c', 'e', 'f'), ('a', 'c', 'e', 'g'),
('a', 'c', 'f', 'g'), ('a', 'e', 'f', 'g'), ('b', 'd', 'e', 'f'), ('b', 'd', 'e', 'g'), ('b', 'd', 'f', 'g'),
('b', 'e', 'f', 'g'), ('c', 'd', 'e', 'f'), ('c', 'd', 'e', 'g'), ('c', 'd', 'f', 'g'), ('c', 'e', 'f', 'g'),
('d', 'e', 'f', 'g')
]
You could convert the exclude tuples to sets and then check for each element of have is the excluded set isn't a subset of it:
excludeSet = [set(e) for e in exclude]
filteredHave = [h for h in have if not any(e for e in excludeSet if e.issubset(h))]

Creating combination on list values

I have a requirement where say i have a list
lis = ['a','b','c','d','e','f']
I have to now create a combination of them eg:
l1 = [a],['b,c,d,e,f]
l2: [b],[a,c,d,e,f]
.
.
l10 [a,b,c],[d,e,f]
.
l11 [a,b,c,d] [e,f]
The repeated elements on the left and right nodes will be removed:
eg: i don't need two lists as:
l1: [b,c] , [a,d,e,f]
l2: [a,d,e,f], [b,c]
Since they are the same
The pseudo code i have in mind is:
for length = 1, i will take one element from list and club others
similar for length=2, will take two element and club others
till length=len(list)-1, will do the clubbing
and then later remove the duplicates.
Any better solution i could try?
This may no be optimal, but is very straightforward:
from itertools import chain, combinations
def power_set(iterable):
"""power_set([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"""
source = list(iterable)
return chain.from_iterable(combinations(source, r) for r in range(1, len(source) // 2 + 1))
def complement(source, universe):
return tuple(set(universe) - set(source))
lst = ['a', 'b', 'c', 'd', 'e', 'f']
result = set(frozenset({si, complement(si, lst)}) for si in power_set(lst))
for s, c in result:
print(s, c)
Output
('a', 'd', 'e') ('f', 'c', 'b')
('f', 'a', 'c', 'b') ('d', 'e')
('b', 'e') ('d', 'f', 'a', 'c')
('a', 'b', 'f') ('d', 'e', 'c')
('e', 'd', 'a', 'f', 'b') ('c',)
('c', 'f') ('d', 'a', 'e', 'b')
('d', 'f') ('a', 'e', 'b', 'c')
('d',) ('e', 'c', 'a', 'f', 'b')
('f', 'a', 'e', 'c') ('b', 'd')
('e', 'c', 'd', 'a', 'b') ('f',)
('b', 'c', 'd') ('f', 'a', 'e')
('a', 'b', 'e') ('d', 'f', 'c')
('b', 'c') ('d', 'f', 'a', 'e')
('f', 'a', 'b') ('c', 'd', 'e')
('d', 'e', 'b', 'c') ('a', 'f')
('c', 'd', 'f') ('a', 'e', 'b')
('e', 'c', 'd', 'f', 'b') ('a',)
('a', 'c') ('d', 'f', 'e', 'b')
('f', 'e', 'c') ('a', 'b', 'd')
('a', 'd') ('f', 'e', 'b', 'c')
('b', 'c', 'e') ('d', 'f', 'a')
('a', 'c', 'e') ('d', 'f', 'b')
('d', 'e', 'f') ('a', 'c', 'b')
('a', 'c', 'd') ('f', 'e', 'b')
('d', 'f', 'e', 'c') ('a', 'b')
('f', 'a', 'e', 'b') ('c', 'd')
('d', 'a', 'c') ('b', 'e', 'f')
('a', 'e') ('d', 'f', 'c', 'b')
('a', 'b', 'c') ('d', 'f', 'e')
('a', 'd', 'f') ('e', 'b', 'c')
('d', 'e', 'b') ('a', 'c', 'f')
('c', 'd', 'a', 'f', 'b') ('e',)
('b',) ('e', 'c', 'd', 'a', 'f')
('e', 'f') ('d', 'a', 'c', 'b')
('d', 'c', 'b') ('a', 'e', 'f')
('b', 'f') ('d', 'a', 'e', 'c')
('d', 'a', 'e') ('b', 'c', 'f')
('b', 'd', 'e') ('f', 'a', 'c')
('a', 'e', 'c') ('b', 'd', 'f')
('c', 'e') ('d', 'f', 'a', 'b')
('d', 'a', 'b') ('c', 'e', 'f')

Zip lists of tuples

I'm doing some stuff with data from files, and I have already zipped every column with its info, but now i want to combine info from other files (where i have zipped the info too) and i don't know how to unzip and get it together.
EDIT:
I have a couple of zip objects:
l1 = [('a', 'b'), ('c', 'd')] # list(zippedl1)
l2 = [('e', 'f'), ('g', 'h')] # list(zippedl1)
l3 = [('i', 'j'), ('k', 'm')] # list(zippedl1)
and i want to unzip like:
unzipped = [('a', 'c', 'e', 'g', 'i', 'k'), ('b', 'd', 'f', 'h', 'j', 'm')]
I wouldn't like to transform the zipped structures to a list, just for memory reasons. I searched and i didn't find something that helps me. Hope you can help me please!.
[sorry about my bad english]
I believe you want to zip an unpacked chain:
# Leaving these as zip objects as per your edit
l1 = zip(('a', 'c'), ('b', 'd'))
l2 = zip(('e', 'g'), ('f', 'h'))
l3 = zip(('i', 'k'), ('j', 'm'))
unzipped = [('a', 'c', 'e', 'g', 'i', 'k'), ('b', 'd', 'f', 'h', 'j', 'm')]
You can simply do
from itertools import chain
result = list(zip(*chain(l1, l2, l3)))
# You can also skip list creation if all you need to do is iterate over result:
# for x in zip(chain(l1, l2, l3)):
# print(x)
print(result)
print(result == unzipped)
This prints:
[('a', 'c', 'e', 'g', 'i', 'k'), ('b', 'd', 'f', 'h', 'j', 'm')]
True
You need to concatenate the lists first:
>>> l1 = [('a', 'b'), ('c', 'd')]
>>> l2 = [('e', 'f'), ('g', 'h')]
>>> l3 = [('i', 'j'), ('k', 'm')]
>>> zip(*(l1 + l2 + l3))
[('a', 'c', 'e', 'g', 'i', 'k'), ('b', 'd', 'f', 'h', 'j', 'm')]

Printing values in row order from dictionary/ list

Could someone show me how to get the values from a dictionary in row order (STARTING FROM SECOND ROW) e.g. get the first value from all rows, when rows finish, move onto getting the second value from all rows until all the values have been collected (when no more columns).
E.g. here is a table:
('E', 'K', 'Y') # <- don't get the values from the first row
('B', 'C', 'B') # start getting values from this (second row)
('C', 'B', 'F')
('F', 'C', 'A')
('C', 'C', 'C')
('B', 'C', 'B')
('E', 'B', 'F')
('B', 'B', 'F')
('D', 'A', 'A')
('A', 'D', 'F')
The table above should print the values:
BCFCBEBDACBCCCBBADBFACBFFAF
Creating an encode and decoder program. stuck with printing values in correct order.
Thanks.
If you have all the tuples in a list you can use zip and join :
>>> l=[('E', 'K', 'Y') ,('B', 'C', 'B') ,('C', 'B', 'F'),('F', 'C', 'A'),('C', 'C', 'C'),('B', 'C', 'B'),('E', 'B', 'F'),('B', 'B', 'F'),('D', 'A', 'A')]
>>> ''.join([''.join(i) for i in zip(*l[1:])])
'BCFCBEBDCBCCCBBABFACBFFA'
Alterantive way, that gets a list of characters instead of str:
print([v for r in (row for row in zip(*l[1:])) for v in r])
#['B', 'C', 'F', 'C', 'B', 'E', 'B', 'D', 'A', 'C', 'B', 'C', 'C', 'C', 'B', 'B', 'A', 'D', 'B', 'F', 'A', 'C', 'B', 'F', 'F', 'A', 'F']
If the data structure is a list of tuples, meaning:
l =[('C', 'B', 'F'),
('F', 'C', 'A'),
('C', 'C', 'C'),
('B', 'C', 'B'),
('E', 'B', 'F'),
('B', 'B', 'F'),
('D', 'A', 'A'),
('A', 'D', 'F')]
Something like this could solve the problem:
string = ""
for i in range(3):
for element in range(1, len(l)):
string+=l[element][i]

Trying to turn inner and out tuples into inner and outer lists

Ok, so I've got information in the form of
(('A', 'B', 'C'), ('D', 'E', 'F'), ('H', 'I', 'J'))
and I would like to convert this to
[['A', 'B', 'C'], ['D', 'E', 'F'], ['H', 'I', 'J']]
What is the best/easiest way to do this?
List comprehension:
tpl = (('A', 'B', 'C'), ('D', 'E', 'F'), ('H', 'I', 'J'))
lst = [list(x) for x in tpl]
a = (('A', 'B', 'C'), ('D', 'E', 'F'), ('H', 'I', 'J'))
print map(list, a)
prints
[['A', 'B', 'C'], ['D', 'E', 'F'], ['H', 'I', 'J']]
>>> data = (('A', 'B', 'C'), ('D', 'E', 'F'), ('H', 'I', 'J'))
>>> [list(tup) for tup in data]
[['A', 'B', 'C'], ['D', 'E', 'F'], ['H', 'I', 'J']]
Here is a simple recursive solution for any number of nested tuples:
>>> tup = (('A', ('B', 'C')), ('D', 'E', 'F', ('H', 'I', 'J')))
>>> listify = lambda x: map(listify, x) if isinstance(x, tuple) else x
>>> listify(tup)
[['A', ['B', 'C']], ['D', 'E', 'F', ['H', 'I', 'J']]]
For Python 3 replace map(listify, x) with list(map(listify, x)).
If you know the structure is only two levels, try:
x = (('A', 'B', 'C'), ('D', 'E', 'F'), ('H', 'I', 'J'))
y = [ list(t) for t in x ]
If there might be deeper nesting, you'll want recursion -- see F.J's answer.

Categories