pyspark reduce key being a tuple values nested lists

pyspark reduce key being a tuple values nested lists - python

My problem is the following: I am parsing users interactions, each time an interaction is detected I emit ((user1,user2),((date1,0),(0,1))). The zero's are here for the direction of the interaction.
I cannot figure out why I cannot reduce this output with the following reduce function:
def myFunc2(x1,x2):
return (min(x1[0][0],x2[0][0]),max(x1[0][0],x2[0][0]),min(x1[0][1],x2[0][1]),max(x1[0][1],x2[0][1]),x1[1][0]+x2[1][0],x1[1][1]+x2[1][1])
The output of my mapper (flatmap(myFunc)) is correct:
((7401899, 5678002), ((1403185440.0, 0), (1, 0)))
((82628194, 22251869), ((0, 1403185452.0), (0, 1)))
((2162276, 98056200), ((1403185451.0, 0), (1, 0)))
((0509420, 4827510), ((1403185449.0, 0), (1, 0)))
((7974923, 9235930), ((1403185450.0, 0), (1, 0)))
((250259, 6876774), ((0, 1403185450.0), (0, 1)))
((642369, 6876774), ((0, 1403185450.0), (0, 1)))
((82628194, 22251869), ((0, 1403185452.0), (0, 1)))
((2162276, 98056200), ((1403185451.0, 0), (1, 0)))
But running
lines.flatMap(myFunc) \
.map(lambda x: (x[0], x[1])) \
.reduceByKey(myFunc2)
Gives me the error
return (min(x1[0][0],x2[0][0]),max(x1[0][0],x2[0][0]),min(x1[0][1],x2[0][1]),max(x1[0][1],x2[0][1]),x1[1][0]+x2[1][0],x1[1][1]+x2[1][1])
TypeError: 'int' object has no attribute 'getitem'
I guess I am messing something up in my keys but I don't know why (I tried to recast the key to tuple as said here but same error)
Some idea ? Thanks a lot

Okay, I think the problem here is that you are indexing too deep in items that don't go as deep as you think.
Let's examine myFunc2
def myFunc2(x1,x2):
return (min(x1[0][0],x2[0][0]),max(x1[0][0],x2[0][0]),min(x1[0][1],x2[0][1]),max(x1[0][1],x2[0][1]),x1[1][0]+x2[1][0],x1[1][1]+x2[1][1])
Given your question above, the input data will look like this:
((467401899, 485678002), ((1403185440.0, 0), (1, 0)))
Let's go ahead and assign that data row equal to a variable.
x = ((467401899, 485678002), ((1403185440.0, 0), (1, 0)))
What happens when we run x[0]? We get (467401899, 485678002). When we run x[1]? We get ((1403185440.0, 0), (1, 0)). That's what your map statement is doing, I believe.
Okay. That's clear.
In your function myFunc2, you have two parameters, x1 and x2. Those correspond to the variables above: x1 = x[0] = (467401899, 485678002) and x2 = x[1] = ((1403185440.0, 0), (1, 0))
Now let's examine just the first part of your return statement in your function.
min(x1[0][0], x2[0][0])
So, x1 = (467401899, 485678002). Cool. Now, what's x1[0]? Well, that's 467401899. Obviously. But wait! What's x1[0][0]? You're tryinig to get the zeroth index of the item at x1[0], but the item at x1[0] isn't a list or a tuple, it's just an int. And objects of <type 'int'> don't have a method called getitem.
To summarize: you're digging too deep into objects that are not nested that deeply. Think carefully about what you are passing into myFunc2, and how deep your objects are.
I think the first part of the return statement for myFunc2 should look like:
return min(x1[0], x2[0][0]). You can index deeper on x2 because x2 has more deeply nested tuples!
When I run the following, it works just fine:
a = sc.parallelize([((7401899, 5678002), ((1403185440.0, 0), (1, 0))),
((82628194, 22251869), ((0, 1403185452.0), (0, 1))),
((2162276, 98056200), ((1403185451.0, 0), (1, 0))),
((1509420, 4827510), ((1403185449.0, 0), (1, 0))),
((7974923, 9235930), ((1403185450.0, 0), (1, 0))),
((250259, 6876774), ((0, 1403185450.0), (0, 1))),
((642369, 6876774), ((0, 1403185450.0), (0, 1))),
((82628194, 22251869), ((0, 1403185452.0), (0, 1))),
((2162276, 98056200), ((1403185451.0, 0), (1, 0)))])
b = a.map(lambda x: (x[0], x[1])).reduceByKey(myFunc2)
b.collect()
[((1509420, 4827510), ((1403185449.0, 0), (1, 0))),
((2162276, 98056200), (1403185451.0, 1403185451.0, 0, 0, 2, 0)),
((7974923, 9235930), ((1403185450.0, 0), (1, 0))),
((7401899, 5678002), ((1403185440.0, 0), (1, 0))),
((642369, 6876774), ((0, 1403185450.0), (0, 1))),
((82628194, 22251869), (0, 0, 1403185452.0, 1403185452.0, 0, 2)),
((250259, 6876774), ((0, 1403185450.0), (0, 1)))]

Related

Find common union groups among tuples in a set

I need help to write a function that:
takes as input set of tuples
returns the number of tuples that has unique numbers
Example 1:
# input:
{(0, 1), (3, 4), (0, 0), (1, 1), (3, 3), (2, 2), (1, 0)}
# expected output: 3
The expected output is 3, because:
(3,4) and (3,3) contain common numbers, so this counts as 1
(0, 1), (0, 0), (1, 1), and (1, 0) all count as 1
(2, 2) counts as 1
So, 1+1+1 = 3
Example 2:
# input:
{(0, 1), (2, 1), (0, 0), (1, 1), (0, 3), (2, 0), (0, 2), (1, 0), (1, 3)}
# expected output: 1
The expected output is 1, because all tuples are related to other tuples by containing numbers in common.

This may not be the most efficient algorithm for it, but it is simple and looks nice.
from functools import reduce
def unisets(iterables):
def merge(fsets, fs):
if not fs: return fsets
unis = set(filter(fs.intersection, fsets))
return {reduce(type(fs).union, unis, fs), *fsets-unis}
return reduce(merge, map(frozenset, iterables), set())
us = unisets({(0,1), (3,4), (0,0), (1,1), (3,3), (2,2), (1,0)})
print(us) # {frozenset({3, 4}), frozenset({0, 1}), frozenset({2})}
print(len(us)) # 3
Features:
Input can be any kind of iterable, whose elements are iterables (any length, mixed types...)
Output is always a well-behaved set of frozensets.

this code works for me
but check it maby there edge cases
how this solution?
def count_groups(marked):
temp = set(marked)
save = set()
for pair in temp:
if pair[1] in save or pair[0] in save:
marked.remove(pair)
else:
save.add(pair[1])
save.add(pair[0])
return len(marked)
image

Generate grid of coordinate tuples

Assume a d-dimensional integer grid, containing n^d (n >= 1) points.
I am trying to write a function that takes the number of domain points n and the number of dimensions d and returns a set that contains all the coordinate points in the grid, as tuples.
Example: intGrid (n=2, dim=2) should return the set:
{(0,0), (0,1), (1,0), (1,1)}
Note: I cannot use numpy or any external imports.

Python has a good set of built-in modules that provides most of the basic functionality you will probably need to start getting your things done.
One of such good modules is itertools, where you will find all sorts of functions related to iterations and combinatorics. The perfect function for you is product, that you can use as below:
from itertools import product
def grid(n, dim):
return set(product(range(n), repeat=dim))
print(grid(2, 2))
# {(0, 0), (0, 1), (1, 0), (1, 1)}
print(grid(2, 3))
# {(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)}

How can I add a random binary info into current 'coordinate'? (Python)

This is part of the code I'm working on: (Using Python)
import random
pairs = [
(0, 1),
(1, 2),
(2, 3),
(3, 0), # I want to treat 0,1,2,3 as some 'coordinate' (or positional infomation)
]
alphas = [(random.choice([1, -1]) * random.uniform(5, 15), pairs[n]) for n in range(4)]
alphas.sort(reverse=True, key=lambda n: abs(n[0]))
A sample output looks like this:
[(13.747649802587832, (2, 3)),
(13.668274782626717, (1, 2)),
(-9.105374057105703, (0, 1)),
(-8.267840318934667, (3, 0))]
Now I'm wondering is there a way I can give each element in 0,1,2,3 a random binary number, so if [0,1,2,3] = [0,1,1,0], (By that I mean if the 'coordinates' on the left list have the corresponding random binary information on the right list. In this case, coordinate 0 has the random binary number '0' and etc.) then the desired output using the information above looks like:
[(13.747649802587832, (1, 0)),
(13.668274782626717, (1, 1)),
(-9.105374057105703, (0, 1)),
(-8.267840318934667, (0, 0))]
Thanks!!

One way using dict:
d = dict(zip([0,1,2,3], [0,1,1,0]))
[(i, tuple(d[j] for j in c)) for i, c in alphas]
Output:
[(13.747649802587832, (1, 0)),
(13.668274782626717, (1, 1)),
(-9.105374057105703, (0, 1)),
(-8.267840318934667, (0, 0))]

You can create a function to convert your number to the random binary assigned. Using a dictionary within this function would make sense. Something like this should work where output1 is that first sample output you provide and binary_code would be [0, 1, 1, 0] in your example:
def convert2bin(original, binary_code):
binary_dict = {n: x for n, x in enumerate(binary_code)}
return tuple([binary_code[x] for x in original])
binary_code = np.random.randint(2, size=4)
[convert2bin(x[1], binary_code) for x in output1]

How to make this list of statements into a for loop

inp[0][0] = shadow[3][0]
inp[0][3] = shadow[0][0]
inp[3][3] = shadow[0][3]
inp[3][0] = shadow[3][3]
I want to turn this code into a for loop, because this is disgusting! I can't figure out how though.

You are basically picking two sets of coordinates from the series (0, 0), (0, 3), (3, 3), (3, 0), in a ring fashion. You can do so by iteration over that series with an index to use for the second point:
points = [(0, 0), (0, 3), (3, 3), (3, 0)]
for index, (x, y) in enumerate(points, -1):
shadow_x, shadow_y = points[index]
inp[x][y] = shadow[shadow_x][shadow_y]
By giving the enumerate() function a starting point of -1 we create an offset that'll find the right matching point in points.
You could also use the zip() function:
points = [(0, 0), (0, 3), (3, 3), (3, 0)]
for (x, y), (shadow_x, shadow_y) in zip(points, [points[-1]] + points):
inp[x][y] = shadow[shadow_x][shadow_y]
Pick whichever you feel fits your usecase best.
Demo (replacing the actual assignment with a print() statement to show what would be executed):
>>> points = [(0, 0), (0, 3), (3, 3), (3, 0)]
>>> for index, (x, y) in enumerate(points, -1):
... shadow_x, shadow_y = points[index]
... print(f"inp[{x}][{y}] = shadow[{shadow_x}][{shadow_y}]")
...
inp[0][0] = shadow[3][0]
inp[0][3] = shadow[0][0]
inp[3][3] = shadow[0][3]
inp[3][0] = shadow[3][3]
>>> for (x, y), (shadow_x, shadow_y) in zip(points, [points[-1]] + points):
... print(f"inp[{x}][{y}] = shadow[{shadow_x}][{shadow_y}]")
...
inp[0][0] = shadow[3][0]
inp[0][3] = shadow[0][0]
inp[3][3] = shadow[0][3]
inp[3][0] = shadow[3][3]

RDD creation and variable binding

I have a very simple code:
def fun(x, n):
return (x, n)
rdds = []
for i in range(2):
rdd = sc.parallelize(range(5*i, 5*(i+1)))
rdd = rdd.map(lambda x: fun(x, i))
rdds.append(rdd)
a = sc.union(rdds)
print a.collect()
I had expected the output to be the following:
[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]
However, the output is the following:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]
This is bewildering, to say the least.
It seems, due to lazy evaluation of RDDs, the value of i that is being used to create RDDs is the one it bears when collect() is called, which is 1 (from the last run of the for loop).
Now, both elements of the tuple are derived from i.
But it seems, for the first element of the tuple, i bears values 0 and 1 while for the second element of the tuple i bears the value 2.
Can somebody please explain what's happening?
Thanks.

just change
rdd = rdd.map(lambda x: fun(x, i))
to
rdd = rdd.map(lambda x, i=i: (x, i))
That is only about Python, look at this
https://docs.python.org/2.7/tutorial/controlflow.html#default-argument-values

sc.parallelize() is an action which will be executed instantly. So both the values of i i.e 0 and 1 will be used.
But in case of rdd.map() only the last value of i will be used when you call collect() later.
rdd = sc.parallelize(range(5*i, 5*(i+1)))
rdd = rdd.map(lambda x: fun(x, i))
Here rdd.map wont transform the rdd, it will just create DAG(Directed Acyclic Graph), i.e lambda function will not be applied to elements of rdd.
When you call collect(), then the lambda function will be called but by that time i has a value of 1. If you reassign i=10 before calling collect then that value of i will be used.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyspark reduce key being a tuple values nested lists - python

Related

Find common union groups among tuples in a set

Generate grid of coordinate tuples

How can I add a random binary info into current 'coordinate'? (Python)

How to make this list of statements into a for loop

RDD creation and variable binding

Categories

Resources