Pyspark: merging values in a nested list - python

I have a pair-RDD with the structure:
[(key, [(timestring, value)]]
Example:
[("key1", [("20161101", 23), ("20161101", 41), ("20161102", 66),...]),
("key2", [("20161101", 86), ("20161101", 9), ("20161102", 11),...])
...]
I want to process the list for each key, grouping by timestring and calculate the mean of all values for identical timestrings. So the above example would become:
[("key1", [("20161101", 32), ..]),
("key2", [("20161101", 47.5),...])
...]
I struggle to find a solution just using Pyspark methods in one step, is it at all possible or do I need to use some intermediate steps?

You can define a function:
from itertools import groupby
import numpy as np
def mapper(xs):
return [(k, np.mean([v[1] for v in vs])) for k, vs in groupby(sorted(xs), lambda x: x[0])]
And mapValues
rdd = sc.parallelize([
("key1", [("20161101", 23), ("20161101", 41), ("20161102", 66)]),
("key2", [("20161101", 86), ("20161101", 9), ("20161102", 11)])
])
rdd.mapValues(mapper)

Related

Group items in a list and calculate sums

I have a list with weekly figures and need to obtain the grouped totals by month.
The following code does the job, but there should be a more pythonic way of doing it with using the standard libraries.
The drawback of the code below is that the list needs to be in sorted order.
#Test data (not sorted)
sum_weekly=[('2020/01/05', 59), ('2020/01/19', 88), ('2020/01/26', 95), ('2020/02/02', 89),
('2020/02/09', 113), ('2020/02/16', 90), ('2020/02/23', 68), ('2020/03/01', 74), ('2020/03/08', 85),
('2020/04/19', 6), ('2020/04/26', 5), ('2020/05/03', 14),
('2020/05/10', 5), ('2020/05/17', 20), ('2020/05/24', 28),('2020/03/15', 56), ('2020/03/29', 5), ('2020/04/12', 2),]
month = sum_weekly[0][0].split('/')[1]
count=0
out=[]
for item in sum_weekly:
m_sel = item[0].split('/')[1]
if m_sel!=month:
out.append((month, count))
count=item[1]
else:
count+=item[1]
month = m_sel
out.append((month, count))
# monthly sums output as ('01', 242), ('02', 360), ('03', 220), ('04', 13), ('05', 67)
print (out)
You could use defaultdict to store the result instead of a list. The keys of the dictionary would be the months and you can simply add the values with the same month (key).
Possible implementation:
# Test Data
from collections import defaultdict
sum_weekly = [('2020/01/05', 59), ('2020/01/19', 88), ('2020/01/26', 95), ('2020/02/02', 89),
('2020/02/09', 113), ('2020/02/16', 90), ('2020/02/23', 68), ('2020/03/01', 74), ('2020/03/08', 85),
('2020/03/15', 56), ('2020/03/29', 5), ('2020/04/12', 2), ('2020/04/19', 6), ('2020/04/26', 5),
('2020/05/03', 14),
('2020/05/10', 5), ('2020/05/17', 20), ('2020/05/24', 28)]
results = defaultdict(int)
for date, count in sum_weekly: # used unpacking to make it clearer
month = date.split('/')[1]
# because we use a defaultdict if the key does not exist it
# the entry for the key will be created and initialize at zero
results[month] += count
print(results)
You can use itertools.groupby (it is part of standard library) - it does pretty much what you did under the hood (grouping together sequences of elements for which the key function gives same output). It can look like the following:
import itertools
def select_month(item):
return item[0].split('/')[1]
def get_value(item):
return item[1]
result = [(month, sum(map(get_value, group)))
for month, group in itertools.groupby(sorted(sum_weekly), select_month)]
print(result)
Terse, but maybe not that pythonic:
import calendar, functools, collections
{calendar.month_name[i]: val for i, val in functools.reduce(lambda a, b: a + b, [collections.Counter({datetime.datetime.strptime(time, '%Y/%m/%d').month: val}) for time, val in sum_weekly]).items()}
a method using pyspark
from pyspark import SparkContext
sc = SparkContext()
l = sc.parallelize(sum_weekly)
r = l.map(lambda x: (x[0].split("/")[1], x[1])).reduceByKey(lambda p, q: (p + q)).collect()
print(r) #[('04', 13), ('02', 360), ('01', 242), ('03', 220), ('05', 67)]
You can accomplish this with a Pandas dataframe. First, you isolate the month, and then use groupby.sum().
import pandas as pd
sum_weekly=[('2020/01/05', 59), ('2020/01/19', 88), ('2020/01/26', 95), ('2020/02/02', 89), ('2020/02/09', 113), ('2020/02/16', 90), ('2020/02/23', 68), ('2020/03/01', 74), ('2020/03/08', 85), ('2020/04/19', 6), ('2020/04/26', 5), ('2020/05/03', 14), ('2020/05/10', 5), ('2020/05/17', 20), ('2020/05/24', 28),('2020/03/15', 56), ('2020/03/29', 5), ('2020/04/12', 2)]
df= pd.DataFrame(sum_weekly)
df.columns=['Date','Sum']
df['Month'] = df['Date'].str.split('/').str[1]
print(df.groupby('Month').sum())

How to filter dictionary by value? [duplicate]

This question already has answers here:
How to filter a dictionary according to an arbitrary condition function?
(7 answers)
Closed 4 years ago.
I have dictionary in format "site_mame": (side_id, frequency):
d=[{'fpdownload2.macromedia.com': (1, 88),
'laposte.net': (2, 23),
'www.laposte.net': (3, 119),
'www.google.com': (4, 5441),
'match.rtbidder.net': (5, 84),
'x2.vindicosuite.com': (6, 37),
'rp.gwallet.com': (7, 88)}]
Is there a smart way to filter dictionary d by value so that I have only those positions, where frequency is less than 100? For example:
d=[{'fpdownload2.macromedia.com': (1, 88),
'laposte.net': (2, 23),
'match.rtbidder.net': (5, 84),
'x2.vindicosuite.com': (6, 37),
'rp.gwallet.com': (7, 88)}]
I don't want to use loops, just looking for smart and efficient solution...
You can use a dictionary comprehension with unpacking for a more Pythonic result:
d=[{'fpdownload2.macromedia.com': (1, 88),
'laposte.net': (2, 23),
'www.laposte.net': (3, 119),
'www.google.com': (4, 5441),
'match.rtbidder.net': (5, 84),
'x2.vindicosuite.com': (6, 37),
'rp.gwallet.com': (7, 88)}]
new_data = [{a:(b, c) for a, (b, c) in d[0].items() if c < 100}]
Output:
[{'laposte.net': (2, 23), 'fpdownload2.macromedia.com': (1, 88), 'match.rtbidder.net': (5, 84), 'x2.vindicosuite.com': (6, 37), 'rp.gwallet.com': (7, 88)}]
You can use a dictionary comprehension to do the filtering:
d = {
'fpdownload2.macromedia.com': (1, 88),
'laposte.net': (2, 23),
'www.laposte.net': (3, 119),
'www.google.com': (4, 5441),
'match.rtbidder.net': (5, 84),
'x2.vindicosuite.com': (6, 37),
'rp.gwallet.com': (7, 88),
}
d_filtered = {
k: v
for k, v in d.items()
if v[1] < 100
}
What you want is a dictionary comprehension. I'll show it with a different example:
d = {'spam': 120, 'eggs': 20, 'ham': 37, 'cheese': 101}
d = {key: value for key, value in d.items() if value >= 100}
If you don't already understand list comprehensions, this probably looks like magic that you won't be able to maintain and debug, so I'll show you how to break it out into an explicit loop statement that you should be able to understand easily:
new_d = {}
for key, value in d.items():
if value >= 100:
new_d[key] = value
If you can't figure out how to turn that back into the comprehension, just use the statement version until you learn a bit more; it's a bit more verbose, but better to have code you can think through in your head.
Your problem is slightly more complicated, because the values aren't just a number but a tuple of two numbers (so you want to filter on value[1], not value). And because you have a list of one dict rather than just a dict (so you may need to do this for each dict in the list). And of course my filter test isn't the same as yours. But hopefully you can figure it out from here.

PySpark Distinct List of Each of the Keys from an RDD

I'm sure this is simple, but I keep having issues. I have an RDD with key value pairs. I want a distinct list of just the keys. I'll share the code and examples. Thank you in advance!
RDD Example
>>> rdd4.take(3)
[[(u'11394071', 1), (u'11052103', 1), (u'11052101', 1)], [(u'11847272', 10), (u'999999', 1), (u'11847272', 10)], [(u'af1lowprm1704', 5), (u'am1prm17', 2), (u'af1highprm1704', 2)]]
Tried / Didn't Work
rdd4.distinct().keys()
rdd4.map(lambda x: tuple(sorted(x))).keys().distinct()
[(u'10972402', 1), (u'10716707', 1), (u'11165362', 1)]
Preferred Structure
[u'11394071', u'11052101', '999999', u'11847272', u'am1prm17', u'af1highprm1704']
You can for example:
rdd.flatMap(lambda x: x).keys().distinct()
You can use flatMap to get the keys from inner tuples and then call distinct on the result RDD:
rdd = sc.parallelize([[(u'11394071', 1), (u'11052103', 1), (u'11052101', 1)], [(u'11847272', 10), (u'999999', 1), (u'11847272', 10)], [(u'af1lowprm1704', 5), (u'am1prm17', 2), (u'af1highprm1704', 2)]])
rdd.flatMap(lambda x: [k for k, _ in x]).distinct().collect()
# [u'999999', u'11394071', u'11847272', u'af1highprm1704', u'11052101', u'af1lowprm1704', u'am1prm17', u'11052103']
If you want just the distinct values from the key column, and you have a dataframe you can do:
df.select('k').distinct()
If you have only the RDD, you can do
rdd.map(lambda r: r[0]).distinct
Assuming tha the key is your left column

Sum values in tuple (values in dict)

I have a dictionary data that looks like that with sample values:
defaultdict(<type 'list'>,
{(None, 2014): [(5, 1), (10, 2)],
(u'Middle', 2014): [(6, 2), (11, 3)],
(u'SouthWest', 2015): [(7,3), (12, 4)]})
I get this from collections.defaultdict(list) because my keys have to be lists.
My goal is to get a new dictionary that will contain the sum values for every tuple with respect to their position in the tuple.
By running
out = {k:(sum(tup[0] for tup in v),sum(tup[1] for tup in v)) for k,v in data.items()}
I get
{(None, 2014): (15, 3), (u'Middle', 2014): (17, 5), (u'SouthWest', 2015): (19, 7)}
However, I don't know in advance how many items will be in every tuple, so using the sum(tup[0] for tup in v) with hard-coded indices is not an option. I know, however, how many integers will be in the tuple. This value is an integer and I get this along with the data dict. All tuples are always of the same length (in this example, of length 2).
How do I tell Python that I want the out dict to contain tuple of the size that matches the length I have to use?
I think you want the built-in zip function:
In [26]: {k: tuple(sum(x) for x in zip(*v)) for k, v in data.items()}
Out[26]:
{('SouthWest', 2015): (19, 7),
(None, 2014): (15, 3),
('Middle', 2014): (17, 5)}

Sort list by nested tuple values

Is there a better way to sort a list by a nested tuple values than writing an itemgetter alternative that extracts the nested tuple value:
def deep_get(*idx):
def g(t):
for i in idx: t = t[i]
return t
return g
>>> l = [((2,1), 1),((1,3), 1),((3,6), 1),((4,5), 2)]
>>> sorted(l, key=deep_get(0,0))
[((1, 3), 1), ((2, 1), 1), ((3, 6), 1), ((4, 5), 2)]
>>> sorted(l, key=deep_get(0,1))
[((2, 1), 1), ((1, 3), 1), ((4, 5), 2), ((3, 6), 1)]
I thought about using compose, but that's not in the standard library:
sorted(l, key=compose(itemgetter(1), itemgetter(0))
Is there something I missed in the libs that would make this code nicer?
The implementation should work reasonably with 100k items.
Context: I would like to sort a dictionary of items that are a histogram. The keys are a tuples (a,b) and the value is the count. In the end the items should be sorted by count descending, a and b. An alternative is to flatten the tuple and use the itemgetter directly but this way a lot of tuples will be generated.
Yes, you could just use a key=lambda x: x[0][1]
Your approach is quite good, given the data structure that you have.
Another approach would be to use another structure.
If you want speed, the de-factor standard NumPy is the way to go. Its job is to efficiently handle large arrays. It even has some nice sorting routines for arrays like yours. Here is how you would write your sort over the counts, and then over (a, b):
>>> arr = numpy.array([((2,1), 1),((1,3), 1),((3,6), 1),((4,5), 2)],
dtype=[('pos', [('a', int), ('b', int)]), ('count', int)])
>>> print numpy.sort(arr, order=['count', 'pos'])
[((1, 3), 1) ((2, 1), 1) ((3, 6), 1) ((4, 5), 2)]
This is very fast (it's implemented in C).
If you want to stick with standard Python, a list containing (count, a, b) tuples would automatically get sorted in the way you want by Python (which uses lexicographic order on tuples).
I compared two similar solutions. The first one uses a simple lambda:
def sort_one(d):
result = d.items()
result.sort(key=lambda x: (-x[1], x[0]))
return result
Note the minus on x[1], because you want the sort to be descending on count.
The second one takes advantage of the fact that sort in Python is stable. First, we sort by (a, b) (ascending). Then we sort by count, descending:
def sort_two(d):
result = d.items()
result.sort()
result.sort(key=itemgetter(1), reverse=True)
return result
The first one is 10-20% faster (both on small and large datasets), and both complete under 0.5sec on my Q6600 (one core used) for 100k items. So avoiding the creation of tuples doesn't seem to help much.
This might be a little faster version of your approach:
l = [((2,1), 1), ((1,3), 1), ((3,6), 1), ((4,5), 2)]
def deep_get(*idx):
def g(t):
return reduce(lambda t, i: t[i], idx, t)
return g
>>> sorted(l, key=deep_get(0,1))
[((2, 1), 1), ((1, 3), 1), ((4, 5), 2), ((3, 6), 1)]
Which could be shortened to:
def deep_get(*idx):
return lambda t: reduce(lambda t, i: t[i], idx, t)
or even just simply written-out:
sorted(l, key=lambda t: reduce(lambda t, i: t[i], (0,1), t))

Categories