Creating combination of value list with existing key - Pyspark - python

So my rdd consists of data looking like:
(k, [v1,v2,v3...])
I want to create a combination of all sets of two for the value part.
So the end map should look like:
(k1, (v1,v2))
(k1, (v1,v3))
(k1, (v2,v3))
I know to get the value part, I would use something like
rdd.cartesian(rdd).filter(case (a,b) => a < b)
However, that requires the entire rdd to be passed (right?) not just the value part. I am unsure how to arrive at my desired end, I suspect its a groupby.
Also, ultimately, I want to get to the k,v looking like
((k1,v1,v2),1)
I know how to get from what I am looking for to that, but maybe its easier to go straight there?
Thanks.

I think Israel's answer is a incomplete, so I go a step further.
import itertools
a = sc.parallelize([
(1, [1,2,3,4]),
(2, [3,4,5,6]),
(3, [-1,2,3,4])
])
def combinations(row):
l = row[1]
k = row[0]
return [(k, v) for v in itertools.combinations(l, 2)]
a.map(combinations).flatMap(lambda x: x).take(3)
# [(1, (1, 2)), (1, (1, 3)), (1, (1, 4))]

Use itertools to create the combinations. Here is a demo:
import itertools
k, v1, v2, v3 = 'k1 v1 v2 v3'.split()
a = (k, [v1,v2,v3])
b = itertools.combinations(a[1], 2)
data = [(k, pair) for pair in b]
data will be:
[('k1', ('v1', 'v2')), ('k1', ('v1', 'v3')), ('k1', ('v2', 'v3'))]

I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow, it will run in a cluster of big data(cloudera), so i think that i have to put the function into pyspark, please give a hand if you can.
import pandas as pd
import itertools as itts
number_list = [10953, 10423, 10053]
def reducer(nums):
def ranges(n):
print(n)
return range(n, -1, -1)
num_list = list(map(ranges, nums))
return list(itts.product(*num_list))
data=pd.DataFrame(reducer(number_list))
print(data)

Related

Converting 2 list and one string to dictionary

P.S: Thank you everybody ,esp Matthias Fripp . Just reviewed the question You are right I made mistake : String is value not the key
num=[1,2,3,4,5,6]
pow=[1,4,9,16,25,36]
s= ":subtraction"
dic={1:1 ,0:s , 2:4,2:s, 3:9,6:s, 4:16,12:s.......}
There is easy way to convert two list to dictionary :
newdic=dict(zip(list1,list2))
but for this problem no clue even with comprehension:
print({num[i]:pow[i] for i in range(len(num))})
As others have said, dict cannot contain duplicate keys. You can make key duplicate with a little bit of tweaking. I used OrderedDict to keep order of inserted keys:
from pprint import pprint
from collections import OrderedDict
num=[1,2,3,4,5,6]
pow=[1,4,9,16,25,36]
pprint(OrderedDict(sum([[[a, b], ['substraction ({}-{}):'.format(a, b), a-b]] for a, b in zip(num, pow)], [])))
Prints:
OrderedDict([(1, 1),
('substraction (1-1):', 0),
(2, 4),
('substraction (2-4):', -2),
(3, 9),
('substraction (3-9):', -6),
(4, 16),
('substraction (4-16):', -12),
(5, 25),
('substraction (5-25):', -20),
(6, 36),
('substraction (6-36):', -30)])
In principle, this would do what you want:
nums = [(n, p) for (n, p) in zip(num, pow)]
diffs = [('subtraction', p-n) for (n, p) in zip(num, pow)]
items = nums + diffs
dic = dict(items)
However, a dictionary cannot have multiple items with the same key, so each of your "subtraction" items will be replaced by the next one added to the dictionary, and you'll only get the last one. So you might prefer to work with the items list directly.
If you need the items list sorted as you've shown, that will take a little more work. Maybe something like this:
items = []
for n, p in zip(num, pow):
items.append((n, p))
items.append(('subtraction', p-n))
# the next line will drop most 'subtraction' entries, but on
# Python 3.7+, it will at least preserve the order (not possible
# with earlier versions of Python)
dic = dict(items)

Arrange elements with same count in alphabetical order

Python Collection Counter.most_common(n) method returns the top n elements with their counts. However, if the counts for two elements is the same, how can I return the result sorted by alphabetical order?
For example: for a string like: BBBAAACCD, for the "2-most common" elements, I want the result to be for specified n = 2:
[('A', 3), ('B', 3), ('C', 2)]
and NOT:
[('B', 3), ('A', 3), ('C', 2)]
Notice that although A and B have the same frequency, A comes before B in the resultant list since it comes before B in alphabetical order.
[('A', 3), ('B', 3), ('C', 2)]
How can I achieve that?
Although this question is already a bit old i'd like to suggest a very simple solution to the problem which just involves sorting the input of Counter() before creating the Counter object itself. If you then call most_common(n) you will get the top n entries sorted in alphabetical order.
from collections import Counter
char_counter = Counter(sorted('ccccbbbbdaef'))
for char in char_counter.most_common(3):
print(*char)
resulting in the output:
b 4
c 4
a 1
There are two issues here:
Include duplicates when considering top n most common values excluding duplicates.
For any duplicates, order alphabetically.
None of the solutions thus far address the first issue. You can use a heap queue with the itertools unique_everseen recipe (also available in 3rd party libraries such as toolz.unique) to calculate the nth largest count.
Then use sorted with a custom key.
from collections import Counter
from heapq import nlargest
from toolz import unique
x = 'BBBAAACCD'
c = Counter(x)
n = 2
nth_largest = nlargest(n, unique(c.values()))[-1]
def sort_key(x):
return -x[1], x[0]
gen = ((k, v) for k, v in c.items() if v >= nth_largest)
res = sorted(gen, key=sort_key)
[('A', 3), ('B', 3), ('C', 2)]
I would first sort your output array in alphabetical order and than sort again by most occurrences which will keep the alphabetical order:
from collections import Counter
alphabetic_sorted = sorted(Counter('BBBAAACCD').most_common(), key=lambda tup: tup[0])
final_sorted = sorted(alphabetic_sorted, key=lambda tup: tup[1], reverse=True)
print(final_sorted[:3])
Output:
[('A', 3), ('B', 3), ('C', 2)]
I would go for:
sorted(Counter('AAABBBCCD').most_common(), key=lambda t: (-t[1], t[0]))
This sorts count descending (as they are already, which should be more performant) and then sorts by name ascending in each equal count group
This is one of the problems I got in the interview exam and failed to do it. Came home slept for a while and solution came in my mind.
from collections import Counter
def bags(list):
cnt = Counter(list)
print(cnt)
order = sorted(cnt.most_common(2), key=lambda i:( i[1],i[0]), reverse=True)
print(order)
return order[0][0]
print(bags(['a','b','c','a','b']))
s = "BBBAAACCD"
p = [(i,s.count(i)) for i in sorted(set(s))]
**If you are okay with not using the Counter.
from collections import Counter
s = 'qqweertyuiopasdfghjklzxcvbnm'
s_list = list(s)
elements = Counter(s_list).most_common()
print(elements)
alphabet_sort = sorted(elements, key=lambda x: x[0])
print(alphabet_sort)
num_sort = sorted(alphabet_sort, key=lambda x: x[1], reverse=True)
print(num_sort)
if you need to get slice:
print(num_sort[:3])
from collections import Counter
print(sorted(Counter('AAABBBCCD').most_common(3)))
This question seems to be a duplicate
How to sort Counter by value? - python

Call Distinct on 'pyspark.resultiterable.ResultIterable'

I am writing some spark code and I have an RDD which looks like
[(4, <pyspark.resultiterable.ResultIterable at 0x9d32a4c>),
(1, <pyspark.resultiterable.ResultIterable at 0x9d32cac>),
(5, <pyspark.resultiterable.ResultIterable at 0x9d32bac>),
(2, <pyspark.resultiterable.ResultIterable at 0x9d32acc>)]
What I need to do is to call a distinct on the pyspark.resultiterable.ResultIterable
I tried this
def distinctHost(a, b):
p = sc.parallelize(b)
return (a, p.distinct())
mydata.map(lambda x: distinctHost(*x))
But I get an error:
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transforamtion.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
The error is self explanatory that I cannot use sc. But I need to find a way to cover the pyspark.resultiterable.ResultIterable to an RDD so that I can call distinct on it.
Straightforward approach is to use sets:
from numpy.random import choice, seed
seed(323)
keys = (4, 1, 5, 2)
hosts = [
u'in24.inetnebr.com',
u'ix-esc-ca2-07.ix.netcom.com',
u'uplherc.upl.com',
u'slppp6.intermind.net',
u'piweba4y.prodigy.com'
]
pairs = sc.parallelize(zip(choice(keys, 20), choice(hosts, 20))).groupByKey()
pairs.map(lambda (k, v): (k, set(v))).take(3)
Result:
[(1, {u'ix-esc-ca2-07.ix.netcom.com', u'slppp6.intermind.net'}),
(2,
{u'in24.inetnebr.com',
u'ix-esc-ca2-07.ix.netcom.com',
u'slppp6.intermind.net',
u'uplherc.upl.com'}),
(4, {u'in24.inetnebr.com', u'piweba4y.prodigy.com', u'uplherc.upl.com'})]
If there is a particular reason for using rdd.disinct you can try something like this:
def distinctHost(pairs, key):
return (pairs
.filter(lambda (k, v): k == key)
.flatMap(lambda (k, v): v)
.distinct())
[(key, distinctHost(pairs, key).collect()) for key in pairs.keys().collect()]

Sum second value in tuple for each given first value in tuples using Python

I'm working with a large set of records and need to sum a given field for each customer account to reach an overall account balance. While I can probably put the data in any reasonable form, I figured the easiest would be a list of tuples (cust_id,balance_contribution) as I process through each record. After the round of processing, I'd like to add up the second item for each cust_id, and I am trying to do it without looping though the data thousands of time.
As an example, the input data could look like:[(1,125.50),(2,30.00),(1,24.50),(1,-25.00),(2,20.00)]
And I want the output to be something like this:
[(1,125.00),(2,50.00)]
I've read other questions where people have just wanted to add the values of the second element of the tuple using the form of sum(i for i, j in a), but that does separate them by the first element.
This discussion, python sum tuple list based on tuple first value, which puts the values as a list assigned to each key (cust_id) in a dictionary. I suppose then I could figure out how to add each of the values in a list?
Any thoughts on a better approach to this?
Thank you in advance.
import collections
def total(records):
dct = collections.defaultdict(int)
for cust_id, contrib in records:
dct[cust_id] += contrib
return dct.items()
Would the following code be useful?
in_list = [(1,125.50),(2,30.00),(1,24.50),(1,-25.00),(3,20.00)]
totals = {}
for uid, x in in_list :
if uid not in totals :
totals[uid] = x
else :
totals[uid] += x
print(totals)
output :
{1: 125.0, 2: 30.0, 3: 20.0}
People usually like one-liners in python:
[(uk,sum([vv for kk,vv in data if kk==uk])) for uk in set([k for k,v in data])]
When
data=[(1,125.50),(2,30.00),(1,24.50),(1,-25.00),(3,20.00)]
The output is
[(1, 125.0), (2, 30.0), (3, 20.0)]
Here's an itertools solution:
from itertools import groupby
>>> x
[(1, 125.5), (2, 30.0), (1, 24.5), (1, -25.0), (2, 20.0)]
>>> sorted(x)
[(1, -25.0), (1, 24.5), (1, 125.5), (2, 20.0), (2, 30.0)]
>>> for a,b in groupby(sorted(x), key=lambda item: item[0]):
print a, sum([item[1] for item in list(b)])
1 125.0
2 50.0

itertools.groupby key func to produce groupings of zero and non-zero values

Does anyone have an idea how I would make use of the key func argument in the itertools.groupby function to group rows of data by zero and non-zero values?
For a simplified example:
from collections import namedtuple
from operator import attrgetter
from itertools import groupby
FakeRow = namedtuple('FakeRow', ['start_date_time', 'wear_sensor',
'part_number', 'chip_count'])
data = [
FakeRow(1,1,'999-045', 0),
FakeRow(2,1,'999-045', 4),
FakeRow(3,1,'999-045', 3),
FakeRow(3,1,'999-047', 0),
FakeRow(4,1,'999-045', 0),
FakeRow(5,1,'999-047', 1),
]
# need to groupby start date time first
unique_keys = []
groups = []
data = sorted(data, key=attrgetter('start_date_time'))
# want to group by 'chip_count' but by zero and non-zero values
for k, g in groupby(data, key=my_key_func(*args)):
groups.append(list(g))
unique_keys.append(k)
def my_key_func(*args):
'''Help itertools.groupby group by zeros, or group by anything non-zero'''
pass
The desired output would be:
groups == [
[FakeRow(1,1,'999-045', 0)],
[FakeRow(2,1,'999-045', 4),FakeRow(3,1,'999-045', 3)],
[FakeRow(3,1,'999-047', 0), FakeRow(4,1,'999-045', 0)],
[FakeRow(5,1,'999-047', 1)]
]
Thanks.
It should be as easy as looking at the boolean value of the fake row's chip_count:
def my_key_func(fakerow):
return bool(fakerow.chip_count)
In this case, your unique_keys will be True or False which is likely not what you want. You'd probably want to use a set and update with the fakerow.chip_count instead:
unique_keys = set()
for k, g in groupby(data, key=my_key_func):
group = list(g)
groups.append(group)
unique_keys.update(fk.chip_count for fk in group)

Categories