reduceByKey in spark for adding tuples - python

Consider an Rdd with below dataset
where 10000241 is the key and remaining are values
('10000241',([0,0,1],[None,None,'RX']))
('10000241',([0,2,0],[None,'RX','RX']))
('10000241',([3,0,0],['RX',None,None]))
pv1 = rdd.reduceBykey(lambda x,y :(
addtup(x[0],y[0]),
addtup(x[1],y[1]),
))
def addtup(t1,t2):
j =()
for k,v in enumerate(t1):
j = j + (t1[k] + t2[k],)
return j
The final output i want is (10000241,(3,2,1)('RX','RX','RX))
but i get the error of cant add none type to none type or nonetype to Str .how can i overcome this issue?

If I understood you correctly, you want to summarize numbers in the first tuple and to use logic or in the second?
I think you should rewrite your function as following:
def addtup(t1,t2):
left = list(map(lambda x: sum(x), zip(t1[0], t2[0])))
right = list(map(lambda x: x[0] or x[1], zip(t1[1], t2[1])))
return (left, right)
Then you can use it like this:
rdd.reduceBykey(addtup)
Here is a demonstration
import functools
data = (([0,0,1],[None,None,'RX']),
([0,2,0],[None,'RX','RX']),
([3,0,0],['RX',None,None]))
functools.reduce(addtup, data)
#=> ([3, 2, 1], ['RX', 'RX', 'RX'])

Related

Pandas / Python: Groupby.apply() with function dictionary

I'm trying to implement something like this:
def RR(x):
x['A'] = x['A'] +1
return x
def Locked(x):
x['A'] = x['A'] + 2
return x
func_mapper = {"RR": RR, "Locked": Locked}
df = pd.DataFrame({'A':[1,1], 'LookupVal':['RR','Locked'],'ID':[1,2]})
df= df.groupby("ID").apply(lambda x: func_mapper[x.LookupVal.first()](x))
Output for column A would be 2, 6
where x.LookupVal is a column of strings (it will have the same value within each groupby("ID")) that I want to pass as the key to the dictionary lookup.
Any suggestions how to implement this??
Thanks!
The first is not what you think it is. It is for timeseries data and it requires an offset parameter. I think you are mistaken with groupby first
You can use iloc[0] to get the first value:
slice_b.groupby("ID").apply(lambda x: func_mapper[x.LookupVal.iloc[0]](x))

How to reduce on a list of tuples in python

I have an array and I want to count the occurrence of each item in the array.
I have managed to use a map function to produce a list of tuples.
def mapper(a):
return (a, 1)
r = list(map(lambda a: mapper(a), arr));
//output example:
//(11817685, 1), (2014036792, 1), (2014047115, 1), (11817685, 1)
I'm expecting the reduce function can help me to group counts by the first number (id) in each tuple. For example:
(11817685, 2), (2014036792, 1), (2014047115, 1)
I tried
cnt = reduce(lambda a, b: a + b, r);
and some other ways but they all don't do the trick.
NOTE
Thanks for all the advice on other ways to solve the problems, but I'm just learning Python and how to implement a map-reduce here, and I have simplified my real business problem a lot to make it easy to understand, so please kindly show me a correct way of doing map-reduce.
You could use Counter:
from collections import Counter
arr = [11817685, 2014036792, 2014047115, 11817685]
counter = Counter(arr)
print zip(counter.keys(), counter.values())
EDIT:
As pointed by #ShadowRanger Counter has items() method:
from collections import Counter
arr = [11817685, 2014036792, 2014047115, 11817685]
print Counter(arr).items()
Instead of using any external module you can use some logic and do it without any module:
track={}
if intr not in track:
track[intr]=1
else:
track[intr]+=1
Example code :
For these types of list problems there is a pattern :
So suppose you have a list :
a=[(2006,1),(2007,4),(2008,9),(2006,5)]
And you want to convert this to a dict as the first element of the tuple as key and second element of the tuple. something like :
{2008: [9], 2006: [5], 2007: [4]}
But there is a catch you also want that those keys which have different values but keys are same like (2006,1) and (2006,5) keys are same but values are different. you want that those values append with only one key so expected output :
{2008: [9], 2006: [1, 5], 2007: [4]}
for this type of problem we do something like this:
first create a new dict then we follow this pattern:
if item[0] not in new_dict:
new_dict[item[0]]=[item[1]]
else:
new_dict[item[0]].append(item[1])
So we first check if key is in new dict and if it already then add the value of duplicate key to its value:
full code:
a=[(2006,1),(2007,4),(2008,9),(2006,5)]
new_dict={}
for item in a:
if item[0] not in new_dict:
new_dict[item[0]]=[item[1]]
else:
new_dict[item[0]].append(item[1])
print(new_dict)
output:
{2008: [9], 2006: [1, 5], 2007: [4]}
After writing my answer to a different question, I remembered this post and thought it would be helpful to write a similar answer here.
Here is a way to use reduce on your list to get the desired output.
arr = [11817685, 2014036792, 2014047115, 11817685]
def mapper(a):
return (a, 1)
def reducer(x, y):
if isinstance(x, dict):
ykey, yval = y
if ykey not in x:
x[ykey] = yval
else:
x[ykey] += yval
return x
else:
xkey, xval = x
ykey, yval = y
a = {xkey: xval}
if ykey in a:
a[ykey] += yval
else:
a[ykey] = yval
return a
mapred = reduce(reducer, map(mapper, arr))
print mapred.items()
Which prints:
[(2014036792, 1), (2014047115, 1), (11817685, 2)]
Please see the linked answer for a more detailed explanation.
If all you need is cnt, then a dict would probably be better than a list of tuples here (if you need this format, just use dict.items).
The collections module has a useful data structure for this, a defaultdict.
from collections import defaultdict
cnt = defaultdict(int) # create a default dict where the default value is
# the result of calling int
for key in arr:
cnt[key] += 1 # if key is not in cnt, it will put in the default
# cnt_list = list(cnt.items())

How to join array based on position and datatype in Python?

I have a few arrays containing integer and strings. For example:
myarray1 = [1,2,3,"ab","cd",4]
myarray2 = [1,"a",2,3,"bc","cd","e",4]
I'm trying to combine only the strings in an array that are next to each other. So I want the result to be:
newarray1= [1,2,3,"abcd",4]
newarray2= [1,"a",2,3,"bccde",4]
Does anyone know how to do this? Thank you!
The groupby breaks the list up into runs of strings and runs of integers. The ternary operation joins the groups of strings and puts them into a temporary sequence. The chain re-joins the strings and the runs of integers.
from itertools import groupby, chain
def joinstrings(iterable):
return list(chain.from_iterable(
(''.join(group),) if key else group
for key, group in
groupby(iterable, key=lambda elem: isinstance(elem, basestring))))
>>> myarray1 = [1,2,3,"ab","cd",4]
>>> newarray1 = [myarray1[0]]
>>> for item in myarray1[1:]:
... if isinstance(item, str) and isinstance(newarray1[-1], str):
... newarray1[-1] = newarray1[-1] + item
... else:
... newarray1.append(item)
>>> newarray1
[1, 2, 3, 'abcd', 4]
reduce(lambda x, (tp, it): tp and x + ["".join(it)] or x+list(it), itertools.groupby( myarray1, lambda x: isinstance(x, basestring) ), [])
a = [1,2,3,"ab","cd",4]
b = [1,a,2,3,"bc","cd","e",4]
def func(a):
ret = []
s = ""
for x in a:
if isinstance(x, basestring):
s = s + x
else:
if s:
ret.append(s)
s = ""
ret.append(x)
return ret
print func(a)
print func(b)

is there a way to make this code more terse?

Given a list of items, and a map from a predicate function to the "value" function, the code below applies "value" functions to the items satisfying the corresponding predicates:
my_re0 = re.compile(r'^([a-z]+)$')
my_re1 = re.compile(r'^([0-9]+)$')
my_map = [
(my_re0.search, lambda x: x),
(my_re1.search, lambda x: x),
]
for x in ['abc','123','a1']:
for p, f in my_map:
v = p(x)
if v:
print f(v.groups())
break
Is there a way to express the same with a single statement?
If I did not have to pass the value returned by the predicate to the "value" function then I could do
for x in ['abc','123','a1']:
print next((f(x) for p, f in my_map if p(x)), None)
Can something similar be done for the code above? I know, maybe it is better to leave these nested for loops, but I am just curious whether it is possible.
A bit less terse than Nate's ;-)
from itertools import product
comb = product(my_map, ['abc','123','a1'])
mapped = ((p(x),f) for (p,f),x in comb)
groups = (f(v.groups()) for v,f in mapped if v)
print next(groups), list(groups) # first match and the rest of them
[f(v.groups()) for x in ['abc','123','a1'] for p, f in my_map for v in [p(x)] if v]
You said more terse, right? ;^)
here is my version:
for x in ['abc','123','a1']:
print next((f(v.groups()) for p, f in my_map for v in [p(x)] if v), None)
this version does not iterate over the whole my_map but stops as soon as the first successful mapping is found.

Python trim last and sort list

I have list MC below:
MC = [('GGP', '4.653B'), ('JPM', '157.7B'), ('AIG', '24.316B'), ('RX', 'N/A'), ('PFE', '136.6B'), ('GGP', '4.653B'), ('MNKD', '672.3M'), ('ECLP', 'N/A'), ('WYE', 'N/A')]
def fn(number):
divisors = {'B': 1, 'M': 1000}
if number[-1] in divisors:
return ((float(number[:-1]) / divisors[number[-1]])
return number
map(fn, MC)
How do I remove B, M with fn, and sort list mc high to low.
def fn(tup):
number = tup[1]
divisors = {'B': 1, 'M': 1000}
if number[-1] in divisors:
return (tup[0], float(number[:-1]) / divisors[number[-1]])
else:
return tup
The problem is that that function was meant to run on a string representation of a number but you were passing it a tuple. So just pull the 1'st element of the tuple. Then return a tuple consisting of the 0'th element and the transformed 1'st element if the 1'st element is transformable or just return the tuple.
Also, I stuck an else clause in there because I find them more readable. I don't know which is more efficient.
as far as sorting goes, use sorted with a key keyword argument
either:
MC = sorted(map(fn, MC), key=lambda x: x[0])
to sort by ticker or
MC = sorted(map(fn, MC), key=lambda x: x[1] )
to sort by price. Just pass reversed=True to the reversed if you want it high to low:
MC = sorted(map(fn, MC), key=lambda x: x[1], reversed=True)
you can find other nifty sorting tips here: http://wiki.python.org/moin/HowTo/Sorting/

Categories