PySpark ReduceByKey - python

I have been trying to make it work for a while, but failed every time. I have 2 files. One has a list of names:
Name1
Name2
Name3
Name4
The other is list of values associated with names for each day in the year over several years:
['0.1,0.2,0.3,0.4',
'0.5,0.6,0.7,0.8',
'10,1000,0.2,5000'
...]
The goal is to have the output like:
Name1: [0.1,0.5,10]
Name2: [0.2,0.6,1000]
Name3:[0.3,0.7,0.2]
Name4:[0.4,0.8,5000]
And then plot histogram for each. I wrote a mapper that creates a list of tuples that produces the following output (this is an RDD object):
[[('Name1', [0.1]),('Name2', [0,2]),('Name3', [0.3]),('Name4', [0.4])],
[('Name1', [0.5]),('Name2', [0,6]),('Name3', [0.7]),('Name4', [0.8])],
[('Name1', [10]),('Name2', [1000]),('Name3', [0.8]),('Name4', [5000])]]
Now I need to concatenate all values for each name in a single list, but each map by key, value that I attempted returns a wrong result.

You can simply loop through each and create a dictionary from it using dict.setdefault() . Example -
>>> ll = [[('Name1', [0.1]),('Name2', [0,2]),('Name3', [0.3]),('Name4', [0.4])],
... [('Name1', [0.5]),('Name2', [0,6]),('Name3', [0.7]),('Name4', [0.8])],
... [('Name1', [10]),('Name2', [1000]),('Name3', [0.8]),('Name4', [5000])]]
>>> d = {}
>>> for i in ll:
... for tup in i:
... d.setdefault(tup[0],[]).extend(tup[1])
...
>>> pprint.pprint(d)
{'Name1': [0.1, 0.5, 10],
'Name2': [0, 2, 0, 6, 1000],
'Name3': [0.3, 0.7, 0.8],
'Name4': [0.4, 0.8, 5000]}
For Pyspark RDD Object, try a simple reduce function such as -
func = lambda x,y: x+y
Then send this in to reduceByKey method -
object.reduceByKey(func)
Per comments, actually the OP has a list of RDD Objects (not a single RDD Objects) , in that case you can convert the RDD objects to a list by calling .collect() and then do the logic , and then you can decide whether you want the resultant as a python dictionary or an RDD object, if you want first. You can call dict.items() to get the key-value pairs and call sc.parrallelize . Example -
d = {}
for i in ll:
c = i.collect()
for tup in i:
d.setdefault(tup[0],[]).extend(tup[1])
rddobj = sc.parallelize(d.items())

Related

How to creat key and list of elements from column as values from dataframes

How to create python dictionary using the data below
Df1:
Id
mail-id
1
xyz#gm
1
ygzbb
2.
Ghh.
2.
Hjkk.
I want it as
{1:[xyz#gm,ygzbb], 2:[Ghh,Hjkk]}
Something like this?
data = [
[1, "xyz#gm"],
[1, "ygzbb"],
[2, "Ghh"],
[2, "Hjkk"],
]
dataDict = {}
for k, v in data:
if k not in dataDict:
dataDict[k] = []
dataDict[k].append(v)
print(dataDict)
One option is to groupby the Id column and turn the mail-id into a list in a dictionary comprehension:
{k:v["mail-id"].values.tolist() for k,v in df.groupby("Id")}
One option is to iterate over the set version of the ids and check one by one:
>>> _d = {}
>>> df = pd.DataFrame({"Id":[1,1,2,2],"mail-id":["xyz#gm","ygzbb","Ghh","Hjkk"]})
>>> for x in set(df["Id"]):
... _d.update({x:df[df["id"]==x]["mail_id"]})
But it's much faster to use dictionary comprehension and builtin pandas DataFrame.groupby; a quick look from the Official Documentation:
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
as #fsimonjetz pointed out, this code will be sufficent:
>>> df = pd.DataFrame({"Id":[1,1,2,2],"mail-id":["xyz#gm","ygzbb","Ghh","Hjkk"]})
>>> {k:v["mail-id"].values.tolist() for k,v in df.groupby("Id")}
You can do:
df.groupby('Id').agg(list).to_dict()['mail-id']
Output:
{1: ['xyz#gm', 'ygzbb'], 2: ['Ghh.', 'Hjkk.']}

Efficient way to replace values in a column starting from a list of pairs

I'm trying to replace duplicates in my data, and I'm looking for an efficient way to do that.
I have a df with 2 columns, idA and idB, like this:
idA idB
22 5
22 590
5 6000
This is a df with similarities.
I want to create a dictionary in which the key is the id, and the value is a list with all the devices linked to the key. Example:
d[5] = [22, 6000]
d[22] = [5, 590]
What I'm doing is the following:
ids = set(gigi_confirmed['idA'].unique()).union(set(gigi_confirmed['idB'].unique()))
dup_list = list(zip(A_confirmed, B_confirmed))
dict_dup = dict()
for j in ids:
l1 = []
for i in range(0, len(dup_list)):
if j in dup_list[i]:
l2 = list(dup_list[i])
l2.remove(j)
l1.append(l2[0])
dict_dup[j] = l1
Is it possible to make it more efficiently?
I have to do some guessing here, because you question is no super clear, but the way I understand it, you want a dictionary that maps each id in idA or idB to the list of ids found on the other side, from that id.
If I understood your problem correctly, I would solve it by directly constructing a dictionary mapping ids to sets of ids.
idA = [22, 22, 5]
idB = [5, 590, 6000]
dict_dup = dict()
for a, b in zip(idA, idB):
if a not in dict_dup:
dict_dup[a] = set()
dict_dup[a].add(b)
if b not in dict_dup:
dict_dup[b] = set()
dict_dup[b].add(a)
After this runs, print(dict_dup) outputs
{22: {5, 590}, 5: {6000, 22}, 590: {22}, 6000: {5}}
which I think is the data structure you're looking for.
By using dicts and sets, this code is very efficient. It will run in linear time over the number of ids.
Shorter code with defaultdict
You can also make this code a lot shorter by using a defaultdict instead of a regular dict, which will automatically create those empty sets when needed:
from collections import defaultdict
idA = [22, 22, 5]
idB = [5, 590, 6000]
dict_dup = defaultdict(set)
for a, b in zip(idA, idB):
dict_dup[a].add(b)
dict_dup[b].add(a)
The print statements produces slightly different output, but it's equivalent:
defaultdict(<class 'set'>, {22: {5, 590}, 5: {6000, 22}, 590: {22}, 6000: {5}})
This still contains the info you want, and is just as efficient as the first solution.
Putting it back in your data frame
Now, if you need to put this information back in your dataframe, you can use dict_dup to efficiently retrieve what you're looking for for each row.

Sum of duplicate values in 2d array

So, I'm sure similar questions have been asked before but I couldn't find quite what I need.
I have a program that outputs a 2D array like the one below:
arr = [[0.2, 3], [0.3, "End"], ...]
There may be more or less elements, but each is a 2-element array, where the first value is a float and the second can be a float or a string.
Both of those values may repeat. In each of those arrays, the second element takes on only a few possible values.
What I want to do is sum the first elements' value within the arrays that have the same value of the second element and output a similar array that does not have those duplicated values.
For example:
input = [[0.4, 1.5], [0.1, 1.5], [0.8, "End"], [0.05, "End"], [0.2, 3.5], [0.2, 3.5]]
output = [[0.5, 1.5], [0.4, 3.5], [0.85, "End"]]
I'd appreciate if the output array was sorted by this second element (floats ascending, strings at the end), although it's not necessary.
EDIT: Thanks for both answers; I've decided to use the one by Chris, because the code was more comprehensible to me, although groupby seems like a function designed to solved this very problem, so I'll try to read up on that, too.
UPDATE: The values of floats were always positive, by nature of the task at hand, so I used negative values to stop the usage of any strings - now I have a few if statements that check for those "encoded" negative values and replace them with strings again just before they're printed out, so sorting is now easier.
You could use a dictionary to accumulate the sum of the first value in the list keyed by the second item.
To get the 'string' items at the end of the list, the sort key could be set to positive infinity, float('inf'), in the sort key .
input_ = [[0.4, 1.5], [0.1, 1.5], [0.8, "End"], [0.05, "End"], [0.2, 3.5], [0.2, 3.5]]
d = dict()
for pair in input_:
d[pair[1]] = d.get(pair[1], 0) + pair[0]
L = []
for k, v in d.items():
L.append([v,k])
L.sort(key=lambda x: x[1] if type(x[1]) == float else float('inf'))
print(L)
This prints:
[[0.5, 1.5], [0.4, 3.5], [0.8500000000000001, 'End']]
You can try to play with itertools.groupby:
import itertools
out = [[key, sum([elt[0]for elt in val])] for key, val in itertools.groupby(a, key=lambda elt: elt[1])]
>>> [[0.5, 1.5], [0.8500000000000001, 'End'], [0.4, 3.5]]
Explanation:
Groupby the 2D list according to the 2nd element of each sublist using itertools.groupby and the key parameters. We define the lambda key=lambda elt: elt[1] to groupby on the 2nd element:
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
print(key, val)
# 1.5 <itertools._grouper object at 0x0000026AD1F6E160>
# End <itertools._grouper object at 0x0000026AD2104EF0>
# 3.5 <itertools._grouper object at 0x0000026AD1F6E160>
For each value of the group, compute the sum using the buildin function sum:
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
print(sum([elt[0]for elt in val]))
# 0.5
# 0.8500000000000001
# 0.4
Compute the desired output:
out = []
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
out.append([sum([elt[0]for elt in val]), key])
print(out)
# [[0.5, 1.5], [0.8500000000000001, 'End'], [0.4, 3.5]]
Then you said about sorting on the 2nd value but there are strings and numbers, it's quite a problem for the computer. It can't make a choice between a number and a string. Objects must be comparable.

Python add values in one list according to names in another list

So, I've got 2 lists
veg_type = [Urban,Urban,Forest,OpenForest,Arboretum]
veg_density = [0.5,0.6,0.1,0,0.9]
I want to add up the veg_density corresponding to the veg_type. So that means that Urban = 1.1 (This is 0.5+0.6)
Forest = 0.1
OpenForest = 0
Arboretum = 0.9
The index of veg_density and veg_type have the same values. That means that is Urban appears at position 0, its corresponding veg_density is also in position 0.
Also , I cannot assume that the elements in veg_type is confined to the above examples .
How do I go about solving this ?
Using dictionaries (key-to-value maps) will allow you to solve your problem:
veg_type = ["Urban", "Urban", "Forest", "OpenForest", "Arboretum"]
veg_density = [0.5, 0.6, 0.1, 0, 0.9]
type_density = {} #Creates a new dictionary
if len(veg_type) == len(veg_density): #pointed out by #lalengua -- veg_type and veg_density need to have the same length
for i in range(len(veg_type)):
if veg_type[i] not in type_density: #If the veg_type isn't in the dictionary, add it
type_density[veg_type[i]] = 0
type_density[veg_type[i]] += veg_density[i]
This produces:
{'Urban': 1.1, 'Forest': 0.1, 'OpenForest': 0, 'Arboretum': 0.9}
With it's values being accessed like so:
type_density['Urban'] = #some value
some_variable = type_density['Forest'] #double quotes can be used as well
A few things about dictionaries:
Dictionaries have keys which correspond to certain values
Keys are unique in a dictionary -- redefining a key will overwrite it's value
Keys can be strings, numbers or objects -- anything that can be hashed
Keys need to be in the dictionary in order to apply operations to them
To predefine a dictionary (rather than having it empty), use the following: name = {key1 : value1, key2 : value2, key3 : value3, keyN : valueN}
You can read more about dictionary objects at the offical python docs.
One-liner:
print({k:sum([veg_density[i] for i,val in enumerate(veg_type) if val == k]) for k,v in veg_type})
Output:
{'Urban': 1.1, 'Forest': 0.1, 'OpenForest': 0, 'Arboretum': 0.9}
To explain:
make a dictionary with keys of veg_type, using fromkeys
do a dictionary comprehension for summing the values using sum, iterating trough indexes and values of veg_type and get the values that are the key k then get index and get veg_density value with that index
Or Use pandas:
df=pd.DataFrame(list(zip(veg_type,veg_density)))
print(df.groupby(0)[1].sum().to_dict())
Output:
{'Arboretum': 0.90000000000000002, 'Forest': 0.10000000000000001, 'OpenForest': 0.0, 'Urban': 1.1000000000000001}
If Care about decimals:
df=pd.DataFrame(list(zip(veg_type,veg_density)))
print({k:float("%.2f"%v) for k,v in df.groupby(0)[1].sum().to_dict().items()})
Output:
{'Arboretum': 0.9, 'Forest': 0.1, 'OpenForest': 0.0, 'Urban': 1.1}
To Explain:
create a data frame using pandas with a list value of a list(zip(..)) for veg_type and veg_density
then do pandas groupby for removing duplicates in columns 0, then pushed down, that get the sum of columns 1 with the column 0 having the value in the same row
Related:
pandas docs
Note:
Pandas is a library that has to be installed, not a default package
>>> from collections import defaultdict
>>> veg_type = ['Urban','Urban','Forest','OpenForest','Arboretum']
>>> veg_density = [0.5,0.6,0.1,0,0.9]
>>> sums = defaultdict(int)
>>> for name,value in zip(veg_type,veg_density):
... sums[name] += value
...
>>> sums
defaultdict(<class 'int'>, {'Urban': 1.1, 'Forest': 0.1, 'OpenForest': 0, 'Arboretum': 0.9})
veg_type = ["Urban", "Urban", "Forest", "OpenForest", "Arboretum"]
veg_density = [0.5, 0.6, 0.1, 0, 0.9]
duos = zip(veg_type, veg_density)
result = {} #or dict.fromkeys(set(veg_type), 0)
for i in set(veg_type):
result[i] = sum([d for t, d in duos if t==i])
output:
{'Arboretum': 0.9, 'Forest': 0.1, 'OpenForest': 0, 'Urban': 1.1}
Version in a line:
veg_type = ["Urban", "Urban", "Forest", "OpenForest", "Arboretum"]
veg_density = [0.5, 0.6, 0.1, 0, 0.9]
{ e:sum([d for t, d in zip(veg_type, veg_density) if t==e]) for e in set(veg_type)}
output:
{'Arboretum': 0.9, 'Forest': 0.1, 'OpenForest': 0, 'Urban': 1.1}
You could zip the two lists together and then use groupby to populate your dictionary. itemgetter can be replaced with lambda x: x[0]
from itertools import groupby
from operator import itemgetter
z = zip(veg_type, veg_density)
d = {}
for k, g in groupby(z, key=itemgetter(0)):
d[k] = sum([i[1] for i in g])
# {'Urban': 1.1, 'Forest': 0.1, 'OpenForest': 0, 'Arboretuem': 0.9}

matrix list comprehension mean

This is an offshoot of a previous question which started to snowball. If I have a matrix A and I want to use the mean/average of each row [1:] values to create another matrix B, but keep the row headings intact, how would I do this? I've included matrix A, my attempt at cobbling together a list comprehension, and the expected result.
from operator import sum,len
# matrix A with row headings and values
A = [('Apple',0.95,0.99,0.89,0.87,0.93),
('Bear',0.33,0.25.0.85,0.44,0.33),
('Crab',0.55,0.55,0.10,0.43,0.22)]
#List Comprehension
B = [(A[0],sum,A[1:]/len,A[1:]) for A in A]
Expected outcome
B = [('Apple', 0.926), ('Bear', 0.44), ('Crab', 0.37)]
Your list comprehension looks a little weird. You are using the same variable for the iterable and the item.
This approach seems to work:
def average(lst):
return sum(lst) / len(lst)
B = [(a[0], average(a[1:])) for a in A]
I've created a function average for readability. It matches your expected values, so I think that's what you want, although your use of mul suggests that I may be missing something.
Taking from #recursive and #Steven Rumbalski:
>>> def average(lst):
... return sum(lst) / len(lst)
...
>>> A = {
... 'Apple': (0.95, 0.99, 0.89, 0.87, 0.93),
... 'Bear': (0.33, 0.25, 0.85, 0.44, 0.33),
... 'Crab': (0.55, 0.55, 0.10, 0.43, 0.22),
... }
>>>
>>> B = [{key: average(values)} for key, values in A.iteritems()]
>>> B
[{'Apple': 0.92599999999999993}, {'Bear': 0.44000000000000006}, {'Crab': 0.37}]

Categories