matrix list comprehension mean - python

This is an offshoot of a previous question which started to snowball. If I have a matrix A and I want to use the mean/average of each row [1:] values to create another matrix B, but keep the row headings intact, how would I do this? I've included matrix A, my attempt at cobbling together a list comprehension, and the expected result.
from operator import sum,len
# matrix A with row headings and values
A = [('Apple',0.95,0.99,0.89,0.87,0.93),
('Bear',0.33,0.25.0.85,0.44,0.33),
('Crab',0.55,0.55,0.10,0.43,0.22)]
#List Comprehension
B = [(A[0],sum,A[1:]/len,A[1:]) for A in A]
Expected outcome
B = [('Apple', 0.926), ('Bear', 0.44), ('Crab', 0.37)]

Your list comprehension looks a little weird. You are using the same variable for the iterable and the item.
This approach seems to work:
def average(lst):
return sum(lst) / len(lst)
B = [(a[0], average(a[1:])) for a in A]
I've created a function average for readability. It matches your expected values, so I think that's what you want, although your use of mul suggests that I may be missing something.

Taking from #recursive and #Steven Rumbalski:
>>> def average(lst):
... return sum(lst) / len(lst)
...
>>> A = {
... 'Apple': (0.95, 0.99, 0.89, 0.87, 0.93),
... 'Bear': (0.33, 0.25, 0.85, 0.44, 0.33),
... 'Crab': (0.55, 0.55, 0.10, 0.43, 0.22),
... }
>>>
>>> B = [{key: average(values)} for key, values in A.iteritems()]
>>> B
[{'Apple': 0.92599999999999993}, {'Bear': 0.44000000000000006}, {'Crab': 0.37}]

Related

Sum of duplicate values in 2d array

So, I'm sure similar questions have been asked before but I couldn't find quite what I need.
I have a program that outputs a 2D array like the one below:
arr = [[0.2, 3], [0.3, "End"], ...]
There may be more or less elements, but each is a 2-element array, where the first value is a float and the second can be a float or a string.
Both of those values may repeat. In each of those arrays, the second element takes on only a few possible values.
What I want to do is sum the first elements' value within the arrays that have the same value of the second element and output a similar array that does not have those duplicated values.
For example:
input = [[0.4, 1.5], [0.1, 1.5], [0.8, "End"], [0.05, "End"], [0.2, 3.5], [0.2, 3.5]]
output = [[0.5, 1.5], [0.4, 3.5], [0.85, "End"]]
I'd appreciate if the output array was sorted by this second element (floats ascending, strings at the end), although it's not necessary.
EDIT: Thanks for both answers; I've decided to use the one by Chris, because the code was more comprehensible to me, although groupby seems like a function designed to solved this very problem, so I'll try to read up on that, too.
UPDATE: The values of floats were always positive, by nature of the task at hand, so I used negative values to stop the usage of any strings - now I have a few if statements that check for those "encoded" negative values and replace them with strings again just before they're printed out, so sorting is now easier.
You could use a dictionary to accumulate the sum of the first value in the list keyed by the second item.
To get the 'string' items at the end of the list, the sort key could be set to positive infinity, float('inf'), in the sort key .
input_ = [[0.4, 1.5], [0.1, 1.5], [0.8, "End"], [0.05, "End"], [0.2, 3.5], [0.2, 3.5]]
d = dict()
for pair in input_:
d[pair[1]] = d.get(pair[1], 0) + pair[0]
L = []
for k, v in d.items():
L.append([v,k])
L.sort(key=lambda x: x[1] if type(x[1]) == float else float('inf'))
print(L)
This prints:
[[0.5, 1.5], [0.4, 3.5], [0.8500000000000001, 'End']]
You can try to play with itertools.groupby:
import itertools
out = [[key, sum([elt[0]for elt in val])] for key, val in itertools.groupby(a, key=lambda elt: elt[1])]
>>> [[0.5, 1.5], [0.8500000000000001, 'End'], [0.4, 3.5]]
Explanation:
Groupby the 2D list according to the 2nd element of each sublist using itertools.groupby and the key parameters. We define the lambda key=lambda elt: elt[1] to groupby on the 2nd element:
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
print(key, val)
# 1.5 <itertools._grouper object at 0x0000026AD1F6E160>
# End <itertools._grouper object at 0x0000026AD2104EF0>
# 3.5 <itertools._grouper object at 0x0000026AD1F6E160>
For each value of the group, compute the sum using the buildin function sum:
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
print(sum([elt[0]for elt in val]))
# 0.5
# 0.8500000000000001
# 0.4
Compute the desired output:
out = []
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
out.append([sum([elt[0]for elt in val]), key])
print(out)
# [[0.5, 1.5], [0.8500000000000001, 'End'], [0.4, 3.5]]
Then you said about sorting on the 2nd value but there are strings and numbers, it's quite a problem for the computer. It can't make a choice between a number and a string. Objects must be comparable.

Python add values in one list according to names in another list

So, I've got 2 lists
veg_type = [Urban,Urban,Forest,OpenForest,Arboretum]
veg_density = [0.5,0.6,0.1,0,0.9]
I want to add up the veg_density corresponding to the veg_type. So that means that Urban = 1.1 (This is 0.5+0.6)
Forest = 0.1
OpenForest = 0
Arboretum = 0.9
The index of veg_density and veg_type have the same values. That means that is Urban appears at position 0, its corresponding veg_density is also in position 0.
Also , I cannot assume that the elements in veg_type is confined to the above examples .
How do I go about solving this ?
Using dictionaries (key-to-value maps) will allow you to solve your problem:
veg_type = ["Urban", "Urban", "Forest", "OpenForest", "Arboretum"]
veg_density = [0.5, 0.6, 0.1, 0, 0.9]
type_density = {} #Creates a new dictionary
if len(veg_type) == len(veg_density): #pointed out by #lalengua -- veg_type and veg_density need to have the same length
for i in range(len(veg_type)):
if veg_type[i] not in type_density: #If the veg_type isn't in the dictionary, add it
type_density[veg_type[i]] = 0
type_density[veg_type[i]] += veg_density[i]
This produces:
{'Urban': 1.1, 'Forest': 0.1, 'OpenForest': 0, 'Arboretum': 0.9}
With it's values being accessed like so:
type_density['Urban'] = #some value
some_variable = type_density['Forest'] #double quotes can be used as well
A few things about dictionaries:
Dictionaries have keys which correspond to certain values
Keys are unique in a dictionary -- redefining a key will overwrite it's value
Keys can be strings, numbers or objects -- anything that can be hashed
Keys need to be in the dictionary in order to apply operations to them
To predefine a dictionary (rather than having it empty), use the following: name = {key1 : value1, key2 : value2, key3 : value3, keyN : valueN}
You can read more about dictionary objects at the offical python docs.
One-liner:
print({k:sum([veg_density[i] for i,val in enumerate(veg_type) if val == k]) for k,v in veg_type})
Output:
{'Urban': 1.1, 'Forest': 0.1, 'OpenForest': 0, 'Arboretum': 0.9}
To explain:
make a dictionary with keys of veg_type, using fromkeys
do a dictionary comprehension for summing the values using sum, iterating trough indexes and values of veg_type and get the values that are the key k then get index and get veg_density value with that index
Or Use pandas:
df=pd.DataFrame(list(zip(veg_type,veg_density)))
print(df.groupby(0)[1].sum().to_dict())
Output:
{'Arboretum': 0.90000000000000002, 'Forest': 0.10000000000000001, 'OpenForest': 0.0, 'Urban': 1.1000000000000001}
If Care about decimals:
df=pd.DataFrame(list(zip(veg_type,veg_density)))
print({k:float("%.2f"%v) for k,v in df.groupby(0)[1].sum().to_dict().items()})
Output:
{'Arboretum': 0.9, 'Forest': 0.1, 'OpenForest': 0.0, 'Urban': 1.1}
To Explain:
create a data frame using pandas with a list value of a list(zip(..)) for veg_type and veg_density
then do pandas groupby for removing duplicates in columns 0, then pushed down, that get the sum of columns 1 with the column 0 having the value in the same row
Related:
pandas docs
Note:
Pandas is a library that has to be installed, not a default package
>>> from collections import defaultdict
>>> veg_type = ['Urban','Urban','Forest','OpenForest','Arboretum']
>>> veg_density = [0.5,0.6,0.1,0,0.9]
>>> sums = defaultdict(int)
>>> for name,value in zip(veg_type,veg_density):
... sums[name] += value
...
>>> sums
defaultdict(<class 'int'>, {'Urban': 1.1, 'Forest': 0.1, 'OpenForest': 0, 'Arboretum': 0.9})
veg_type = ["Urban", "Urban", "Forest", "OpenForest", "Arboretum"]
veg_density = [0.5, 0.6, 0.1, 0, 0.9]
duos = zip(veg_type, veg_density)
result = {} #or dict.fromkeys(set(veg_type), 0)
for i in set(veg_type):
result[i] = sum([d for t, d in duos if t==i])
output:
{'Arboretum': 0.9, 'Forest': 0.1, 'OpenForest': 0, 'Urban': 1.1}
Version in a line:
veg_type = ["Urban", "Urban", "Forest", "OpenForest", "Arboretum"]
veg_density = [0.5, 0.6, 0.1, 0, 0.9]
{ e:sum([d for t, d in zip(veg_type, veg_density) if t==e]) for e in set(veg_type)}
output:
{'Arboretum': 0.9, 'Forest': 0.1, 'OpenForest': 0, 'Urban': 1.1}
You could zip the two lists together and then use groupby to populate your dictionary. itemgetter can be replaced with lambda x: x[0]
from itertools import groupby
from operator import itemgetter
z = zip(veg_type, veg_density)
d = {}
for k, g in groupby(z, key=itemgetter(0)):
d[k] = sum([i[1] for i in g])
# {'Urban': 1.1, 'Forest': 0.1, 'OpenForest': 0, 'Arboretuem': 0.9}

Categorizing data based on value of interger using Python

I have these two variables:
instance = [0.45,6.54,19.0,3.34,2.34]
distance_tolerance = [5.00,10.00,20.00]
I like to sort each data in instance and categorize it based on their value that is fewer than each data in distance_tolerance and save it in a variable.
For example 0.45 < 5.00 then make a variable to save 0.45 and iterate for every data.
Expected result:
data5 = [0.45,3.34,2.34]
data10 = [0.45,6.54,3.34,2,34]
data20 = [0.45,6.54,19.0,3.34,2.34]
I need to do looping for this task since the real data is large.
What is the best way to perform this task? thanks
You can simply iterative through and append the values of instance which are lower than the value of distance_tolerance that your on.
So for some element in distance_tolerance you could have a function called itemsLowerThan(value) which will return an array of the elements in instance that are lower than the value you pass.
For example:
instance = [0.45,6.54,19.0,3.34,2.34]
distance_tolerance = [5.00,10.00,20.00]
def itemsLowerThan(value):
arr = []
for item in instance:
if (item < value):
arr.append(item)
return arr
for tolerance in distance_tolerance:
print(itemsLowerThan(tolerance))
Would give the output:
[0.45, 3.34, 2.34]
[0.45, 6.54, 3.34, 2.34]
[0.45, 6.54, 19.0, 3.34, 2.34]
You can use a nested list comprehension.
[[i for i in instance if i < tolerance] for tolerance in distance_tolerance]
Which is equivalent to:
[
[0.45, 3.34, 2.34],
[0.45, 6.54, 3.34, 2,34],
[0.45, 6.54, 19.0, 3.34, 2.34],
]

How to format a list when I print it

values = [[3.5689651969162908, 4.664618442892583, 3.338666695570425],
[6.293153787450157, 1.1285723419142026, 10.923859694586376],
[2.052506259736077, 3.5496423448584924, 9.995488620338277],
[9.41858935127928, 10.034233496516803, 7.070345442417161]]
def flatten(values):
new_values = []
for i in range(len(values)):
for v in range(len(values[0])):
new_values.append(values[i][v])
return new_values
v = flatten(values)
print("A 2D list contains:")
print("{}".format(values))
print("The flattened version of the list is:")
print("{}".format(v))
I am flatting the 2D list to 1D, but I can format it. I know the (v) is a list, and I tried to use for loop to print it, but I still can't get the result I want. I am wondering are there any ways to format the list. I want to print the (v) as a result with two decimal places. Like this
[3.57, 4.66, 3.34, 6.29, 1.13, 10.92, 2.05, 3.55, 10.00, 9.42, 10.03, 7.07]
I am using the Eclipse and Python 3.0+.
You could use:
print(["{:.2f}".format(val) for val in v])
Note that you can flatten your list using itertools.chain:
import itertools
v = list(itertools.chain(*values))
I would use the built-in function round(), and while I was about it I would simplify your for loops:
def flatten(values):
new_values = []
for i in values:
for v in i:
new_values.append(round(v, 2))
return new_values
How to flatten and transform the list in one line
[round(x,2) for b in [x for x in values] for x in b]
It returns a list of two decimals after the comma.
One you have v you can use a list comprehension like:
formattedList = ["%.2f" % member for member in v]
output was as follows:
['3.57', '4.66', '3.34', '6.29', '1.13', '10.92', '2.05', '3.55', '10.00', '9.42', '10.03', '7.07']
Hope that helps!
You can first flatten the list (as described here) and then use round to solve this:
flat_list = [number for sublist in l for number in sublist]
# All numbers are in the same list now
print(flat_list)
[3.5689651969162908, 4.664618442892583, 3.338666695570425, 6.293153787450157, ..., 7.070345442417161]
rounded_list = [round(number, 2) for number in flat_list]
# The numbers are rounded to two decimals (but still floats)
print(flat_list)
[3.57, 4.66, 3.34, 6.29, 1.13, 10.92, 2.05, 3.55, 10.00, 9.42, 10.03, 7.07]
This can be written shorter if we put the rounding directly into the list comprehension:
print([round(number, 2) for sublist in l for number in sublist])

PySpark ReduceByKey

I have been trying to make it work for a while, but failed every time. I have 2 files. One has a list of names:
Name1
Name2
Name3
Name4
The other is list of values associated with names for each day in the year over several years:
['0.1,0.2,0.3,0.4',
'0.5,0.6,0.7,0.8',
'10,1000,0.2,5000'
...]
The goal is to have the output like:
Name1: [0.1,0.5,10]
Name2: [0.2,0.6,1000]
Name3:[0.3,0.7,0.2]
Name4:[0.4,0.8,5000]
And then plot histogram for each. I wrote a mapper that creates a list of tuples that produces the following output (this is an RDD object):
[[('Name1', [0.1]),('Name2', [0,2]),('Name3', [0.3]),('Name4', [0.4])],
[('Name1', [0.5]),('Name2', [0,6]),('Name3', [0.7]),('Name4', [0.8])],
[('Name1', [10]),('Name2', [1000]),('Name3', [0.8]),('Name4', [5000])]]
Now I need to concatenate all values for each name in a single list, but each map by key, value that I attempted returns a wrong result.
You can simply loop through each and create a dictionary from it using dict.setdefault() . Example -
>>> ll = [[('Name1', [0.1]),('Name2', [0,2]),('Name3', [0.3]),('Name4', [0.4])],
... [('Name1', [0.5]),('Name2', [0,6]),('Name3', [0.7]),('Name4', [0.8])],
... [('Name1', [10]),('Name2', [1000]),('Name3', [0.8]),('Name4', [5000])]]
>>> d = {}
>>> for i in ll:
... for tup in i:
... d.setdefault(tup[0],[]).extend(tup[1])
...
>>> pprint.pprint(d)
{'Name1': [0.1, 0.5, 10],
'Name2': [0, 2, 0, 6, 1000],
'Name3': [0.3, 0.7, 0.8],
'Name4': [0.4, 0.8, 5000]}
For Pyspark RDD Object, try a simple reduce function such as -
func = lambda x,y: x+y
Then send this in to reduceByKey method -
object.reduceByKey(func)
Per comments, actually the OP has a list of RDD Objects (not a single RDD Objects) , in that case you can convert the RDD objects to a list by calling .collect() and then do the logic , and then you can decide whether you want the resultant as a python dictionary or an RDD object, if you want first. You can call dict.items() to get the key-value pairs and call sc.parrallelize . Example -
d = {}
for i in ll:
c = i.collect()
for tup in i:
d.setdefault(tup[0],[]).extend(tup[1])
rddobj = sc.parallelize(d.items())

Categories