Adding a column to a pandas dataframe based on other columns - python

Problem description
Introductory remark: For the code have a look below
Let's say we have a pandas dataframe consisting of 3 columns and 2 rows.
I'd like to add a 4th column called 'Max_LF' that will consist of an array. The value of the cell is retrieved by having a look at the column 'Max_WD'. For the first row that would be 0.35 which will than be compared to the values in the column 'WD' where 0.35 can be found at the third position. Therefore, the third value of the column 'LF' should be written into the column 'Max_LF'. If the value of 'Max_WD' occures multiple times in 'WD', then all corresponding items of 'LF' should be written into 'Max_LF'.
Failed attempt
So far I had various attemps on first retrieving the index of the item in 'Max_WD' in 'WD'. After potentially retrieving the index the idea was to then get the items of 'LF' via their index:
df4['temp_indices'] = [i for i, x in enumerate(df4['WD']) if x == df4['Max_WD']]
However, a ValueError occured:
raise ValueError('Lengths must match to compare')
ValueError: Lengths must match to compare
This is what the example dateframe looks like
df = pd.DataFrame(data={'LF': [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] , 'WD': [[0.28, 0.34, 0.35, 0.18], [0.42, 0.45, 0.45, 0.18], [0.31, 0.21, 0.41, 0.41]], 'Max_WD': [0.35, 0.45, 0.41]})
The expected outcome should look like
df=pd.DataFrame(data={'LF': [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] , 'WD': [[0.28, 0.34, 0.35, 0.18], [0.42, 0.45, 0.45, 0.18], [0.31, 0.21, 0.41, 0.41]], 'Max_WD': [0.35, 0.45, 0.41], 'Max_LF': [[3] ,[2,3], [3,4]]})

You could get it by simply using lambda as follows
df['Max_LF'] = df.apply(lambda x : [i + 1 for i, e in enumerate(x['WD']) if e == x['Max_WD']], axis=1)
output is
LF Max_WD WD Max_LF
0 [1, 2, 3] 0.35 [0.28, 0.34, 0.35, 0.18] [3]
1 [1, 2, 3] 0.45 [0.42, 0.45, 0.45, 0.18] [2, 3]
2 [1, 2, 3] 0.41 [0.31, 0.21, 0.41, 0.41] [3, 4]

Thanks guys! With your help I was able to solve my problem.
Like Prince Francis suggested I first did
df['temp'] = df.apply(lambda x : [i for i, e in enumerate(x['WD']) if e == x['Max_WD']], axis=1)
to get the indicees of the 'WD'-values in 'LF'. In a second stept I then could add the actual column 'Max_LF' by doing
df['LF_Max'] = df.apply(lambda x: [x['LF'][e] for e in (x['temp'])],axis=1)
Thanks a lot guys!

You can achieve it by applying a function over axis 1.
For this, I recommend you to first convert the WD list into a pd.Series (or a numpy.ndarray) and then compare all the values at once.
Assuming that you want a list of all the values higher than the threshold, you could use this:
>>> def get_max_wd(x):
... wd = pd.Series(x.WD)
... return list(wd[wd >= x.Max_WD])
...
>>> df.apply(get_max_wd, axis=1)
0 [0.35]
1 [0.45, 0.45]
2 [0.41, 0.41]
dtype: object
The result of the apply can then be assigned as a new column into the dataframe:
df['Max_LF'] = df.apply(get_max_wd, axis=1)
If what you are after is only the maximum value (see my comment above), you can use the max() method within the function.

Related

Add list of numbers to an empty list using for loop in Python

Faced a wall recently with somewhat simple thing but no matter what I am unable to solve it.
I created a small function that calculates some values and returns a list as an output value
def calc(file):
#some calculation based on file
return degradation #as a list
for example, for file "data1.txt"
degradation = [1,0.9,0.8,0.5]
and for file "data2.txt"
degradation = [1,0.8,0.6,0.2]
Since I have several files on which I want to apply the calc() I wanted them to connect them, sideways, so that I connect them into an array which has len(degradation) number of rows, and columns as much as I have files. Was planning to do it with for loop.
For this specific case something like:
output = 1 , 1
0.9,0.8
0.8,0.6
0.5,0.2
Tried with pandas as well but without a success.
import numpy as np
arr2d = np.array([[1, 2, 3, 4]])
arr2d = np.append(arr2d, [[9, 8, 7, 6]], axis=0).T
expect an output something like this:
array([[1, 9],
[2, 8],
[3, 7],
[4, 6]])
You can use numpy.hstack() to achieve this.
Imagine you have data from the first two files from the first two iterations of the for loop.
data1.txt gives you
degradation1 = [1,0.9,0.8,0.5]
and data2.txt gives you
degradation2 = [1,0.8,0.6,0.2]
First, you have to convert both lists into lists of lists.
degradation1 = [[i] for i in degradation1]
degradation2 = [[i] for i in degradation2]
This gives the outputs,
print(degradation1)
print(degradation2)
[[1], [0.9], [0.8], [0.5]]
[[1], [0.8], [0.6], [0.2]]
Now you can stack the data using the numpy.hstack().
stacked = numpy.hstack(degradation1,degradation2)
This gives the output
array([[1. , 1. ],
[0.9, 0.8],
[0.8, 0.6],
[0.5, 0.2]])
Imagine you have the file data3.text during the 3rd iteration of the for loop and it gives
degradation3 = [1,0.3,0.6,0.4]
You can follow the same steps as above and stack it with stacked. Follow the steps; convert to a list of the lists, stack with stacked.
degradation3 = [[i] for i in degradation3]
stacked = numpy.hstack(stacked,degradation3)
This gives you the output
array([[1. , 1. , 1. ],
[0.9, 0.8, 0.3],
[0.8, 0.6, 0.6],
[0.5, 0.2, 0.4]])
You can continue this for the whole loop.
Assume my_lists is a list of your lists.
my_lists = [
[1, 2, 3, 4],
[10, 20, 30, 40],
[100, 200, 300, 400]]
result = []
for _ in my_lists[0]:
result.append([])
for l in my_lists:
for i in range(len(result)):
result[i].append(l[i])
for line in result:
print(line)
The output would be
[1, 10, 100]
[2, 20, 200]
[3, 30, 300]
[4, 40, 400]
As you seem to want to work with lists
## degradations as list
degradation1 = [1,0.8,0.6,0.2]
degradation2 = [1,0.9,0.8,0.5]
degradation3 = [0.7,0.9,0.8,0.5]
degradations = [degradation1, degradation2, degradation3]
## CORE OF THE ANSWER ##
degradationstransposed = [list(i) for i in zip(*degradations)]
print(degradationstransposed)
[[1, 1, 0.7], [0.8, 0.9, 0.9], [0.6, 0.8, 0.8], [0.2, 0.5, 0.5]]

Python: Group same integer values, then average

I have a huge array of data and I would like to do subgroups for the values for same integers and then take their average.
For example:
a = [0, 0.5, 1, 1.5, 2, 2.5]
I want to take sub groups as follows:
[0, 0.5] [1, 1.5] [2, 2.5]
... and then take the average and put all the averages in a new array.
Assuming you want to group by the number's integer value (so the number rounded down), something like this could work:
>>> a = [0, 0.5, 1, 1.5, 2, 2.5]
>>> groups = [list(g) for _, g in itertools.groupby(a, int)]
>>> groups
[[0, 0.5], [1, 1.5], [2, 2.5]]
Then averaging becomes:
>>> [sum(grp) / len(grp) for grp in groups]
[0.25, 1.25, 2.25]
This assumes a is already sorted, as in your example.
Ref: itertools.groupby, list comprehensions.
If you have no problem using additional libraries:
import pandas as pd
import numpy as np
a = [0, 0.5, 1, 1.5, 2, 2.5]
print(pd.Series(a).groupby(np.array(a, dtype=np.int32)).mean())
Gives:
0 0.25
1 1.25
2 2.25
dtype: float64
If you want an approach with dictionary, you can go ahead like this:
dic={}
a = [0, 0.5, 1, 1.5, 2, 2.5]
for items in a:
if int(items) not in dic:
dic[int(items)]=[]
dic[int(items)].append(items)
print(dic)
for items in dic:
dic[items]=sum(dic[items])/len(dic[items])
print(dic)
You can use groupby to easily get that (you might need to sort the list first):
from itertools import groupby
from statistics import mean
a = [0, 0.5, 1, 1.5, 2, 2.5]
for k, group in groupby(a, key=int):
print(mean(group))
Will give:
0.25
1.25
2.25

Group list elements using pandas in python [duplicate]

This question already has answers here:
Group Python List Elements
(4 answers)
Closed 6 years ago.
I have a python list as follows:
my_list =
[[25, 1, 0.65],
[25, 3, 0.63],
[25, 2, 0.62],
[50, 3, 0.65],
[50, 2, 0.63],
[50, 1, 0.62]]
I want to order them according to this rule:
1 --> [0.65, 0.62] <--25, 50
2 --> [0.62, 0.63] <--25, 50
3 --> [0.63, 0.65] <--25, 50
So the expected result is as follows:
Result = [[0.65, 0.62],[0.62, 0.63],[0.63, 0.65]]
I tried as follows:
import pandas as pd
df = pd.DataFrame(my_list,columns=['a','b','c'])
res = df.groupby(['b', 'c']).get_group('c')
print res
ValueError: must supply a tuple to get_group with multiple grouping keys
How to do it guys?
Here is a pandas solution, you can sort the list by the first column, groupby the second column and covert the third column to list, if you prefer the result to be a list, use tolist() method afterwards:
df = pd.DataFrame(my_list, columns=list('ABC'))
s = df.sort_values('A').groupby('B').C.apply(list)
#B
#1 [0.65, 0.62]
#2 [0.62, 0.63]
#3 [0.63, 0.65]
#Name: C, dtype: object
The above method obtains a pandas series:
To get a list of lists:
s.tolist():
# [[0.65000000000000002, 0.62], [0.62, 0.63], [0.63, 0.65000000000000002]]
To get a numpy array of lists:
s.values
# array([[0.65000000000000002, 0.62], [0.62, 0.63],
# [0.63, 0.65000000000000002]], dtype=object)
s.values[0]
# [0.65000000000000002, 0.62] # here each element in the array is still a list
To get a 2D array or a matrix, you can transform the data frame in a different way, i.e pivot your original data frame to wide format and then convert it to a 2d array:
df.pivot('B', 'A', 'C').as_matrix()
# array([[ 0.65, 0.62],
# [ 0.62, 0.63],
# [ 0.63, 0.65]])
Or:
np.array(s.tolist())
# array([[ 0.65, 0.62],
# [ 0.62, 0.63],
# [ 0.63, 0.65]])
Here is an other way, as it seems in your question you were trying to use get_group():
g = [1,2,3]
result = []
for i in g:
lst = df.groupby('b')['c'].get_group(i).tolist()
result.append(lst)
print(result)
[[0.65, 0.62], [0.62, 0.63], [0.63, 0.65]]

Sum and aggregate over a list with equal values

I have a pair lists of the same length, the first containing int values and the second contains float values. I wish to replace these with another pair of lists which may be shorter, but still have the same length, in which the first list will contain only unique values, and the second list will contain the sums for each matching value. That is, if the i'th element of the first list in the new pair is x, and the indices in the first list of the original pair in which x has appeared are i_1,...,i_k, then the i'th element of the second list in the new pair should contain the sum of the values in indices i_1,...,i_k in the second list of the original pair.
An example will clarify.
Input:
([1, 2, 2, 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
Ourput:
([1, 2, 3], [1.0, 0.5, 1.0])
I was trying to do some list comprehension trick here but failed. I can write a silly loop function for that, but I believe there should be something much nicer here.
Not a one-liner, but since you've not posted your solution I'll suggest this solution that is using collections.OrderedDict:
>>> from collections import OrderedDict
>>> a, b = ([1, 2, 2, 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
>>> d = OrderedDict()
>>> for k, v in zip(a, b):
... d[k] = d.get(k, 0) + v
...
>>> d.keys(), d.values()
([1, 2, 3], [1.0, 0.5, 1.0])
Of course if order doesn't matter then it's better to use collections.defaultdict:
>>> from collections import defaultdict
>>> a, b = ([1, 'foo', 'foo', 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
>>> d = defaultdict(int)
>>> for k, v in zip(a, b):
d[k] += + v
...
>>> d.keys(), d.values()
([3, 1, 'foo'], [1.0, 1.0, 0.5])
one way to go is using pandas:
>>> import pandas as pd
>>> df = pd.DataFrame({'tag':[1, 2, 2, 1, 1, 3],
'val':[0.1, 0.2, 0.3, 0.4, 0.5, 1.0]})
>>> df
tag val
0 1 0.1
1 2 0.2
2 2 0.3
3 1 0.4
4 1 0.5
5 3 1.0
>>> df.groupby('tag')['val'].aggregate('sum')
tag
1 1.0
2 0.5
3 1.0
Name: val, dtype: float64
Build a map with the keys:
la,lb = ([1, 2, 2, 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
m = {k:0.0 for k in la}
And fill it with the summations:
for i in xrange(len(lb)):
m[la[i]] += lb[i]
Finally, from your map:
zip(*[(k,m[k]) for k in m]*1)

Sort a list based on a given distribution

Answering one Question, I ended up with a problem that I believe was a circumlocution way of solving which could have been done in a better way, but I was clueless
There are two list
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
optimal_partition, is one of the integer partition of the number 8 into 4 parts
I would like to sort optimal_partition, in a manner which matches the percentage distribution to as closest as possible which would mean, the individual partition should match the percent magnitude as closest as possible
So 3 -> 0.4, 2 -> 0.27 and 0.23 and 1 -> 0.1
So the final result should be
[2, 2, 3, 1]
The way I ended up solving this was
>>> percent = [0.23, 0.27, 0.4, 0.1]
>>> optimal_partition = [3, 2, 2, 1]
>>> optimal_partition_percent = zip(sorted(optimal_partition),
sorted(enumerate(percent),
key = itemgetter(1)))
>>> optimal_partition = [e for e, _ in sorted(optimal_partition_percent,
key = lambda e: e[1][0])]
>>> optimal_partition
[2, 2, 3, 1]
Can you suggest an easier way to solve this?
By easier I mean, without the need to implement multiple sorting, and storing and later rearranging based on index.
Couple of more examples:
percent = [0.25, 0.25, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
result = [2, 2, 3, 1]
percent = [0.2, 0.2, 0.4, 0.2]
optimal_partition = [3, 2, 2, 1]
result = [1, 2, 3, 2]
from numpy import take,argsort
take(opt,argsort(argsort(perc)[::-1]))
or without imports:
zip(*sorted(zip(sorted(range(len(perc)), key=perc.__getitem__)[::-1],opt)))[1]
#Test
l=[([0.23, 0.27, 0.4, 0.1],[3, 2, 2, 1]),
([0.25, 0.25, 0.4, 0.1],[3, 2, 2, 1]),
([0.2, 0.2, 0.4, 0.2],[3, 2, 2, 1])]
def f1(perc,opt):
return take(opt,argsort(argsort(perc)[::-1]))
def f2(perc,opt):
return zip(*sorted(zip(sorted(range(len(perc)),
key=perc.__getitem__)[::-1],opt)))[1]
for i in l:
perc, opt = i
print f1(perc,opt), f2(perc,opt)
# output:
# [2 2 3 1] (2, 2, 3, 1)
# [2 2 3 1] (2, 2, 3, 1)
# [1 2 3 2] (1, 2, 3, 2)
Use the fact that the percentages sum to 1:
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
total = sum(optimal_partition)
output = [total*i for i in percent]
Now you need to figure out a way to redistribute the fractional components somehow. Thinking out loud:
from operator import itemgetter
intermediate = [(i[0], int(i[1]), i[1] - int(i[1])) for i in enumerate(output)]
# Sort the list by the fractional component
s = sorted(intermediate, key=itemgetter(2))
# Now, distribute the first item's fractional component to the rest, starting at the top:
for i, tup in enumerate(s):
fraction = tup[2]
# Go through the remaining items in reverse order
for index in range(len(s)-1, i, -1):
this_fraction = s[index][2]
if fraction + this_fraction >= 1:
# increment this item by 1, clear the fraction, carry the remainder
new_fraction = fraction + this_fraction -1
s[index][1] = s[index][1] + 1
s[index][2] = 0
fraction = new_fraction
else:
#just add the fraction to this element, clear the original element
s[index][2] = s[index][2] + fraction
Now, I'm not sure I'd say that's "easier". I haven't tested it, and I'm sure I got the logic wrong in that last section. In fact, I'm attempting assignment to tuples, so I know there's at least one error. But it's a different approach.

Categories