Let's say I have a data frame named df as below in Pandas :
id x y
1 10 A
2 12 B
3 10 B
4 4 C
5 9 A
6 15 A
7 6 B
Now I would like to group data by column y and get mean of 2 largest values (x) of each group, which would look something like that
y
A (10+15)/2 = 12.5
B (12 + 10)/2 = 11
C 4
If I try with df.groupby('y')['x'].nlargest(2), I get
y id
A 1 10
6 15
B 2 12
3 10
C 4 4
which is of type pandas.core.series.Series. So when I do df.groupby('y')[x].nlargest(2).mean() I get mean of all numbers instead of 3 means, one for each group. At the end I would like to plot the results where groups would be on the x axis and means on y axis, so I'm guessing I should get rid of column 'id' as well?
Anyone knows how to solve this one? Thank you for help!
df.groupby('y')['x'].nlargest(2).mean(level=0)
Out:
y
A 12.5
B 11.0
C 4.0
Name: x, dtype: float64
Note that this is grouping by 'y' two times (mean(level=0) is another groupby but it is done on an index so it is faster). groupby.apply might be more efficient based on the number of groups as it requires grouping only once in this particular situation.
df.groupby('y')['x'].apply(lambda ser: ser.nlargest(2).mean())
Out:
y
A 12.5
B 11.0
C 4.0
Name: x, dtype: float64
Related
I found the following example online which explains how to essentially achieve a SQL equivalent of PARTITION BY
df['percent_of_points'] = df.groupby('team')['points'].transform(lambda x: x/x.sum())
#view updated DataFrame
print(df)
team points percent_of_points
0 A 30 0.352941
1 A 22 0.258824
2 A 19 0.223529
3 A 14 0.164706
4 B 14 0.191781
5 B 11 0.150685
6 B 20 0.273973
7 B 28 0.383562
I struggle to understand what the 'x' refers to in the lambda function lambda x: x/x.sum() because it appears to refer to an individual element when used as the numerator i.e. 'x' but also appears to be a list of values when used as a denominator i.e. x.sum().
I think I am not thinking about this is in the right way or have a gap in my understanding of python or pandas.
it appears to refer to an individual element when used as the
numerator i.e. 'x' but also appears to be a list of values when used
as a denominator i.e. x.sum()
It doesn't, it returns a pd.Series of length the size of the group, and x / x.sum() is not a single value, it a pd.Series of the same size.
.transform will assign the values of this series to the corresponding values in that column from the group-by operation.
So, consider:
In [15]: df
Out[15]:
team points
0 A 30
1 A 22
2 A 19
3 A 14
4 B 14
5 B 11
6 B 20
7 B 28
In [16]: for k, g in df.groupby('team')['points']:
...: print(g)
...: print(g / g.sum())
...:
0 30
1 22
2 19
3 14
Name: points, dtype: int64
0 0.352941
1 0.258824
2 0.223529
3 0.164706
Name: points, dtype: float64
4 14
5 11
6 20
7 28
Name: points, dtype: int64
4 0.191781
5 0.150685
6 0.273973
7 0.383562
Name: points, dtype: float64
In [17]:
This is the second part of this question.
Suppose I have a dataframe df and I want to select x1 and x100, corresponding to the largest amount, grouped by group_id. If there are multiple rows with the largest amount, then I want to select medians of x1 and x100.
df = pd.DataFrame({'group_id' : [1,1,1,2,2,3,3,3,3],
'amount' : [2,4,5,1,2,3,5,5,5],
'x1':[2,5,8,3,6,9,3,1,0],
'x100':[1,2,3,4,8,9,9,4,5]})
group_id amount x1 x100
0 1 2 2 1
1 1 4 5 2
2 1 5 8 3
3 2 1 3 4
4 2 2 6 8
5 3 3 9 9
6 3 5 3 9
7 3 5 1 4
8 3 5 0 5
So the desired output looks like this:
median_x1 median_x100
group_id
1 8.0 3.0
2 6.0 8.0
3 1.0 5.0
For only 2 columns (x1 and x100), I can simply add 1 line to #AndrejKesely solution to the previous question, something like this:
out = df.groupby("group_id").apply(
lambda x: pd.Series(
{"median_x1": (d := x.loc[x["amount"] == x["amount"].max()])['x1'].median(),
"median_x100": d["x100"].median()}
)
)
How to do this in a neat way, which will work for 100 columns, i.e., x1, x2 up to x100? Ideally, I do not want to copypaste one line 100 times and manually changing name of a column in an editor...
Maybe something like this?
df.groupby('group_id').apply(
lambda x: x[x['amount'] == x['amount'].max()
].drop(columns=['amount', 'group_id']).median())
You can also use names of columns instead of .drop():
df.groupby('group_id').apply(
lambda x: x.loc[x['amount'] == x['amount'].max(), ['x1', 'x100']].median())
I have a df with values:
A B C D
0 1 2 3 2
1 2 3 3 9
2 5 3 6 6
3 3 6 7
4 6 7
5 2
df.shape is 6x4, say
df.iloc[:,1] pulls out the B column, but len(df.iloc[:,1]) is also = 6
How do I "reshape" df.iloc[:,1]? Which function can I use so that the output is the length of the actual values in the column.
My expected output in this case is 3
You can use last_valid_index. Just note that since your series originally contained NaN values and these are considered float, even after filtering your series will be float. You may wish to convert to int as a separate step.
# first convert dataframe to numeric
df = df.apply(pd.to_numeric, errors='coerce')
# extract column
B = df.iloc[:, 1]
# filter to the last valid value
B_filtered = B[:B.last_valid_index()]
print(B_filtered)
0 2.0
1 3.0
2 3.0
3 6.0
Name: B, dtype: float64
You can use list comprehension like this.
len([x for x in df.iloc[:,1] if x != ''])
I'm having trouble understanding how a function works:
""" the apply() method lets you apply an arbitrary function to the group
result. The function take a DataFrame and returns a Pandas object (a df or
series) or a scalar.
For example: normalize the first column by the sum of the second"""
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
print (df); print (df.groupby('key').apply(norm_by_data2))
(Excerpt from: "Python Data Science Handbook", Jake VanderPlas pp. 167)
Returns this:
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
key data1 data2
0 A 0.000000 5
1 B 0.142857 0
2 C 0.166667 3
3 A 0.375000 3
4 B 0.571429 7
5 C 0.416667 9
For me, the best way to understand how this works is by manually calculating the values.
Can someone explain how to manually arrive to the second value of the column 'data1': 0.142857
It's 1/7? but where do this values come from?
Thanks!
I got it!!
The sum of column B for each group is:
A: 5 + 3 = 8
B: 0 + 7 = 7
C: 3 + 9 = 12
For example, to arrive to 0.142857, divide 1 in the sum of group B (it's 7) : 1/7 = 0.142857
I want to find the difference between rows prior to and following to the specific row. Specifically, I have the following dataset:
Number of rows A
1 4
2 2
3 2
4 3
5 2
I should get the following data:
Number of rows A B
1 4 NaN (since there is not row before this row)
2 2 2 (4-2)
3 2 -1 (2-3)
4 3 0 (2-2)
5 2 NaN (since there is not row after this row)
As you can see, each row in column B, equal the difference between previous and following rows in column A. For example, second row in column B, equal the difference between value in the first row in column A and value in the third row in column A. IMPORTANT POINT: I do not need only previous and following. I should find the difference between previous 2 and the following 2 rows. I meant the value in row Number 23 in column B will be equal the difference between the value in row Number 21 in column A and the value in row Number 25 in column A. I use the previous and the following rows for simplicity.
I hope I could explain it.
Seems like you need a centered rolling window. You can specify that with the arg center=True
>>> df.A.rolling(3, center=True).apply(lambda s: s[0]-s[-1])
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
Name: A, dtype: float64
This approach works for any window. Notice that this is a centered window, so the size of the window has to be N+N+1 (where N is the number of lookback and lookforward rows, and you add 1 to account for the value in the middle). Thus, the general formula is
window = 2*N + 1
If you need 2 rows before and 2 after, then N = 2. if you need 5 and 5, N=5 (and window = 11) etc. The apply lambda stays the same.
Let the series (i.e. DataFrame column) be s.
You want:
s.shift(1) - s.shift(-1)
You need to use .shift on the column (series) where you want to run your calculation.
With shift(1) you get the previous row, with shift(-1) you get the next row.
from there you need to calculate previous - next
>>> s = pd.Series([4,2,2,3,2])
>>> s
0 4
1 2
2 2
3 3
4 2
dtype: int64
# previous
>>> s.shift(1)
0 NaN
1 4.0
2 2.0
3 2.0
4 3.0
dtype: float64
# next
>>> s.shift(-1)
0 2.0
1 2.0
2 3.0
3 2.0
4 NaN
dtype: float64
# previous - next
>>> s.shift(1)-s.shift(-1)
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
dtype: float64