pd.NamedAgg overwrites previous columns values - python

This is the dataframe I used.
token name ltp change
0 12345.0 abc 2.0 NaN
1 12345.0 abc 5.0 1.500000
2 12345.0 abc 3.0 -0.400000
3 12345.0 abc 9.0 2.000000
4 12345.0 abc 5.0 -0.444444
5 12345.0 abc 16.0 2.200000
6 6789.0 xyz 1.0 NaN
7 6789.0 xyz 5.0 4.000000
8 6789.0 xyz 3.0 -0.400000
9 6789.0 xyz 13.0 3.333333
10 6789.0 xyz 9.0 -0.307692
11 6789.0 xyz 20.0 1.222222
While trying to solve this question, I encountered this wierd behaviour of pd.NamedAgg
#Worked as intended
df.groupby('name').agg(pos=pd.NamedAgg(column='change',aggfunc=lambda x: x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.lt(0).sum()))
# Output
pos neg
name
abc 3.0 2.0
xyz 3.0 2.0
When doing it over specific column
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.lt(0).sum()))
#Output
pos neg
name
abc 2.0 2.0
xyz 2.0 2.0
pos columns values are over-written with neg column values.
Another example below:
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()))
#Output
pos neg
name
abc 4.855556 4.855556
xyz 7.847863 7.847863
More weirder results:
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()),\
max = pd.NamedAgg(column='ltp',aggfunc='max'))
# I'm applying on Series `'change'` but I mentioned `column='ltp'` which should
# raise an `KeyError: "Column 'ltp' does not exist!"` but it produces results as follows
pos neg max
name
abc 4.855556 4.855556 2.2
xyz 7.847863 7.847863 4.0
The problem is when using it with pd.Series
s = pd.Series([1,1,2,2,3,3,4,5])
s.groupby(s.values).agg(one = pd.NamedAgg(column='new',aggfunc='sum'))
one
1 2
2 4
3 6
4 4
5 5
Shouldn't it raise an KeyError?
Some more weird results, The values one column are not over-written when we use different column names.
s.groupby(s.values).agg(one=pd.NamedAgg(column='anything',aggfunc='sum'),\
second=pd.NamedAgg(column='something',aggfunc='max'))
one second
1 2 1
2 4 2
3 6 3
4 4 4
5 5 5
Values are over-written when we use the same column name in pd.NamedAgg
s.groupby(s.values).agg(one=pd.NamedAgg(column='weird',aggfunc='sum'),\
second=pd.NamedAgg(column='weird',aggfunc='max'))
one second # Values of column `one` are over-written
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
My pandas version
pd.__version__
# '1.0.3'
From the pandas documentation:
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.
In [82]: animals.groupby("kind").height.agg(
....: min_height='min',
....: max_height='max',
....: )
....:
Out[82]:
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
But couldn't find why using it with column produces weird results.
UPDATE :
Bug report is filed by #jezrael in github issue #34380, and here too.
EDIT: This is a bug confirmed by pandas-dev and this has been resolved in PR BUG: aggregations were getting overwritten if they had the same name #30858

If there is specified columns after groupby use solution specified in paragraph:
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.
df = df.groupby('name')['change'].agg(pos = lambda x:x.gt(0).sum(),\
neg = lambda x:x.lt(0).sum())
print (df)
pos neg
name
abc 3.0 2.0
xyz 3.0 2.0
why using it with column produces weird results.
I think it is bug, instead wrong output is should raise error.

Related

Calculate value in Columne based on two other columns Dataframe

I am trying to calculate the mean value for the two classes Physics and Math and include it in a seperate Column Mean. Furthermore I am just trying to calculate the mean for classes wher both need a grade. This is working to make a filter. The only think which is not working is to calculate the Mean value. For the missing ones it works but somehow it sets the ones where I have a value to zero which is weird. The data looks like the following:
Date School Math Physics Mean Flag
01.01.2020 ABC 3 4 1
01.03.2020 ABC 2 3 1
01.05.2020 ABC 2 1 1.5 2
01.07.2020 ABC 2 1 1
01.08.2020 ABC 2 1 1.5 2
01.04.2020 ABC 2 3
01.06.2020 ABC 1 3
My code looks as the following:
import pandas as pd
path = 'School_grades.xlsx'
df = pd.read_excel(path)
df_copy = df.copy(deep=True)
df_copy['Date'] = pd.to_datetime(df_copy.Date)
df_copy = df_copy[(df_copy["Flag"] != 3)]
df_copy['Mean'] = ((df_copy['Math'] + df_copy['Physics'])/2).where(df_copy['Flag'] == 1)
print(df_copy)
My code provides the following where the columns where I had already Means included to NaN:
Date School Math Physics Mean Flag
0 2020-01-01 ABC 3.0 4.0 3.5 1
1 2020-01-03 ABC 2.0 3.0 2.5 1
2 2020-01-05 ABC 2.0 1.0 NaN 2
3 2020-01-07 ABC 2.0 1.0 1.5 1
4 2020-01-08 ABC 2.0 1.0 NaN 2
But would rather expect something like this:
Date School Math Physics Mean Flag
0 2020-01-01 ABC 3.0 4.0 3.5 1
1 2020-01-03 ABC 2.0 3.0 2.5 1
2 2020-01-05 ABC 2.0 1.0 1.5 2
3 2020-01-07 ABC 2.0 1.0 1.5 1
4 2020-01-08 ABC 2.0 1.0 1.5 2
Your .where() method has no "else" statement but return a series for each row of the dataframe. This means it only return values where your where statement is True and missing values where it is False, essentially throwing your previous results away.
There are multiple way to solve this. One is the following using numpy library.
np.where() essentially has a series with True/False values. Where True use the next provided series, where false use the last provided series. Here we insert the previous mean values.
import numpy as np
df_copy['Mean'] = np.where(df_copy['Flag'] == 1, ((df_copy['Math'] + df_copy['Physics'])/2), df_copy['Mean'])
you forgot to add the other parameter in pandas.where()
>> df_copy['Mean'] = ((df_copy['Math'] + df_copy['Physics'])/2).where(df_copy['Flag'] == 1,df_copy['Mean'])
>> print(df_copy)
Date School Math Physics Mean Flag
0 01.01.2020 ABC 3.0 4.0 3.5 1
1 01.03.2020 ABC 2.0 3.0 2.5 1
2 01.05.2020 ABC 2.0 1.0 1.5 2
3 01.07.2020 ABC 2.0 1.0 1.5 1
4 01.08.2020 ABC 2.0 1.0 1.5 2
5 01.04.2020 ABC 2.0 NaN NaN 3
6 01.06.2020 ABC NaN 1.0 NaN 3
Use pandas.DataFrame.mean to calculate the average
df_copy['Mean'] = df_copy[['Math','Physics']].mean(axis=1).where(df_copy.Flag == 1,df_copy['Mean'])
You can also use the numpy.where
import numpy as np
df_copy['Mean'] = np.where(df_copy.Flag == 1,df_copy[['Math','Physics']].mean(axis=1),df_copy['Mean'])

how to count positive and negative numbers of a column after applying groupby in pandas

have the following dataframe:
token name ltp change
0 12345.0 abc 2.0 NaN
1 12345.0 abc 5.0 1.500000
2 12345.0 abc 3.0 -0.400000
3 12345.0 abc 9.0 2.000000
4 12345.0 abc 5.0 -0.444444
5 12345.0 abc 16.0 2.200000
6 6789.0 xyz 1.0 NaN
7 6789.0 xyz 5.0 4.000000
8 6789.0 xyz 3.0 -0.400000
9 6789.0 xyz 13.0 3.333333
10 6789.0 xyz 9.0 -0.307692
11 6789.0 xyz 20.0 1.222222
I need to count of positive and negative number for each category of the name column. in above example
abc:pos_count: 3 abc:neg_count:2
xyz:pos_count:2 xyz:neg_count:2
count=df.groupby('name')['change'].count()
count
however, this gives me only the total count by group but not the positive & negative count separately.
Use:
g = df.groupby('name')['change']
counts = g.agg(
pos_count=lambda s: s.gt(0).sum(),
neg_count=lambda s: s.lt(0).sum(),
net_count=lambda s: s.gt(0).sum()- s.lt(0).sum()).astype(int)
Result:
# print(counts)
pos_count neg_count net_count
name
abc 3 2 1
xyz 3 2 1
Use np.sign with Series.map for new column added by DataFrame.assign and then count values by SeriesGroupBy.value_counts:
count=(df.assign(type=np.sign(df['change'])
.map({1:'pos_count', -1:'neg_count'}))
.groupby(df['name'])['type']
.value_counts()
.reset_index(name='count'))
print (count)
name type count
0 abc pos_count 3
1 abc neg_count 2
2 xyz pos_count 3
3 xyz neg_count 2
You can create a new column in df with the sign of change and group by name and sign:
import pandas as pd
import numpy as np
df['change_sign'] = np.sign(df['change'])
df.groupby(['name','change_sign']).count()
You can then pivot if you need the result in columns instead of rows

Removing specific value from cell of dataframe and shifting the value towards left

I'm working with the pandas data frame. I have unwanted data in some cells. I need to clear that data from specific cells and shift the whole row towards left by one cell. I have tried couple of things but it is not working for me. Here is the example dataframe
userId movieId ratings extra
0 1 500 3.5
1 1 600 4.5
2 1 www.abcd 700 2.0
3 2 1100 5.0
4 2 1200 4.0
5 3 600 4.5
6 4 600 5.0
7 4 1900 3.5
Expected Outcome:
userId movieId ratings extra
0 1 500 3.5
1 1 600 4.5
2 1 700 2.0
3 2 1100 5.0
4 2 1200 4.0
5 3 600 4.5
6 4 600 5.0
7 4 1900 3.5
I have tried the following code but it is showing the following error.
raw = df[f['ratings'].str.contains('www')==True]
#Here I am trying to fix the specific cell value to empty but it shows the following error.
**AttributeError:** 'str' object has no attribute 'at'
df = df.at[raw, 'movieId'] = ' '
#code for shifting the cell value to left
df.iloc[raw,2:-1] = df.iloc[raw,2:-1].shift(-1,axis=1)
You can shift values by mask, but is realy important match types, it means if column movieId is filled by strings (because at leas one string) is necessary convert it to numeric by to_numeric for avoid data lost, because different types:
m = df['movieId'].str.contains('www')
df['movieId'] = pd.to_numeric(df['movieId'], errors='coerce')
#if want shift only missing values rows
#m = df['movieId'].isna()
df[m] = df[m].shift(-1, axis=1)
df['userId'] = df['userId'].ffill()
df = df.drop('extra', axis=1)
print (df)
userId movieId ratings
0 1.0 500.0 3.5
1 1.0 600.0 4.5
2 1.0 700.0 2.0
3 2.0 1100.0 5.0
4 2.0 1200.0 4.0
5 3.0 600.0 4.5
6 4.0 600.0 5.0
7 4.0 1900.0 3.5
If omit converting to numeric get missing value:
m = df['movieId'].str.contains('www')
df[m] = df[m].shift(-1, axis=1)
df['userId'] = df['userId'].ffill()
df = df.drop('extra', axis=1)
print (df)
userId movieId ratings
0 1.0 500 3.5
1 1.0 600 4.5
2 1.0 NaN 2.0
3 2.0 1100 5.0
4 2.0 1200 4.0
5 3.0 600 4.5
6 4.0 600 5.0
7 4.0 1900 3.5
You can try this:-
df['movieId'] = pd.to_numeric(df['movieId'], errors='coerce')
df = df.sort_values(by = 'movieId', ascending = 'True')

Custom expanding function with raw=False

Consider the following dataframe:
df = pd.DataFrame({
'a': np.arange(1, 5),
'b': np.arange(1, 5) * 2,
'c': np.arange(1, 5) * 3
})
a b c
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
I want to calculate the cumulative sum for each row across the columns:
def expanding_func(s):
return s.sum()
df.expanding(1, axis=1).apply(expanding_func, raw=True)
# As expected:
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
However, if I set raw=False, expanding_func no longer works:
df.expanding(1, axis=1).apply(expanding_func, raw=False)
ValueError: Length of passed values is 3, index implies 4
The documentation says expanding_func
Must produce a single value from an ndarray input if raw=True or a single value from a Series if raw=False.
And that is exactly what I was doing. Why did expanding_func fail when raw=False?
Note: this is only a contrived example. I want to know how to write a custom rolling function, not how to calculate the cumulative sum across columns.
It seems this is a bug with pandas.
If you do:
df.iloc[:3].expanding(1, axis=1).apply(expanding_func, raw=False)
It actually works. It seems when passed as a series, pandas tries to check the number of returned columns with the number of rows of the dataframe for some reason. (it should compare the number of columns of the df)
A workaround is to transpose the df, apply your function and transpose back which seems to work. The bug only seems to affect when axis is set to 1.
df.T.expanding(1, axis=0).apply(expanding_func, raw=False).T
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
dont need to define raw False/True,Just do simple way:
df.expanding(0, axis=1).apply(expanding_func)
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0

pandas aggregate dataframe returns only one column

Hy there.
I have a pandas DataFrame (df) like this:
foo id1 bar id2
0 8.0 1 NULL 1
1 5.0 1 NULL 1
2 3.0 1 NULL 1
3 4.0 1 1 2
4 7.0 1 3 2
5 9.0 1 4 3
6 5.0 1 2 3
7 7.0 1 3 1
...
I want to group by id1 and id2 and try to get the mean of foo and bar.
My code:
res = df.groupby(["id1","id2"])["foo","bar"].mean()
What I get is almost what I expect:
foo
id1 id2
1 1 5.750000
2 7.000000
2 1 3.500000
2 1.500000
3 1 6.000000
2 5.333333
The values in column "foo" are exactly the average values (means) that I am looking for but where is my column "bar"?
So if it would be SQL I was looking for a result like from:
"select avg(foo), avg(bar) from dataframe group by id1, id2;"
(Sorry for this but I am more an sql person and new to pandas but I need it now.)
What I alternatively tried:
groupedFrame = res.groupby(["id1","id2"])
aggrFrame = groupedFrame.aggregate(numpy.mean)
Which gives me exactly the same result, still missing column "bar".
Sites I read:
http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.aggregate.html
and documentation for group-by but I cannot post the link here.
What am I doing wrong? - Thanks in foreward.
There is problem your column bar is not numeric, so aggregate function omit it.
You can check dtype of omited column - is not numeric:
print (df['bar'].dtype)
object
You can check automatic exclusion of nuisance columns.
Solution is before aggregating convert string values to numeric and if not possible, add NaNs with to_numeric and parameter errors='coerce':
df['bar'] = pd.to_numeric(df['bar'], errors='coerce')
res = df.groupby(["id1","id2"])["foo","bar"].mean()
print (res)
foo bar
id1 id2
1 1 5.75 3.0
2 5.50 2.0
3 7.00 3.0
But if have mixed data - numeric with strings is possible use replace:
df['bar'] = df['bar'].replace("NULL", np.nan)
As stated earlier, you should replace your NULL value before taking the mean
df.replace("NULL",-1).groupby(["id1","id2"])["foo","bar"].mean()
output
id1 id2 foo bar
1 1 5.75 3.0
1 2 5.5 2.0
1 3 7.0 3.0

Categories