For the following dataframe:
df = pd.DataFrame({'group':['a','a','b','b'], 'data':[5,10,100,30]},columns=['group', 'data'])
print(df)
group data
0 a 5
1 a 10
2 b 100
3 b 30
When grouping by column, adding and creating a new column, the result is:
df['new'] = df.groupby('group')['data'].sum()
print(df)
group data new
0 a 5 NaN
1 a 10 NaN
2 b 100 NaN
3 b 30 NaN
However if we reset the df to the original data and move the group column to the index,
df.set_index('group', inplace=True)
print(df)
data
group
a 5
a 10
b 100
b 30
And then group and sum, then we get:
df['new'] = df.groupby('group')['data'].sum()
print(df)
data new
group
a 5 15
a 10 15
b 100 130
b 30 130
Why does the column group not set the values in the new column but the index grouping does set the values in the new column?
Better here is use GroupBy.transform for return Series with same size like original DataFrame, so after assign all working correctly:
df['new'] = df.groupby('group')['data'].transform('sum')
Because if assign new Series values are align by index values. If index is different, get NaNs:
print (df.groupby('group')['data'].sum())
group
a 15
b 130
Name: data, dtype: int64
Different index values - get NaNs:
print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')
print (df.index)
RangeIndex(start=0, stop=4, step=1)
df.set_index('group', inplace=True)
print (df.groupby('group')['data'].sum())
group
a 15
b 130
Name: data, dtype: int64
Index can align, because values matched:
print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')
print (df.index)
Index(['a', 'a', 'b', 'b'], dtype='object', name='group')
You're not getting what you want because when using df.groupby('group')['data'].sum(), this is returning an aggregated result with group as index:
group
a 15
b 130
Name: data, dtype: int64
Where clearly indexes are not aligned.
If you want this to work you'll have to use transform, which returns a Series with the transformed vales which has the same axis length as self:
df['new'] = df.groupby('group')['data'].transform('sum')
group data new
0 a 5 15
1 a 10 15
2 b 100 130
3 b 30 130
Related
I am trying to group a dataset based on the name and find the monthly average. i.e sum all the values for each name divided by the number of the distinct month for each name.
For example,
name time values
A 2011-01-17 10
B 2011-02-17 20
A 2011-01-11 10
A 2011-03-17 30
B 2011-02-17 10
The expected result is
name monthly_avg
A 25
B 30
I have tried
data.groupby(['name'])['values'].mean().reset_index(name='Monthly Average')
but it gives the output below instead of my desired output above:
name Monthly Average
A 16.666667
B 15.000000
Convert values to datetimes first, then aggregate sum per name and months by Grouper and last get mean per first level name:
data['time'] = pd.to_datetime(data['time'])
df = (data.groupby(['name', pd.Grouper(freq='m', key='time')])['values'].sum()
.groupby(level=0)
.mean()
.reset_index(name='Monthly Average'))
print (df)
name Monthly Average
0 A 25
1 B 30
With months period solution is if change Grouper to Series.dt.to_period:
data['time'] = pd.to_datetime(data['time'])
df = (data.groupby(['name', data['time'].dt.to_period('m')])['values']
.sum()
.groupby(level=0)
.mean()
.reset_index(name='Monthly Average'))
print (df)
name Monthly Average
0 A 25
1 B 30
Use:
df = pd.DataFrame({'name': ['A', 'B', 'A', 'A', 'B'], 'time': ['2011-01-17', '2011-02-1', '2011-01-11', '2011-03-17', '2011-02-17'], 'vals': range(5)})
df['month']=pd.to_datetime(df['time']).dt.month
a1 = df.groupby('name')['month'].apply(lambda x: len(set(x)))
a2 = df.groupby('name')['vals'].sum()
a2/a1
output:
Looking to group my fields based on date, and get a mean of all the columns except a binary column which I want to sum in order to get a count.
I know I can do this by:
newdf=df.groupby('date').agg({'var_a': 'mean', 'var_b': 'mean', 'var_c': 'mean', 'binary_var':'sum'})
But there is about 50 columns (other than the binary) that I want to mean, and I feel there must be a simple, quicker way of doing this instead of writing each 'column title' :'mean' for all of them. I've tried to make a list of column names but when I put this in the agg function, it says a list is an unhashable type.
Thanks!
Something like this might work -
df = pd.DataFrame({'a':['a','a','b','b','b','b'], 'b':[10,20,30,40,20,10], 'c':[1,1,0,0,0,1]}, 'd':[20,30,10,15,34,10])
df
a b c d
0 a 10 1 20
1 a 20 1 30
2 b 30 0 10
3 b 40 0 15
4 b 20 0 34
5 b 10 1 10
Assuming c is the binary variable column. Then,
cols = [ val for val in df.columns if val != 'c']
temp = pd.concat([df.groupby(['a'])[cols].mean(), df.groupby(['a'])['c'].sum()], axis=1).reset_index()
temp
a b d c
0 a 15.0 25.00 2
1 b 25.0 17.25 1
In general, I would build the agg dict automatically:
sum_cols = ['binary_val']
agg_dict = {col: 'sum' if col in sum_cols else 'mean'
for col in df.columns if col != 'date'}
df.groupby('date').agg(agg_dict)
I have a Pandas DataFrame with columns A, B, C, D, date. I want to filter out duplicates of A and B, keeping the row with the most recent value in date.
So if I have two rows that look like:
A B C D date
1 1 2 3 1/1/18
1 1 2 3 1/1/17
The correct output would be:
A B C D date
1 1 2 3 1/1/18
I can do this by looping through, but I'd like to use df.groupby(['A', 'B']) and then aggregate by taking the largest value for date in each group.
I tried:
df.groupby(['A', 'B']).agg(lambda x: x.iloc[x.date.argmax()])
But I get:
AttributeError: 'Series' object has no attribute 'date'
Any idea what I'm doing incorrectly?
Edit: Hmm if I do:
df.groupby(['A', 'B']).UPDATED_AT.max()
I get mostly what I want but I lose columns D and C...
You can do with
df.date=pd.to_datetime(df.date)
df.sort_values('date').drop_duplicates(['A','B'],keep='last')
A B C D date
0 1 1 2 3 2018-01-01
Try df.groupby(['A', 'B']).agg(lambda x: x.iloc[x['date'].argmax()])
pandas has its own date object. Maybe pandas got confused with the series name.
df = pd.DataFrame([[1, 1, 2, 3, '1/1/18'],
[1, 1, 2, 3, '1/1/17']],
columns=['A', 'B', 'C', 'D', 'date'])
Output:
A B C D date
0 1 1 2 3 1/1/18
1 1 1 2 3 1/1/17
Grouping an
d removing duplicate:
df.groupby(['A', 'B']).agg(
{
'date': 'max'
})
Output:
date
A B
1 1 1/1/18
This should work. It may work better with having 'date' column to be datetime object.
Hello I have the following Data Frame:
df =
ID Value
a 45
b 3
c 10
And another dataframe with the numeric ID of each value
df1 =
ID ID_n
a 3
b 35
c 0
d 7
e 1
I would like to have a new column in df with the numeric ID, so:
df =
ID Value ID_n
a 45 3
b 3 35
c 10 0
Thanks
Use pandas merge:
import pandas as pd
df1 = pd.DataFrame({
'ID': ['a', 'b', 'c'],
'Value': [45, 3, 10]
})
df2 = pd.DataFrame({
'ID': ['a', 'b', 'c', 'd', 'e'],
'ID_n': [3, 35, 0, 7, 1],
})
df1.set_index(['ID'], drop=False, inplace=True)
df2.set_index(['ID'], drop=False, inplace=True)
print pd.merge(df1, df2, on="ID", how='left')
output:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You could use join(),
In [14]: df1.join(df2)
Out[14]:
Value ID_n
ID
a 45 3
b 3 35
c 10 0
If you want index to be numeric you could reset_index(),
In [17]: df1.join(df2).reset_index()
Out[17]:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You can do this in a single operation. join works on the index, which you don't appear to have set. Just set the index to ID, join df after also setting its index to ID, and then reset your index to return your original dataframe with the new column added.
>>> df.set_index('ID').join(df1.set_index('ID')).reset_index()
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
Also, because you don't do an inplace set_index on df1, its structure remains the same (i.e. you don't change its indexing).
I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
I recommend the drop-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])