How can I add the means of b and c to my dataframe? I tried a merge but it didn't seem to work. So I want two extra columns b_mean and c_mean added to my dataframe with the results of df.groupBy('date').mean()
DataFrame
a b c date
0 2 3 5 1
1 5 9 1 1
2 3 7 1 1
I have the following code
import pandas as pd
a = [{'date': 1,'a':2, 'b':3, 'c':5}, {'date':1, 'a':5, 'b':9, 'c':1}, {'date':1, 'a':3, 'b':7, 'c':1}]
df = pd.DataFrame(a)
x = df.groupby('date').mean()
Edit:
Desired output would be the following
df.groupby('date').mean() returns:
a b c
date
1 3.333333 6.333333 2.333333
My desired result would be the following data frame
a b c date a_mean b_mean
0 2 3 5 1 3.3333 6.3333
1 5 9 1 1 3.3333 6.3333
2 3 7 1 1 3.3333 6.3333
As #ayhan mentioned, you can use pd.groupby.transform() for this. Transform is like apply, but it uses the same index as the original dataframe instead of the unique values in the column(s) grouped on.
df['a_mean'] = df.groupby('date')['a'].transform('mean')
df['b_mean'] = df.groupby('date')['b'].transform('mean')
>>> df
a b c date b_mean a_mean
0 2 3 5 1 6.333333 3.333333
1 5 9 1 1 6.333333 3.333333
2 3 7 1 1 6.333333 3.333333
solution
Use join with a rsuffix parameter.
df.join(df.groupby('date').mean(), on='date', rsuffix='_mean')
a b c date a_mean b_mean c_mean
0 2 3 5 1 3.333333 6.333333 2.333333
1 5 9 1 1 3.333333 6.333333 2.333333
2 3 7 1 1 3.333333 6.333333 2.333333
We can limit it to just ['a', 'b']
df.join(df.groupby('date')[['a', 'b']].mean(), on='date', rsuffix='_mean')
a b c date a_mean b_mean
0 2 3 5 1 3.333333 6.333333
1 5 9 1 1 3.333333 6.333333
2 3 7 1 1 3.333333 6.333333
extra credit
Not really answering your question... but I thought it was neat!
d1 = df.set_index('date', append=True).swaplevel(0, 1)
g = df.groupby('date').describe()
d1.append(g).sort_index()
a b c
date
1 0 2.000000 3.000000 5.000000
1 5.000000 9.000000 1.000000
2 3.000000 7.000000 1.000000
25% 2.500000 5.000000 1.000000
50% 3.000000 7.000000 1.000000
75% 4.000000 8.000000 3.000000
count 3.000000 3.000000 3.000000
max 5.000000 9.000000 5.000000
mean 3.333333 6.333333 2.333333
min 2.000000 3.000000 1.000000
std 1.527525 3.055050 2.309401
I assuming that you need mean value of a column added as a new column value in the dataframe. Please correct me otherwise.
You can achieve by taking the mean of column directly and create a new column by assigning like
In [1]: import pandas as pd
In [2]: a = [{'date': 1,'a':2, 'b':3, 'c':5}, {'date':1, 'a':5, 'b':9, 'c':1}, {'date':1, 'a':3, 'b':7, 'c':1}]
In [3]: df = pd.DataFrame(a)
In [4]: for col in ['b','c']:
...: df[col+"_mean"] = df.groupby('date')[col].transform('mean')
In [5]: df
Out[5]:
a b c date b_mean c_mean
0 2 3 5 1 6.333333 2.333333
1 5 9 1 1 6.333333 2.333333
2 3 7 1 1 6.333333 2.333333
Related
I would like to divide all columns, except the first, with a specific column of a dataframe and add the results as new columns with a new header, but I'm stuck. Here is my approach, but please be gentle, I just started programming a month ago..:
I got this example dataframe:
np.random.seed(0)
data = pd.DataFrame(np.random.randint(1,10,size=(100, 10)),
columns=list('ABCDEFGHIJ'))
Now I create a list of the columns and drop 'A' and 'J':
cols = list(data.drop(columns=['A', 'J']).columns)
Then I would like to divide the columns B-I by column J. In this example this would be easy, since there are just single letters, but the column names are longer in reality (for example "Donaudampfschifffahrtkapitän" (there are really funny and long words in german). That's why I want to do it with the "cols"-list.
data[[cols]] = data[[cols]].div(data['J'].values,axis=0)
However, I get this error:
KeyError: "None of [Index([('B', 'C', 'D', 'E', 'F', 'G', 'H', 'I')], dtype='object')] are in the [columns]"
What is wrong? Or does someone knows an even better approach?
And how can I add the results with their specific names ('B/J', 'C/J', ..., 'I/J') to the dataframe?
Thx in advance!
You need to remove the [], cols is already a list:
data[cols] = data[cols].div(data['J'], axis=0)
NB. Also using values is not needed as pandas perform index alignment (and anyway you don't change the order of the rows here).
output:
A B C D E F G H I J
0 6 0.125000 0.500000 0.500000 1.000000 0.500000 0.750000 0.375000 0.625000 8
1 7 1.500000 1.500000 0.333333 1.166667 1.333333 1.333333 1.500000 0.333333 6
2 9 0.555556 0.444444 0.111111 0.444444 0.666667 0.111111 0.333333 0.444444 9
3 2 0.500000 0.500000 0.500000 1.000000 0.125000 0.250000 0.125000 0.625000 8
4 4 0.428571 1.142857 0.428571 0.142857 0.142857 0.714286 0.857143 0.857143 7
...
as new columns
data2 = pd.concat([data, data[cols].div(data['J'], axis=0).add_suffix('/J')],
axis=1)
output:
A B C D E F G H I J B/J C/J D/J E/J \
0 6 1 4 4 8 4 6 3 5 8 0.125000 0.500000 0.500000 1.000000
1 7 9 9 2 7 8 8 9 2 6 1.500000 1.500000 0.333333 1.166667
2 9 5 4 1 4 6 1 3 4 9 0.555556 0.444444 0.111111 0.444444
3 2 4 4 4 8 1 2 1 5 8 0.500000 0.500000 0.500000 1.000000
4 4 3 8 3 1 1 5 6 6 7 0.428571 1.142857 0.428571 0.142857
F/J G/J H/J I/J
0 0.500000 0.750000 0.375000 0.625000
1 1.333333 1.333333 1.500000 0.333333
2 0.666667 0.111111 0.333333 0.444444
3 0.125000 0.250000 0.125000 0.625000
4 0.142857 0.714286 0.857143 0.857143
Because cols is list remove nested []:
data = pd.DataFrame(np.random.randint(1,10,size=(100, 10)), columns=list('ABCDEFGHIJ'))
#you can already drop from columns names, converting to list is not necessary
cols = data.columns.drop(['A', 'J'])
#alternative solution
cols = data.columns.difference(['A', 'J'], sort=False)
data[cols] = data[cols].div(data['J'],axis=0)
print (data)
A B C D E F G H \
0 2 1.000000 0.200000 0.200000 0.400000 1.600000 1.200000 0.800000
1 2 0.428571 0.285714 0.857143 1.142857 0.142857 0.714286 0.142857
2 2 0.222222 0.444444 1.000000 0.111111 0.222222 0.222222 0.333333
3 2 1.500000 3.000000 0.500000 0.500000 3.500000 2.000000 3.000000
4 1 0.666667 1.333333 0.833333 0.166667 1.166667 0.500000 1.500000
.. .. ... ... ... ... ... ... ...
95 8 0.857143 1.142857 0.142857 1.000000 0.571429 0.142857 1.000000
96 1 5.000000 4.000000 8.000000 8.000000 2.000000 7.000000 3.000000
97 2 0.888889 0.222222 0.222222 0.666667 1.000000 0.333333 0.444444
98 7 2.333333 0.666667 3.000000 2.000000 0.666667 2.000000 1.333333
99 2 2.000000 6.000000 8.000000 5.000000 9.000000 5.000000 3.000000
I J
0 0.800000 5
1 1.000000 7
2 1.000000 9
3 1.000000 2
4 0.833333 6
.. ... ..
95 0.857143 7
96 3.000000 1
97 1.000000 9
98 1.000000 3
99 8.000000 1
[100 rows x 10 columns]
If need add new columns use concat:
df = pd.concat([data, data[cols].div(data['J'], axis=0).add_suffix('/J')], axis=1)
I know how to compute the groupby mean or std. But now I want to compute both at a time.
My code:
df =
a b c d
0 Apple 3 5 7
1 Banana 4 4 8
2 Cherry 7 1 3
3 Apple 3 4 7
xdf = df.groupby('a').agg([np.mean(),np.std()])
Present output:
TypeError: _mean_dispatcher() missing 1 required positional argument: 'a'
Try to remove () from the np. functions:
xdf = df.groupby("a").agg([np.mean, np.std])
print(xdf)
Prints:
b c d
mean std mean std mean std
a
Apple 3 0.0 4.5 0.707107 7 0.0
Banana 4 NaN 4.0 NaN 8 NaN
Cherry 7 NaN 1.0 NaN 3 NaN
EDIT: To "flatten" column multi-index:
xdf = df.groupby("a").agg([np.mean, np.std])
xdf.columns = xdf.columns.map("_".join)
print(xdf)
Prints:
b_mean b_std c_mean c_std d_mean d_std
a
Apple 3 0.0 4.5 0.707107 7 0.0
Banana 4 NaN 4.0 NaN 8 NaN
Cherry 7 NaN 1.0 NaN 3 NaN
I am trying to subtract the minimum value of each column from all values in that column in a pandas dataframe. But when using df.describe().min[columnName] to get the minimum value of that column, it returns the minimum values correctly except for the last column, it seems to return the standard deviation instead. Here is an example:
import pandas as pd
import numpy as np
# create dictionary and dataframe
dfDict = {'A': [1,2,3,4], 'B':[2,4,6,8],'C': [3,5,7,9]}
df = pd.DataFrame.from_dict(dfDict)
print(df)
output:
A B C
0 1 2 3
1 2 4 5
2 3 6 7
3 4 8 9
When I print(df.describe()) this value seems to be ok, Output:
A B C
count 4.000000 4.000000 4.000000
mean 2.500000 5.000000 6.000000
std 1.290994 2.581989 2.581989
min 1.000000 2.000000 3.000000
25% 1.750000 3.500000 4.500000
50% 2.500000 5.000000 6.000000
75% 3.250000 6.500000 7.500000
max 4.000000 8.000000 9.000000
But when I print(df.describe().min()), the value for C is not the minimum value but rather the standard deviation, I get this output:
A 1.000000
B 2.000000
C 2.581989
dtype: float64
Ultimately, I want to subtract the minimum value of each column from all the values in that column. I tried doing so as follows:
iterColNames = df.columns.tolist()
for colName in iterColNames:
df[colName] = df[colName]-df.describe().min()[colName]
This leads to good values for the first two columns but not the last one.
If I print(df) now, it gives me this output:
A B C
0 0.0 0.0 0.418011
1 1.0 2.0 2.418011
2 2.0 4.0 4.418011
3 3.0 6.0 6.418011
Where it should give me the following output instead:
A B C
0 0.0 0.0 0.0
1 1.0 2.0 2.0
2 2.0 4.0 4.0
3 3.0 6.0 6.0
This seems rather simple but I am not sure what is the reason from this problem.
Appreciate your help!
print(df.describe().min())
will compute minimum values for the (pseudo-)dataframe that df.describe() returns, which will likely not make much sense.
Instead, simply
>>> df.min()
A 1
B 2
C 3
will return column-wise minimums.
This will give you the result you are looking for:
df - df.min()
A B C
0 0 0 0
1 1 2 2
2 2 4 4
3 3 6 6
df.min() calculates the minimum for each column. And when you subtract this minimum from your df, pandas will subtract it from every value in the column. No need to use for loops. Try to avoid for loops when using pandas. Pandas uses vectorized operations, they are a lot faster in general.
As a supplement to the other answers, which are generally better solutions to your question:
If you want to select a specific row by index in a dataframe like df.describe() you can use loc
df.describe().loc['min']
Out:
A 1.0
B 2.0
C 3.0
Name: min, dtype: float64
To get your desired output
df - df.describe().loc['min']
Out:
A B C
0 0.0 0.0 0.0
1 1.0 2.0 2.0
2 2.0 4.0 4.0
3 3.0 6.0 6.0
I have a dataframe with dates, id's and values.
For example:
date id value
2016-08-28 A 1
2016-08-28 B 1
2016-08-29 C 2
2016-09-02 B 0
2016-09-03 A 3
2016-09-06 C 1
2017-01-15 B 2
2017-01-18 C 3
2017-01-18 A 2
I want to apply a rolling mean by element, stating one after, so that the result would be like:
date id value rolling_mean
2016-08-28 A 1 NaN
2016-08-28 B 1 NaN
2016-08-29 C 2 NaN
2016-09-02 B 0 0.5
2016-09-03 A 3 2.0
2016-09-06 C 1 1.5
2017-01-15 B 2 1.0
2017-01-18 C 3 2.0
2017-01-18 A 2 2.5
The closest I've come to this was:
grouped = df.groupby(["id", "value"])
df["rolling_mean"] = grouped["value"].shift(1).rolling(window = 2).mean()
But this gives me the wrong values back, as it keeps the order with the remaining elements.
Any ideia?
Thank you in advance,
You can just groupby id and use transform:
df['rolling_mean'] = df.groupby('id')['value'].transform(lambda x: x.rolling(2).mean())
Output:
date id value rolling_mean
0 2016-08-28 A 1 NaN
1 2016-08-28 B 1 NaN
2 2016-08-29 C 2 NaN
3 2016-09-02 B 0 0.5
4 2016-09-03 A 3 2.0
5 2016-09-06 C 1 1.5
6 2017-01-15 B 2 1.0
7 2017-01-18 C 3 2.0
8 2017-01-18 A 2 2.5
Fix your code with groupby with id
grouped = df.groupby(["id"])
df['rolling_mean']=grouped["value"].rolling(window = 2).mean().reset_index(level=0,drop=True)
df
Out[67]:
date id value rolling_mean
0 2016-08-28 A 1 NaN
1 2016-08-28 B 1 NaN
2 2016-08-29 C 2 NaN
3 2016-09-02 B 0 0.5
4 2016-09-03 A 3 2.0
5 2016-09-06 C 1 1.5
6 2017-01-15 B 2 1.0
7 2017-01-18 C 3 2.0
8 2017-01-18 A 2 2.5
Like this:
df['rolling_mean'] = df.groupby('id')['value'].rolling(2).mean().reset_index(0,drop=True).sort_index()
Output:
date id value rolling_mean
0 2016-08-28 A 1 nan
1 2016-08-28 B 1 nan
2 2016-08-29 C 2 nan
3 2016-09-02 B 0 0.50
4 2016-09-03 A 3 2.00
5 2016-09-06 C 1 1.50
6 2017-01-15 B 2 1.00
7 2017-01-18 C 3 2.00
8 2017-01-18 A 2 2.50
I am trying to build a simple function to fill the pandas columns with
some distribution, but it fails to fill the whole table (df still have NaN after fillna ...)
def simple_impute_missing(df):
from numpy.random import normal
rnd_filled = pd.DataFrame( {c : normal(df[c].mean(), df[c].std(), len(df))
for c in df.columns[3:]})
filled_df = df.fillna(rnd_filled)
return filled_df
But the returned df, still have NaNs !
I have checked to make sure that rnd_filled is full and have the right shape.
what is going on?
I think you need remove [:3] from df.columns[3:] for select all columns of df.
Sample:
df = pd.DataFrame({'A':[1,np.nan,3],
'B':[4,5,6],
'C':[np.nan,8,9],
'D':[1,3,np.nan],
'E':[5,np.nan,6],
'F':[7,np.nan,3]})
print (df)
A B C D E F
0 1.0 4 NaN 1.0 5.0 7.0
1 NaN 5 8.0 3.0 NaN NaN
2 3.0 6 9.0 NaN 6.0 3.0
rnd_filled = pd.DataFrame( {c : normal(df[c].mean(), df[c].std(), len(df))
for c in df.columns})
filled_df = df.fillna(rnd_filled)
print (filled_df)
A B C D E F
0 1.000000 4 6.922458 1.000000 5.000000 7.000000
1 2.277218 5 8.000000 3.000000 5.714767 6.245759
2 3.000000 6 9.000000 0.119522 6.000000 3.000000