Here my pandas:
df = pd.DataFrame({
'location': ['USA','USA','USA','USA', 'France','France','France','France'],
'date':['2020-11-20','2020-11-21','2020-11-22','2020-11-23', '2020-11-20','2020-11-21','2020-11-22','2020-11-23'],
'dm':[5.,4.,2.,2.,17.,3.,3.,7.]
})
For a precise location (so groupby is needed) I want the mean of dm over 2 days. If I use this :
df['rolling']=df.groupby('location').dm.rolling(2).mean().values
I obtain this incorrect pandas
location date dm rolling
0 USA 2020-11-20 5.0 NaN
1 USA 2020-11-21 4.0 10.0
2 USA 2020-11-22 2.0 3.0
3 USA 2020-11-23 2.0 5.0
4 France 2020-11-20 17.0 NaN
5 France 2020-11-21 3.0 4.5
6 France 2020-11-22 3.0 3.0
7 France 2020-11-23 7.0 2.0
While it should be:
location date dm rolling
0 USA 2020-11-20 5.0 NaN
1 USA 2020-11-21 4.0 4.5
2 USA 2020-11-22 2.0 3.0
3 USA 2020-11-23 2.0 2.0
4 France 2020-11-20 17.0 NaN
5 France 2020-11-21 3.0 10
6 France 2020-11-22 3.0 3.0
7 France 2020-11-23 7.0 5.0
Two questions:
what my syntax is actually doing ?
what is the correct way to proceed ?
Here is problem groupby create new level of MultiIndex, so for matching original index values is necessary remove it by Series.reset_index with drop=True, if use .value then is no alignemnt by index, so order should be different like here:
df['rolling']=df.groupby('location').dm.rolling(2).mean().reset_index(level=0, drop=True)
print (df)
location date dm rolling
0 USA 2020-11-20 5.0 NaN
1 USA 2020-11-21 4.0 4.5
2 USA 2020-11-22 2.0 3.0
3 USA 2020-11-23 2.0 2.0
4 France 2020-11-20 17.0 NaN
5 France 2020-11-21 3.0 10.0
6 France 2020-11-22 3.0 3.0
7 France 2020-11-23 7.0 5.0
Details:
print (df.groupby('location').dm.rolling(2).mean())
location
France 4 NaN
5 10.0
6 3.0
7 5.0
USA 0 NaN
1 4.5
2 3.0
3 2.0
Name: dm, dtype: float64
print (df.groupby('location').dm.rolling(2).mean().reset_index(level=0, drop=True))
4 NaN
5 10.0
6 3.0
7 5.0
0 NaN
1 4.5
2 3.0
3 2.0
Name: dm, dtype: float64
Related
I have a pandas data frame df like this.
date id eng math sci
2021-08-01 00:00:37 23 4.0 5.0 7.0
2021-08-01 00:05:37 23 4.0 4.0 5.0
2021-08-01 00:10:37 23 4.0 4.0 6.0
2021-08-01 00:15:38 23 4.0 4.0 5.0
2021-08-01 00:20:37 23 4.0 5.0 6.0
2021-08-01 00:25:37 23 4.0 5.0 7.0
... ... ... ... ...
2021-08-31 23:38:40 1995 4.0 4.0 5.0
2021-08-31 23:43:40 1995 4.0 4.0 4.0
2021-08-31 23:48:40 1995 4.0 5.0 5.0
2021-08-31 23:53:40 1995 4.0 4.0 4.0
2021-08-31 23:58:40 1995 4.0 5.0 7.0
1661089 rows × 4 columns
I want to remove rows with the maximum sci value in each hour.
In each hour, I would like to remove exactly 1 maximum sci value. If there are 2 maxmimum values in each hour like above case, remove just first row.
So the result should look like:
date id eng math sci
2021-08-01 00:05:37 23 4.0 4.0 5.0
2021-08-01 00:10:37 23 4.0 4.0 6.0
2021-08-01 00:15:38 23 4.0 4.0 5.0
2021-08-01 00:20:37 23 4.0 5.0 6.0
2021-08-01 00:25:37 23 4.0 5.0 7.0
... ... ... ... ...
2021-08-31 23:38:40 1995 4.0 4.0 5.0
2021-08-31 23:43:40 1995 4.0 4.0 4.0
2021-08-31 23:48:40 1995 4.0 5.0 5.0
2021-08-31 23:53:40 1995 4.0 4.0 4.0
My first attempt:
df_filtered = df.reset_index()
df_temp_max = (df_filtered.groupby(['id', pd.Grouper(key='date', freq='1H')])
.agg({'sci': 'max'})
.reset_index())
df_test_max = pd.Series(df_temp_max['sci'].values)
df_filtered.insert(5, 'sci_max', df_test_max, True)
I got:
date id eng math sci sci_max
0 2021-08-01 00:00:37 23 4.0 5.0 7.0 7.0
1 2021-08-01 00:05:37 23 4.0 4.0 5.0 7.0
2 2021-08-01 00:10:37 23 4.0 4.0 6.0 7.0
3 2021-08-01 00:15:38 23 4.0 4.0 5.0 7.0
4 2021-08-01 00:20:37 23 4.0 5.0 6.0 7.0
... ... ... ... ... ... ...
1661084 2021-08-31 23:38:40 1995 4.0 4.0 5.0 NaN
1661085 2021-08-31 23:43:40 1995 4.0 4.0 4.0 NaN
1661086 2021-08-31 23:48:40 1995 4.0 5.0 5.0 NaN
1661087 2021-08-31 23:53:40 1995 4.0 4.0 4.0 NaN
1661088 2021-08-31 23:58:40 1995 4.0 5.0 7.0 NaN
Of course, it's not true. There are so many NaN values.
I tried to using for loop, but it took too much time and if I remove one row, there was indexing error as well.
Could you help me to solve this problem, plase? Thank you so much!
Use idxmax instead of max to get the index to remove per group:
idx = df.groupby(['id', pd.Grouper(key='date', freq='H')])['sci'].idxmax()
out = df.drop(idx)
Output:
>>> idx
id date
23 2021-08-01 00:00:00 0
1995 2021-08-31 23:00:00 10
Name: sci, dtype: int64
>>> out
date id eng math sci
1 2021-08-01 00:05:37 23 4.0 4.0 5.0
2 2021-08-01 00:10:37 23 4.0 4.0 6.0
3 2021-08-01 00:15:38 23 4.0 4.0 5.0
4 2021-08-01 00:20:37 23 4.0 5.0 6.0
5 2021-08-01 00:25:37 23 4.0 5.0 7.0
6 2021-08-31 23:38:40 1995 4.0 4.0 5.0
7 2021-08-31 23:43:40 1995 4.0 4.0 4.0
8 2021-08-31 23:48:40 1995 4.0 5.0 5.0
9 2021-08-31 23:53:40 1995 4.0 4.0 4.0
i'm in this situation,
my df is like that
A B
0 0.0 2.0
1 3.0 4.0
2 NaN 1.0
3 2.0 NaN
4 NaN 1.0
5 4.8 NaN
6 NaN 1.0
and i want to apply this line of code:
df['A'] = df['B'].fillna(df['A'])
and I expect a workflow and final output like that:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 NaN NaN
4 1.0 1.0
5 NaN NaN
6 1.0 1.0
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0
but I receive this error:
TypeError: Unsupported type Series
probably because each time there is an NA it tries to fill it with the whole series and not with the single element with the same index of the B column.
I receive the same error with a syntax like that:
df['C'] = df['B'].fillna(df['A'])
so the problem seems not to be the fact that I'm first changing the values of A with the ones of B and then trying to fill the "B" NA with the values of a column that is technically the same as B
I'm in a databricks environment and I'm working with koalas data frames but they work as the pandas ones.
can you help me?
Another option
Suppose the following dataset
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10],
'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Doom-Dooma", "Guntur", "Labac-Silchar", "Numaligarh", "Sibsagar", "Munger-Jamalpu"],
'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
'Apr-21': [121.1, 118.3, 131.5, np.NaN, 128.2, 128.2, 115.4, 115.1, np.NaN, 118.3]})
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 NaN
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 NaN
9 10 Munger-Jamalpu 117.7 118.3
Then
df.loc[(df["Mar-21"].notnull()) & (df["Apr-21"].isna()), "Apr-21"] = df["Mar-21"]
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 123.7
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.7
9 10 Munger-Jamalpu 117.7 118.3
IIUC:
try with max():
df['A']=df[['A','B']].max(axis=1)
output of df:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0
In Pandas, I would like to create columns, which will represent the season (e.g. travel season) starting from November and ending in October next year.
This is my snippet:
from numpy import random
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
'date': pd.date_range('1990-01-01', freq='M', periods=12),
'travel_2016': random.randint(10, size=(12)),
'travel_2017': random.randint(10, size=(12)),
'travel_2018': random.randint(10, size=(12)),
'travel_2019': random.randint(10, size=(12)),
'travel_2020': random.randint(10, size=(12))})
df['month_date'] = df['date'].dt.strftime('%m')
df = df.drop(columns = ['date'])
I was trying this approach pandas groupby by customized year, e.g. a school year
I failed after 'unpivoting' the table with both solutions. It would be easier for me to keep up the pivot table for future operations.
My desired output would be something like this:
season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 8 7 7 4 11
1 0 1 4 8 12
2 1 4 5 9 01
3 8 3 5 7 02
4 4 7 8 3 03
5 6 8 4 4 04
6 5 8 3 1 05
7 7 0 1 1 06
8 1 2 1 3 07
9 8 9 7 5 08
10 7 7 7 8 09
11 9 1 4 0 10
Many thanks!
Your table is already foramtted as you want, roughly: you’re basically shifting all the rows down by 2, and getting the 2 bottom rows up to the start − but shifted into the next year.
>>> year_ends = df.shift(-10)
>>> year_ends = year_ends.drop(columns=['month_date']).shift(axis='columns').join(year_ends['month_date'])
>>> year_ends
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN
The rest is pretty easy:
>>> seasons = df.shift(2).fillna(year_ends)
>>> seasons
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
Of course you should now rename the columns appropriately:
>>> seasons.rename(columns=lambda c: c if not c.startswith('travel_') else f"season_{int(c[7:]) - 1}/{c[7:]}")
season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
Note that the 2 first values of 2015 are NaN, which makes sense, as those were not in the initial dataframe.
An alternate way is to use datetime tools. This may be more generic:
>>> data = df.set_index('month_date').rename_axis('year', axis='columns').stack().reset_index(name='data')
>>> data.head()
month_date year data
0 01 travel_2016 5
1 01 travel_2017 8
2 01 travel_2018 4
3 01 travel_2019 3
4 01 travel_2020 2
>>> dates = data['year'].str[7:].str.cat(data['month_date']).transform(pd.to_datetime, format='%Y%m')
>>> dates.head()
0 2016-01-01
1 2017-01-01
2 2018-01-01
3 2019-01-01
4 2020-01-01
Name: year, dtype: datetime64[ns]
Then as in the linked question get the year fiscal year starting in november:
>>> season = dates.dt.to_period('Q-OCT').dt.qyear.rename('season')
>>> seasonal_data = data.join(season).pivot('month_date', 'season', 'data')
>>> seasonal_data.rename(columns=lambda c: f"season_{c - 1}/{c}", inplace=True)
>>> seasonal_data.reindex([*df['month_date'][-2:], *df['month_date'][:-2]]).reset_index()
season month_date season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 season_2020/2021
0 11 NaN 7.0 8.0 3.0 2.0 4.0
1 12 NaN 6.0 9.0 3.0 7.0 9.0
2 01 5.0 8.0 4.0 3.0 2.0 NaN
3 02 0.0 8.0 3.0 7.0 0.0 NaN
4 03 3.0 1.0 0.0 0.0 0.0 NaN
5 04 3.0 6.0 3.0 1.0 4.0 NaN
6 05 7.0 7.0 5.0 9.0 5.0 NaN
7 06 9.0 7.0 0.0 9.0 5.0 NaN
8 07 3.0 8.0 2.0 0.0 6.0 NaN
9 08 5.0 1.0 3.0 4.0 8.0 NaN
10 09 2.0 5.0 8.0 7.0 4.0 NaN
11 10 4.0 9.0 1.0 3.0 1.0 NaN
I'm trying to use python's pandas groupby, apply, where and quantile to replace values that fall below a 50% quantile with NaN by 'date' group however it seems to be returning lists in the cells. How can I get these results in a new column after the column 'value'.
This is the code I have (any other approaches are welcome). It returns lists in cells:
In[0]: df.groupby('date')['value'].apply(lambda x: np.where(x<x.quantile(0.5),np.nan,x))
Out[0]:
date value
2019-12-23 [nan, nan, 3.0, 4.0, 5.0]
2014-08-13 [nan, nan, 3.0, 4.0, 5.0]
If I create a new column it returns NaN in new column:
In[1]: df['new_value']= df.groupby('date')['value'].apply(lambda x: np.where(x<x.quantile(0.5),np.nan,x))
Out[1]:
date value new_value
0 2019-12-23 1.0 NaN
1 2019-12-23 2.0 NaN
2 2019-12-23 3.0 NaN
3 2019-12-23 4.0 NaN
4 2019-12-23 5.0 NaN
5 2014-08-13 1.0 NaN
6 2014-08-13 2.0 NaN
7 2014-08-13 3.0 NaN
8 2014-08-13 4.0 NaN
9 2014-08-13 5.0 NaN
I would like to get to this:
date value new_value
0 2019-12-23 1.0 NaN
1 2019-12-23 2.0 NaN
2 2019-12-23 3.0 3.0
3 2019-12-23 4.0 4.0
4 2019-12-23 5.0 5.0
5 2014-08-13 1.0 NaN
6 2014-08-13 2.0 NaN
7 2014-08-13 3.0 3.0
8 2014-08-13 4.0 4.0
9 2014-08-13 5.0 5.0
Instead of apply you can use transform
df["new_value"] = df.groupby("date")["value"].transform(
lambda x: np.where(x < x.quantile(0.5), np.nan, x)
)
date value new_value
0 2019-12-23 1.0 NaN
1 2019-12-23 2.0 NaN
2 2019-12-23 3.0 3.0
3 2019-12-23 4.0 4.0
4 2019-12-23 5.0 5.0
5 2014-08-13 1.0 NaN
6 2014-08-13 2.0 NaN
7 2014-08-13 3.0 3.0
8 2014-08-13 4.0 4.0
9 2014-08-13 5.0 5.0
I have a dataframe of race results. I'd like to create a series that takes the last stage position and subtracts that by the average of all the stages before that. Here is a small slice for the df (could have more stages, countries and rows)
race_location stage1_position stage2_position stage3_position number_of_stages
AUS 2.0 2.0 NaN 2
AUS 1.0 5.0 NaN 2
AUS 3.0 4.0 NaN 2
AUS 4.0 8.0 NaN 2
AUS 10.0 6.0 NaN 2
AUS 9.0 7.0 NaN 2
FRA 23.0 1.0 10.0 3
FRA 6.0 12.0 24.0 3
FRA 14.0 11.0 14.0 3
FRA 18.0 10.0 1.0 3
FRA 15.0 14.0 4.0 3
USA 24.0 NaN NaN 1
USA 7.0 NaN NaN 1
USA 22.0 NaN NaN 1
USA 11.0 NaN NaN 1
USA 8.0 NaN NaN 1
USA 16.0 NaN NaN 1
USA 13.0 NaN NaN 1
USA 19.0 NaN NaN 1
USA 5.0 NaN NaN 1
USA 25.0 NaN NaN 1
The output would be
last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5
-13
-10.5
0
0
0
0
0
0
0
0
0
0
0
This wont work, but I was thinking something like this:
new_series = []
for country in country_list:
num_stages = df.loc[df['race_location'] == country, 'number_of_stages']
differnce = df.ix[df['race_location'] == country, num_stages] -
df.iloc[:, 0:num_stages-1].mean(axis=1)
new_series.append(difference)
I'm not sure how to go about doing this. Any help or direction would be amazing!
#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]:
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
I'd use filter to get just he stage columns, then stack and groupby
stages = df.filter(regex='^stage\d+.*')
stages.stack().groupby(level=0).apply(
lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
how it works
stack will automatically drop the NaN values when converting to a series.
Now, position -1 is the last value within each group if we grouped by the first level of the new multiindex
So, we use a lambda and calculate the mean with every thing up to the last value x.iloc[:-1].mean()
And subtract that from the last value x.iloc[-1]
subtracts that by the average of all the stages before that
It's not a big deal but I'm just curious! Unlike your desired output but along to your description, if one of the racers finished only one race, shouldn't their result be inf or nan instead of 0? (to specify them from the one who has already done 2~3 race but last race result is exactly same with average of races? like racer #1 vs racer #11~20)
df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN