Rolling average on previous dates per group - python

I have the following dataset:
Name Loc Site Date Total
Alex Italy A 12.31.2020 30
Alex Italy B 12.31.2020 40
Alex Italy B 12.30.2020 100
Alex Italy A 12.30.2020 80
Alex France A 12.28.2020 10
Alex France B 12.28.2020 20
Alex France B 12.27.2020 10
I want to add per each row the average of total in the day before the date per Name, Loc and Date
This is the outcome I'm looking for:
Name Loc Site Date Total Prv_Avg
Alex Italy A 12.31.2020 30 90
Alex Italy B 12.31.2020 40 90
Alex Italy B 12.30.2020 100 NULL
Alex Italy A 12.30.2020 80 NULL
Alex France A 12.28.2020 10 10
Alex France B 12.28.2020 20 10
Alex France B 12.27.2020 10 NULL
The Nulls are for rows where previous date couldn't be found.
I've tried rolling but got mixed up with the index.

First aggregate mean per 3 columns, add one day to MultiIndex for match previous day and last use DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby(['Name','Loc','Date'])['Total'].mean().rename('Prv_Avg')
print (s)
Name Loc Date
Alex France 2020-12-27 10
2020-12-28 15
Italy 2020-12-30 90
2020-12-31 35
Name: Prv_Avg, dtype: int64
s = s.rename(lambda x: x + pd.Timedelta('1 day'), level=2)
print (s)
Name Loc Date
Alex France 2020-12-28 10
2020-12-29 15
Italy 2020-12-31 90
2021-01-01 35
Name: Prv_Avg, dtype: int64
df = df.join(s, on=['Name','Loc','Date'])
print (df)
Name Loc Site Date Total Prv_Avg
0 Alex Italy A 2020-12-31 30 90.0
1 Alex Italy B 2020-12-31 40 90.0
2 Alex Italy B 2020-12-30 100 NaN
3 Alex Italy A 2020-12-30 80 NaN
4 Alex France A 2020-12-28 10 10.0
5 Alex France B 2020-12-28 20 10.0
6 Alex France B 2020-12-27 10 NaN

Another possible solution:
grp = ['Name', 'Loc', 'Date']
s = df.groupby(grp)['Total'].mean().shift().rename('Prv_Avg')
idx1 = s.index.get_level_values('Name').to_list()
idx2 = s.index.get_level_values('Loc').to_list()
df.merge(s.where((idx1 == np.roll(idx1, 1)) &
(idx2 == np.roll(idx2, 1))).reset_index(),
on= grp)
Output:
Name Loc Site Date Total Prv_Avg
0 Alex Italy A 12.31.2020 30 90.0
1 Alex Italy B 12.31.2020 40 90.0
2 Alex Italy B 12.30.2020 100 NaN
3 Alex Italy A 12.30.2020 80 NaN
4 Alex France A 12.28.2020 10 10.0
5 Alex France B 12.28.2020 20 10.0
6 Alex France B 12.27.2020 10 NaN

Related

Missing value replacemnet using mode in pandas in subgroup of a group

Having a data set as below.Here I need to group the subset in column and fill the missing values using mode method.Here specifically needs to fill the missing value of Tom from UK. So I need to group the TOM from Uk, and in that group the most repeating value needs to be added to the missing value.
Below fig shows how i need to do the group by.From the below matrix i need to replace all the Nan values using mode.
the desired output:
attaching the dataset
Name location Value
Tom USA 20
Tom UK Nan
Tom USA Nan
Tom UK 20
Jack India Nan
Nihal Africa 30
Tom UK Nan
Tom UK 20
Tom UK 30
Tom UK 20
Tom UK 30
Sam UK 30
Sam UK 30
try:
df = df\
.set_index(['Name', 'location'])\
.fillna(
df[df.Name.eq('Tom') & df.location.eq('UK')]\
.groupby(['Name', 'location'])\
.agg(pd.Series.mode)\
.to_dict()
)\
.reset_index()
Output:
Name location Value
0 Tom USA 20
1 Tom UK 20
2 Tom USA NaN
3 Tom UK 20
4 Jack India NaN
5 Nihal Africa 30
6 Tom UK 20
7 Tom UK 20
8 Tom UK 30
9 Tom UK 20
10 Tom UK 30
11 Sam UK 30
12 Sam UK 30

Combine rows with containing blanks and each other data - Python [duplicate]

I have pandas DF as below ,
id age gender country sales_year
1 None M India 2016
2 23 F India 2016
1 20 M India 2015
2 25 F India 2015
3 30 M India 2019
4 36 None India 2019
I want to group by on id, take the latest 1 row as per sales_date with all non null element.
output expected,
id age gender country sales_year
1 20 M India 2016
2 23 F India 2016
3 30 M India 2019
4 36 None India 2019
In pyspark,
df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))
But i need same solution in pandas .
EDIT ::
This can the case with all the columns. Not just age. I need it to pick up latest non null data(id exist) for all the ids.
Use GroupBy.first:
df1 = df.groupby('id', as_index=False).first()
print (df1)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
If column sales_year is not sorted:
df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
print(df.replace('None',np.NaN).groupby('id').first())
first replace the 'None' with NaN
next use groupby() to group by 'id'
next filter out the first row using first()
Use -
df.dropna(subset=['gender']).sort_values('sales_year', ascending=False).groupby('id')['age'].first()
Output
id
1 20
2 23
3 30
4 36
Name: age, dtype: object
Remove the ['age'] to get full rows -
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first()
Output
age gender country sales_year
id
1 20 M India 2015
2 23 F India 2016
3 30 M India 2019
4 36 None India 2019
You can put the id back as a column with reset_index() -
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first().reset_index()
Output
id age gender country sales_year
0 1 20 M India 2015
1 2 23 F India 2016
2 3 30 M India 2019
3 4 36 None India 2019

How would I rank within a groupby object based on another row condition in Pandas? Example included

The dataframe below has 4 columns: runner_name,race_date, height_in_inches,top_ten_finish.
I want to groupby race_date, and if the runner finished in the top ten for that race_date, rank his height_in_inches among only the other runners who finished in the top ten for that race_date. How would I do this?
This is the original dataframe:
>>> import pandas as pd
>>> d = {"runner":['mike','paul','jim','dave','douglas'],
... "race_date":['2019-02-02','2019-02-02','2020-02-02','2020-02-01','2020-02-01'],
... "height_in_inches":[72,68,70,74,73],
... "top_ten_finish":["yes","yes","no","yes","no"]}
>>> df = pd.DataFrame(d)
>>> df
runner race_date height_in_inches top_ten_finish
0 mike 2019-02-02 72 yes
1 paul 2019-02-02 68 yes
2 jim 2020-02-02 70 no
3 dave 2020-02-01 74 yes
4 douglas 2020-02-01 73 no
>>>
and this is what I'd like the result to look like. Notice how if they didn't finish in the top 10 of a race, they get a value of 0 for that new column.
runner race_date height_in_inches top_ten_finish if_top_ten_height_rank
0 mike 2019-02-02 72 yes 1
1 paul 2019-02-02 68 yes 2
2 jim 2020-02-02 70 no 0
3 dave 2020-02-01 74 yes 1
4 douglas 2020-02-01 73 no 0
Thank you!
We can do groupby + filter with rank
df['rank']=df[df.top_ten_finish.eq('yes')].groupby('race_date')['height_in_inches'].rank(ascending=False)
df['rank'].fillna(0,inplace=True)
df
Out[87]:
runner race_date height_in_inches top_ten_finish rank
0 mike 2019-02-02 72 yes 1.0
1 paul 2019-02-02 68 yes 2.0
2 jim 2020-02-02 70 no 0.0
3 dave 2020-02-01 74 yes 1.0
4 douglas 2020-02-01 73 no 0.0
You can filter and rank on groupby() then assign back:
df['if_top_ten_height_rank'] = (df.loc[df['top_ten_finish']=='yes','height_in_inches']
.groupby(df['race_date']).rank(ascending=False)
.reindex(df.index, fill_value=0)
.astype(int)
)
Output:
runner race_date height_in_inches top_ten_finish if_top_ten_height_rank
-- -------- ----------- ------------------ ---------------- ------------------------
0 mike 2019-02-02 72 yes 1
1 paul 2019-02-02 68 yes 2
2 jim 2020-02-02 70 no 0
3 dave 2020-02-01 74 yes 1
4 douglas 2020-02-01 73 no 0

How can I swap the values of a selective column randomly?

Suppose I have a large data-set(in CSV formate) like the following :
Country Age Salary Purchased
0 France 44 72000 No
1 Spain 27 48000 Yes
2 Germany 30 54000 No
3 Spain 38 61000 No
4 Germany 40 45000 Yes
5 France 35 58000 Yes
6 Spain 75 52000 No
7 France 48 79000 Yes
8 Germany 50 83000 No
9 France 37 67000 Yes
Now how can i swap all the values for a selected column Randomly ? For Example
i want to swap all the values of the first column 'Country' randomly.
Looking for your suggestion. Thanks in advance !
Shuffle in-place using random.shuffle:
# <= 0.23
# np.random.shuffle(df['Country'].values)
# 0.24+
np.random.shuffle(df['Country'].to_numpy())
Or, assign back with random.choice:
df['Country'] = np.random.choice(df['Country'], len(df), replace=False)
permutation
np.random.seed([3, 1415])
df.assign(Country=df.Country.to_numpy()[np.random.permutation(len(df))])
Country Age Salary Purchased
0 France 44 72000 No
1 Germany 27 48000 Yes
2 France 30 54000 No
3 Spain 38 61000 No
4 France 40 45000 Yes
5 Spain 35 58000 Yes
6 Germany 75 52000 No
7 Spain 48 79000 Yes
8 Germany 50 83000 No
9 France 37 67000 Yes
sample
df.assign(Country=df.Country.sample(frac=1).to_numpy())

New dataframe from grouping together two columns

I have a dataset that looks like the following.
Region_Name Date Average
London 1990Q1 105
London 1990Q1 118
... ... ...
London 2018Q1 157
I converted the date into quarters and wish to create a new dataframe with the matching quarters and region names grouped together, with the mean average.
What is the best way to accomplish such a task.
I have been looking at the groupby function but keep getting a traceback.
for example:
new_df = df.groupby(['Resion_Name','Date']).mean()
dict3={'Region_Name': ['London','Newyork','London','Newyork','London','London','Newyork','Newyork','Newyork','Newyork','London'],
'Date' : ['1990Q1','1990Q1','1990Q2','1990Q2','1991Q1','1991Q1','1991Q2','1992Q2','1993Q1','1993Q1','1994Q1'],
'Average': [34,56,45,67,23,89,12,45,67,34,67]}
df3=pd.DataFrame(dict3)
**Now My df3 is as follows **
Region_Name Date Average
0 London 1990Q1 34
1 Newyork 1990Q1 56
2 London 1990Q2 45
3 Newyork 1990Q2 67
4 London 1991Q1 23
5 London 1991Q1 89
6 Newyork 1991Q2 12
7 Newyork 1992Q2 45
8 Newyork 1993Q1 67
9 Newyork 1993Q1 34
10 London 1994Q1 67
code looks as follows:
new_df = df3.groupby(['Region_Name','Date'])
new1=new_df['Average'].transform('mean')
Result of dataframe new1:
print(new1)
0 34.0
1 56.0
2 45.0
3 67.0
4 56.0
5 56.0
6 12.0
7 45.0
8 50.5
9 50.5
10 67.0

Categories