Applying .mean() to a grouped data with a condition - python

I have a df that looks like this:
Day Country Type Product Cost
Mon US 1 a1 0
Mon US 2 a1 5
Mon US 3 a1 6
Mon CA 1 a1 8
Mon CA 2 a1 0
Mon CA 3 a1 1
I am trying to make it to this:
Day Country Type Product Cost Average
Mon US 1 a1 0 (5+6)/2
Mon US 2 a1 5 (5+6)/2
Mon US 3 a1 6 (5+6)/2
Mon CA 1 a1 8 (8+1)/2
Mon CA 2 a1 0 (8+1)/2
Mon CA 3 a1 1 (8+1)/2
The idea is to group it by Country and Product and get the average cost but take the Costs where its >0.
What I've tried:
np.where(df['Cost']>0, df.loc[df.groupby(['Country','Product'])]['Cost'].mean())
But I get:
ValueError: Cannot index with multidimensional key
What is the best practice solution of applying built-in functions like .mean(), max(), etc to a grouped pandas dataframe with a filter?

First idea is replace 0 to NaNs and then use GroupBy.transform with mean, missing values are omitted by default:
print (df.assign(new = df['Cost'].where(df['Cost'] > 0)))
Day Country Type Product Cost new
0 Mon US 1 a1 0 NaN
1 Mon US 2 a1 5 5.0
2 Mon US 3 a1 6 6.0
3 Mon CA 1 a1 8 8.0
4 Mon CA 2 a1 0 NaN
5 Mon CA 3 a1 1 1.0
df['Average'] = (df.assign(new = df['Cost'].where(df['Cost'] > 0))
.groupby(['Country','Product'])['new']
.transform('mean'))
print (df)
Day Country Type Product Cost Average
0 Mon US 1 a1 0 5.5
1 Mon US 2 a1 5 5.5
2 Mon US 3 a1 6 5.5
3 Mon CA 1 a1 8 4.5
4 Mon CA 2 a1 0 4.5
5 Mon CA 3 a1 1 4.5
Or first filter, aggregate mean and assign back by DataFrame.join:
s = df[df["Cost"] > 0].groupby(['Country','Product'])['Cost'].mean().rename('Average')
df = df.join(s, on=['Country','Product'])
print (df)
Day Country Type Product Cost Average
0 Mon US 1 a1 0 5.5
1 Mon US 2 a1 5 5.5
2 Mon US 3 a1 6 5.5
3 Mon CA 1 a1 8 4.5
4 Mon CA 2 a1 0 4.5
5 Mon CA 3 a1 1 4.5

Try this :
df[df["Cost"] > 0].groupby(['Country','Product'])["Cost"].mean()
It filters out where cost is greater than zero, groups it and then takes the mean.

Related

How to insert missing rows in multi-index

I have a dataframe that looks something like this
df = pd.DataFrame({'year':[23,23,23,23,23,23], 'month':[1,1,1,2,3,3], 'utility':['A','A','B','A','A','B'], 'state':['NY','NJ','CA','NJ','NY','CA']})
year month utility state
0 23 1 A NY
1 23 1 A NJ
2 23 1 B CA
3 23 2 A NJ
4 23 3 A NY
5 23 3 B CA
And I would like to create new rows for utilities-state combinations with missing months. So the new dataframe would look something like this
year month utility state
0 23 1 A NY
1 23 1 A NJ
2 23 1 B CA
3 23 2 A NY
4 23 2 A NJ
5 23 2 B CA
6 23 3 A NY
7 23 3 A NJ
8 23 3 B CA
I know that I can use a MultiIndex and then reindex, but using the from_product() method results in utility-state combinations not present in the original df (I do not want a utility A - CA combination, for example).
I thought about concatenating the utility and state columns and then getting the cartesian product from that, but I think there must be a simpler method.
One option is with DataFrame.complete from pyjanitor. For your data, you are basically doing a combination of (year, month) and (utility, state):
# pip install pyjanitor
import janitor
​
df.complete(('year', 'month'), ('utility', 'state'))
year month utility state
0 23 1 A NY
1 23 1 A NJ
2 23 1 B CA
3 23 2 A NY
4 23 2 A NJ
5 23 2 B CA
6 23 3 A NY
7 23 3 A NJ
8 23 3 B CA
#Timeless, undelete your code and I'll delete mine. You had a good start, and I edited your code to make it simpler.
A possible solution:
cols = ['utility', 'state']
d1 = df.drop_duplicates(cols)
d2 = df.drop_duplicates(['year', 'month'])
d2.assign(**{x: [d1[x].to_list()] * len(d2) for x in cols}).explode(cols)
Output:
year month utility state
0 23 1 A NY
0 23 1 A NJ
0 23 1 B CA
3 23 2 A NY
3 23 2 A NJ
3 23 2 B CA
4 23 3 A NY
4 23 3 A NJ
4 23 3 B CA
I was wondering whether a solution using numpy broadcasting would be possible, and it is:
cols1, cols2 = ['year', 'month'], ['utility', 'state']
(pd.DataFrame(
np.vstack(np.concatenate(
np.broadcast_arrays(
df[cols1].drop_duplicates(cols1).values[:,None],
df[cols2].drop_duplicates(cols2).values), axis=2)),
columns=df.columns))

Check if answer in next row is different based on a condition python?

I have a dataframe like this:
country question year value
1 a1 2017 Y
1 a1 2018 Y
1 a1 2019 N
1 a2 2017 N
1 a2 2018 N
1 a2 2019 Y
2 a1 2017 Y
2 a1 2018 Nan
2 a1 2019 Y
2 a2 2017 Y
2 a2 2018 N
2 a2 2019 Y
3 a1 2017 Y
3 a1 2018 N
3 a1 2019 Y
3 a2 2017 Y
3 a2 2018 Y
3 a2 2019 Y
I want to find where the value from the previous year does not match the value from the next year. I have tried using shift but it doesn't give me what I want.
This is how far I've gotten:
country = all_data['country']
question = all_data['question']
value = all_data['value']
for i in range(len(country)):
if(country[i] == country[i+1] && question[i] == question[i+1]):
Thank you for any suggestions!!
do a partial key self-join (merge())
filter down to rows where year is different by 1 and value is different
self-join full to dataframe built above to construct column
df = pd.read_csv(io.StringIO(""" id question year value
1 a1 2017 Y
1 a1 2018 Y
1 a1 2019 N
1 a2 2017 N
1 a2 2018 N
1 a2 2019 Y
2 a1 2017 Y
2 a1 2018 Nan
2 a1 2019 Y
2 a2 2017 Y
2 a2 2018 N
2 a2 2019 Y
3 a1 2017 Y
3 a1 2018 N
3 a1 2019 Y
3 a2 2017 Y
3 a2 2018 Y
3 a2 2019 N"""), sep="\s+")
# do a "self-join" to find years where values are different to previous year
dfdiff = df.merge(df, on=["id","question"],
suffixes=("","_pre")).loc[lambda dfa: dfa.year.eq(dfa.year_pre+1)
& dfa.value.ne(dfa.value_pre)]
# from knowing id, q, year where different construct another column "changed"
df.merge(dfdiff.loc[:,["id","question","year"]].assign(changed=True),
on=["id","question","year"], how="left").assign(changed=lambda dfa: dfa.changed.fillna(False))
id
question
year
value
changed
0
1
a1
2017
Y
False
1
1
a1
2018
Y
False
2
1
a1
2019
N
True
3
1
a2
2017
N
False
4
1
a2
2018
N
False
5
1
a2
2019
Y
True
6
2
a1
2017
Y
False
7
2
a1
2018
Nan
True
8
2
a1
2019
Y
True
9
2
a2
2017
Y
False
10
2
a2
2018
N
True
11
2
a2
2019
Y
True
12
3
a1
2017
Y
False
13
3
a1
2018
N
True
14
3
a1
2019
Y
True
15
3
a2
2017
Y
False
16
3
a2
2018
Y
False
17
3
a2
2019
N
True

Pandas dataframe groupby multiple years rolling stat

I have a pandas dataframe for which I'm trying to compute an expanding windowed aggregation after grouping by columns. The data structure is something like this:
df = pd.DataFrame([['A',1,2015,4],['A',1,2016,5],['A',1,2017,6],['B',1,2015,10],['B',1,2016,11],['B',1,2017,12],
['A',1,2015,24],['A',1,2016,25],['A',1,2017,26],['B',1,2015,30],['B',1,2016,31],['B',1,2017,32],
['A',2,2015,4],['A',2,2016,5],['A',2,2017,6],['B',2,2015,10],['B',2,2016,11],['B',2,2017,12]],columns=['Typ','ID','Year','dat'])\
.sort_values(by=['Typ','ID','Year'])
i.e.
Typ ID Year dat
0 A 1 2015 4
6 A 1 2015 24
1 A 1 2016 5
7 A 1 2016 25
2 A 1 2017 6
8 A 1 2017 26
12 A 2 2015 4
13 A 2 2016 5
14 A 2 2017 6
3 B 1 2015 10
9 B 1 2015 30
4 B 1 2016 11
10 B 1 2016 31
5 B 1 2017 12
11 B 1 2017 32
15 B 2 2015 10
16 B 2 2016 11
17 B 2 2017 12
In general, there is a completely varying number of years per Type-ID and rows per Type-ID-Year. I need to group this dataframe by the columns Type and ID, then compute an expanding windowed median & std of all observations by Year. I would like to get output results like this:
Typ ID Year median std
0 A 1 2015 14.0 14.14
1 A 1 2016 14.5 11.56
2 A 1 2017 15.0 10.99
3 A 2 2015 4.0 0
4 A 2 2016 4.5 0
5 A 2 2017 5.0 0
6 B 1 2015 20.0 14.14
7 B 1 2016 20.5 11.56
8 B 1 2017 21.0 10.99
9 B 2 2015 10.0 0
10 B 2 2016 10.5 0
11 B 2 2017 11.0 0
Hence, I want something like a groupby by ['Type','ID','Year'], with the median & std for each Type-ID-Year computed for all data with the same Type-ID and cumulative inclusive that Year.
How can I do this without manual iteration?
There's been no activity on this question, so I'll post the solution I found.
mn = df.groupby(by=['Typ','ID']).dat.expanding().median().reset_index().set_index('level_2')
mylast = lambda x: x.iloc[-1]
mn = mn.join(df['Year'])
mn = mn.groupby(by=['Typ','ID','Year']).agg(mylast).reset_index()
My solution follows this algorithm:
group the data, compute the windowed median, and get the original index back
with the original index back, get the year back from the original dataframe
group by the grouping columns, taking the last (in order) value for each
This gives the output desired. The same process can be followed for the standard deviation (or any other statistic desired).

Calculating distance to a row with a certain value

I am working on a data with pandas in which a maintenance work is done at a location. The maintenance is done every four years at each site. I want to find the years since the last maintenance action at each site. I am giving here only two sites in the following example but in the original dataset, I have thousands of them. My data only covers the years 2014 through 2017.
Action = 0 means no action has been performed that year, Action = 1 means some action has been done. Measurement is a performance reading related to the effect of the action. The action can happen in any year. I know that if the action has been performed in Year Y, the previous maintenance has been performed in Year Y-4.
Site Year Action Measurement
A 2014 0 100
A 2015 0 150
A 2016 1 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 0 60
B 2017 0 110
Given this dataset; first, I want to have a temporary dataset like this:
Item Year Action Measurement Years_Since_Last_Action
A 2014 0 100 2
A 2015 0 150 3
A 2016 1 300 4
A 2017 0 80 1
B 2014 0 200 3
B 2015 1 250 4
B 2016 0 60 1
B 2017 0 110 2
Then, I want to have:
Years_Since_Last_Action Mean_Measurement
1 70
2 105
3 175
4 275
Thanks in advance!
Your first question
s=df.loc[df.Action==1,['Site','Year']].set_index('Site') # get all year have the action and map back to the whole dataframe
df['Newyear']=df.Site.map(s.Year)
s1=df.Year-df.Newyear
df['action since last year']=np.where(s1<=0,s1+4,s1)# using np.where get the condition
df
Out[167]:
Site Year Action Measurement Newyear action since last year
0 A 2014 0 100 2016 2
1 A 2015 0 150 2016 3
2 A 2016 1 300 2016 4
3 A 2017 0 80 2016 1
4 B 2014 0 200 2015 3
5 B 2015 1 250 2015 4
6 B 2016 0 60 2015 1
7 B 2017 0 110 2015 2
2nd question
df.groupby('action since last year').Measurement.mean()
Out[168]:
action since last year
1 70
2 105
3 175
4 275
Name: Measurement, dtype: int64
First, build your intermediate using groupby, *fill and a little arithmetic.
v = (df.Year
.where(df.Action.astype(bool))
.groupby(df.Site)
.ffill()
.bfill()
.sub(df.Year))
df['Years_Since_Last_Action'] = np.select([v > 0, v < 0], [4 - v, v.abs()], default=4)
df
Site Year Action Measurement Years_Since_Last_Action
0 A 2014 0 100 2.0
1 A 2015 0 150 3.0
2 A 2016 1 300 4.0
3 A 2017 0 80 1.0
4 B 2014 0 200 3.0
5 B 2015 1 250 4.0
6 B 2016 0 60 1.0
7 B 2017 0 110 2.0
Next,
df.groupby('Years_Since_Last_Action', as_index=False).Measurement.mean()
Years_Since_Last_Action Measurement
0 1.0 70
1 2.0 105
2 3.0 175
3 4.0 275
How about:
delta_year = df.loc[df.groupby("Site")["Action"].transform("idxmax"), "Year"].values
years_since = ((df.Year - delta_year) % 4).replace(0, 4)
df["Years_Since_Last_Action"] = years_since
out = df.groupby("Years_Since_Last_Action")["Measurement"].mean().reset_index()
out = out.rename(columns={"Measurement": "Mean_Measurement"})
which gives me
In [230]: df
Out[230]:
Site Year Action Measurement Years_Since_Last_Action
0 A 2014 0 100 2
1 A 2015 0 150 3
2 A 2016 1 300 4
3 A 2017 0 80 1
4 B 2014 0 200 3
5 B 2015 1 250 4
6 B 2016 0 60 1
7 B 2017 0 110 2
In [231]: out
Out[231]:
Years_Since_Last_Action Mean_Measurement
0 1 70
1 2 105
2 3 175
3 4 275

Why am I not able to drop values within columns on pandas using python3?

I have a DataFrame (df) with various columns. In this assignment I have to find the difference between summer gold medals and winter gold medals, relative to total medals, for each country using stats about the olympics.
I must only include countries which have at least one gold medal. I am trying to use dropna() to not include those countries who do not at least have one medal. My current code:
def answer_three():
df['medal_count'] = df['Gold'] - df['Gold.1']
df['medal_count'].dropna()
df['medal_dif'] = df['medal_count'] / df['Gold.2']
df['medal_dif'].dropna()
return df.head()
print (answer_three())
This results in the following output:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
Afghanistan 13 0 0 2 2 0 0
Algeria 12 5 2 8 15 3 0
Argentina 23 18 24 28 70 18 0
Armenia 5 1 2 9 12 6 0
Australasia 2 3 4 5 12 0 0
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 \
Afghanistan 0 0 0 13 0 0 2
Algeria 0 0 0 15 5 2 8
Argentina 0 0 0 41 18 24 28
Armenia 0 0 0 11 1 2 9
Australasia 0 0 0 2 3 4 5
Combined total ID medal_count medal_dif
Afghanistan 2 AFG 0 NaN
Algeria 15 ALG 5 1.0
Argentina 70 ARG 18 1.0
Armenia 12 ARM 1 1.0
Australasia 12 ANZ 3 1.0
I need to get rid of both the '0' values in "medal_count" and the NaN in "medal_dif".
I am also aware the maths/way I have written the code is probably incorrect to solve the question, but I think I need to start by dropping these values? Any help with any of the above is greatly appreciated.
You are required to pass an axis e.g. axis=1 into the drop function.
An axis of 0 => row, and 1 => column. 0 seems to be the default.
As you can see the entire column is dropped for axis =1

Categories