Replace outliers with groupby average in multi-index dataframe - python

I have the following multi-index data frame, where ID and Year are part of the multi-index. Some numbers for the variable ROA are unreasonable, so I want to replace every ROA value that is larger than the 99th percentile of ROA in the overall data frame by the average of its company (the same for everything smaller than the 1th percentile).
ID Year ROA
1 2016 1.5
1 2017 0.8
1 2018 NaN
2 2016 0.7
2 2017 0.8
2 2018 0.4
In a different thread I found the following approach (Replace values based on multiple conditions with groupby mean in Pandas):
mask = ((df['ROA'] > df['ROA'].quantile(0.99)) | (df['ROA'] < df['ROA'].quantile(0.01)))
df['ROA'] = np.where(~mask, df['ROA'], df.groupby('ID')['ROA'].transform('mean'))
However, this does not work for me. The maximum and minimum values of my data frame do not change. Does someone have an idea why this could be?
EDIT:
Alternatively, I thought of this function:
df_outliers = df[(df['ROA'] < df['ROA'].quantile(0.01))|(df['ROA'] >
df['ROA'].quantile(0.99))]
for i in df_outliers.index:
df.loc[(df.index.get_level_values('ID') == float(i[0])) &
(df.index.get_level_values('Year')==float(i[1])), 'ROA'] =
float(df.query('ID == {} and Year != {}'.format(i[0],
i[1])).ROA.mean())
However, here I run into the problem that in the df_outliers.index some companies are mentioned several times because their ROA is an outlier in several years. This makes the function defeat its purpose, as it is currently it only excludes one year from the calculation of the mean, and not several.

Related

Creating a new column based on entries from another column in Python

I'm new in Python and hope you guys can help me with the following:
I have a data frame that contains the daily demand of a certain product. However, the demand is shown cumulative over time. I want to create a column that shows the actual daily demand (see table below).
Current Data frame:
Day#
Cumulative Demand
1
10
2
15
3
38
4
44
5
53
What I want to achieve:
Day#
Cumulative Demand
Daily Demand
1
10
10
2
15
5
3
38
23
4
44
6
5
53
9
Thank you!
Firstly, we need the data of the old column
# My Dataframe is called df
demand = df["Cumulative Demand"].tolist()
Then recalculate the data
daily_demand = [demand[0]]
for i, d in enumerate(demand[1:]):
daily_demand.append(d-demand[i])
Lastly append the data to a new column
df["Daily Demand"] = daily_demand
Assuming what you shared above is representative of your actual data, meaning you have 1 row per day, and Day column is sorted in ascending order.
You can use shift() (please read what it does), and perform a subtraction between the cumulative demand, and the shifted version of the cumulative demand. This will give you back the actual daily demand.
To make sure that it works, check whether the cumulative sum of daily demand (the new column) sums to the cumulative sum, using cumsum().
import pandas as pd
# Calculate your Daily Demand column
df['Daily Demand'] = (df['Cumulative Demand'] - df['Cumulative Demand'].shift()).fillna(df['Cumulative Demand'][0])
# Check whether the cumulative sum of daily demands sum up to the Cumulative Demand
>>> all(df['Daily Demand'].cumsum() == df['Cumulative Demand'])
True
Will print back:
Day Cumulative Demand Daily Demand
0 1 10 10.0
1 2 15 5.0
2 3 38 23.0
3 4 44 6.0
4 5 53 9.0

Using groupby calculations in Pandas data frames

I am working on a geospatial project where I need to do some calculations between groups of data within a data frame. The data I am using spans over several different years and specific to the Local Authority District code, each year has a numerical ID.
I need to be able to calculate the mean average of a group of years within that data set relative to the LAD code.
LAC LAN JAN FEB MAR APR MAY JUN ID
K04000001 ENGLAND AND WALES 56597 43555 49641 88049 52315 42577 5
E92000001 ENGLAND 53045 40806 46508 83504 49413 39885 5
I can use groupby to calculate the mean based on a LAC, but what I can't do is calculate the mean grouped by LAC for ID 1:3 for example.
What is more efficient, seperate in to seperate dataframes stored in an dict for example, or keep in one dataframe and use an ID?
df.groupby('LAC').mean()
I come frome a matlab background so just getting the hang of the best way to do things.
Secondly, once these operatons are complete, I would like to do the following:
(mean of id - 1:5 - mean id:6) using LAC as the key.
Sorry if I haven't explained this very well!
Edit: Expected output.
To be able to average a group of rows by specific ID for a given value of LAC.
For example:
Average monthly values for E92000001 rows with ID 3
LAC JAN FEB MAR APR MAY JUN ID
K04000001, 56706 43653 49723 88153 52374 42624 5
K04000001 56597 43555 49641 88049 52315 42577 5
E92000001 49186 36947 42649 79645 45554 36026 5
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 68715 56476 62178 99174 65083 55555 4
E92000001 41075 28836 34538 71534 37443 27915 3
E92000001 54595 42356 48058 85054 50963 41435 1
Rows to be averaged:
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 41075 28836 34538 71534 37443 27915 3
Result
E92000001 47060 34821 40523 77519 43428 33900 3
edit: corrected error.
To match the update in your question. This will give you a dataframe with only one row for each ID-LAC combination, with the average of all the rows that had that index.
df.groupby(['ID', 'LAC']).mean()
I would start by setting the year and LAC as the index
df.set_index(['ID', 'LAC'], inplace=True).sort_index(inplace=True)
Now you can groupby Index and get the mean for every month, or even each row's average since the first year.
expanding_mean = df.groupby('index').cumsum() / (df.groupby('index').cumcount() + 1)

Rolling mean with varying window length in Python

I am working with NLSY79 data and I am trying to construct a 'smoothed' income variable that averages over a period of 4 years. Between 1979 and 1994, the NLSY conducted surveys annually, while after 1996 the survey was conducted biennially. This means that my smoothed income variable will average four observations prior to 1994 and only two after 1996.
I would like my smoothed income variable to satisfy the following criteria:
1) It should be an average of 4 income observations from 1979 to 1994 and only 2 from 1996 onward
2) The window should START from a given observation rather than be centered at it. Therefore, my smoothed income variable should tell me the average income over the four years starting from that date
3) It should ignore NaNs
It should, therefore, look like the following (note that I only computed values for 'smoothed income' that could be computed with the data I have provided.)
id year income 'smoothed income'
1 1979 20,000 21,250
1 1980 22,000
1 1981 21,000
1 1982 22,000
...
1 2014 34,000 34,500
1 2016 35,000
2 1979 28,000 28,333
2 1980 NaN
2 1981 28,000
2 1982 29,000
I am relatively new to dataframe manipulation with pandas, so here is what I have tried:
smooth = DATA.groupby('id')['income'].rolling(window=4, min_periods=1).mean()
DATA['smoothIncome'] = smooth.reset_index(level=0, drop=True)
This code accounts for NaNs, but otherwise does not accomplish objectives 2) and 3).
Any help would be much appreciated
Ok, I've modified the code provided by ansev to make it work. filling in NaNs was causing the problems.
Here's the modified code:
df.set_index('year').groupby('id').income.apply(lambda x: x.reindex(range(x.index.min(),x.index.max()+1))
.rolling(4, min_periods = 1).mean().shift(-3)).reset_index()
The only problem I have now is that the mean is not calculated when there are fewer that 4 years remaining (e.g. from 2014 onward, because my data goes until 2016). Is there a way of shortening the window length after 2014?

Dataframe of the Top X Values of the Top Y Days - Pandas Groupby

I have data about three variables where I want to find the largest X values of one variable on a per day basis. Previously I wrote some code to find the hour where the max value of the day occurred, but now I want to add some options to find more max hours per day.
I've been able to find the Top X values per day for all the days, but I've gotten stuck on narrowing it down to the Top X Values from the Top X Days. I've included pictures detailing what the end result would hopefully look like.
Data
Identified Top 2 Hours
Code
df = pd.DataFrame(
{'ID':['ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1'],
'Year':[2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018],
'Month':[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],
'Day':[12,12,12,12,13,13,13,13,14,14,14,14,15,15,15,15,16,16,16,16,17,17,17,17],
'Hour':[19,20,21,22,11,12,13,19,19,20,21,22,18,19,20,21,19,20,21,23,19,20,21,22],
'var_1': [0.83,0.97,0.69,0.73,0.66,0.68,0.78,0.82,1.05,1.05,1.08,0.88,0.96,0.81,0.71,0.88,1.08,1.02,0.88,0.79,0.91,0.91,0.80,0.96],
'var_2': [47.90,42.85,67.37,57.18,66.13,59.96,52.63,54.75,32.54,36.58,36.99,37.23,46.94,52.80,68.79,50.84,37.79,43.54,48.04,38.01,42.22,47.13,50.96,44.19],
'var_3': [99.02,98.10,98.99,99.12,98.78,98.90,99.09,99.20,99.22,99.11,99.18,99.24,99.00,98.90,98.87,99.07,99.06,98.86,98.92,99.32,98.93,98.97,98.99,99.21],})
# Get the top 2 var2 values each day
top_two_var2_each_day = df.groupby(['ID', 'Year', 'Month', 'Day'])['var_2'].nlargest(2)
top_two_var2_each_day = top_two_var2_each_day.reset_index()
# set level_4 index to the current index
top_two_var2_each_day = top_two_var2_each_day.set_index('level_4')
# use the index from the top_two_var2 to get the rows from df to get values of the other variables when top 2 values occured
top_2_all_vars = df[df.index.isin(top_two_var2_each_day.index)]
End Goal Result
I figure the best way would be to average the two hours to identify what days have the largest average, then go back into top_2_all_vars dataframe and grab the rows where the Days occur. I am unsure how to proceed.
mean_day = top_2_all_vars.groupby(['ID', 'Year', 'Month', 'Day'],as_index=False)['var_2'].mean()
top_2_day = mean_day.nlargest(2, 'var_2')
Final Dataframe
This is the result I am trying to find. A dataframe consisting of the Top 2 values of var_2 from each of the Top 2 days.
Code I previously used to find the single largest value of each day, but I don't know how I would make it work for more than a single max per day
# For each ID and Day, Find the Hour where the Max Amount of var_2 occurred and save the index location
df_idx = df.groupby(['ID', 'Year', 'Month', 'Day',])['var_2'].transform(max) == df['var_2']
# Now the hour has been found, store the rows in a new dataframe based on the saved index location
top_var2_hour_of_each_day = df[df_idx]
Using Groupbys may be not the best way to go about it, but I am open to anything.
This is one approach:
If your data spans multiple months its a lot harder dealing with it when the month and day are in different columns. So First I made a new column called 'Date' which just combines the month and the day.
df['Date'] = df['Month'].astype('str')+"-"+df['Day'].astype('str')
Next we need the top two values of var_2 per day, and then average them. So we can create a really simple function to find exactly that.
def topTwoMean(series):
top = series.sort_values(ascending = False).iloc[0]
second = series.sort_values(ascending = False).iloc[1]
return (top+second)/2
We then use our function, sort by the average of var_2 to get the highest 2 days, then save the dates to a list.
maxDates = df.groupby('Date').agg({'var_2': [topTwoMean]})\
.sort_values(by = ('var_2', 'topTwoMean'), ascending = False)\
.reset_index()['Date']\
.head(2)\
.to_list()
Finally we filter by the dates chosen above, then find the highest two of var_2 on those days.
df[df['Date'].isin(maxDates)]\
.groupby('Date')\
.apply(lambda x: x.sort_values('var_2', ascending = False).head(2))\
.reset_index(drop = True)
ID Year Month Day Hour var_1 var_2 var_3 Date
0 ID_1 2018 6 12 21 0.69 67.37 98.99 6-12
1 ID_1 2018 6 12 22 0.73 57.18 99.12 6-12
2 ID_1 2018 6 13 11 0.66 66.13 98.78 6-13
3 ID_1 2018 6 13 12 0.68 59.96 98.90 6-13

Incremental spend in 6 weeks for two groups using pandas

I have an excel data with the following information,
df.head()
User_id Group Week Spend Purchases Group
170309867 Test 2014-10-13 794.66 2 Test-NonRed
57954586 Test 2014-10-13 55.99 1 Test-Red
118068583 Test 2014-10-13 40.87 1 Test-NonRed
516478393 Test 2014-10-13 17.5 2 Test-NonRed
457873235 Test 2014-10-13 20.44 1 Test-Red
From the above information, I need to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. control. I need it in absolute ($) and % terms.
I have tried pandas as,
df2= df.groupby(by=['Group','Week']).sum().abs().groupby(level=[0]).cumsum()
And I have the following result,
df2.head()
And then I calculated the sum for each group as,
df2.groupby(by=['group2']).sum()
df2.head()
I would like to have them ( the incremental spend) as an absolute value which I tried by abs(), as well as I need it in absolute percentage.Any help would be much appreciated,
The expected results are to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. Control. I need it in absolute spend and then its percentage.The incremental spend for the 6 weeks. Something like this,
Group incremental_spend incremental_%
Control 11450175 #%
test-NonRed 50288158 #%
test-Red 12043938 #%
So my real questions,
1. Whether the above-mentioned approach is the right way to calculate incremental spend for Column Group in 6 Weeks from column Week on Spend?
2. Also, I need all my results in Absolute counts and Absolute %
I think there are several problems here which make your answer difficult to understand.
Vocabulary
What you describe as "Incremental spend" is just the sum.
What you do in two steps is the sum of the cumulative sum .cumsum().sum(), which is not right.
Also I am not sure whether you need abs, which gives the absolute value (abs(-1) gives 1) and will thus only have an effect if there are negative values in your data.
Unfortunately the sample dataset is not large enough to get a conclusion.
Dataset
Your dataset has two columns Group with identical names, which is error prone.
Missing information
You want to get final values (sums) as a ratio (%), but you do not indicate what is the reference value for this ratio.
Is it the sum of Spend for the control group ?
Potential solution
>>> df # Sample dataframe with one entry as 'Control' group
Out[]:
User_id Group Week Spend Purchases Group.1
0 170309867 Test 2014-10-13 794.66 2 Test-NonRed
1 57954586 Test 2014-10-13 55.99 1 Test-Red
2 118068583 Test 2014-10-13 40.87 1 Test-NonRed
3 516478393 Test 2014-10-13 17.50 2 Control
4 457873235 Test 2014-10-13 20.44 1 Test-Red
df2 = pd.DataFrame(df.groupby('Group.1').Spend.sum()) # Get 'Spend' sum for each group
>>> df2
Out[]:
Spend
Group.1
Control 17.50
Test-NonRed 835.53
Test-Red 76.43
control_spend_total = df2.loc['Control'].values # Get total spend for 'Control' group
>>> control_spend_total
Out[]: array([ 17.5])
df2['Spend_%'] = df2.Spend / control_spend_total * 100 # Add 'Spend_ratio' column
>>> df2
Out[]:
Spend Spend_%
Group.1
Control 17.50 100.000000
Test-NonRed 835.53 4774.457143
Test-Red 76.43 436.742857
Does it look like what you want?

Categories