Using groupby calculations in Pandas data frames - python

I am working on a geospatial project where I need to do some calculations between groups of data within a data frame. The data I am using spans over several different years and specific to the Local Authority District code, each year has a numerical ID.
I need to be able to calculate the mean average of a group of years within that data set relative to the LAD code.
LAC LAN JAN FEB MAR APR MAY JUN ID
K04000001 ENGLAND AND WALES 56597 43555 49641 88049 52315 42577 5
E92000001 ENGLAND 53045 40806 46508 83504 49413 39885 5
I can use groupby to calculate the mean based on a LAC, but what I can't do is calculate the mean grouped by LAC for ID 1:3 for example.
What is more efficient, seperate in to seperate dataframes stored in an dict for example, or keep in one dataframe and use an ID?
df.groupby('LAC').mean()
I come frome a matlab background so just getting the hang of the best way to do things.
Secondly, once these operatons are complete, I would like to do the following:
(mean of id - 1:5 - mean id:6) using LAC as the key.
Sorry if I haven't explained this very well!
Edit: Expected output.
To be able to average a group of rows by specific ID for a given value of LAC.
For example:
Average monthly values for E92000001 rows with ID 3
LAC JAN FEB MAR APR MAY JUN ID
K04000001, 56706 43653 49723 88153 52374 42624 5
K04000001 56597 43555 49641 88049 52315 42577 5
E92000001 49186 36947 42649 79645 45554 36026 5
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 68715 56476 62178 99174 65083 55555 4
E92000001 41075 28836 34538 71534 37443 27915 3
E92000001 54595 42356 48058 85054 50963 41435 1
Rows to be averaged:
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 41075 28836 34538 71534 37443 27915 3
Result
E92000001 47060 34821 40523 77519 43428 33900 3
edit: corrected error.

To match the update in your question. This will give you a dataframe with only one row for each ID-LAC combination, with the average of all the rows that had that index.
df.groupby(['ID', 'LAC']).mean()
I would start by setting the year and LAC as the index
df.set_index(['ID', 'LAC'], inplace=True).sort_index(inplace=True)
Now you can groupby Index and get the mean for every month, or even each row's average since the first year.
expanding_mean = df.groupby('index').cumsum() / (df.groupby('index').cumcount() + 1)

Related

Replace outliers with groupby average in multi-index dataframe

I have the following multi-index data frame, where ID and Year are part of the multi-index. Some numbers for the variable ROA are unreasonable, so I want to replace every ROA value that is larger than the 99th percentile of ROA in the overall data frame by the average of its company (the same for everything smaller than the 1th percentile).
ID Year ROA
1 2016 1.5
1 2017 0.8
1 2018 NaN
2 2016 0.7
2 2017 0.8
2 2018 0.4
In a different thread I found the following approach (Replace values based on multiple conditions with groupby mean in Pandas):
mask = ((df['ROA'] > df['ROA'].quantile(0.99)) | (df['ROA'] < df['ROA'].quantile(0.01)))
df['ROA'] = np.where(~mask, df['ROA'], df.groupby('ID')['ROA'].transform('mean'))
However, this does not work for me. The maximum and minimum values of my data frame do not change. Does someone have an idea why this could be?
EDIT:
Alternatively, I thought of this function:
df_outliers = df[(df['ROA'] < df['ROA'].quantile(0.01))|(df['ROA'] >
df['ROA'].quantile(0.99))]
for i in df_outliers.index:
df.loc[(df.index.get_level_values('ID') == float(i[0])) &
(df.index.get_level_values('Year')==float(i[1])), 'ROA'] =
float(df.query('ID == {} and Year != {}'.format(i[0],
i[1])).ROA.mean())
However, here I run into the problem that in the df_outliers.index some companies are mentioned several times because their ROA is an outlier in several years. This makes the function defeat its purpose, as it is currently it only excludes one year from the calculation of the mean, and not several.

Creating a new column based on entries from another column in Python

I'm new in Python and hope you guys can help me with the following:
I have a data frame that contains the daily demand of a certain product. However, the demand is shown cumulative over time. I want to create a column that shows the actual daily demand (see table below).
Current Data frame:
Day#
Cumulative Demand
1
10
2
15
3
38
4
44
5
53
What I want to achieve:
Day#
Cumulative Demand
Daily Demand
1
10
10
2
15
5
3
38
23
4
44
6
5
53
9
Thank you!
Firstly, we need the data of the old column
# My Dataframe is called df
demand = df["Cumulative Demand"].tolist()
Then recalculate the data
daily_demand = [demand[0]]
for i, d in enumerate(demand[1:]):
daily_demand.append(d-demand[i])
Lastly append the data to a new column
df["Daily Demand"] = daily_demand
Assuming what you shared above is representative of your actual data, meaning you have 1 row per day, and Day column is sorted in ascending order.
You can use shift() (please read what it does), and perform a subtraction between the cumulative demand, and the shifted version of the cumulative demand. This will give you back the actual daily demand.
To make sure that it works, check whether the cumulative sum of daily demand (the new column) sums to the cumulative sum, using cumsum().
import pandas as pd
# Calculate your Daily Demand column
df['Daily Demand'] = (df['Cumulative Demand'] - df['Cumulative Demand'].shift()).fillna(df['Cumulative Demand'][0])
# Check whether the cumulative sum of daily demands sum up to the Cumulative Demand
>>> all(df['Daily Demand'].cumsum() == df['Cumulative Demand'])
True
Will print back:
Day Cumulative Demand Daily Demand
0 1 10 10.0
1 2 15 5.0
2 3 38 23.0
3 4 44 6.0
4 5 53 9.0

How to handle Datatime data with Pandas when grouping by

I have a question. I am dealing with a Datetime DataFrame in Pandas. I want to perform a count on a particular column and group by the month.
For example:
df.groupby(df.index.month)["count_interest"].count()
Assuming that I am analyzing a Data From December 2019. I get a result like this
date
1 246
2 360
3 27
12 170
In reality, December 2019 is supposed to come first. Please what can I do because when I plot the frame grouped by month, the December 2019 is showing at the last and this is practically incorrect.
See plot below for your understanding:
You can try reindex:
df.groupby(df.index.month)["count_interest"].count().reindex([12,1,2,3])

Rolling mean with varying window length in Python

I am working with NLSY79 data and I am trying to construct a 'smoothed' income variable that averages over a period of 4 years. Between 1979 and 1994, the NLSY conducted surveys annually, while after 1996 the survey was conducted biennially. This means that my smoothed income variable will average four observations prior to 1994 and only two after 1996.
I would like my smoothed income variable to satisfy the following criteria:
1) It should be an average of 4 income observations from 1979 to 1994 and only 2 from 1996 onward
2) The window should START from a given observation rather than be centered at it. Therefore, my smoothed income variable should tell me the average income over the four years starting from that date
3) It should ignore NaNs
It should, therefore, look like the following (note that I only computed values for 'smoothed income' that could be computed with the data I have provided.)
id year income 'smoothed income'
1 1979 20,000 21,250
1 1980 22,000
1 1981 21,000
1 1982 22,000
...
1 2014 34,000 34,500
1 2016 35,000
2 1979 28,000 28,333
2 1980 NaN
2 1981 28,000
2 1982 29,000
I am relatively new to dataframe manipulation with pandas, so here is what I have tried:
smooth = DATA.groupby('id')['income'].rolling(window=4, min_periods=1).mean()
DATA['smoothIncome'] = smooth.reset_index(level=0, drop=True)
This code accounts for NaNs, but otherwise does not accomplish objectives 2) and 3).
Any help would be much appreciated
Ok, I've modified the code provided by ansev to make it work. filling in NaNs was causing the problems.
Here's the modified code:
df.set_index('year').groupby('id').income.apply(lambda x: x.reindex(range(x.index.min(),x.index.max()+1))
.rolling(4, min_periods = 1).mean().shift(-3)).reset_index()
The only problem I have now is that the mean is not calculated when there are fewer that 4 years remaining (e.g. from 2014 onward, because my data goes until 2016). Is there a way of shortening the window length after 2014?

Is there a way to sample on a 'type' column, whilst keeping all ID's within that type in another column?

I'm splitting a dataframe into two; one to get the average over a period of time, and the other to use that average on. The dataframe looks similar to the following:
ID Type Num. Hours Month
2 black 10 Jan
2 black 12 Feb
2 black 15 March
3 red 7 Jan
3 red 10 Feb
The ID's each have 24 rows, spanning over 2 years. Different ID's can have the same Type or different Type.
I'd like the two split dataframes to have the same amount of different Type's in, whilst keeping all 24 of the ID's together for each unique ID.
I've tried grouping by Type and ID, together and separately, but it seems to give me only a fraction of the ID's instead of keeping them together
df1 = df.groupby('ID')['Type'].apply(lambda x: x.sample(frac=0.5))
or
df1 = df.groupby(['ID', 'Type']).apply(lambda x: x.sample(frac=0.5))
and afterwards of course I would use than index to get the second split dataframe from the original.
Neither have worked the way I require.
For the output, it should be two dataframes, which should not share any ID's and have an equal amount of different Types.
So using something similar to the above, I would hopefully output a DataFrame which looks like this:
ID Type Num. Hours Month
2 black 10 Jan
2 black 12 Feb
2 black 15 March
5 yellow 17 Jan
5 yellow 21 Feb
Using that table would allow me to index on the original dataframe and give me a second table which outputs something similar to the following:
ID Type Num. Hours Month
4 black 10 Jan
4 black 12 Feb
4 black 15 March
6 yellow 22 Jan
6 yellow 27 Feb
sample takes a fraction but does not split the dataframe in two. Having obtained half of the samples, taking the other half is simple!
I am assuming your original line works as you want it to work for the first dataframe
df1 = df.groupby(['ID', 'Type']).apply(lambda x: x.sample(frac=0.5))
df2 = df[~df.index.isin(df1.index)]
Update
Based on comments; to randomly divide your ID's over two dataframes you can use the following:
import random
unique_ids = df.ID.unique()
random.shuffle(unique_ids)
id_set_1 = unique_ids[: len(unique_ids) // 2] # take first half of list
df1 = df[df.ID.isin(id_set_1)]
df2 = df[~df.ID.isin(id_set_2)]
Beware that this could lead to two dataframes with very different sizes, depending on the number of types per ID!

Categories