Calculate the which percentile a value belongs to in a dataframe - python

I have a dataframe with different categorical variables, year and expenditure. I would like to calculate the mean of the expenditure variable by group. Then I would like to calculate what percentile this mean corresponds to in the entire dataset of expenditures. I would like to know this percentile value for the mean of each subgroup.
import pandas as pd
import numpy as np
from scipy import stats
Create dataframe:
df=pd.DataFrame([])
df['year'] = np.random.randint(low=2008, high=2013, size=100000)
df['dkg'] = np.random.choice([0, 1], size=100000)
df['fkg'] = np.random.choice([0, 1], size=100000)
df['expenditure'] = np.random.choice([0, 25230], size=100000)
Calculate mean by group:
means = df.groupby(['fkg','dkg', 'year']).mean()
print(means)
Calculate corresponding percentile value for each subgroup:
percentile = stats.groupby(['fkg','dkg', 'year']).percentileofscore(df[['expenditure']], means)
This last like is giving me an error msg. What am I doing wrong?

You should provide the expected output, but if I guess correctly you might want:
from scipy import stats
means['percentile'] = (means.groupby(['fkg','dkg', 'year'])
.apply(lambda x: stats.percentileofscore(means['expenditure'], x)[0, 0])
)
output:
expenditure percentile
fkg dkg year
0 0 2008 12625.289560 50.0
2009 12565.753031 40.0
2010 12715.557991 85.0
2011 12687.687264 75.0
2012 12859.975976 100.0
1 2008 12682.890173 70.0
2009 12532.481603 35.0
2010 12637.576059 55.0
2011 12707.997609 80.0
2012 12600.228337 45.0
1 0 2008 12672.511397 65.0
2009 12510.373223 25.0
2010 12273.914379 5.0
2011 12423.146356 10.0
2012 12482.421178 20.0
1 2008 12447.964962 15.0
2009 12518.209719 30.0
2010 12778.573674 95.0
2011 12736.721412 90.0
2012 12669.934679 60.0
Or maybe you want a rank per 2 variables (eg. fkg/dkg):
means['percentile'] = means.groupby(['fkg','dkg'])['expenditure'].rank(pct=True).mul(100)
output:
expenditure percentile
fkg dkg year
0 0 2008 12625.289560 40.0
2009 12565.753031 20.0
2010 12715.557991 80.0
2011 12687.687264 60.0
2012 12859.975976 100.0
1 2008 12682.890173 80.0
2009 12532.481603 20.0
2010 12637.576059 60.0
2011 12707.997609 100.0
2012 12600.228337 40.0
1 0 2008 12672.511397 100.0
2009 12510.373223 80.0
2010 12273.914379 20.0
2011 12423.146356 40.0
2012 12482.421178 60.0
1 2008 12447.964962 20.0
2009 12518.209719 40.0
2010 12778.573674 100.0
2011 12736.721412 80.0
2012 12669.934679 60.0

Related

How to create new rows with missing years and populate them with current rows

I have a dataset that looks like this:
overflow_data={'state': ['CA', 'CA', 'HI', 'HI', 'HI', 'NY', 'NY'],
'year': [2010, 2013, 2010, 2012, 2016, 2009, 2013],
'value': [1, 3, 1, 2, 3, 2, 5]}
pd.DataFrame(overflow_data)
Starting DataFrame:
I would like to fill in the missing years for each state, and use the prior year's values for those years, so the table would look like this:
Expected output:
I think you are looking for pivot and fill:
(df.pivot('year','state','value') # you can print this line alone to see what it does
.ffill().bfill() # fill missing the data based on the states
.unstack() # transform back to original form
.reset_index(name='value')
)
Output:
state year value
0 CA 2009 1.0
1 CA 2010 1.0
2 CA 2012 1.0
3 CA 2013 3.0
4 CA 2016 3.0
5 HI 2009 1.0
6 HI 2010 1.0
7 HI 2012 2.0
8 HI 2013 2.0
9 HI 2016 3.0
10 NY 2009 2.0
11 NY 2010 2.0
12 NY 2012 2.0
13 NY 2013 5.0
14 NY 2016 5.0
Note I just realized that the above is slightly different than what you are asking for. It only spawns data to all available years in the data, not resamples the data for the continuous years.
For what you ask, we can resolve to reindex with groupby:
(df.set_index('year').groupby('state')
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max()+1)).ffill())
.reset_index('state',drop=True)
.reset_index()
)
Output:
year state value
0 2010 CA 1.0
1 2011 CA 1.0
2 2012 CA 1.0
3 2013 CA 3.0
4 2010 HI 1.0
5 2011 HI 1.0
6 2012 HI 2.0
7 2013 HI 2.0
8 2014 HI 2.0
9 2015 HI 2.0
10 2016 HI 3.0
11 2009 NY 2.0
12 2010 NY 2.0
13 2011 NY 2.0
14 2012 NY 2.0
15 2013 NY 5.0

pandas DataFrame .stack(dropna=False) but keeping existing combinations of levels

My data looks like this
import numpy as np
import pandas as pd
# My Data
enroll_year = np.arange(2010, 2015)
grad_year = enroll_year + 4
n_students = [[100, 100, 110, 110, np.nan]]
df = pd.DataFrame(
n_students,
columns=pd.MultiIndex.from_arrays(
[enroll_year, grad_year],
names=['enroll_year', 'grad_year']))
print(df)
# enroll_year 2010 2011 2012 2013 2014
# grad_year 2014 2015 2016 2017 2018
# 0 100 100 110 110 NaN
What I am trying to do is to stack the data, one column/index level for year of enrollment, one for year of graduation and one for the numbers of students, which should look like
# enroll_year grad_year n
# 2010 2014 100.0
# . . .
# . . .
# . . .
# 2014 2018 NaN
The data produced by .stack() is very close, but the missing record(s) is dropped,
df1 = df.stack(['enroll_year', 'grad_year'])
df1.index = df1.index.droplevel(0)
print(df1)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# dtype: float64
So, .stack(dropna=False) is tried, but it will expand the index levels to all combinations of enrollment and graduation years
df2 = df.stack(['enroll_year', 'grad_year'], dropna=False)
df2.index = df2.index.droplevel(0)
print(df2)
# enroll_year grad_year
# 2010 2014 100.0
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2011 2014 NaN
# 2015 100.0
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2012 2014 NaN
# 2015 NaN
# 2016 110.0
# 2017 NaN
# 2018 NaN
# 2013 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 110.0
# 2018 NaN
# 2014 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# dtype: float64
And I need to subset df2 to get my desired data set.
existing_combn = list(zip(
df.columns.levels[0][df.columns.labels[0]],
df.columns.levels[1][df.columns.labels[1]]))
df3 = df2.loc[existing_combn]
print(df3)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# 2014 2018 NaN
# dtype: float64
Although it only adds a few more extra lines to my code, I wonder if there are any better and neater approaches.
Use unstack with pd.DataFrame then reset_index and drop unnecessary columns and rename the column as:
pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Or:
df.unstack().reset_index(level=2, drop=True)
enroll_year grad_year
2010 2014 100.0
2011 2015 100.0
2012 2016 110.0
2013 2017 110.0
2014 2018 NaN
dtype: float64
Or:
df.unstack().reset_index(level=2, drop=True).reset_index().rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Explanation :
print(pd.DataFrame(df.unstack()))
0
enroll_year grad_year
2010 2014 0 100.0
2011 2015 0 100.0
2012 2016 0 110.0
2013 2017 0 110.0
2014 2018 0 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1))
enroll_year grad_year 0
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'}))
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN

Want MultiIndex for rows and columns with read_csv

My .csv file looks like:
Area When Year Month Tickets
City Day 2015 1 14
City Night 2015 1 5
Rural Day 2015 1 18
Rural Night 2015 1 21
Suburbs Day 2015 1 15
Suburbs Night 2015 1 21
City Day 2015 2 13
containing 75 rows. I want both a row multiindex and column multiindex that looks like:
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 1 5.0 3.0 22.0 11.0 13.0 2.0
2 22.0 8.0 4.0 16.0 6.0 18.0
3 26.0 25.0 22.0 23.0 22.0 2.0
2016 1 20.0 25.0 39.0 14.0 3.0 10.0
2 4.0 14.0 16.0 26.0 1.0 24.0
3 22.0 17.0 7.0 24.0 12.0 20.0
I've read the .read_csv doc at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I can get the row multiindex with:
df2 = pd.read_csv('c:\\Data\Tickets.csv', index_col=[2, 3])
I've tried:
df2 = pd.read_csv('c:\\Data\Tickets.csv', index_col=[2, 3], header=[1, 3, 5])
thinking [1, 3, 5] fetches 'City', 'Rural', and 'Suburbs'. How do I get the desired column multiindex shown above?
Seems like you need to pivot_table with multiple indexes and multiple columns.
Start with just reading you csv plainly
df = pd.read_csv('Tickets.csv')
Then
df.pivot_table(index=['Year', 'Month'], columns=['Area', 'When'], values=['Tickets'])
With the input data you provided, you'd get
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 1 14.0 5.0 18.0 21.0 15.0 21.0
2 13.0 NaN NaN NaN NaN NaN

Write column backwards with condition in Python

I have the following df and want to write the number column backwards and also overwrite other values if necessary. The condition is to always use the previous value unless the new values difference to the old value is greater than 10%.
Date Number
2019 150
2018 NaN
2017 118
2016 NaN
2015 115
2014 107
2013 105
2012 NaN
2011 100
Because of the condition the value in e.g. 2013 is equal to 100, because it is not smaller than 90 and not greater than 110. The result would look like this:
Date Number
2019 150
2018 115
2017 115
2016 115
2015 115
2014 100
2013 100
2012 100
2011 100
You can reverse your column and then apply a function to update values. Finally reverse the column to the original order:
def get_val(x):
global prev_num
if x and x > prev_num*1.1:
prev_num = x
return prev_num
prev_num = 0
df['number'] = df['number'][::-1].apply(get_val)[::-1]
Just groupby the difference after floor division by 10 which is not equal to zero then transform the min i.e
df['x'] = df.groupby((df['number'].bfill()[::-1]//10).diff().ne(0).cumsum())['number'].transform(min)
date number x
0 2019 150.0 150.0
1 2018 NaN 115.0
2 2017 118.0 115.0
3 2016 NaN 115.0
4 2015 115.0 115.0
5 2014 107.0 100.0
6 2013 105.0 100.0
7 2012 NaN 100.0
8 2011 100.0 100.0
​
Here is one way. It assumes the first value 100 is not NaN and the original dataframe is ordered descending by year. If performance is an issue, the loop can be converted to a list comprehension.
lst = df.sort_values('date')['number'].ffill().tolist()
for i in range(1, len(lst)):
if abs(lst[i] - lst[i-1]) / lst[i] <= 0.10:
lst[i] = lst[i-1]
df['number'] = list(reversed(lst))
# date number
# 0 2019 150.0
# 1 2018 115.0
# 2 2017 115.0
# 3 2016 115.0
# 4 2015 115.0
# 5 2014 100.0
# 6 2013 100.0
# 7 2012 100.0
# 8 2011 100.0

Recognizing missing values and return a list with these values

I am quite new to coding and I have done a very small Pandas course at work recently and a part of this course was to think about a project where we would like to improve something. I want to be able to recognise the missing values in a table from a CSV or Excel file and then make a list of these missing values.
An example of imported CSV file:
Company 2016 2015 2014 2013 2012 2011 2010
AAPL US 31 NaN 21.0 3.0 NaN 80.0 7
MSFT US 72 8.0 67.0 NaN 93.0 30.0 37
SNAP US 51 NaN NaN 7.0 33.0 16.0 44
FB US 49 56.0 33.0 97.0 NaN NaN 98
Into:
AAPL US, 2015, 2012
MSFT US, 2013
SNAP US, 2015, 2014
FB US, 2012, 2011
I understand how to count them and etc, but I want to see a finalized list in some sort.
Thanks!
First set_index, check NaNs by isnull and last generate lists by apply with filtering:
df = (df.set_index('Company')
.isnull()
.apply(lambda x : x.index[x].tolist(), 1)
.reset_index(name='val'))
print (df)
Company val
0 AAPL US [2015, 2012]
1 MSFT US [2013]
2 SNAP US [2015, 2014]
3 FB US [2012, 2011]
Or if want strings:
df = df.set_index('Company')
s = np.where(df.isnull(), ['{}, '.format(x) for x in df.columns], '')
df = pd.Series([''.join(x).strip(', ') for x in s], index=df.index).reset_index(name='val')
print (df)
Company val
0 AAPL US 2015, 2012
1 MSFT US 2013
2 SNAP US 2015, 2014
3 FB US 2012, 2011

Categories