Find a quantile value and then add it back to the dataframe - python

I have one dataframe that looks like this:
ID | Name | pay
1 John 10
1 John 20
1 John 30
I want to find the 50th percentile of pay and then add that value back to the orginal df.
df = np.round(users.groupby('ID')['pay'].quantile(0.50), 3).reset_index()
ID | Name | pay | 50th percentile
1 John 10 20
1 John 20 20
1 John 30 20
I've seen where you can use transform and pass it a function, but I haven't seen an example using the quantile function, so I'm not sure how to pass the quantile function to transform.
I thought something like this df = df.groupby('ID')['pay'].transform(quantile(0.50))

You can add positional arguments to functions with groupby.transform
import pandas as pd
import io
t = '''
ID Name pay
1 John 10
1 John 20
1 John 30
2 Doris 40
2 Doris 100'''
df = pd.read_csv(io.StringIO(t), sep='\s+')
df['median'] = df.groupby('ID').pay.transform('quantile', .5)
df
Output
ID Name pay median
0 1 John 10 20.0
1 1 John 20 20.0
2 1 John 30 20.0
3 2 Doris 40 70.0
4 2 Doris 100 70.0

Related

Pandas comparing dataframes and changing column value based on number of similar rows in another dataframe

Suppose I have two dataframes:
df1:
Person Number Type
0 Kyle 12 Male
1 Jacob 15 Male
2 Jacob 15 Male
df2:
A much larger dataset with similar format except there is a count column that needs to increment based on df1
Person Number Type Count
0 Kyle 12 Male 0
1 Jacob 15 Male 0
3 Sally 43 Female 0
4 Mary 15 Female 5
What I am looking to do is increase the count column based on the number of occurrences of the same person in df1
Excepted output for this example:
Person Number Type Count
0 Kyle 12 Male 1
1 Jacob 15 Male 2
3 Sally 43 Female 0
4 Mary 15 Female 5
Increase count to 1 for Kyle because there is one instance, increase count to 2 because there are two instances for Jacob. Don't change value for Sally and Mary and keep the value the same.
How do I do this? I have tried using .loc but I can't figure out how to account for two instances of the same row. Meaning that I can only get count to increase by one for Jacob even though there are two Jacobs in df1.
I have tried
df2.loc[df2['Person'].values == df1['Person'].values, 'Count'] += 1
However this does not account for duplicates.
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df1 = df1.set_index(['Person','Number','Type'])
df2 = df2.set_index(['Person','Number','Type'])
df1.add(df2, fill_value=0).reset_index()
Or
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df2.merge(df1, on=['Person','Number','Type'], how='left').set_index(['Person','Number','Type']).sum(axis=1).to_frame('Count').reset_index()
value_counts + Index alignment.
u = df2.set_index("Person")
u.assign(Count=df1["Person"].value_counts().add(u["Count"], fill_value=0))
Number Type Count
Person
Kyle 12 Male 1.0
Jacob 15 Male 2.0
Sally 43 Female 0.0
Mary 15 Female 5.0

Pandas: Group by two parameters and sort by third parameter

I want to group my dataframe by two columns (Name and Budget) and then sort the aggregated results by a third parameter (Prio).
Name Budget Prio Quantity
peter A 2 12
B 1 123
joe A 3 34
B 1 51
C 2 43
I already checked this post, which was very helpful and leads to the following output. However, I cannot manage sorting by the third parameter (Prio).
df_agg = df.groupby(['Name','Budget','Prio']).agg({'Quantity':sum})
g = df_agg['Quantity'].groupby(level=0, group_keys=False)
res = g.apply(lambda x: x.sort_values(ascending=True))
I would now like to sort the prio in ascending order within each of the groups. To get something like:
Name Budget Prio Quantity
peter B 1 123
A 2 12
joe B 1 51
C 2 34
A 3 43
IIUC,
df.groupby(['Name','Budget','Prio']).agg({'Quantity':sum}).sort_values(['Name','Prio'])
Output:
Quantity
Name Budget Prio
joe B 1 51
C 2 4
A 3 34
peter B 1 123
A 2 12
If you want only sort by Prio, you can use sort_index:
(df.groupby(['Name','Budget','Prio'])
.agg({'Quantity':'sum'})
.sort_index(level=['Name', 'Prio'],
ascending=[False, True])
)
Output:
Quantity
Name Budget Prio
peter B 1 123
A 2 12
joe B 1 51
C 2 43
A 3 34

Get order of subgroups in pandas dataframe

I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'Name' : ['Kate', 'John', 'Peter','Kate', 'John', 'Peter'],'Distance' : [23,16,32,15,31,26], 'Time' : [3,5,2,7,9,4]})
df
Distance Name Time
0 23 Kate 3
1 16 John 5
2 32 Peter 2
3 15 Kate 7
4 31 John 9
5 26 Peter 2
I want to add a column that tells me, for each Name, what's the order of the times.
I want something like this:
Order Distance Name Time
0 16 John 5
1 31 John 9
0 23 Kate 3
1 15 Kate 7
0 32 Peter 2
1 26 Peter 4
I can do it using a for loop:
df2 = df[df['Name'] == 'aaa'].reset_index().reset_index() # I did this just to create an empty data frame with the columns I want
for name, row in df.groupby('Name').count().iterrows():
table = df[df['Name'] == name].sort_values('Time').reset_index().reset_index()
to_concat = [df2,table]
df2 = pd.concat(to_concat)
df2.drop('index', axis = 1, inplace = True)
df2.columns = ['Order', 'Distance', 'Name', 'Time']
df2
This works, the problem is (apart from being very unpythonic), for large tables (my actual table has about 50 thousand rows) it takes about half an hour to run.
Can someone help me write this in a simpler way that runs faster?
I'm sorry if this has been answered somewhere, but I didn't really know how to search for it.
Best,
Use sort_values with cumcount:
df = df.sort_values(['Name','Time'])
df['Order'] = df.groupby('Name').cumcount()
print (df)
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
If need first column use insert:
df = df.sort_values(['Name','Time'])
df.insert(0, 'Order', df.groupby('Name').cumcount())
print (df)
Order Distance Name Time
1 0 16 John 5
4 1 31 John 9
0 0 23 Kate 3
3 1 15 Kate 7
2 0 32 Peter 2
5 1 26 Peter 4
In [67]: df = df.sort_values(['Name','Time']) \
.assign(Order=df.groupby('Name').cumcount())
In [68]: df
Out[68]:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
PS I'm not sure this is the most elegant way to do this...

How to make complex data cleaning in pandas

For example, I have a DataFrame as following.
lineNum id name Cname score
1 001 Jack Math 99
2 002 Jack English 110
3 003 Jack Chinese 90
4 003 Jack Chinese 90
5 004 Tom Math Nan
6 005 Tom English 75
7 006 Tom Chinese 85
As you see, I want to data cleaning for this data.
1) delete the duplicate value which is line 3 and line 4.
2) deal with ths unreasonable value. In line 2, Jack's English get 110 which is over the max value 100. I want to set his score to the mean value for all students' english score.
3) deal with the Nan value. Tom's Math score is Nan. I want to change to the mean value for all students' math score.
I can do every requirement respectively. But I don't know how to do all these three requirements. Thanks!
plan
I drop duplicates to start.
use mask to make scores > than 100 null
filter new dataframe and group by with mean
map means and use it to fill nulls
d = df.drop_duplicates(['id', 'name', 'Cname'])
s0 = d.score
s1 = s0.mask(s > 100)
m = s1.mask(s1 > 100).notnull()
d.assign(score=s1.fillna(d.Cname.map(d[m].groupby('Cname').score.mean())))
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 110.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
You can use:
cols = ['id','name','Cname','score']
#remove duplicates by columns
df = df.drop_duplicates(subset=cols)
#replace values > 100 to NaN
df.loc[df['score'] > 100, 'score'] = np.nan
#replace NaN by mean for all students by subject
df['score'] = df.groupby('Cname')['score'].transform(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
Alternative solution with mask for NaN:
cols = ['id','name','Cname','score']
df = df.drop_duplicates(subset=cols)
df['score'] = df['score'].mask(df['score'] > 100)
df['score'] = df.groupby('Cname')['score'].apply(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
You should consider `.apply(func)' if the data is not too big.
import pandas as pd
df = pd.read_table('sample.txt', delimiter='\s+', na_values='Nan') # Your sample data
df = df.set_index('lineNum').drop_duplicates()
def deal_with(x):
if (x['score'] > 100.) or (pd.isnull(x['score'])):
df_ = df[df['id'] != x['id']]
x['score'] = df_.loc[df_['Cname'] == x['Cname'], 'score'].mean()
return x
print(df.apply(deal_with, axis=1))
id name Cname score
lineNum
1 1 Jack Math 99.0
2 2 Jack English 75.0
3 3 Jack Chinese 90.0
5 4 Tom Math 99.0
6 5 Tom English 75.0
7 6 Tom Chinese 85.0

Python Grouping Transpose

I have my data in a pandas dataframe
out[1]:
NAME STORE AMOUNT
0 GARY GAP 20
1 GARY GAP 10
2 GARY KROGER 15
3 ASHLEY FOREVER21 30
4 ASHLEY KROGER 10
5 MARK GAP 10
6 ROGER KROGER 30
I'm trying to get grouping by name, sum their total amount spent, while also generating columns for each unique store in the dataframe.
Desired:
out[1]:
NAME GAP KROGER FOREVER21
0 GARY 30 15 0
1 ASHLEY 0 10 30
2 MARK 10 0 0
3 ROGER 0 30 0
Thanks for your help!
You need pivot_table:
df1 = df.pivot_table(index='NAME',
columns='STORE',
values='AMOUNT',
aggfunc='sum',
fill_value=0)
print (df1)
STORE FOREVER21 GAP KROGER
NAME
ASHLEY 30 0 10
GARY 0 30 15
MARK 0 10 0
ROGER 0 0 30
Alternative solution with aggregating by groupby and sum:
df1 = df.groupby(['NAME','STORE'])['AMOUNT'].sum().unstack(fill_value=0)
print (df1)
STORE FOREVER21 GAP KROGER
NAME
ASHLEY 30 0 10
GARY 0 30 15
MARK 0 10 0
ROGER 0 0 30
Last if need column from index values and remove column and index names:
print (df1.reset_index().rename_axis(None, axis=1).rename_axis(None))
NAME FOREVER21 GAP KROGER
0 ASHLEY 30 0 10
1 GARY 0 30 15
2 MARK 0 10 0
3 ROGER 0 0 30

Categories