Writing own custom aggregation function for groupby - python

I have a Data Set that is available here
It gives us a DataFrame like
df=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|')
df.head()
user_id age gender occupation zip_code
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
I want to find out what is the ratio of Males:Females in each occupation
I have used the given function below but this is not the most optimal approach.
df.groupby(['occupation', 'gender']).agg({'gender':'count'}).div(df.groupby('occupation').agg('count'), level='occupation')['gender']*100
That gives us the result something like
occupation gender
administrator F 45.569620
M 54.430380
artist F 46.428571
M 53.571429
The above answer is in a very different format as I want something like: (demo)
occupation M:F
programmer 2:3
farmer 7:2
Can somebody please tell me how to make own aggregation functions?

Actually, pandas has built-in value_counts(normalized=True) for computing the value count. Then you can play with the number a bit:
new_df = (df.groupby('occupation')['gender']
.value_counts(normalize=True) # this gives normalized counts: 0.45
.unstack('gender', fill_value=0)
.round(2) # get two significant digits
.mul(100) # get the percentage
.astype(int) # get rid of .0000
.astype(str) # turn to string
)
new_df['F:M'] = new_df['F'] + ':' + new_df['M']
new_df.head()
Output:
gender F M F:M
occupation
administrator 46 54 46:54
artist 46 54 46:54
doctor 0 100 0:100
educator 27 73 27:73
engineer 3 97 3:97

It is pretty easy actually. Every group after groupby is a dataframe (a part of initial dataframe) so you can apply your own functions to process this partial dataframe. You may add print statements inside compute_gender_ratio and see what df is.
import pandas as pd
data = pd.read_csv(
'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
sep='|')
def compute_gender_ratio(df):
gender_count = df['gender'].value_counts()
return f"{gender_count.get('M', 0)}:{gender_count.get('F', 0)}"
result = data.groupby('occupation').apply(compute_gender_ratio)
result_df = result.to_frame(name='M:F')
result_df is:
M:F
occupation
administrator 43:36
artist 15:13
doctor 7:0
educator 69:26
engineer 65:2
entertainment 16:2
executive 29:3
healthcare 5:11
homemaker 1:6
lawyer 10:2
librarian 22:29
marketing 16:10
none 5:4
other 69:36
programmer 60:6
retired 13:1
salesman 9:3
scientist 28:3
student 136:60
technician 26:1
writer 26:19

Does this work for you
df_g = df.groupby(['occupation', 'gender']).count().user_id/df.groupby(['occupation']).count().user_id
df_g = df_g.reset_index()
df_g['ratio'] = df_g['user_id'].apply(lambda x: str(Fraction(x).limit_denominator()).replace('/',':'))
Output
occupation gender user_id ratio
0 administrator F 0.455696 36:79
1 administrator M 0.544304 43:79
2 artist F 0.464286 13:28
3 artist M 0.535714 15:28
4 doctor M 1.000000 1
5 educator F 0.273684 26:95
6 educator M 0.726316 69:95
7 engineer F 0.029851 2:67
8 engineer M 0.970149 65:67
9 entertainment F 0.111111 1:9
10 entertainment M 0.888889 8:9
11 executive F 0.093750 3:32
12 executive M 0.906250 29:32
13 healthcare F 0.687500 11:16
14 healthcare M 0.312500 5:16
15 homemaker F 0.857143 6:7
16 homemaker M 0.142857 1:7
17 lawyer F 0.166667 1:6
18 lawyer M 0.833333 5:6
19 librarian F 0.568627 29:51
20 librarian M 0.431373 22:51
21 marketing F 0.384615 5:13
22 marketing M 0.615385 8:13
23 none F 0.444444 4:9
24 none M 0.555556 5:9
25 other F 0.342857 12:35
26 other M 0.657143 23:35
27 programmer F 0.090909 1:11
28 programmer M 0.909091 10:11
29 retired F 0.071429 1:14
30 retired M 0.928571 13:14
31 salesman F 0.250000 1:4
32 salesman M 0.750000 3:4
33 scientist F 0.096774 3:31
34 scientist M 0.903226 28:31
35 student F 0.306122 15:49
36 student M 0.693878 34:49
37 technician F 0.037037 1:27
38 technician M 0.962963 26:27
39 writer F 0.422222 19:45
40 writer M 0.577778 26:45

Related

Calculate deviation after groupby - loop of ufunc does not support argument 0

I have data about electric cars in USA and I am trying to calculate standard deviation for each state. I already calculated mean in that way:
df = pd.read_csv('https://gist.githubusercontent.com/AlbertKozera/6396b4333d1a9222193e11401069ed9a/raw/ab8733a2135bcf61999bbcac4f92e0de5fd56794/Pojazdy%2520elektryczne%2520w%2520USA.csv')
for col in df.columns:
df[col] = df[col].astype(str)
df['range'] = pd.to_numeric(df['range'])
.
.
.
df_avg_range = df.drop(columns = ['state', 'brand', 'model', 'year of production', 'type']).groupby('code', as_index=False)['range'].mean()
And here is my return after that:
code range
0 AK 154.553600
1 AL 156.959936
2 AR 153.950400
3 AZ 152.756000
4 CA 152.359200
5 CO 159.084800
6 CT 155.212000
7 DE 156.322400
8 FL 153.728000
9 GA 154.748800
10 HI 154.503200
11 IA 155.746400
12 ID 157.851200
13 IL 155.200800
14 IN 153.338400
15 KS 154.240000
16 KY 154.162400
17 LA 156.728800
18 MA 134.643200
19 MD 137.080800
20 ME 142.263200
21 MI 132.828000
22 MN 135.828000
23 MO 138.376000
24 MS 132.704000
25 MT 132.552000
26 NC 133.800000
27 ND 136.096800
28 NE 137.150400
29 NH 131.498400
30 NJ 137.760800
31 NM 133.325600
32 NV 137.522400
33 NY 137.476000
34 OH 137.784800
35 OK 134.277600
36 OR 134.504000
37 PA 141.052000
38 RI 137.572000
39 SC 143.348000
40 SD 141.189600
41 TN 139.981600
42 TX 139.233600
43 UT 138.615200
44 VA 141.334400
45 VT 143.104000
46 WA 137.880800
47 WI 143.916800
48 WV 141.008000
49 WY 147.109600
Now I am trying to calculate deviation in the same way:
df_dev_range = df.drop(columns = ['state', 'brand', 'model', 'year of production', 'type']).groupby('code', as_index=False)['range'].std()
And here is my error after that:
*** TypeError: loop of ufunc does not support argument 0 of type str which has no callable sqrt method
Can someone explain what am I doing wrong?
Try removing as_index=False from groupby. Standard deviation will be applied to all columns including groupby column when keeping as_index as False.
In order to retain index 0-49 try using the below syntax
df_dev_range = df.drop(columns = ['state', 'brand', 'model', 'year of production', 'type']).groupby('code',as_index=False).agg({'range':'std'})

How to get the rolling mean and find the percentage of Male and Female in each occupation?

occupation gender number
administrator F 36
M 43
artist F 13
M 15
doctor M 7
educator F 26
M 69
How to get the rolling mean of first 2 column and find the average of (M)male and (F)female in each occupation
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
sep='|', index_col='user_id')
users.head()
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
# create a data frame and apply count to gender
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'})
# create a DataFrame and apply count for each occupation
occup_count = users.groupby(['occupation']).agg('count')
# divide the gender_ocup per the occup_count and multiply per 100
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100
# present all rows from the 'gender column'
occup_gender.loc[: , 'gender']
courtesy
https://github.com/guipsamora/pandas_exercises/blob/master/03_Grouping/Occupation/Exercises_with_solutions.ipynb

Combining Rows in a DataFrame

I have a DF that has the results of a NER classifier such as the following:
df =
s token pred tokenID
17 hakawati B-Loc 3
17 theatre L-Loc 3
17 jerusalem U-Loc 7
56 university B-Org 5
56 of I-Org 5
56 texas I-Org 5
56 here L-Org 6
...
5402 dwight B-Peop 1
5402 d. I-Peop 1
5402 eisenhower L-Peop 1
There are many other columns in this DataFrame that are not relevant. Now I want to group the tokens depending on their sentenceID (=s) and their predicted tags to combine them into a single entity:
df2 =
s token pred
17 hakawati theatre Location
17 jerusalem Location
56 university of texas here Organisation
...
5402 dwight d. eisenhower People
Normally I would do so by simply using a line like
data_map = df.groupby(["s"],as_index=False, sort=False).agg(" ".join) and using a rename function. However since the data contains different kind of Strings (B,I,L - Loc/Org ..) I don't know how to exactly do it.
Any ideas are appreciated.
Any ideas?
One solution via a helper column.
df['pred_cat'] = df['pred'].str.split('-').str[-1]
res = df.groupby(['s', 'pred_cat'])['token']\
.apply(' '.join).reset_index()
print(res)
s pred_cat token
0 17 Loc hakawati theatre jerusalem
1 56 Org university of texas here
2 5402 Peop dwight d. eisenhower
Note this doesn't match exactly your desired output; there seems to be some data-specific treatment involved.
You could group by both s and tokenID and aggregate like so:
def aggregate(df):
token = " ".join(df.token)
pred = df.iloc[0].pred.split("-", 1)[1]
return pd.Series({"token": token, "pred": pred})
df.groupby(["s", "tokenID"]).apply(aggregate)
# Output
token pred
s tokenID
17 3 hakawati theatre Loc
7 jerusalem Loc
56 5 university of texas Org
6 here Org
5402 1 dwight d. eisenhower Peop

Finding the averages from columns

I'm using this txt file named Gradedata.txt and it looks like this:
Sarah K.,10,9,7,9,10,20,19,19,45,92
John M.,9,9,8,9,8,20,20,18,43,95
David R.,8,7,7,9,6,18,17,17,40,83
Joan A.,9,10,10,10,10,20,19,20,47,99
Nick J.,9,7,10,10,10,20,20,19,46,98
Vicki T.,7,7,8,9,9,17,18,19,44,88
I'm looking for the averages of each column. Each column has it's own title (Homework #1, Homework #2, etc. in that order). What I am trying to do should look exactly like this:
Homework #1 8.67
Homework #2 8.17
Homework #3 8.33
Homework #4 9.33
Homework #5 8.83
Quiz #1 19.17
Quiz #2 18.83
Quiz #3 18.67
Midterm #1 44.17
Final #1 92.50
Here is my attempt at accomplishing this task:
with open("GradeData.txt", "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
for line in f:
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
sums[i] += int(values[i])
numRows += 1
for index, summedRowValue in enumerate(sums):
print columns[index], 1.0 * summedRowValue / numRows
I'm getting errors and also I realize I have to name each assignment average. Need some help here. I appreciate it.
numpy can chew this up in one line:
>>> np.loadtxt('Gradedata.txt', delimiter=',', usecols=range(1,11)).mean(axis=0)
array([ 8.66666667, 8.16666667, 8.33333333, 9.33333333,
8.83333333, 19.16666667, 18.83333333, 18.66666667,
44.16666667, 92.5 ])
Just transpose and use statistics.mean to get the average, skipping the first col:
import csv
from itertools import islice
from statistics import mean
with open("in.txt") as f:
for col in islice(zip(*csv.reader(f)), 1, None):
print(mean(map(float,col)))
Which will give you:
8.666666666666666
8.166666666666666
8.333333333333334
9.333333333333334
8.833333333333334
19.166666666666668
18.833333333333332
18.666666666666668
44.166666666666664
92.5
If the columns are actually named and you want to pair them:
import csv
from itertools import islice
from statistics import mean
with open("in.txt") as f:
# get column names
cols = next(f).split(",")
for col in islice(zip(*csv.reader(f)),1 ,None):
# keys are column names, values are averages
data = dict(zip(cols[1:],mean(map(float,col))))
Or using pandas.read_csv:
import pandas as pd
df = pd.read_csv("in.txt",index_col=0,header=None)
print(df)
print(df.mean(axis=0))
1 2 3 4 5 6 7 8 9 10
0
Sarah K. 10 9 7 9 10 20 19 19 45 92
John M. 9 9 8 9 8 20 20 18 43 95
David R. 8 7 7 9 6 18 17 17 40 83
Joan A. 9 10 10 10 10 20 19 20 47 99
Nick J. 9 7 10 10 10 20 20 19 46 98
Vicki T. 7 7 8 9 9 17 18 19 44 88
1 8.666667
2 8.166667
3 8.333333
4 9.333333
5 8.833333
6 19.166667
7 18.833333
8 18.666667
9 44.166667
10 92.500000
dtype: float64

How to find the common values in a particular column in a particular table and display the intersected output

Table
Roll Class Country Rights CountryAcc
1 x IND 23 US
1 x1 IND 32 Ind
2 s US 12 US
3 q IRL 33 CA
4 a PAK 12 PAK
4 e PAK 12 IND
5 f US 21 CA
5 g US 31 PAK
6 h US 21 BAN
I want to display only those Rolls whose CountryAcc is not in US or CA. For example: if Roll 1 has one CountryAcc in US then I don't want its other row with CountryAcc Ind and same goes with Roll 5 as it is having one row with CountryAcc as CA. So my final output would be:
Roll Class Country Rights CountryAcc
4 a PAK 12 PAK
4 e PAK 12 IND
6 h US 21 BAN
I tried getting that output following way:
Home_Country = ['US', 'CA']
#First I saved two countries in a variable
Account_Other_Count = df.loc[~df.CountryAcc.isin(Home_Country)]
Account_Other_Count_Var = df.loc[~df.CountryAcc.isin(Home_Country)][['Roll']].values.ravel()
# Then I made two variables one with CountryAcc in US or CA and other variable with remaining and I got their Roll
Account_Home_Count = df.loc[df.CountryAcc.isin(Home_Country)]
Account_Home_Count_Var = df.loc[df.CountryAcc.isin(Home_Country)][['Roll']].values.ravel()
#Here I got the common Rolls
Common_ROLL = list(set(Account_Home_Count_Var).intersection(list(Account_Other_Count_Var)))
Final_Output = Account_Other_Count.loc[~Account_Other_Count.Roll.isin(Common_ROLL)]
Is there any better and more pandas or pythonic way to do it.
One solution could be
In [37]: df.ix[~df['Roll'].isin(df.ix[df['CountryAcc'].isin(['US', 'CA']), 'Roll'])]
Out[37]:
Roll Class Country Rights CountryAcc
4 4 a PAK 12 PAK
5 4 e PAK 12 IND
8 6 h US 21 BAN
This is one way to do it:
sortdata = df[~df['CountryAcc'].isin(['US', 'CA'])].sort(axis=0)

Categories