I'm looking to get the sum of some values in a dataframe after it has been grouped.
some sample data:
Race officeID CandidateId total_votes precinct
Mayor 10 705 20 Bell
Mayor 10 805 30 Bell
Treasurer 12 505 10 Bell
Treasurer 12 506 40 Bell
Treasurer 12 507 30 Bell
Mayor 10 705 50 Park
Mayor 10 805 10 Park
Treasurer 12 505 5 Park
Treasurer 12 506 13 Park
Treasurer 12 507 16 Park
To get the sum of the votes for each candidate, I can do:
cand_votes = df.groupby('CandidateId').sum().total_votes
print cand_votes
CandidateId
505 15
506 53
507 46
705 70
805 40
To get total votes per office:
total_votes = df.groupby('officeID').sum().total_votes
print total_votes
officeID
10 110
12 114
But what if I want to get the percentage of the vote each candidate got? Would I have to apply some sort of function on each data object? Ideally I would like the final data object to look like:
officeID CandidateID total_votes vote_pct
10 705 70 .6363
10 805 40 .37
First, create a frame that that has the votes by candidate and office.
gb = df.groupby(['officeID','CandidateId'], as_index=False)['total_votes'].sum()
Then with that, you can aggregate by office and use a transform (which returns like indexed data) to calculate a percent of office.
gb['vote_pct'] = gb['total_votes'] / gb.groupby('officeID')['total_votes'].transform('sum')
In [146]: gb
Out[146]:
officeID CandidateId total_votes vote_pct
0 10 705 70 0.636364
1 10 805 40 0.363636
2 12 505 15 0.131579
3 12 506 53 0.464912
4 12 507 46 0.403509
Related
Based on the data below I want to create a histogram. How can I do this? The code below is reading the data as a count of 1 for each record instead of taking the value from patient_unique_count
df5=df.groupby(["PANDEMIC","PATIENT_AGE_GRP"]).agg(
{"PATIENT_ID": pd.Series.nunique}).rename(columns={'PATIENT_ID':'PATIENT_UNIQUE_COUNT'})
sns.histplot(x='PATIENT_AGE_GRP', data=df5, kde=True, hue='PANDEMIC')
plt.show()
PATIENT_UNIQUE_COUNT
PANDEMIC PATIENT_AGE_GRP
AFTER 15-19 14
20-24 21
25-29 58
30-34 90
35-39 156
40-44 194
45-49 266
50-54 369
55-59 535
60-64 660
65-69 829
70-74 823
75-79 713
80-84 657
85-89 576
90+ 595
<1 1
NA 5
BEFORE 15-19 13
20-24 14
25-29 41
30-34 56
35-39 144
40-44 179
45-49 279
50-54 466
55-59 758
60-64 873
65-69 929
70-74 890
75-79 860
80-84 789
85-89 726
90+ 757
NA 11
The code is reading the data as a count of 1 for each record instead of taking the value from patient_unique_count. I would also like the histograms to be side by side for before and after pandemic flag.
So I have been trying to use pandas to create a DataFrame that reports the number of graduates working at jobs that do require college degrees ('college_jobs'), and do not require college degrees ('non_college_jobs').
note: the name of the dataframe I am dealing with is recent_grads
I tried the following code:
df1 = recent_grads.groupby(['major_category']).college_jobs.non_college_jobs.sum()
or
df1 = recent_grads.groupby(['major_category']).recent_grads['college_jobs','non_college_jobs'].sum()
or
df1 = recent_grads.groupby(['major_category']).recent_grads['college_jobs'],['non_college_jobs'].sum()
none of them worked! what am I supposed to do? can somebody give me a simple explanation regarding this? I had been trying to read through pandas documentations and did not find the explanation wanted.
here is the head of the dataframe:
rank major_code major major_category \
0 1 2419 PETROLEUM ENGINEERING Engineering
1 2 2416 MINING AND MINERAL ENGINEERING Engineering
2 3 2415 METALLURGICAL ENGINEERING Engineering
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering
4 5 2405 CHEMICAL ENGINEERING Engineering
total sample_size men women sharewomen employed ... \
0 2339 36 2057 282 0.120564 1976 ...
1 756 7 679 77 0.101852 640 ...
2 856 3 725 131 0.153037 648 ...
3 1258 16 1123 135 0.107313 758 ...
4 32260 289 21239 11021 0.341631 25694 ...
part_time full_time_year_round unemployed unemployment_rate median \
0 270 1207 37 0.018381 110000
1 170 388 85 0.117241 75000
2 133 340 16 0.024096 73000
3 150 692 40 0.050125 70000
4 5180 16697 1672 0.061098 65000
p25th p75th college_jobs non_college_jobs low_wage_jobs
0 95000 125000 1534 364 193
1 55000 90000 350 257 50
2 50000 105000 456 176 0
3 43000 80000 529 102 0
4 50000 75000 18314 4440 972
[5 rows x 21 columns]
You could filter the initial DataFrame by the columns you're interested in and then perform the groupby and summation as below:
recent_grads[['major_category', 'college_jobs', 'non_college_jobs']].groupby('major_category').sum()
Conversely, if you don't perform the initial column filter and then do a .sum() on the recent_grads.groupby('major_category') it will be applied to all numeric columns possible.
I have a dataframe with 7 variables:
RACA pca pp pcx psc lp csc
0 BARBUDA 1915 470 150 140 87.65 91.41
1 BARBUDA 1345 305 100 110 79.32 98.28
2 BARBUDA 1185 295 80 85 62.19 83.12
3 BARBUDA 1755 385 120 130 80.65 90.01
4 BARBUDA 1570 325 120 120 77.96 87.99
5 CANELUDA 1640 365 110 115 81.38 87.26
6 CANELUDA 1960 525 135 145 89.21 99.37
7 CANELUDA 1715 410 100 120 79.35 99.84
8 CANELUDA 1615 380 100 110 76.32 99.27
9 CANELUDA 2230 500 165 160 90.22 99.56
10 CANELUDA 1570 400 105 95 85.24 83.95
11 COMERCIAL 1815 380 145 90 73.32 92.81
12 COMERCIAL 2475 345 180 140 71.77 105.64
13 COMERCIAL 1870 295 125 125 72.36 97.89
14 COMERCIAL 2435 565 185 160 73.24 107.39
15 COMERCIAL 1705 315 115 125 72.03 96.11
16 COMERCIAL 2220 495 165 150 87.63 96.89
17 PELOCO 1145 250 75 85 50.57 77.90
18 PELOCO 705 85 55 50 38.26 78.09
19 PELOCO 1140 195 80 75 66.15 96.35
20 PELOCO 1355 250 90 95 50.60 91.39
21 PELOCO 1095 220 80 80 53.03 84.57
22 PELOCO 1580 255 125 120 59.30 95.57
I want to fit a glm for every dependent variable, pca:csc, in R it's quite simple to do it, but I don't know how to get this working on Python. I tried to write a for loop and pass the column name to the formula but so far didn't work out:
for column in df:
col = str(column)
model = sm.formula.glm(paste(col,"~ RACA"), data=df).fit()
print(model.summary())
I am using Pandas and statsmodel
import pandas as pd
import statsmodels.api as sm
I imagine it must be so simple, but I sincerely couldn't figure it out yet.
I was able to figure out a solution, don't know if it's the most efficient or elegant one, but give the results I wanted:
for column in df.loc[:,'pca':'csc']:
col = str(column)
formula = col + "~RACA"
model = sm.formula.glm(formula = formula, data=df).fit()
print(model.summary())
I am open to suggestions on how I could improve this. Thank you!
I have a dataframe and I want to pull the first Index value after each time I sort the dataframe based on values as a string.
And what I want my function to do is pull the country name at the top of the list. In this example, it would pull 'United States' as a string. Because the country names are the indexes and not Series values I can't just do summer_gold.iloc[0].
Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 AFG
Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 ALG
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 ARG
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 ARM
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 ANZ
So if I were to sort based on number of Gold medals I'd get a
dataframe that looks like:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
United States 26 976 757 666 2399 22 96
Soviet Union 9 395 319 296 1010 9 78
Great Britain 27 236 272 272 780 22 10
France 27 202 223 246 671 22 31
China 9 201 146 126 473 10 12
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 \
United States 102 84 282 48 1072 859
Soviet Union 57 59 194 18 473 376
Great Britain 4 12 26 49 246 276
France 31 47 109 49 233 254
China 22 19 53 19 213 168
Bronze.2 Combined total ID
United States 750 2681 USA
Soviet Union 355 1204 URS
Great Britain 284 806 GBR
France 293 780 FRA
China 145 526 CHN
So far my overall code looks like:
def answer_one():
summer_gold = df.sort_values('Gold', ascending=False)
summer_gold = summer_gold.iloc[0]
return summer_gold
answer_one()
Output:
# Summer 26
Gold 976
Silver 757
Bronze 666
Total 2399
# Winter 22
Gold.1 96
Silver.1 102
Bronze.1 84
Total.1 282
# Games 48
Gold.2 1072
Silver.2 859
Bronze.2 750
Combined total 2681
ID USA
Name: United States, dtype: object
I want an output of 'United States', in this case, or the name of whatever the country is at the top of my sorted dataframe.
After you sorted your dataframe, you can access the first row index like:
df.index[0]
I've noticed that for a DataFrame with a PeriodIndex, the month reverts to its native Int64 type upon a reset_index(), losing its freq attribute in the process. Is there any way to keep it as a Series of Periods?
For example:
In [42]: monthly
Out[42]:
qunits expend
month store upc
1992-12 1 21 83 248.17
72 3 13.95
78 2 6.28
79 1 5.82
85 5 28.10
87 1 1.87
88 6 11.76
...
1994-12 151 857 12 81.48
858 23 116.15
880 7 44.73
881 13 25.05
883 21 67.25
884 44 190.56
885 13 83.57
887 1 4.55
becomes:
In [43]: monthly.reset_index()
Out[43]:
month store upc qunits expend
0 275 1 21 83 248.17
1 275 1 72 3 13.95
2 275 1 78 2 6.28
3 275 1 79 1 5.82
4 275 1 85 5 28.10
5 275 1 87 1 1.87
6 275 1 88 6 11.76
7 275 1 89 21 41.16
...
500099 299 151 857 12 81.48
500100 299 151 858 23 116.15
500101 299 151 880 7 44.73
500102 299 151 881 13 25.05
500103 299 151 883 21 67.25
500104 299 151 884 44 190.56
500105 299 151 885 13 83.57
500106 299 151 887 1 4.55
Update 6/13/2014
It worked beautifully but the end result I need is the PeriodIndex values to be passed onto a grouped DataFrame. I got it to work but it seems to me that it could be done more compactly. I.e., my code is:
periods_index = monthly.index.get_level_values('month')
monthly.reset_index(inplace=True)
monthly.month = periods_index
grouped=monthly.groupby('month')
moments=pd.DataFrame(monthly.month.unique(),columns=['month'])
for month,group in grouped:
moments.loc[moments.month==month,'meanNo0']=wmean(group[group.relative!=1].avExpend,np.log(group[group.relative!=1].relative))
Any further suggestions?
How about this:
periods_index = monthly.index.get_level_values('month')