How to add additional text to matplotlib annotations - python

I have used seaborn's titanic dataset as a proxy for my very large dataset to create the chart and data based on that.
The following code runs without any errors:
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_theme(style="darkgrid")
# Load the example Titanic dataset
df = sns.load_dataset("titanic")
# split fare into decile groups and order them
df['fare_grp'] = pd.qcut(df['fare'], q=10,labels=None, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp'],dropna=False).size()
df['fare_grp_num'] = pd.qcut(df['fare'], q=10,labels=False, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp_num'],dropna=False).size()
df['fare_ord_grp'] = df['fare_grp_num'] + ' ' +df['fare_grp']
df['fare_ord_grp']
# set variables
target = 'survived'
ydim = 'fare_ord_grp'
xdim = 'embark_town'
#del [result]
non_events = pd.DataFrame(df[df[target]==0].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'non_events'})
non_events[xdim]=non_events[xdim].replace(np.nan, 'Missing', regex=True)
non_events[ydim]=non_events[ydim].replace(np.nan, 'Missing', regex=True)
non_events_total = pd.DataFrame(df[df[target]==0].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'non_events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
events = pd.DataFrame(df[df[target]==1].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'events'})
events[xdim]=events[xdim].replace(np.nan, 'Missing', regex=True)
events[ydim]=events[ydim].replace(np.nan, 'Missing', regex=True)
events_total = pd.DataFrame(df[df[target]==1].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total = pd.DataFrame(df.groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total=grand_total.merge(non_events_total, how='left', on=xdim).merge(events_total, how='left', on=xdim)
result = pd.merge(non_events, events, how="outer",on=[ydim,xdim])
result['total'] = result['non_events'].fillna(0) + result['events'].fillna(0)
result[xdim] = result[xdim].replace(np.nan, 'Missing', regex=True)
result = pd.merge(result, grand_total, how="left",on=[xdim])
result['survival rate %'] = round(result['events']/result['total']*100,2)
result['% event dist by xdim'] = round(result['events']/result['events_total_by_xdim']*100,2)
result['% non-event dist by xdim'] = round(result['non_events']/result['non_events_total_by_xdim']*100,2)
result['% total dist by xdim'] = round(result['total']/result['total_by_xdim']*100,2)
display(result)
value_name1 = "% dist by " + str(xdim)
dfl = pd.melt(result, id_vars=[ydim, xdim],value_vars =['% total dist by xdim'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl2 = dfl.pivot(index=ydim, columns=xdim, values=value_name1)
print(dfl2)
title1 = "% dist by " + str(xdim)
ax=dfl2.T.plot(kind='bar', stacked=True, rot=1, figsize=(8, 8), title=title1)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.legend(bbox_to_anchor=(1.0, 1.0),title = 'Fare Range')
ax.set_ylabel('% Dist')
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width/2, y+height/2,'{:.0f}%'.format(height),horizontalalignment='center', verticalalignment='center')
It produces the following stacked percent bar chart, which shows the % of total distribution by embark town.
I also want to show the survival rate along with the %distribution in each block. For example, for Queenstown, fare range 1 (7.6, 7.9], the % total distribution is 56%. I want to display the survival rate 37.21% as (56%, 37.21%). I am not able to figure it out. Kindly offer any suggestions. Thanks.
Here is the output summary table for reference
fare_ord_grp
embark_town
non_events
events
total
total_by_xdim
non_events_total_by_xdim
events_total_by_xdim
survival rate %
% event dist by xdim
% non-event dist by xdim
% total dist by xdim
0
0 (-0.1,7.6]
Cherbourg
22
7
29
168
75
93
24.14
7.53
29.33
17.26
1
0 (-0.1,7.6]
Queenstown
4
NaN
4
77
47
30
NaN
NaN
8.51
5.19
2
0 (-0.1,7.6]
Southampton
53
6
59
644
427
217
10.17
2.76
12.41
9.16
3
1 (7.6,7.9]
Queenstown
27
16
43
77
47
30
37.21
53.33
57.45
55.84
4
1 (7.6,7.9]
Southampton
34
10
44
644
427
217
22.73
4.61
7.96
6.83
5
2 (7.9,8]
Cherbourg
4
1
5
168
75
93
20
1.08
5.33
2.98
6
2 (7.9,8]
Southampton
83
13
96
644
427
217
13.54
5.99
19.44
14.91
7
3 (8.0,10.5]
Cherbourg
2
1
3
168
75
93
33.33
1.08
2.67
1.79
8
3 (8.0,10.5]
Queenstown
2
NaN
2
77
47
30
NaN
NaN
4.26
2.6
9
3 (8.0,10.5]
Southampton
56
17
73
644
427
217
23.29
7.83
13.11
11.34
10
4 (10.5,14.5]
Cherbourg
7
8
15
168
75
93
53.33
8.6
9.33
8.93
11
4 (10.5,14.5]
Queenstown
1
2
3
77
47
30
66.67
6.67
2.13
3.9
12
4 (10.5,14.5]
Southampton
40
26
66
644
427
217
39.39
11.98
9.37
10.25
13
5 (14.5,21.7]
Cherbourg
9
10
19
168
75
93
52.63
10.75
12
11.31
14
5 (14.5,21.7]
Queenstown
5
3
8
77
47
30
37.5
10
10.64
10.39
15
5 (14.5,21.7]
Southampton
37
24
61
644
427
217
39.34
11.06
8.67
9.47
16
6 (21.7,27]
Cherbourg
1
4
5
168
75
93
80
4.3
1.33
2.98
17
6 (21.7,27]
Queenstown
2
3
5
77
47
30
60
10
4.26
6.49
18
6 (21.7,27]
Southampton
40
39
79
644
427
217
49.37
17.97
9.37
12.27
19
7 (27.0,39.7]
Cherbourg
14
10
24
168
75
93
41.67
10.75
18.67
14.29
20
7 (27.0,39.7]
Queenstown
5
NaN
5
77
47
30
NaN
NaN
10.64
6.49
21
7 (27.0,39.7]
Southampton
38
24
62
644
427
217
38.71
11.06
8.9
9.63
22
8 (39.7,78]
Cherbourg
5
19
24
168
75
93
79.17
20.43
6.67
14.29
23
8 (39.7,78]
Southampton
37
28
65
644
427
217
43.08
12.9
8.67
10.09
24
9 (78.0,512.3]
Cherbourg
11
33
44
168
75
93
75
35.48
14.67
26.19
25
9 (78.0,512.3]
Queenstown
1
1
2
77
47
30
50
3.33
2.13
2.6
26
9 (78.0,512.3]
Southampton
9
30
39
644
427
217
76.92
13.82
2.11
6.06
27
2 (7.9,8]
Queenstown
NaN
5
5
77
47
30
100
16.67
NaN
6.49
28
9 (78.0,512.3]
Missing
NaN
2
2
2
NaN
2
100
100
NaN
100

dfl2.T is being plotted, but 'survival rate %' is in result. As such, the indices for the values from dfl2.T do not correspond with 'survival rate %'.
Because all of values in result['% total dist by xdim'] are
not unique, we can't use a dict of matched key-values.
Create a corresponding pivoted DataFrame for 'survival rate %', and then flatten it. All of the values will be in the same order as the '% total dist by xdim' values from dfl2.T. As such, they can be indexed.
With respect to dfl2.T, the plot API plots in column order, which means .flatten(order='F') must be used to flatten the array in the correct order to be indexed.
# create a corresponding pivoted dataframe for survival rate %
dfl3 = pd.melt(result, id_vars=[ydim, xdim],value_vars =['survival rate %'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl4 = dfl3.pivot(index=ydim, columns=xdim, values=value_name1)
# flatten dfl4.T in column order
dfl4_flattened = dfl4.T.to_numpy().flatten(order='F')
for i, p in enumerate(ax.patches):
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
# only print values when height is not 0
if height != 0:
# create the text string
text = f'{height:.0f}%, {dfl4_flattened[i]:.0f}%'
# annotate the bar segments
ax.text(x+width/2, y+height/2, text, horizontalalignment='center', verticalalignment='center')
Notes
Here we can see dfl2.T and dfl4.T
# dfl2.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 17.26 NaN 2.98 1.79 8.93 11.31 2.98 14.29 14.29 26.19
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown 5.19 55.84 6.49 2.60 3.90 10.39 6.49 6.49 NaN 2.60
Southampton 9.16 6.83 14.91 11.34 10.25 9.47 12.27 9.63 10.09 6.06
# dfl4.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 24.14 NaN 20.00 33.33 53.33 52.63 80.00 41.67 79.17 75.00
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown NaN 37.21 100.00 NaN 66.67 37.50 60.00 NaN NaN 50.00
Southampton 10.17 22.73 13.54 23.29 39.39 39.34 49.37 38.71 43.08 76.92

Related

Concatenate Dataframes on common values yields NaN values for non-matches [python]

I am trying to merge/concatenate two dataframes on a common column and match up all the corresponding values. However, while matching values receive corresponding values for that row, when there is no match, a NaN value is produced. I am using python for this. I will explain here in more detail.
I have this dataframe A:
ID Area Distance Height Temp
----------------------------------------------------
0 100 8.31 0 1.30 24.27
1 101 3.11 0 1.29 25.99
2 102 5.10 0 1.23 29.51
3 105 9.70 0 1.97 15.17
4 107 4.77 0 1.53 27.84
...
Each ID represents a different building footprint (polygon), with its Area Recorded, the height of the building, and the mean outdoor temperature recorded at the site of the building. The "Distance" column denotes the distance away from the building at which the temperature was recorded, and so onsite = 0 meters away.
And I have this dataframe B:
ID Temp Distance
---------------------------------
0 100 25.68 5
1 100 26.05 10
2 100 26.85 15
3 100 27.25 20
4 100 27.78 25
5 101 22.68 5
6 101 26.44 10
7 101 26.83 15
8 101 27.26 20
9 101 28.38 25
10 102 25.63 5
11 102 26.26 10
12 102 26.57 15
13 102 26.91 20
14 102 28.84 25
15 105 25.33 5
16 105 26.25 10
17 105 26.54 15
18 105 26.23 20
19 105 27.53 25
20 107 25.23 5
21 107 26.73 10
22 107 26.26 15
23 107 26.11 20
24 107 27.16 25
...
This shows for the same building IDs the temperatures recorded at different distances away from the building, and so for each building I want the recorded mean outdoor temperature 5 meters away, 10 meters away, 15 meters away, 20 meters away, and 25 meters away.
What I want to do is join dataframes A and B by the common "ID" column. And so what I want to do is produce a dataframe C that shows for each ID that buildings Temperature at Distances 0, 5, 10, 15, 20, and 25. The issue is that I want for each building ID the Area and Height to remain the same, because of course the buildings area and height will not change! And so I want to produce the following dataframe C:
ID Area Distance Height Temp
----------------------------------------------------
0 100 8.31 0 1.30 24.27
1 100 8.31 5 1.30 25.68
2 100 8.31 10 1.30 26.05
3 100 8.31 15 1.30 26.85
4 100 8.31 20 1.30 27.25
5 100 8.31 25 1.30 27.78
6 101 3.11 0 1.29 25.99
7 101 3.11 5 1.29 22.68
8 101 3.11 10 1.29 26.44
9 101 3.11 15 1.29 26.83
10 101 3.11 20 1.29 27.26
11 101 3.11 25 1.29 28.38
12 102 5.10 0 1.23 29.51
13 102 5.10 5 1.23 25.63
14 102 5.10 10 1.23 26.26
15 102 5.10 15 1.23 26.57
16 102 5.10 20 1.23 26.91
17 102 5.10 25 1.23 28.84
18 105 9.70 0 1.97 15.17
19 105 9.70 5 1.97 25.33
20 105 9.70 10 1.97 26.25
21 105 9.70 15 1.97 26.54
22 105 9.70 20 1.97 26.23
23 105 9.70 25 1.97 27.53
24 107 4.77 0 1.53 27.84
25 107 4.77 5 1.53 25.23
26 107 4.77 10 1.53 26.73
27 107 4.77 15 1.53 26.26
28 107 4.77 20 1.53 26.11
29 107 4.77 25 1.53 27.16
...
And so to obtain this I try the following, trying to concatenate dataframes A and B on the "ID" column, and then sorting the rows by "ID" and "Distance":
df_C = pd.concat([df_A, df_B]).sort_values(["ID", "Distance"]).reset_index(drop=True)
However this yields:
ID Area Distance Height Temp
----------------------------------------------------
0 100 8.31 0 1.30 24.27
1 100 NaN 5 NaN 25.68
2 100 NaN 10 NaN 26.05
3 100 NaN 15 NaN 26.85
4 100 NaN 20 NaN 27.25
5 100 NaN 25 NaN 27.78
6 101 3.11 0 1.29 25.99
7 101 NaN 5 NaN 22.68
8 101 NaN 10 NaN 26.44
9 101 NaN 15 NaN 26.83
10 101 NaN 20 NaN 27.26
11 101 NaN 25 NaN 28.38
12 102 5.10 0 1.23 29.51
13 102 NaN 5 NaN 25.63
14 102 NaN 10 NaN 26.26
15 102 NaN 15 NaN 26.57
16 102 NaN 20 NaN 26.91
17 102 NaN 25 NaN 28.84
18 105 9.70 0 1.97 15.17
19 105 NaN 5 NaN 25.33
20 105 NaN 10 NaN 26.25
21 105 NaN 15 NaN 26.54
22 105 NaN 20 NaN 26.23
23 105 NaN 25 NaN 27.53
24 107 4.77 0 1.53 27.84
25 107 NaN 5 NaN 25.23
26 107 NaN 10 NaN 26.73
27 107 NaN 15 NaN 26.26
28 107 NaN 20 NaN 26.11
29 107 NaN 25 NaN 27.16
...
And so it appears that the Area and Height values are not getting matched up because Dataframe B does not contain the corresponding Area and Height Values, and so there is nothing to report there when I merge the two dataframes. How can I fix this issue so that I get my intended Dataframe C?
If you are sure that all ID are in df_A and have a distance of 0, and no other nan than in Area and Height, then using ffill could do it once sorted as you did.
df_C = df_C.ffill()
If you are not sure, then you can use groupby.transform with first and fillna
df_C = df_C.fillna(df_C.groupby('ID')[['Area', 'Height']].transform('first'))
Finally, another option is to add the column Area and Height in df_B first, then concat so:
df_C = pd.concat([
df_A,
df_B.merge(df_A[['ID','Area','Height']], on='ID', how='left')]
).sort_values(["ID", "Distance"]).reset_index(drop=True)

Why is my code only putting the last value in the iteration into my data frame team_perf?

The code doesn't include any other values other than what is there for the LAL in team_perf. It should include data for every team, not just the lakers. That is why i am iterating through every team abbreviation:
team_abbreviations = ['NOP','BRK','OKC','NYK','DET','ATL','POR','CHI','MIL','TOR','IND',
'UTA','DEN','SAC','GSW','MIN','ORL','BOS','CHO',
'MEM','MIA','WAS','HOU','SAS','LAC','PHI','CLE','PHO','DAL','LAL']
for i in team_abbreviations:
url = r'https://www.basketball-reference.com/teams/{0}/2022/gamelog-advanced/'.format(i)
team_perf = pd.read_html(url)[0]
You need to append your result at each run:
import pandas as pd
team_abbreviations = ['NOP','BRK','OKC','NYK']
DF = []
for i in team_abbreviations:
url = r'https://www.basketball-reference.com/teams/{0}/2022/gamelog-advanced/'.format(i)
team_perf = pd.read_html(url)[0]
DF.append(team_perf)
DF = pd.concat(DF)
which gives:
Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 \
Rk G Date
0 1 1 2021-10-20
1 2 2 2021-10-22
2 3 3 2021-10-23
3 4 4 2021-10-25
4 5 5 2021-10-27
.. ... ... ...
48 45 45 2022-01-18
49 46 46 2022-01-20
50 47 47 2022-01-23
51 48 48 2022-01-24
52 49 49 2022-01-26
Unnamed: 3_level_0 Unnamed: 4_level_0 Unnamed: 5_level_0 \
Unnamed: 3_level_1 Opp W/L
0 NaN PHI L
1 # CHI L
2 # MIN L
3 # MIN W
4 NaN ATL L
.. ... ... ...
48 NaN MIN L
49 NaN NOP L
50 NaN LAC W
51 # CLE L
52 # MIA L
Unnamed: 6_level_0 Unnamed: 7_level_0 Advanced ... \
Tm Opp ORtg DRtg ...
0 97 117 98.6 119.0 ...
1 112 128 111.3 127.2 ...
2 89 96 86.2 93.0 ...
3 107 98 109.5 100.3 ...
4 99 102 107.0 110.2 ...
.. ... ... ... ... ...
48 110 112 111.4 113.4 ...
49 91 102 98.8 110.7 ...
50 110 102 115.9 107.5 ...
51 93 95 99.5 101.6 ...
52 96 110 105.4 120.8 ...
Unnamed: 18_level_0 Offensive Four Factors \
Unnamed: 18_level_1 eFG% TOV% ORB% FT/FGA
0 NaN .489 11.8 19.6 .065
1 NaN .569 14.5 20.5 .149
2 NaN .399 22.0 38.2 .202
3 NaN .506 14.9 33.3 .218
4 NaN .489 8.5 20.9 .086
.. ... ... ... ... ...
48 NaN .538 16.0 26.8 .300
49 NaN .435 13.9 30.4 .312
50 NaN .516 10.8 30.6 .176
51 NaN .488 10.6 22.0 .131
52 NaN .512 15.8 32.6 .133
Unnamed: 23_level_0 Defensive Four Factors
Unnamed: 23_level_1 eFG% TOV% DRB% FT/FGA
0 NaN .594 11.3 85.0 .188
1 NaN .607 12.3 75.0 .225
2 NaN .469 15.9 75.5 .063
3 NaN .428 9.7 80.4 .233
4 NaN .458 9.6 62.5 .146
.. ... ... ... ... ...
48 NaN .512 10.9 80.5 .358
49 NaN .572 12.3 91.9 .197
50 NaN .519 11.5 86.0 .253
51 NaN .512 17.5 73.9 .148
52 NaN .608 16.9 80.6 .270
[207 rows x 28 columns]
​

Python/Pandas/Excel Creating a 2D array from 3 columns

I have two questions actually. I have a dataframe like the one below. I need to split it into years/months, same as a fixed width delimiter in Excel. Pandas str.split() can't do this based on the documentation, it needs a delimiting character.
Initial df:
Year/Period PurchDoc
0 FY19P01 162
1 FY19P02 148
2 FY19P03 133
3 FY19P04 157
4 FY19P05 152
5 FY19P06 176
6 FY19P07 123
7 FY19P08 143
8 FY19P09 161
9 FY19P10 177
10 FY19P11 152
11 FY19P12 175
12 FY20P01 203
13 FY20P02 157
14 FY20P03 206
15 FY20P04 247
16 FY20P05 182
17 FY20P06 141
18 FY20P07 205
19 FY20P08 194
Expected result:
Year Period PurchDoc
0 FY19 P01 162
1 FY19 P02 148
2 FY19 P03 133
3 FY19 P04 157
4 FY19 P05 152
5 FY19 P06 176
6 FY19 P07 123
7 FY19 P08 143
8 FY19 P09 161
9 FY19 P10 177
10 FY19 P11 152
11 FY19 P12 175
12 FY20 P01 203
13 FY20 P02 157
14 FY20 P03 206
15 FY20 P04 247
16 FY20 P05 182
17 FY20 P06 141
18 FY20 P07 205
19 FY20 P08 194
Second, I need to transpose the period and PurchDoc columns so it looks like this (well, as ints and no NaNs, but I can fix that):
Unnamed: 0 P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11 P12
0 FY19 162 148 133 157 152.0 176.0 123.0 143.0 161.0 177.0 152.0 175.0
1 FY20 203 157 206 247 182.0 141.0 205.0 194.0 113.0 44.0 26.0 17.0
2 FY21 41 53 42 40 52.0 54.0 57.0 46.0 90.0 103.0 63.0 86.0
3 FY22 114 96 87 92 NaN NaN NaN NaN NaN NaN NaN NaN
Couldn't find anything remotely useful googling unfortunately, so I don't have any failed code to show.
df['Year'] = df['Year/Period'].str.slice(stop=4)
df['Period'] = df['Year/Period'].str.slice(start=4)
df.drop('Year/Period', axis=1, inplace=True)
df = df.pivot(values = 'PurchDoc', index = 'Year', columns = 'Period')
print(df)
output:
Period P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11 P12
Year
FY19 162.0 148.0 133.0 157.0 152.0 176.0 123.0 143.0 161.0 177.0 152.0 175.0
FY20 203.0 157.0 206.0 247.0 182.0 141.0 205.0 194.0 NaN NaN NaN NaN
df[["Year", "Period"]] = df.apply(lambda x: (x["Year/Period"][:4], x["Year/Period"][4:]), result_type="expand", axis=1)
Then:
pd.pivot_table(df, columns="Period", index="Year", values="PurchDoc", aggfunc="sum")

.loc function for a specific label returning an empty data frame?

For context, I'm trying to filter out rows in my dataframe that only belong to the year 2021.
This is my script code:
test = all_SS_batting_columns.loc[all_SS_batting_columns['Year'] == '2021']
but it only returns:
Empty DataFrame
Columns: [index, Year, Age, Tm, Lg, G, PA, AB, R, H, 2B, 3B, HR, RBI, SB, CS, BB, SO, BA, OBP, SLG]
Index: []
Just in case, here's what my dataframe looks like:
index Year Age Tm Lg G ... CS BB SO BA OBP SLG
0 0 2016 23.0 CHW AL 99 ... 2 13 117 0.283 0.306 0.432
1 1 2017 24.0 CHW AL 146 ... 1 13 162 0.257 0.276 0.402
2 2 2018 25.0 CHW AL 153 ... 8 30 149 0.240 0.281 0.406
3 3 2019 26.0 CHW AL 123 ... 5 15 109 0.335 0.357 0.508
4 4 2020 27.0 CHW AL 49 ... 2 10 50 0.322 0.357 0.529
5 5 2021 28.0 CHW AL 88 ... 7 18 87 0.299 0.332 0.437
6 6 6 Yrs NaN NaN NaN 658 ... 25 99 674 0.283 0.312 0.442
7 7 162 Game Avg. NaN NaN NaN 162 ... 6 24 166 0.283 0.312 0.442
8 0 2019 21.0 TOR AL 46 ... 4 14 50 0.311 0.358 0.571
9 1 2020 22.0 TOR AL 29 ... 1 5 27 0.301 0.328 0.512
[10 rows x 21 columns]
Duplicate df name to see all columns:
test = all_SS_batting_columns[all_SS_batting_columns.loc[all_SS_batting_columns['Year'] == '2021']]
You can just use where instead of loc.
test = all_SS_batting_columns.where[all_SS_batting_columns['Year'] == '2021']
If you want to filter Dataframe for Year 2021 , use -
No need to use loc
test = all_SS_batting_columns[all_SS_batting_columns['Year'] == '2021']
test

iterating over loc on dataframes

I'm trying to extract data from a list of dataframes and extract row ranges. Each dataframe might not have the same data, therefore I have a list of possible index ranges that I would like loc to loop over, i.e. from the code sample below, I might want CIN to LAN, but on another dataframe, the CIN row doesn't exist, so I will want DET to LAN or HOU to LAN.
so I was thinking putting them in a list and iterating over the list, i.e.
for df in dfs:
ranges=[[df.loc["CIN":"LAN"]], [df.loc["DET":"LAN"]]]
extracted ranges = (i for i in ranges)
I'm not sure how you would iterate over a list and feed into loc, or perhaps .query().
df1 stint g ab r h X2b X3b hr rbi sb cs bb \
year team
2007 CIN 6 379 745 101 203 35 2 36 125.0 10.0 1.0 105
DET 5 301 1062 162 283 54 4 37 144.0 24.0 7.0 97
HOU 4 311 926 109 218 47 6 14 77.0 10.0 4.0 60
LAN 11 413 1021 153 293 61 3 36 154.0 7.0 5.0 114
NYN 13 622 1854 240 509 101 3 61 243.0 22.0 4.0 174
SFN 5 482 1305 198 337 67 6 40 171.0 26.0 7.0 235
TEX 2 198 729 115 200 40 4 28 115.0 21.0 4.0 73
TOR 4 459 1408 187 378 96 2 58 223.0 4.0 2.0 190
df2 so ibb hbp sh sf gidp
year team
2008 DET 176.0 3.0 10.0 4.0 8.0 28.0
HOU 212.0 3.0 9.0 16.0 6.0 17.0
LAN 141.0 8.0 9.0 3.0 8.0 29.0
NYN 310.0 24.0 23.0 18.0 15.0 48.0
SFN 188.0 51.0 8.0 16.0 6.0 41.0
TEX 140.0 4.0 5.0 2.0 8.0 16.0
TOR 265.0 16.0 12.0 4.0 16.0 38.0
Here is a solution:
import pandas as pd
# Prepare a list of ranges
ranges = [('CIN','LAN'), ('DET','LAN')]
# Declare an empty list of data frames and a list with the existing data frames
df_ranges = []
df_list = [df1, df2]
# Loop over multi-indices
for i, idx_range in enumerate(ranges):
df = df_list[i]
row1, row2 = idx_range
df_ranges.append(df.loc[(slice(None), slice(row1, row2)),:])
# Print the extracted data
print('Extracted data:\n')
print(df_ranges)
Output:
[ stint g ab r h X2b X3b hr rbi sb cs bb
year team
2007 CIN 6 379 745 101 203 35 2 36 125 10 1 105
DET 5 301 1062 162 283 54 4 37 144 24 7 97
HOU 4 311 926 109 218 47 6 14 77 10 4 60
LAN 11 413 1021 153 293 61 3 36 154 7 5 114
so ibb hbp sh sf gidp
year team
2008 DET 176 3 10 4 8 28
HOU 212 3 9 16 6 17
LAN 141 8 9 3 8 29]

Categories