I am having trouble scraping a certain table from basketball reference - python

On basketball-reference, there's section of stats called Team Misc. Here's the url to an example of it:
https://www.basketball-reference.com/teams/MIL/2023.html#all_per_minute-playoffs_per_minute
Anyways, using this code:
link = "https://www.basketball-reference.com/teams/MIL/2023.html#all_per_minute-playoffs_per_minute"
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
table_text = soup.find(id = "all_team_misc")
The last part of table_text looks like the following:
<tbody><tr ><th scope="row" class="left " data-stat="player" >Team</th><td class="center " data-stat="wins" >41</td><td class="center " data-stat="losses" >17</td><td class="center " data-stat="wins_pyth" >35</td><td class="center " data-stat="losses_pyth" >23</td><td class="center " data-stat="mov" >3.22</td><td class="center " data-stat="sos" >-0.11</td><td class="center " data-stat="srs" >3.12</td><td class="center " data-stat="off_rtg" >113.8</td><td class="center " data-stat="def_rtg" >110.6</td><td class="center " data-stat="pace" >100.0</td><td class="center " data-stat="fta_per_fga_pct" >.254</td><td class="center " data-stat="fg3a_per_fga_pct" >.444</td><td class="center " data-stat="efg_pct" >.543</td><td class="center " data-stat="tov_pct" >13.0</td><td class="center " data-stat="orb_pct" >25.8</td><td class="center " data-stat="ft_rate" >.187</td><td class="center " data-stat="opp_efg_pct" >.516</td><td class="center " data-stat="opp_tov_pct" >10.6</td><td class="center " data-stat="drb_pct" >78.0</td><td class="center " data-stat="opp_ft_rate" >.179</td><td class="center " data-stat="arena_name" >Fiserv Forum</td><td class="center " data-stat="attendance" >506,491</td></tr>
<tr ><th scope="row" class="left " data-stat="player" >Lg Rank</th><td class="center " data-stat="wins" >2</td><td class="center " data-stat="losses" >29</td><td class="center " data-stat="wins_pyth" >6</td><td class="center " data-stat="losses_pyth" >6</td><td class="center " data-stat="mov" >6</td><td class="center " data-stat="sos" >21</td><td class="center " data-stat="srs" >6</td><td class="center " data-stat="off_rtg" >20</td><td class="center " data-stat="def_rtg" >3</td><td class="center " data-stat="pace" >11</td><td class="center " data-stat="fta_per_fga_pct" >23</td><td class="center " data-stat="fg3a_per_fga_pct" >4</td><td class="center " data-stat="efg_pct" >14</td><td class="center " data-stat="tov_pct" >18</td><td class="center " data-stat="orb_pct" >8</td><td class="center " data-stat="ft_rate" >28</td><td class="center " data-stat="opp_efg_pct" >1</td><td class="center " data-stat="opp_tov_pct" >30</td><td class="center " data-stat="drb_pct" >4</td><td class="center " data-stat="opp_ft_rate" >2</td><td class="center iz" data-stat="arena_name" ></td><td class="center " data-stat="attendance" >20</td></tr>
How can I extract just numerical data from here?

Use pandas to extract data without pain:
# pip install pandas
import pandas as pd
dfs = pd.read_html('https://www.basketball-reference.com/teams/MIL/2023.html')
stats = dfs[2].dropna() # the third table in the page and remove the last row
Output:
>>> stats
Rk Player Age G GS MP FG FGA FG% 3P 3PA 3P% ... FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1.0 Brook Lopez 34.0 57 57.0 1729 315 621 0.507 104 278 0.374 ... 94 122 0.770 111 260 371 70 27 139 77 145 828
1 2.0 Giannis Antetokounmpo 28.0 47 47.0 1554 532 988 0.538 38 141 0.270 ... 394 610 0.646 109 463 572 254 36 37 187 159 1496
2 3.0 Jrue Holiday 32.0 47 45.0 1552 345 747 0.462 112 297 0.377 ... 112 129 0.868 60 186 246 336 61 17 144 90 914
3 4.0 Grayson Allen 27.0 54 52.0 1491 183 412 0.444 106 262 0.405 ... 94 104 0.904 46 138 184 130 44 10 59 86 566
4 5.0 Jevon Carter 27.0 58 31.0 1306 162 375 0.432 91 218 0.417 ... 22 27 0.815 25 124 149 149 57 25 63 118 437
5 6.0 Bobby Portis 27.0 47 14.0 1258 281 563 0.499 58 170 0.341 ... 58 70 0.829 120 353 473 85 22 11 59 81 678
6 7.0 Pat Connaughton 30.0 41 25.0 1043 124 314 0.395 85 240 0.354 ... 17 28 0.607 35 173 208 53 30 5 25 44 350
7 8.0 George Hill 36.0 35 0.0 668 59 132 0.447 23 74 0.311 ... 34 46 0.739 13 54 67 89 19 3 27 41 175
8 9.0 Wesley Matthews 36.0 39 0.0 608 39 118 0.331 31 95 0.326 ... 17 20 0.850 30 57 87 20 17 10 15 55 126
9 10.0 Jordan Nwora 24.0 38 3.0 597 76 197 0.386 40 102 0.392 ... 37 43 0.860 29 88 117 38 12 7 34 34 229
10 11.0 MarJon Beauchamp 22.0 38 9.0 563 80 201 0.398 38 118 0.322 ... 21 29 0.724 34 56 90 26 18 4 36 59 219
11 12.0 Joe Ingles 35.0 25 0.0 563 53 140 0.379 38 111 0.342 ... 11 13 0.846 8 64 72 81 14 3 29 40 155
12 13.0 Khris Middleton 31.0 17 7.0 364 83 197 0.421 24 83 0.289 ... 41 44 0.932 10 55 65 67 11 3 34 36 231
13 14.0 A.J. Green 23.0 26 0.0 244 42 92 0.457 33 77 0.429 ... 4 4 1.000 4 24 28 12 5 0 7 24 121
14 15.0 Sandro Mamukelashvili 23.0 24 0.0 217 19 58 0.328 7 32 0.219 ... 12 18 0.667 18 38 56 16 4 5 9 19 57
15 16.0 Serge Ibaka 33.0 16 0.0 185 26 54 0.481 6 18 0.333 ... 8 13 0.615 15 29 44 4 2 7 11 23 66
16 17.0 Thanasis Antetokounmpo 30.0 25 0.0 101 4 19 0.211 0 4 0.000 ... 4 8 0.500 8 17 25 4 2 3 7 12 12
[17 rows x 28 columns]
You can also export the result as dict with d = stats.to_dict('records').

Related

Python Pandas Conditional changes to column filling/filtering correctly

I am trying to to change the data of a column based on a condition. However, it doesn't seem to pass through the condition correctly and fills every value in the column with the change when it shouldn't. Here is the code:
uh['Age']= uh['Age']
uh['AgeStatus'] = uh['Age']
uh['AgeStatus'] = uh.loc[uh['AgeStatus'] > 25.0, 'AgeStatus'] = 'Veteran'
and it returns the Type Error:
TypeError: '>' not supported between instances of 'str' and 'float'
and the dataframe:
Year Age Tm Lg G PA ... BB SO BA OBP SLG AgeStatus
5 2021 28.0 CHW AL 88 391 ... 18 87 0.299 0.332 0.437 Veteran
2 2021 23.0 TOR AL 101 443 ... 29 90 0.296 0.348 0.487 Veteran
8 2021 28.0 BOS AL 97 409 ... 37 75 0.309 0.374 0.522 Veteran
6 2021 26.0 HOU AL 96 416 ... 53 80 0.272 0.368 0.476 Veteran
5 2021 27.0 ATL NL 105 431 ... 30 116 0.249 0.305 0.475 Veteran
2 2021 22.0 SDP NL 87 362 ... 43 102 0.292 0.373 0.651 Veteran
6 2021 28.0 WSN NL 96 420 ... 26 77 0.322 0.369 0.521 Veteran
[7 rows x 21 columns]
Really confused on what's causing this.
You need to use conditionals like this
uh.loc[uh['Age'] > 25.0, 'AgeStatus'] = 'Veteran'

Merge dataframes and merge also columns into a single column

I have a dataframe df1
index A B C D E
0 0 92 84
1 1 98 49
2 2 49 68
3 3 0 58
4 4 91 95
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
and also this data frame df2
index C D E F
0 0 27 95 51 45
1 1 99 33 92 67
2 2 68 37 29 65
3 3 99 25 48 40
4 4 33 74 55 66
5 13 65 76 19 62
I wish to get to the following outcome when merging df1 and df2
index A B C D E F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
However, I am keeping getting this when using pd. merge(),
df_total=df1.merge(df2,how="outer",on="index",suffixes=(None,"_"))
df_total.replace(to_replace=np.nan,value=" ", inplace=True)
df_total
index A B C D E C_ D_ E_ F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
Is there a way to get the desirable outcome using pd.merge or similar function?
Thanks
You can use .combine_first():
# convert the empty cells ("") to NaNs
df1 = df1.replace("", np.nan)
df2 = df2.replace("", np.nan)
# set indices and combine the dataframes
df1 = df1.set_index("index")
print(df1.combine_first(df2.set_index("index")).reset_index().fillna(""))
Prints:
index A B C D E F
0 0 92.0 84.0 27.0 95.0 51.0 45.0
1 1 98.0 49.0 99.0 33.0 92.0 67.0
2 2 49.0 68.0 68.0 37.0 29.0 65.0
3 3 0.0 58.0 99.0 25.0 48.0 40.0
4 4 91.0 95.0 33.0 74.0 55.0 66.0
5 5 47.0 56.0 52.0 25.0 58.0
6 6 86.0 71.0 34.0 39.0 40.0
7 7 80.0 78.0 0.0 86.0 12.0
8 8 0.0 8.0 30.0 88.0 42.0
9 9 69.0 83.0 7.0 65.0 60.0
10 10 93.0 39.0 10.0 90.0 45.0
11 13 65.0 76.0 19.0 62.0

How to add additional text to matplotlib annotations

I have used seaborn's titanic dataset as a proxy for my very large dataset to create the chart and data based on that.
The following code runs without any errors:
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_theme(style="darkgrid")
# Load the example Titanic dataset
df = sns.load_dataset("titanic")
# split fare into decile groups and order them
df['fare_grp'] = pd.qcut(df['fare'], q=10,labels=None, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp'],dropna=False).size()
df['fare_grp_num'] = pd.qcut(df['fare'], q=10,labels=False, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp_num'],dropna=False).size()
df['fare_ord_grp'] = df['fare_grp_num'] + ' ' +df['fare_grp']
df['fare_ord_grp']
# set variables
target = 'survived'
ydim = 'fare_ord_grp'
xdim = 'embark_town'
#del [result]
non_events = pd.DataFrame(df[df[target]==0].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'non_events'})
non_events[xdim]=non_events[xdim].replace(np.nan, 'Missing', regex=True)
non_events[ydim]=non_events[ydim].replace(np.nan, 'Missing', regex=True)
non_events_total = pd.DataFrame(df[df[target]==0].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'non_events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
events = pd.DataFrame(df[df[target]==1].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'events'})
events[xdim]=events[xdim].replace(np.nan, 'Missing', regex=True)
events[ydim]=events[ydim].replace(np.nan, 'Missing', regex=True)
events_total = pd.DataFrame(df[df[target]==1].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total = pd.DataFrame(df.groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total=grand_total.merge(non_events_total, how='left', on=xdim).merge(events_total, how='left', on=xdim)
result = pd.merge(non_events, events, how="outer",on=[ydim,xdim])
result['total'] = result['non_events'].fillna(0) + result['events'].fillna(0)
result[xdim] = result[xdim].replace(np.nan, 'Missing', regex=True)
result = pd.merge(result, grand_total, how="left",on=[xdim])
result['survival rate %'] = round(result['events']/result['total']*100,2)
result['% event dist by xdim'] = round(result['events']/result['events_total_by_xdim']*100,2)
result['% non-event dist by xdim'] = round(result['non_events']/result['non_events_total_by_xdim']*100,2)
result['% total dist by xdim'] = round(result['total']/result['total_by_xdim']*100,2)
display(result)
value_name1 = "% dist by " + str(xdim)
dfl = pd.melt(result, id_vars=[ydim, xdim],value_vars =['% total dist by xdim'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl2 = dfl.pivot(index=ydim, columns=xdim, values=value_name1)
print(dfl2)
title1 = "% dist by " + str(xdim)
ax=dfl2.T.plot(kind='bar', stacked=True, rot=1, figsize=(8, 8), title=title1)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.legend(bbox_to_anchor=(1.0, 1.0),title = 'Fare Range')
ax.set_ylabel('% Dist')
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width/2, y+height/2,'{:.0f}%'.format(height),horizontalalignment='center', verticalalignment='center')
It produces the following stacked percent bar chart, which shows the % of total distribution by embark town.
I also want to show the survival rate along with the %distribution in each block. For example, for Queenstown, fare range 1 (7.6, 7.9], the % total distribution is 56%. I want to display the survival rate 37.21% as (56%, 37.21%). I am not able to figure it out. Kindly offer any suggestions. Thanks.
Here is the output summary table for reference
fare_ord_grp
embark_town
non_events
events
total
total_by_xdim
non_events_total_by_xdim
events_total_by_xdim
survival rate %
% event dist by xdim
% non-event dist by xdim
% total dist by xdim
0
0 (-0.1,7.6]
Cherbourg
22
7
29
168
75
93
24.14
7.53
29.33
17.26
1
0 (-0.1,7.6]
Queenstown
4
NaN
4
77
47
30
NaN
NaN
8.51
5.19
2
0 (-0.1,7.6]
Southampton
53
6
59
644
427
217
10.17
2.76
12.41
9.16
3
1 (7.6,7.9]
Queenstown
27
16
43
77
47
30
37.21
53.33
57.45
55.84
4
1 (7.6,7.9]
Southampton
34
10
44
644
427
217
22.73
4.61
7.96
6.83
5
2 (7.9,8]
Cherbourg
4
1
5
168
75
93
20
1.08
5.33
2.98
6
2 (7.9,8]
Southampton
83
13
96
644
427
217
13.54
5.99
19.44
14.91
7
3 (8.0,10.5]
Cherbourg
2
1
3
168
75
93
33.33
1.08
2.67
1.79
8
3 (8.0,10.5]
Queenstown
2
NaN
2
77
47
30
NaN
NaN
4.26
2.6
9
3 (8.0,10.5]
Southampton
56
17
73
644
427
217
23.29
7.83
13.11
11.34
10
4 (10.5,14.5]
Cherbourg
7
8
15
168
75
93
53.33
8.6
9.33
8.93
11
4 (10.5,14.5]
Queenstown
1
2
3
77
47
30
66.67
6.67
2.13
3.9
12
4 (10.5,14.5]
Southampton
40
26
66
644
427
217
39.39
11.98
9.37
10.25
13
5 (14.5,21.7]
Cherbourg
9
10
19
168
75
93
52.63
10.75
12
11.31
14
5 (14.5,21.7]
Queenstown
5
3
8
77
47
30
37.5
10
10.64
10.39
15
5 (14.5,21.7]
Southampton
37
24
61
644
427
217
39.34
11.06
8.67
9.47
16
6 (21.7,27]
Cherbourg
1
4
5
168
75
93
80
4.3
1.33
2.98
17
6 (21.7,27]
Queenstown
2
3
5
77
47
30
60
10
4.26
6.49
18
6 (21.7,27]
Southampton
40
39
79
644
427
217
49.37
17.97
9.37
12.27
19
7 (27.0,39.7]
Cherbourg
14
10
24
168
75
93
41.67
10.75
18.67
14.29
20
7 (27.0,39.7]
Queenstown
5
NaN
5
77
47
30
NaN
NaN
10.64
6.49
21
7 (27.0,39.7]
Southampton
38
24
62
644
427
217
38.71
11.06
8.9
9.63
22
8 (39.7,78]
Cherbourg
5
19
24
168
75
93
79.17
20.43
6.67
14.29
23
8 (39.7,78]
Southampton
37
28
65
644
427
217
43.08
12.9
8.67
10.09
24
9 (78.0,512.3]
Cherbourg
11
33
44
168
75
93
75
35.48
14.67
26.19
25
9 (78.0,512.3]
Queenstown
1
1
2
77
47
30
50
3.33
2.13
2.6
26
9 (78.0,512.3]
Southampton
9
30
39
644
427
217
76.92
13.82
2.11
6.06
27
2 (7.9,8]
Queenstown
NaN
5
5
77
47
30
100
16.67
NaN
6.49
28
9 (78.0,512.3]
Missing
NaN
2
2
2
NaN
2
100
100
NaN
100
dfl2.T is being plotted, but 'survival rate %' is in result. As such, the indices for the values from dfl2.T do not correspond with 'survival rate %'.
Because all of values in result['% total dist by xdim'] are
not unique, we can't use a dict of matched key-values.
Create a corresponding pivoted DataFrame for 'survival rate %', and then flatten it. All of the values will be in the same order as the '% total dist by xdim' values from dfl2.T. As such, they can be indexed.
With respect to dfl2.T, the plot API plots in column order, which means .flatten(order='F') must be used to flatten the array in the correct order to be indexed.
# create a corresponding pivoted dataframe for survival rate %
dfl3 = pd.melt(result, id_vars=[ydim, xdim],value_vars =['survival rate %'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl4 = dfl3.pivot(index=ydim, columns=xdim, values=value_name1)
# flatten dfl4.T in column order
dfl4_flattened = dfl4.T.to_numpy().flatten(order='F')
for i, p in enumerate(ax.patches):
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
# only print values when height is not 0
if height != 0:
# create the text string
text = f'{height:.0f}%, {dfl4_flattened[i]:.0f}%'
# annotate the bar segments
ax.text(x+width/2, y+height/2, text, horizontalalignment='center', verticalalignment='center')
Notes
Here we can see dfl2.T and dfl4.T
# dfl2.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 17.26 NaN 2.98 1.79 8.93 11.31 2.98 14.29 14.29 26.19
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown 5.19 55.84 6.49 2.60 3.90 10.39 6.49 6.49 NaN 2.60
Southampton 9.16 6.83 14.91 11.34 10.25 9.47 12.27 9.63 10.09 6.06
# dfl4.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 24.14 NaN 20.00 33.33 53.33 52.63 80.00 41.67 79.17 75.00
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown NaN 37.21 100.00 NaN 66.67 37.50 60.00 NaN NaN 50.00
Southampton 10.17 22.73 13.54 23.29 39.39 39.34 49.37 38.71 43.08 76.92

for loop has saved a list as a single element

I have the following code to extract data from a table but because of the second for loop it saves all the data of a column as a single element of the array
is there a to way separate each element from the array below . link for stat_table :
for table in stat_table:
for cell in table.find_all('table'):
stmat.append(cell.text)
print(cell.text)
count = count + 1
print(count)
print(stmat)
print(stmat[0])
this is the output where all the data of second loop is saved as a single element
[' Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 ', ' Max Avg Min 82 73.6 70 82 72.9 0 81 74.2 70 84 76.4
70 86 75.3 68 82 74.6 68 82 74.6 68 82 74.1 68 81 73.9 68 82 75.4 68 84
75.4 68 81 73.9 68 82 75.0 68 79 72.8 68 81 73.6 68 81 73.5 68 82 74.2 68
82 74.9 68 82 73.6 68 79 71.9 66 82 72.7 66 81 71.3 63 82 74.1 63 82 75.0
64 86 76.4 68 84 75.7 68 82 75.4 68 84 75.5 66 84 74.0 66 86 76.7 66 ',
' Max Avg Min 68 66.6 66 68 64.8 0 66 65.2 64 66 65.9 64 68 65.8 64 66 65.3
64 66 64.7 64 68 66.3 64 70 67.1 64 68 65.9 63 70 66.4 64 68 67.2 66 68
66.4 64 68 66.0 64 70 67.4 66 70 67.0 66 68 65.5 64 66 65.4 64 70 67.1 64
70 67.1 66 68 65.6 64 66 61.6 59 66 60.3 55 64 60.0 50 66 62.7 59 68 64.8
63 68 63.8 61 66 63.9 61 68 64.3 63 68 64.8 61 ', ' Max Avg Min 94 80.1 58
88 75.1 0 88 75.3 58 88 71.4 51 94 74.0 48 94 74.8 54 94 73.4 54 94 78.4
54 100 80.7 58 100 73.9 51 100 76.7 51 100 81.0 61 94 76.0 58 94 80.3 65
94 82.5 61 94 81.4 61 94 76.8 54 94 74.4 54 100 82.0 58 100 86.1 65 100
80.4 54 100 73.1 48 94 64.6 39 100 62.2 32 88 64.3 40 94 70.4 48 94 69.2
48 94 68.8 45 88 73.4 48 94 68.9 43 ', ' Max Avg Min 23 15.9 10 22 15.7 10
26 15.2 8 20 13.6 8 21 13.6 8 21 13.2 8 22 14.8 9 20 12.2 7 15 10.4 3
14 8.8 0 16 10.2 5 14 8.7 1 16 10.9 6 17 12.1 7 17 11.1 6 16 11.2 5 18
11.2 5 17 12.4 8 15 10.1 5 15 9.2 3 17 11.6 7 15 9.3 3 12 6.1 0 12 5.2
0 10 6.1 0 10 5.8 0 9 4.8 0 10 5.2 0 10 4.5 0 14 4.7 0 ', ' Max Avg Min
26.8 26.7 26.6 26.8 26.1 0.0 26.8 26.8 26.7 26.8 26.8 26.7 26.8 26.8 26.7 26.9
26.8 26.7 26.8 26.8 26.7 26.8 26.8 26.7 26.9 26.8 26.8 26.9 26.8 26.7 26.8 26.8
26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.8 26.7
26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.9 26.8 26.9 26.9 26.8 26.9 26.8 26.8 26.9
26.8 26.8 26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.9 26.8 26.9 26.9 26.8 26.9 26.8
26.8 26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.8 26.8 ', ' Total 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ']
this is the output of stmat[0] where as I want stmat[0] = sep
Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Given the outputs you show, I'm guessing that
cell.text == "Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 ', ' Max Avg Min 82 73.6 70 82 72.9 0 81 74.2 70 84 76.4
70 86 75.3 68 82 74.6 68 82 74.6 68 82 74.1 68 81 73.9 68 82 75.4 68 84
75.4 68 81 73.9 68 82 75.0 68 79 72.8 68 81 73.6 68 81 73.5 68 82 74.2 68
82 74.9 68 82 73.6 68 79 71.9 66 82 72.7 66 81 71.3 63 82 74.1 63 82 75.0
64 86 76.4 68 84 75.7 68 82 75.4 68 84 75.5 66 84 74.0 66 86 76.7 66"
So if you want actually individual values, you should probably do something like:
for table in stat_table:
for cell in table.find_all('table'):
cell_values = cell.text.split(" ")
stmat.extend(cell_values)
count = count + len(cell_values)
print(count)
print(stmat)
print(stmat[0])

Grouping pandas dataframe based on common key

I have a file which I have parsed as pandas DataFrame but want to collectively group by their individual element at column 3 w.r.t column 2.
0 1 2 3 4
0 00B2 0 -67 39 1.13
1 00B2 85 -72 39 1.13
2 00B2 1 -67 86 1.13
3 00B2 2 -67 87 1.13
4 00B2 3 -67 88 1.13
5 00B2 91 -67 39 1.13
6 00B2 4 -67 246 1.13
7 00B2 5 -67 78 1.13
8 00B2 6 -67 10 1.13
9 00B2 7 -67 153 1.13
10 00B2 1 -67 38 1.13
11 00B2 8 -67 225 1.13
12 00B2 9 -67 135 1.13
13 00B2 10 -67 23 1.13
14 00B2 4 -67 38 1.13
15 00B2 11 -67 132 1.13
16 00B2 12 -71 214 1.13
17 00B2 13 -71 71 1.13
18 00B2 14 -71 215 1.13
19 00B2 8 -71 38 1.13
20 00B2 15 -71 249 1.13
21 00B2 16 -71 174 1.13
22 00B2 17 -71 196 1.13
23 00B2 18 -71 38 1.13
24 00B2 19 -71 252 1.13
25 00B2 20 -71 196 1.13
26 00B2 21 -71 39 1.13
27 00B2 22 -71 39 1.13
28 00B2 23 -71 252 1.13
29 00B2 24 -71 39 1.13
.. ... .. ... ... ...
I want the data that looks something like this
DF1:
-67 37
-72 37
-71 37
... ...
DF2:
-68 38
-67 38
-70 38
... ...
DF3:
-64 39
-63 39
-62 39
... ...
I have tried the following:
e1 = pd.DataFrame(e1)
print (e1)
group = e1[3][2] == "group"
print (e1[group])
This leads to nowhere close to what I want so how to groupby such data according to my requirement?
I think need create dictionary of Series by converting groupby object to tuples and dicts:
d = dict(tuple(df.groupby(3)[2]))
print (d[39])
0 -67
1 -72
5 -67
26 -71
27 -71
29 -71
Name: 2, dtype: int64
For DataFrame:
d1 = dict(tuple(df.groupby(3)))
print (d1[39])
0 1 2 3 4
0 00B2 0 -67 39 1.13
1 00B2 85 -72 39 1.13
5 00B2 91 -67 39 1.13
26 00B2 21 -71 39 1.13
27 00B2 22 -71 39 1.13
29 00B2 24 -71 39 1.13

Categories