I am new to Python, and was trying to run a basic web scraper. My code looks like this
import requests
import pandas as pd
x = requests.get('https://www.baseball-reference.com/players/p/penaje02.shtml')
dfs = pd.read_html(x.content)
print(dfs)
df = pd.DataFrame(dfs)
when printing dfs it looks like this. I only want the second table.
[ Year Age Tm Lg G PA AB \
0 2018 20 HOU-min A- 36 156 136
1 2019 21 HOU-min A,A+ 109 473 409
2 2021 23 HOU-min AAA,Rk 37 160 145
3 2022 24 HOU AL 136 558 521
4 1 Yr 1 Yr 1 Yr 1 Yr 136 558 521
5 162 Game Avg. 162 Game Avg. 162 Game Avg. 162 Game Avg. 162 665 621
R H 2B ... OPS OPS+ TB GDP HBP SH SF IBB Pos \
0 22 34 5 ... 0.649 NaN 42 0 1 0 1 0 NaN
1 72 124 21 ... 0.825 NaN 180 4 11 0 6 0 NaN
2 25 43 5 ... 0.942 NaN 84 0 7 0 0 0 NaN
3 72 132 20 ... 0.715 101.0 222 5 6 1 5 0 *6/H
4 72 132 20 ... 0.715 101.0 222 5 6 1 5 0 NaN
5 86 157 24 ... 0.715 101.0 264 6 7 1 6 0 NaN
Awards
0 TRC · NYPL
1 DAV,FAY · MIDW,CARL
2 SKT,AST · AAAW,FCL
3 GG
4 NaN
5 NaN
[6 rows x 30 columns]]
however, i end up with error Must pass 2-d input. shape=(1, 6, 30) after my last line. I have tried using df=dfs[1], but got the error list index our of range. Any way i can turn dfs from a list to a datframe?
What do you mean you only want the second table? There's only one table, it's 6 rows and 30 columns. The backslashes show up when whatever you're trying to print to isn't wide enough to contain the dataframe without line wrapping. Here's the dataframe printed in a wider terminal:
The pd.read_html() function returns a List[DataFrame] so you first need to grab your dataframe from the list, and then you can subset it to get the columns you care about:
df = dfs[0]
columns = ['R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB', 'Pos']
print(df[columns])
Output:
R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB Pos
0 22 34 5 0 1 10 3 0 18 19 0.250 0.340 0.309 0.649 NaN 42 0 1 0 1 0 NaN
1 72 124 21 7 7 54 20 10 47 90 0.303 0.385 0.440 0.825 NaN 180 4 11 0 6 0 NaN
2 25 43 5 3 10 21 6 1 8 41 0.297 0.363 0.579 0.942 NaN 84 0 7 0 0 0 NaN
3 72 132 20 2 22 63 11 2 22 135 0.253 0.289 0.426 0.715 101.0 222 5 6 1 5 0 *6/H
4 72 132 20 2 22 63 11 2 22 135 0.253 0.289 0.426 0.715 101.0 222 5 6 1 5 0 NaN
5 86 157 24 2 26 75 13 2 26 161 0.253 0.289 0.426 0.715 101.0 264 6 7 1 6 0 NaN
Related
I have a dataframe 'df'. Using the validation data validData, I want to compute the response rate (Florence = 1/Yes) using the rfm_aboveavg (RFM combinations response rates above the overall response). Response rate is given by considering 0/No and 1/Yes, so it would be rfm_crosstab[1] / rfm_crosstab['All'].
Using the results from the validation data, I want to only display the rows that are also shown in the training data output by the RFM column. How do I do this?
Data: 'df'
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
My code: Crosstab for training data trainData
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
# Response rate for training data as a whole
responseRate = (sum(trainData.Florence == 1) / sum(trainData.Florence == 0)) * 100
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
trainData['RFM'] = trainData['Mcode'].astype(str) + trainData['Rcode'].astype(str) + trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index = [trainData['RFM']], columns = trainData['Florence'], margins = True)
rfm_crosstab['Percentage of 1/Yes'] = 100 * (rfm_crosstab[1] / rfm_crosstab['All'])
# RFM combinations response rates above the overall response
rfm_aboveavg = rfm_crosstab['Percentage of 1/Yes'] > responseRate
rfm_crosstab[rfm_aboveavg]
Output: Training data
Florence 0 1 All Percentage of 1/Yes
RFM
121 3 2 5 40.000000
131 9 1 10 10.000000
212 1 2 3 66.666667
221 6 3 9 33.333333
222 6 1 7 14.285714
313 2 1 3 33.333333
321 17 3 20 15.000000
322 20 4 24 16.666667
323 2 1 3 33.333333
341 61 10 71 14.084507
343 17 2 19 10.526316
411 12 3 15 20.000000
422 26 5 31 16.129032
423 32 8 40 20.000000
441 96 12 108 11.111111
511 19 4 23 17.391304
513 44 8 52 15.384615
521 24 5 29 17.241379
523 74 16 90 17.777778
533 177 28 205 13.658537
My code: Crosstab for validation data validData
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
validData['RFM'] = validData['Mcode'].astype(str) + validData['Rcode'].astype(str) + validData['Fcode'].astype(str)
rfm_crosstab1 = pd.crosstab(index = [validData['RFM']], columns = validData['Florence'], margins = True)
rfm_crosstab1['Percentage of 1/Yes'] = 100 * (rfm_crosstab1[1] / rfm_crosstab1['All'])
rfm_crosstab1
Output: Validation data
Florence 0 1 All Percentage of 1/Yes
RFM
131 3 1 4 25.000000
141 8 0 8 0.000000
211 2 1 3 33.333333
212 2 0 2 0.000000
213 0 1 1 100.000000
221 5 0 5 0.000000
222 2 0 2 0.000000
231 21 1 22 4.545455
232 3 0 3 0.000000
233 1 0 1 0.000000
241 11 1 12 8.333333
242 8 0 8 0.000000
243 2 0 2 0.000000
311 7 0 7 0.000000
312 8 0 8 0.000000
313 1 0 1 0.000000
321 12 0 12 0.000000
322 13 0 13 0.000000
323 4 1 5 20.000000
331 19 1 20 5.000000
332 25 2 27 7.407407
333 11 1 12 8.333333
341 36 2 38 5.263158
342 30 2 32 6.250000
343 12 0 12 0.000000
411 8 2 10 20.000000
412 7 0 7 0.000000
413 13 1 14 7.142857
421 21 2 23 8.695652
422 30 1 31 3.225806
423 26 1 27 3.703704
431 51 3 54 5.555556
432 42 7 49 14.285714
433 41 5 46 10.869565
441 68 2 70 2.857143
442 78 3 81 3.703704
443 70 5 75 6.666667
511 17 0 17 0.000000
512 13 1 14 7.142857
513 26 6 32 18.750000
521 19 1 20 5.000000
522 25 6 31 19.354839
523 50 6 56 10.714286
531 66 3 69 4.347826
532 65 3 68 4.411765
533 128 24 152 15.789474
541 86 7 93 7.526882
542 100 6 106 5.660377
543 178 17 195 8.717949
All 1474 126 1600 7.875000
I am newer data science and am working on a project to analyze sports statistics. I have a dataset of hockey statistics for a group of players over multiple seasons. Players have anywhere between 1 row to 12 rows representing their season statistics over however many seasons they've played.
Example:
Player Season Pos GP G A P +/- PIM P/GP ... PPG PPP SHG SHP OTG GWG S S% TOI/GP FOW%
0 Nathan MacKinnon 2022 1 65 32 56 88 22 42 1.35 ... 7 27 0 0 1 5 299 10.7 21.07 45.4
1 Nathan MacKinnon 2021 1 48 20 45 65 22 37 1.35 ... 8 25 0 0 0 2 206 9.7 20.37 48.5
2 Nathan MacKinnon 2020 1 69 35 58 93 13 12 1.35 ... 12 31 0 0 2 4 318 11.0 21.22 43.1
3 Nathan MacKinnon 2019 1 82 41 58 99 20 34 1.21 ... 12 37 0 0 1 6 365 11.2 22.08 43.7
4 Nathan MacKinnon 2018 1 74 39 58 97 11 55 1.31 ... 12 32 0 1 3 12 284 13.7 19.90 41.9
5 Nathan MacKinnon 2017 1 82 16 37 53 -14 16 0.65 ... 2 14 2 2 2 4 251 6.4 19.95 50.6
6 Nathan MacKinnon 2016 1 72 21 31 52 -4 20 0.72 ... 7 16 0 1 0 6 245 8.6 18.87 48.4
7 Nathan MacKinnon 2015 1 64 14 24 38 -7 34 0.59 ... 3 7 0 0 0 2 192 7.3 17.05 47.0
8 Nathan MacKinnon 2014 1 82 24 39 63 20 26 0.77 ... 8 17 0 0 0 5 241 10.0 17.35 42.9
9 J.T. Compher 2022 2 70 18 15 33 6 25 0.47 ... 4 6 1 1 0 0 102 17.7 16.32 51.4
10 J.T. Compher 2021 2 48 10 8 18 10 19 0.38 ... 1 2 0 0 0 2 47 21.3 14.22 45.9
11 J.T. Compher 2020 2 67 11 20 31 9 18 0.46 ... 1 5 0 3 1 3 106 10.4 16.75 47.7
12 J.T. Compher 2019 2 66 16 16 32 -8 31 0.48 ... 4 9 3 3 0 3 118 13.6 17.48 49.2
13 J.T. Compher 2018 2 69 13 10 23 -29 20 0.33 ... 4 7 2 2 2 3 131 9.9 16.00 45.1
14 J.T. Compher 2017 2 21 3 2 5 0 4 0.24 ... 1 1 0 0 0 1 30 10.0 14.93 47.6
15 Darren Helm 2022 1 68 7 8 15 -5 14 0.22 ... 0 0 1 2 0 1 93 7.5 10.55 44.2
16 Darren Helm 2021 1 47 3 5 8 -3 10 0.17 ... 0 0 0 0 0 0 83 3.6 14.68 66.7
17 Darren Helm 2020 1 68 9 7 16 -6 37 0.24 ... 0 0 1 2 0 0 102 8.8 13.73 53.6
18 Darren Helm 2019 1 61 7 10 17 -11 20 0.28 ... 0 0 1 4 0 0 107 6.5 14.57 44.4
19 Darren Helm 2018 1 75 13 18 31 3 39 0.41 ... 0 0 2 4 0 0 141 9.2 15.57 44.1
[sample of my dataset][1]
[1]: https://i.stack.imgur.com/7CsUd.png
If any player has played more than 6 seasons, I want to drop the row corresponding to Season 2021. This is because COVID drastically shortened the season and it is causing issues as I work with averages.
As you can see from the screenshot, Nathan MacKinnon has played 9 seasons. Across those 9 seasons, except for 2021, he plays in no fewer than 64 games. Due to the shortened season of 2021, he only got 48 games.
Removing Season 2021 results in an Average Games Played of 73.75.
Keeping Season 2021 in the data, the Average Games Played becomes 70.89.
While not drastic, it compounds into the other metrics as well.
I have been trying this for a little while now, but as I mentioned, I am new to this world and am struggling to figure out how to accomplish this.
I don't want to just completely drop ALL rows for 2021 across all players, though, as some players only have 1-5 years' worth of data and for those players, I need to use as much data as I can and remove 1 row from a player with only 2 seasons would also negatively skew averages.
I would really appreciate some assistance from anyone more experienced than me!
This can be accomplished by using groupby and apply. For example:
edited_players = (players
.groupby("Player")
.apply(lambda subset: subset if len(subset) <= 6 else subset.query("Season != 2021"))
)
Round brackets for formatting purposes.
The combination of groupby and apply basically feeds a grouped subset of your dataframe to a function. So, first all the rows of Nathan MacKinnon will be used, then rows for J.T. Compher, then Darren Helm rows, etc.
The function used is an anonymous/lambda function which operates under the following logic: "if the dataframe subset that I receive has 6 or fewer rows, I'll return the subset unedited. Otherwise, I will filter out rows within that subset which have the value 2021 in the Season column".
How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64
I am trying to append every column from one row into another row, I want to do this for every row, but some row will not have any values, take a look at my code it will be more clear:
Here is my data
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
Here is my current code:
s = df_md['date'].shift(-1)
df_md['next_calendarday'] = s.mask(s.dt.dayofweek.diff().lt(0))
df_md.set_index('date', inplace=True)
df_md.apply(lambda row: GetNextDayMarketData(row, df_md), axis=1)
def GetNextDayMarketData(row, dataframe):
if(row['next_calendarday'] is pd.NaT):
return
key = row['next_calendarday'].strftime("%Y-%m-%d")
nextrow = dataframe.loc[key]
for index, val in nextrow.iteritems():
if(index != "next_calendarday"):
dataframe.loc[row.name, index+'_nextday'] = val
This works but it's so slow it might as well not work. Here is what the result should look like, you can see that the value from the next row has been added to the previous row. The kicker is that it's the next calendar date and not just the next row in the sequence. If a row does not have an entry for next calendar date, it will simply be blank.
Here is the expected result in csv
date day_of_week day_of_month day_of_year month_of_year next_workingday day_of_week_nextday day_of_month_nextday day_of_year_nextday month_of_year_nextday
5/1/2017 0 1 121 5 5/2/2017 1 2 122 5
5/2/2017 1 2 122 5 5/3/2017 2 3 123 5
5/3/2017 2 3 123 5 5/4/2017 3 4 124 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5 5/9/2017 1 9 129 5
5/9/2017 1 9 129 5 5/10/2017 2 10 130 5
5/10/2017 2 10 130 5 5/11/2017 3 11 131 5
5/11/2017 3 11 131 5 5/12/2017 4 12 132 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5 5/16/2017 1 16 136 5
5/16/2017 1 16 136 5 5/17/2017 2 17 137 5
5/17/2017 2 17 137 5 5/18/2017 3 18 138 5
5/18/2017 3 18 138 5 5/19/2017 4 19 139 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5 5/24/2017 2 24 144 5
5/24/2017 2 24 144 5 5/25/2017 3 25 145 5
5/25/2017 3 25 145 5 5/26/2017 4 26 146 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Use DataFrame.join with remove column next_calendarday_nextday:
df = df.set_index('date')
df = (df.join(df, on='next_calendarday', rsuffix='_nextday')
.drop('next_calendarday_nextday', axis=1))
I'm scraping National Hockey League (NHL) data for multiple seasons from this URL:
https://www.hockey-reference.com/leagues/NHL_2018_skaters.html
I'm only getting a few instances here and have tried moving my dict statements throughout the for loops. I've also tried utilizing solutions I found on other posts with no luck. Any help is appreciated. Thank you!
import requests
from bs4 import BeautifulSoup
import pandas as pd
dict={}
for i in range (2010,2020):
year = str(i)
source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
soup = BeautifulSoup(source,features='lxml')
#identifying table in html
table = soup.find('table', id="stats")
#grabbing <tr> tags in html
rows = table.findAll("tr")
#creating passable values for each "stat" in td tag
data_stats = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage"
]
for rownum in rows:
# grabbing player name and using as key
filter = { "data-stat":'player' }
cell = rows[3].findAll("td",filter)
nameval = cell[0].string
list = []
for data in data_stats:
#iterating through data_stat to grab values
filter = { "data-stat":data }
cell = rows[3].findAll("td",filter)
value = cell[0].string
list.append(value)
dict[nameval] = list
dict[nameval].append(year)
# conversion to numeric values and creating dataframe
columns = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage",
"year"
]
df = pd.DataFrame.from_dict(dict,orient='index',columns=columns)
cols = df.columns.drop(['player','team_id','pos','year'])
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
print(df)
Output
Craig Adams Craig Adams 32 ... 43.9 2010
Luke Adam Luke Adam 22 ... 100.0 2013
Justin Abdelkader Justin Abdelkader 29 ... 29.4 2017
Will Acton Will Acton 27 ... 50.0 2015
Noel Acciari Noel Acciari 24 ... 44.1 2016
Pontus Aberg Pontus Aberg 25 ... 10.5 2019
[6 rows x 28 columns]
I'd just use pandas' .read_html(), It does the hard work of parsing tables for you (uses BeautifulSoup under the hood)
Code:
import pandas as pd
result = pd.DataFrame()
for i in range (2010,2020):
print(i)
year = str(i)
url = 'https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html'
#source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
df = pd.read_html(url,header=1)[0]
df['year'] = year
result = result.append(df, sort=False)
result = result[~result['Age'].str.contains("Age")]
result = result.reset_index(drop=True)
You can then save to file with result.to_csv('filename.csv',index=False)
Output:
print (result)
Rk Player Age Tm Pos GP ... BLK HIT FOW FOL FO% year
0 1 Justin Abdelkader 22 DET LW 50 ... 20 152 148 170 46.5 2010
1 2 Craig Adams 32 PIT RW 82 ... 58 193 243 311 43.9 2010
2 3 Maxim Afinogenov 30 ATL RW 82 ... 21 32 1 2 33.3 2010
3 4 Andrew Alberts 28 TOT D 76 ... 88 216 0 1 0.0 2010
4 4 Andrew Alberts 28 CAR D 62 ... 67 172 0 0 NaN 2010
5 4 Andrew Alberts 28 VAN D 14 ... 21 44 0 1 0.0 2010
6 5 Daniel Alfredsson 37 OTT RW 70 ... 36 41 14 25 35.9 2010
7 6 Bryan Allen 29 FLA D 74 ... 137 120 0 0 NaN 2010
8 7 Cody Almond 20 MIN C 7 ... 5 7 18 12 60.0 2010
9 8 Karl Alzner 21 WSH D 21 ... 21 15 0 0 NaN 2010
10 9 Artem Anisimov 21 NYR C 82 ... 41 45 310 380 44.9 2010
11 10 Nik Antropov 29 ATL C 76 ... 35 82 481 627 43.4 2010
12 11 Colby Armstrong 27 ATL RW 79 ... 29 74 10 10 50.0 2010
13 12 Derek Armstrong 36 STL C 6 ... 0 4 7 8 46.7 2010
14 13 Jason Arnott 35 NSH C 63 ... 17 24 526 551 48.8 2010
15 14 Dean Arsene 29 EDM D 13 ... 13 18 0 0 NaN 2010
16 15 Evgeny Artyukhin 26 TOT RW 54 ... 10 127 1 1 50.0 2010
17 15 Evgeny Artyukhin 26 ANA RW 37 ... 8 90 0 1 0.0 2010
18 15 Evgeny Artyukhin 26 ATL RW 17 ... 2 37 1 0 100.0 2010
19 16 Arron Asham 31 PHI RW 72 ... 16 92 2 11 15.4 2010
20 17 Adrian Aucoin 36 PHX D 82 ... 67 131 1 0 100.0 2010
21 18 Keith Aucoin 31 WSH C 9 ... 0 2 31 25 55.4 2010
22 19 Sean Avery 29 NYR C 69 ... 17 145 4 10 28.6 2010
23 20 David Backes 25 STL RW 79 ... 60 266 504 561 47.3 2010
24 21 Mikael Backlund 20 CGY C 23 ... 4 12 100 86 53.8 2010
25 22 Nicklas Backstrom 22 WSH C 82 ... 61 90 657 660 49.9 2010
26 23 Josh Bailey 20 NYI C 73 ... 36 67 171 255 40.1 2010
27 24 Keith Ballard 27 FLA D 82 ... 201 156 0 0 NaN 2010
28 25 Krys Barch 29 DAL RW 63 ... 13 120 0 3 0.0 2010
29 26 Cam Barker 23 TOT D 70 ... 53 75 0 0 NaN 2010
... ... .. ... .. .. ... ... ... ... ... ... ...
10251 885 Chris Wideman 29 TOT D 25 ... 26 35 0 0 NaN 2019
10252 885 Chris Wideman 29 OTT D 19 ... 25 26 0 0 NaN 2019
10253 885 Chris Wideman 29 EDM D 5 ... 1 7 0 0 NaN 2019
10254 885 Chris Wideman 29 FLA D 1 ... 0 2 0 0 NaN 2019
10255 886 Justin Williams 37 CAR RW 82 ... 32 55 92 150 38.0 2019
10256 887 Colin Wilson 29 COL C 65 ... 31 55 20 32 38.5 2019
10257 888 Garrett Wilson 27 PIT LW 50 ... 16 114 3 4 42.9 2019
10258 889 Scott Wilson 26 BUF C 15 ... 2 29 1 2 33.3 2019
10259 890 Tom Wilson 24 WSH RW 63 ... 52 200 29 24 54.7 2019
10260 891 Luke Witkowski 28 DET D 34 ... 27 67 0 0 NaN 2019
10261 892 Christian Wolanin 23 OTT D 30 ... 31 11 0 0 NaN 2019
10262 893 Miles Wood 23 NJD LW 63 ... 27 97 0 2 0.0 2019
10263 894 Egor Yakovlev 27 NJD D 25 ... 22 12 0 0 NaN 2019
10264 895 Kailer Yamamoto 20 EDM RW 17 ... 11 18 0 0 NaN 2019
10265 896 Keith Yandle 32 FLA D 82 ... 76 47 0 0 NaN 2019
10266 897 Pavel Zacha 21 NJD C 61 ... 24 68 348 364 48.9 2019
10267 898 Filip Zadina 19 DET RW 9 ... 3 6 3 3 50.0 2019
10268 899 Nikita Zadorov 23 COL D 70 ... 67 228 0 0 NaN 2019
10269 900 Nikita Zaitsev 27 TOR D 81 ... 151 139 0 0 NaN 2019
10270 901 Travis Zajac 33 NJD C 80 ... 38 66 841 605 58.2 2019
10271 902 Jakub Zboril 21 BOS D 2 ... 0 3 0 0 NaN 2019
10272 903 Mika Zibanejad 25 NYR C 82 ... 66 134 830 842 49.6 2019
10273 904 Mats Zuccarello 31 TOT LW 48 ... 43 57 10 20 33.3 2019
10274 904 Mats Zuccarello 31 NYR LW 46 ... 42 57 10 20 33.3 2019
10275 904 Mats Zuccarello 31 DAL LW 2 ... 1 0 0 0 NaN 2019
10276 905 Jason Zucker 27 MIN LW 81 ... 38 87 2 11 15.4 2019
10277 906 Valentin Zykov 23 TOT LW 28 ... 6 26 2 7 22.2 2019
10278 906 Valentin Zykov 23 CAR LW 13 ... 2 6 2 6 25.0 2019
10279 906 Valentin Zykov 23 VEG LW 10 ... 3 18 0 1 0.0 2019
10280 906 Valentin Zykov 23 EDM LW 5 ... 1 2 0 0 NaN 2019
[10281 rows x 29 columns]
Scraping heavily formatted tables are positively painful with Beautiful Soup (not to bash on Beautiful Soup, it's wonderful for several use cases). There's a bit of a 'hack' I use for scraping data surrounded with dense markup, if you're willing to be a bit utilitarian about it:
1. Select entire table on web page
2. Copy + paste into Evernote (simplifies and reformats the HTML)
3. Copy + paste from Evernote to Excel or another spreadsheet software (removes the HTML)
4. Save as .csv
Input
Output
It isn't perfect. There will be blank lines in the CSV, but blank lines are easier and far less time-consuming to remove than such data is to scrape. Good luck!
As reference, I've linked my own conversions below.
Parsed to Evernote
Parsed to Excel