I apologize, there are some quite similar questions. I went through them, but couldn't cope, though. It would be nice, if someone could help me on this.
I am willing to find any character (and blanks) except:
8-digit long substrings (eg 20110101)
substrings such as 0.68G or 10.76B(1 or 2 digits, dot, 2 digits, 1 letter)
from the text:
b'STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN
PRCP SNDP FRSHTT\n486200 99999 20110101 79.3 24 74.5 24 1007.2 8
1006.2 8 6.6 24 2.2 24 7.0 999.9 87.8 74.1 0.00G 999.9 010000\n486200 99999 20110102 79.7 24 74.9 24 1007.8 8 1006.9 8 6.1 24 2.8 24 8.0
15.0 91.9 74.8 0.00G 999.9 010010\n486200 99999 20110103 77.5 24 73.6 24 1008.5 8 1007.6 8 6.0 24 2.8 24 6.0 999.9 83.7 73.4* 0.68G 999.9
010000\n486200 99999 20110104 81.2 24 75.0 24 1007.7 8 1006.8 8 6.3 24
3.0 24 5.1 999.9 89.6* 73.0 0.14G 999.9 010010\n486200 99999 20110105 79.7 24 74.8 24 1007.8 8 1006.8 8 7.0 24 2.4 24 6.0 999.9 87.8 73.0 0.57G 999.9 010000\n486200 99999 20110106 77.4 24 74.6 24 1008.8 8 1007.9 8 6.0 24 1.5 24 4.1 999.9 81.0 73.2 0.16G 999.9 010000\n486200 99999 20110107 77.7 24 75.0 24 1008.9
I came out with the regex: (\d{8}|\d{1,2}\.\d{1,2}[ABCDEFG]) which finds all (1) and (2).
It now need to 'negate' this. I tried out several possibilities such as (?! ... ), but that doesn't seem to work.
My expected output is: 20110101 0.00G 20110102 0.00G 20110103 0.68G 20110104 89.6* 20110105 0.57G 20110106 0.16G20110107
Do you have suggestions, please?
You don't actually need to negate the pattern. Use the sme regex in re.findall function and join the resultant list items with a space character.
>>> s = '''STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN
PRCP SNDP FRSHTT\n486200 99999 20110101 79.3 24 74.5 24 1007.2 8
1006.2 8 6.6 24 2.2 24 7.0 999.9 87.8 74.1 0.00G 999.9 010000\n486200 99999 20110102 79.7 24 74.9 24 1007.8 8 1006.9 8 6.1 24 2.8 24 8.0
15.0 91.9 74.8 0.00G 999.9 010010\n486200 99999 20110103 77.5 24 73.6 24 1008.5 8 1007.6 8 6.0 24 2.8 24 6.0 999.9 83.7 73.4* 0.68G 999.9
010000\n486200 99999 20110104 81.2 24 75.0 24 1007.7 8 1006.8 8 6.3 24
3.0 24 5.1 999.9 89.6* 73.0 0.14G 999.9 010010\n486200 99999 20110105 79.7 24 74.8 24 1007.8 8 1006.8 8 7.0 24 2.4 24 6.0 999.9 87.8 73.0 0.57G 999.9 010000\n486200 99999 20110106 77.4 24 74.6 24 1008.8 8 1007.9 8 6.0 24 1.5 24 4.1 999.9 81.0 73.2 0.16G 999.9 010000\n486200 99999 20110107 77.7 24 75.0 24 1008.9'''
>>> ' '.join(re.findall(r'(\b\d{8}\b|\b\d{1,2}\.\d{1,2}[ABCDEFG])', s))
'20110101 0.00G 20110102 0.00G 20110103 0.68G 20110104 0.14G 20110105 0.57G 20110106 0.16G 20110107'
(?<!\d)\d{8}(?!\d)|\d{1,2}\.\d{2}[a-zA-Z]
Just use this with re.findall..See demo.
https://www.regex101.com/r/rK5lU1/27
import re
p = re.compile(r'(?<!\d)\d{8}(?!\d)|\d{1,2}\.\d{2}[a-zA-Z]', re.MULTILINE | re.IGNORECASE)
test_str = "b'STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT\n486200 99999 20110101 79.3 24 74.5 24 1007.2 8 1006.2 8 6.6 24 2.2 24 7.0 999.9 87.8 74.1 0.00G 999.9 010000\n486200 99999 20110102 79.7 24 74.9 24 1007.8 8 1006.9 8 6.1 24 2.8 24 8.0 15.0 91.9 74.8 0.00G 999.9 010010\n486200 99999 20110103 77.5 24 73.6 24 1008.5 8 1007.6 8 6.0 24 2.8 24 6.0 999.9 83.7 73.4* 0.68G 999.9 010000\n486200 99999 20110104 81.2 24 75.0 24 1007.7 8 1006.8 8 6.3 24 3.0 24 5.1 999.9 89.6* 73.0 0.14G 999.9 010010\n486200 99999 20110105 79.7 24 74.8 24 1007.8 8 1006.8 8 7.0 24 2.4 24 6.0 999.9 87.8 73.0 0.57G 999.9 010000\n486200 99999 20110106 77.4 24 74.6 24 1008.8 8 1007.9 8 6.0 24 1.5 24 4.1 999.9 81.0 73.2 0.16G 999.9 010000\n486200 99999 20110107 77.7 24 75.0 24 1008.9\n"
re.findall(p, test_str)
Related
How would I use bs4 to get the "Per Game Stats" table on here to turn it into a pandas dataframe?
I have already tried
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
page = requests.get(url)
page
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
and am stuck from there.
Thanks.
Use pd.read_html:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', id='per_game-team')
df = pd.read_html(str(table))[0]
The table you want has the id 'per_game-team'. Use the inspector from your browser's developer tools to find it.
Output:
>>> df.head(10)
Rk Team G MP ... BLK TOV PF PTS
0 1.0 Milwaukee Bucks* 72 240.7 ... 4.6 13.8 17.3 120.1
1 2.0 Brooklyn Nets* 72 241.7 ... 5.3 13.5 19.0 118.6
2 3.0 Washington Wizards* 72 241.7 ... 4.1 14.4 21.6 116.6
3 4.0 Utah Jazz* 72 241.0 ... 5.2 14.2 18.5 116.4
4 5.0 Portland Trail Blazers* 72 240.3 ... 5.0 11.1 18.9 116.1
5 6.0 Phoenix Suns* 72 242.8 ... 4.3 12.5 19.1 115.3
6 7.0 Indiana Pacers 72 242.4 ... 6.4 13.5 20.2 115.3
7 8.0 Denver Nuggets* 72 242.8 ... 4.5 13.5 19.1 115.1
8 9.0 New Orleans Pelicans 72 242.1 ... 4.4 14.6 18.0 114.6
9 10.0 Los Angeles Clippers* 72 240.0 ... 4.1 13.2 19.2 114.0
[10 rows x 25 columns]
pandas's .read_html() is the way to go here (as it uses BeautifulSoup under the hood). And since it already incorporates requests with it, you can actually simplify the solution Corral provided as simply:
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
df = pd.read_html(url, attrs = {'id': 'per_game-team'})[0]
But since you are specifically asking how to convert to dataframe with bs4, I'll provide that solution.
The basic logic/steps to do this are:
Get the table tag
From the table object, Get the Header names from <th> tags under the <thead> tag
iterate through the rows (<tr> tags) and get the <td> content from each row
Code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id':'per_game-team'})
headers = [x.text for x in table.find('thead').find_all('th')]
data = []
table_body_rows = table.find('tbody').find_all('tr')
for row in table_body_rows:
rank = [row.find('th').text]
row_data = rank + [x.text for x in row.find_all('td')]
data.append(row_data)
df = pd.DataFrame(data, columns=headers)
Output:
print(df)
Rk Team G MP FG ... STL BLK TOV PF PTS
0 1 Milwaukee Bucks* 72 240.7 44.7 ... 8.1 4.6 13.8 17.3 120.1
1 2 Brooklyn Nets* 72 241.7 43.1 ... 6.7 5.3 13.5 19.0 118.6
2 3 Washington Wizards* 72 241.7 43.2 ... 7.3 4.1 14.4 21.6 116.6
3 4 Utah Jazz* 72 241.0 41.3 ... 6.6 5.2 14.2 18.5 116.4
4 5 Portland Trail Blazers* 72 240.3 41.3 ... 6.9 5.0 11.1 18.9 116.1
5 6 Phoenix Suns* 72 242.8 43.3 ... 7.2 4.3 12.5 19.1 115.3
6 7 Indiana Pacers 72 242.4 43.3 ... 8.5 6.4 13.5 20.2 115.3
7 8 Denver Nuggets* 72 242.8 43.3 ... 8.1 4.5 13.5 19.1 115.1
8 9 New Orleans Pelicans 72 242.1 42.5 ... 7.6 4.4 14.6 18.0 114.6
9 10 Los Angeles Clippers* 72 240.0 41.8 ... 7.1 4.1 13.2 19.2 114.0
10 11 Atlanta Hawks* 72 241.7 40.8 ... 7.0 4.8 13.2 19.3 113.7
11 12 Sacramento Kings 72 240.3 42.6 ... 7.5 5.0 13.4 19.4 113.7
12 13 Golden State Warriors 72 240.3 41.3 ... 8.2 4.8 15.0 21.2 113.7
13 14 Philadelphia 76ers* 72 242.1 41.4 ... 9.1 6.2 14.4 20.2 113.6
14 15 Memphis Grizzlies* 72 241.7 42.8 ... 9.1 5.1 13.3 18.7 113.3
15 16 Boston Celtics* 72 241.4 41.5 ... 7.7 5.3 14.1 20.4 112.6
16 17 Dallas Mavericks* 72 240.3 41.1 ... 6.3 4.3 12.1 19.4 112.4
17 18 Minnesota Timberwolves 72 241.7 40.7 ... 8.8 5.5 14.3 20.9 112.1
18 19 Toronto Raptors 72 240.3 39.7 ... 8.6 5.4 13.2 21.2 111.3
19 20 San Antonio Spurs 72 242.8 41.9 ... 7.0 5.1 11.4 18.0 111.1
20 21 Chicago Bulls 72 241.4 42.2 ... 6.7 4.2 15.1 18.9 110.7
21 22 Los Angeles Lakers* 72 242.4 40.6 ... 7.8 5.4 15.2 19.1 109.5
22 23 Charlotte Hornets 72 241.0 39.9 ... 7.8 4.8 14.8 18.0 109.5
23 24 Houston Rockets 72 240.3 39.3 ... 7.6 5.0 14.7 19.5 108.8
24 25 Miami Heat* 72 241.4 39.2 ... 7.9 4.0 14.1 18.9 108.1
25 26 New York Knicks* 72 242.1 39.4 ... 7.0 5.1 12.9 20.5 107.0
26 27 Detroit Pistons 72 242.1 38.7 ... 7.4 5.2 14.9 20.5 106.6
27 28 Oklahoma City Thunder 72 241.0 38.8 ... 7.0 4.4 16.1 18.1 105.0
28 29 Orlando Magic 72 240.7 38.3 ... 6.9 4.4 12.8 17.2 104.0
29 30 Cleveland Cavaliers 72 242.1 38.6 ... 7.8 4.5 15.5 18.2 103.8
[30 rows x 25 columns]
CODE
import pandas
df = pandas.read_csv('biharpopulation.txt', delim_whitespace=True)
df.columns = ['SlNo','District','Total','Male','Female','Total','Male','Female','SC','ST','SC','ST']
DATA
SlNo District Total Male Female Total Male Female SC ST SC ST
1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.2 38.6 68.7
2 Nalanda 473786 248246 225540 970 524 446 20.2 0.0 29.4 29.8
3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.4 39.1 46.7
4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.6 37.9 44.6
5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.0 41.3 30.0
6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.8 40.5 38.6
7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.1 26.3 49.1
8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4
9 Arawal 11479 57677 53802 294 179 115 18.8 0.04
10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.1 22.4 20.5
11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.1 35.7 49.7
Saran
12 Saran 389933 199772 190161 6667 3384 3283 12 0.2 33.6 48.5
13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.5 35.6 44.0
14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.3 32.1 37.8
15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.1 28.9 50.4
16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3
17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1
18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.1 22.1 31.4
19 Sheohar 74391 39405 34986 64 35 29 14.4 0.0 16.9 38.8
20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.1 29.4 29.9
21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.0 24.7 49.5
22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.0 22.2 35.8
23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.1 25.1 22.0
24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.6 42.6 37.3
25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.1 31.4 78.6
26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.0 25.2 45.6
27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.7 26.8 12.9
28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.8 24.5 26.7
The issue is with these 2 lines:
16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3
17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1
If you can somehow remove the space between E. Champaran and W. Champaran then you can do this:
df = pd.read_csv('test.csv', sep=r'\s+', skip_blank_lines=True, skipinitialspace=True)
print(df)
SlNo District Total Male Female Total.1 Male.1 Female.1 SC ST SC.1 ST.1
0 1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.20 38.6 68.7
1 2 Nalanda 473786 248246 225540 970 524 446 20.2 0.00 29.4 29.8
2 3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.40 39.1 46.7
3 4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.60 37.9 44.6
4 5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.00 41.3 30.0
5 6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.80 40.5 38.6
6 7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.10 26.3 49.1
7 8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4
8 9 Arawal 11479 57677 53802 294 179 115 18.8 0.04 NaN NaN
9 10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.10 22.4 20.5
10 11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.10 35.7 49.7
11 12 Saran 389933 199772 190161 6667 3384 3283 12.0 0.20 33.6 48.5
12 13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.50 35.6 44.0
13 14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.30 32.1 37.8
14 15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.10 28.9 50.4
15 16 E.Champaran 514119 270968 243151 4812 2518 2294 13.0 0.10 20.6 34.3
16 17 W.Champaran 434714 228057 206657 44912 23135 21777 14.3 1.50 22.3 24.1
17 18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.10 22.1 31.4
18 19 Sheohar 74391 39405 34986 64 35 29 14.4 0.00 16.9 38.8
19 20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.10 29.4 29.9
20 21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.00 24.7 49.5
21 22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.00 22.2 35.8
22 23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.10 25.1 22.0
23 24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.60 42.6 37.3
24 25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.10 31.4 78.6
25 26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.00 25.2 45.6
26 27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.70 26.8 12.9
27 28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.80 24.5 26.7
Your problem is that the CSV is whitespace-delimited, but some of your district names also have whitespace in them. Luckily, none of the district names contain '\t' characters, so we can fix this:
df = pandas.read_csv('biharpopulation.txt', delimiter='\t')
Say I have a yearly cumulative dataframe as follows:
date v1 v2
0 2019-10 109.23 126.17
1 2019-09 108.90 121.07
2 2019-08 95.96 85.40
3 2019-07 91.30 82.92
4 2019-06 80.19 26.04
5 2019-05 65.98 18.58
6 2019-04 38.80 9.87
7 2019-03 3.01 2.51
8 2019-02 3.01 2.49
9 2018-12 221.31 249.87
10 2018-11 215.59 137.92
11 2018-10 195.16 110.69
12 2018-09 160.45 101.15
13 2018-08 124.70 75.57
14 2018-07 122.98 52.48
15 2018-06 73.46 34.82
16 2018-05 42.22 34.61
17 2018-04 9.94 28.52
18 2018-03 4.07 28.52
19 2018-02 2.04 21.84
Just wonder if it's possible to generate cum_v1 and cum_v2 for each year data.
The logic of calculation is: value for cum_v1 in 2019-10 is calculated by value in 2019-10 (taking the initial amount) minus in 2019-09, until 2019-02 will keep same for cum_v1 as v1, and set 0 for all values in 2019-01. Same logic for the year of 2018.
The desired output will like this:
date v1 cum_v1 v2 cum_v2
0 2019-10 109.23 0.33 126.17 5.10
1 2019-09 108.90 12.94 121.07 35.67
2 2019-08 95.96 4.66 85.40 2.48
3 2019-07 91.30 11.11 82.92 56.88
4 2019-06 80.19 14.21 26.04 7.46
5 2019-05 65.98 27.18 18.58 8.71
6 2019-04 38.80 35.79 9.87 7.36
7 2019-03 3.01 0.00 2.51 0.02
8 2019-02 3.01 3.01 2.49 2.49
9 2019-01 0 0 0 0
10 2018-12 221.31 5.72 249.87 111.95
11 2018-11 215.59 20.43 137.92 27.23
12 2018-10 195.16 34.71 110.69 9.54
13 2018-09 160.45 35.75 101.15 25.58
14 2018-08 124.70 1.72 75.57 23.09
15 2018-07 122.98 49.52 52.48 17.66
16 2018-06 73.46 31.24 34.82 0.21
17 2018-05 42.22 32.28 34.61 6.09
18 2018-04 9.94 5.87 28.52 0.00
19 2018-03 4.07 2.03 28.52 6.68
20 2018-02 2.04 2.04 21.84 21.84
21 2018-01 0 0 0 0
Using pandas.Groupby with diff:
df[['cum_v1', 'cum_v2']] = df.groupby(df['date'].str[:4]).diff(-1).fillna(df[['v1', 'v2']])
print(df)
Output:
date v1 v2 cum_v1 cum_v2
0 2019-10 109.23 126.17 0.33 5.10
1 2019-09 108.90 121.07 12.94 35.67
2 2019-08 95.96 85.40 4.66 2.48
3 2019-07 91.30 82.92 11.11 56.88
4 2019-06 80.19 26.04 14.21 7.46
5 2019-05 65.98 18.58 27.18 8.71
6 2019-04 38.80 9.87 35.79 7.36
7 2019-03 3.01 2.51 0.00 0.02
8 2019-02 3.01 2.49 3.01 2.49
9 2018-12 221.31 249.87 5.72 111.95
10 2018-11 215.59 137.92 20.43 27.23
11 2018-10 195.16 110.69 34.71 9.54
12 2018-09 160.45 101.15 35.75 25.58
13 2018-08 124.70 75.57 1.72 23.09
14 2018-07 122.98 52.48 49.52 17.66
15 2018-06 73.46 34.82 31.24 0.21
16 2018-05 42.22 34.61 32.28 6.09
17 2018-04 9.94 28.52 5.87 0.00
18 2018-03 4.07 28.52 2.03 6.68
19 2018-02 2.04 21.84 2.04 21.84
Use DataFrameGroupBy.diff with Series.dt.year with columns in list, replace last missing values by original by DataFrame.fillna, add prefixes by DataFrame.add_prefix and last join to original by DataFrame.join:
df['date'] = pd.to_datetime(df['date']).dt.to_period('m')
cols = ['v1','v2']
df = df.join(df.groupby(df['date'].dt.year)[cols].diff(-1).fillna(df[cols]).add_prefix('cum'))
print(df)
date v1 v2 cumv1 cumv2
0 2019-10 109.23 126.17 0.33 5.10
1 2019-09 108.90 121.07 12.94 35.67
2 2019-08 95.96 85.40 4.66 2.48
3 2019-07 91.30 82.92 11.11 56.88
4 2019-06 80.19 26.04 14.21 7.46
5 2019-05 65.98 18.58 27.18 8.71
6 2019-04 38.80 9.87 35.79 7.36
7 2019-03 3.01 2.51 0.00 0.02
8 2019-02 3.01 2.49 3.01 2.49
9 2018-12 221.31 249.87 5.72 111.95
10 2018-11 215.59 137.92 20.43 27.23
11 2018-10 195.16 110.69 34.71 9.54
12 2018-09 160.45 101.15 35.75 25.58
13 2018-08 124.70 75.57 1.72 23.09
14 2018-07 122.98 52.48 49.52 17.66
15 2018-06 73.46 34.82 31.24 0.21
16 2018-05 42.22 34.61 32.28 6.09
17 2018-04 9.94 28.52 5.87 0.00
18 2018-03 4.07 28.52 2.03 6.68
19 2018-02 2.04 21.84 2.04 21.84
EDIT:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').resample('MS').sum()
cols = ['v1','v2']
df = (df.join(df.groupby(df.index.year)[cols].diff(-1).fillna(df[cols])
.add_prefix('cum')).to_period('m'))
print(df)
v1 v2 cumv1 cumv2
date
2018-02 2.04 21.84 -2.03 -6.68
2018-03 4.07 28.52 -5.87 0.00
2018-04 9.94 28.52 -32.28 -6.09
2018-05 42.22 34.61 -31.24 -0.21
2018-06 73.46 34.82 -49.52 -17.66
2018-07 122.98 52.48 -1.72 -23.09
2018-08 124.70 75.57 -35.75 -25.58
2018-09 160.45 101.15 -34.71 -9.54
2018-10 195.16 110.69 -20.43 -27.23
2018-11 215.59 137.92 -5.72 -111.95
2018-12 221.31 249.87 221.31 249.87
2019-01 0.00 0.00 -3.01 -2.49
2019-02 3.01 2.49 0.00 -0.02
2019-03 3.01 2.51 -35.79 -7.36
2019-04 38.80 9.87 -27.18 -8.71
2019-05 65.98 18.58 -14.21 -7.46
2019-06 80.19 26.04 -11.11 -56.88
2019-07 91.30 82.92 -4.66 -2.48
2019-08 95.96 85.40 -12.94 -35.67
2019-09 108.90 121.07 -0.33 -5.10
2019-10 109.23 126.17 109.23 126.17
I need to get the table from this page: https://stats.nba.com/teams/traditional/?sort=GP&dir=-1. From the html of the page one can see that the table is encoded in the descendants of the tag
<nba-stat-table filters="filters" ... >
<div class="nba-stat-table">
<div class="nba-stat-table__overflow" data-fixed="2" role="grid">
<table>
...
</nba-stat-table>
(I cannot add a screenshot since I am new to stackoverflow but just doing: right click -> inspect element wherever in the table you will see what I mean).
I've tried some different ways such as the first and second answer to this question How to extract tables from websites in Python as well as those to this other question pandas read_html ValueError: No tables found (since trying the first solution I've got an error which is essentially this second question).
First try using pandas:
import requests
import pandas as pd
url = 'http://stats.nba.com/teams/traditional/?sort=GP&dir=-1'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
Or another try with BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = "https://stats.nba.com/teams/traditional/?sort=GP&dir=-1"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
stats_table = soup.find('nba-stat-table')
for child in stats_table.descendants:
print(child)
For the first I got ''pandas read_html ValueError: No tables found'' error. For the second I didn't get any error but nothing showing. Then, I have tried to see on a file what was actually happening by doing:
with open('html.txt', 'w') as fout:
fout.write(str(page.content))
and/or:
with open('html.txt', 'w') as fout:
fout.write(str(soup))
and I get in the text file in the part of the html in which the table should be:
<nba-stat-table filters="filters"
ng-if="!isLoading && !noData"
options="options"
params="params"
rows="teamStats.rows"
template="teams/teams-traditional">
</nba-stat-table>
So it appears that I am not getting all the descendats of this tag which actually contains the information of the table. Then, does someone has a solution which obtains the whole html of the page and so it allows me for parsing it or instead an alternative solution to obtaining the table?
Here's what I try when attempting to scrape data. (By the way I LOVE scraping/working with sports data.)
1) Pandas pd.read_html(). (beautifulSoup actually works under the hood here). I like this method as it's rather easy and quick. Usually only requires a small amount of manipulation if it does return what I want. The pandas' pd.read_html() only works if the data is within <table> tags though in the html. Since there are no <table> tags here, it will return what you stated as "ValueError: No tables found". So good work on trying that first, it's the easiest method when it works.
2) The other "go to" method I'll use, is then to see if the data is pulled through XHR. Actually, this might be my first choice as it can give you options of being able to filter what is returned, but requires a little more (not much) investigated work to find the correct request url and query parameter. (This is the route I went for this solution).
3) If it is generated through javascript, sometimes you can find the data in json format with <script> tags using BeautifulSoup. this requires a bit more investigation of pulling out the right <script> tag, then doing string manipulation to get the string in a valid json format to be able to use json.loads() to read in the data.
4a) Use BeautifulSoup to pull out the data elements if they are present in other tags and not rendered by javascript.
4b) Selenium is an option to allow the page to render first, then go into the html and parse with BeautifulSoup (in some cases allow Selenium to render and then could use pd.read_html() if it renders <table> tags), but is usually my last choice. It's not that it doesn't work or is bad, it just slow and unnecessary if any of the above choices work.
So I went with option 2. Here's the code and output:
import requests
import pandas as pd
url = 'https://stats.nba.com/stats/leaguedashteamstats'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Mobile Safari/537.36'}
payload = {
'Conference': '',
'DateFrom': '',
'DateTo': '',
'Division': '',
'GameScope': '',
'GameSegment': '',
'LastNGames': '82',
'LeagueID': '00',
'Location': '',
'MeasureType': 'Base',
'Month': '0',
'OpponentTeamID': '0',
'Outcome': '',
'PORound': '0',
'PaceAdjust': 'N',
'PerMode': 'PerGame',
'Period': '0',
'PlayerExperience': '',
'PlayerPosition': '',
'PlusMinus': 'N',
'Rank': 'N',
'Season': '2019-20',
'SeasonSegment': '',
'SeasonType': 'Regular Season',
'ShotClockRange': '',
'StarterBench': '',
'TeamID': '0',
'TwoWay': '0',
'VsConference':'',
'VsDivision':'' }
jsonData = requests.get(url, headers=headers, params=payload).json()
df = pd.DataFrame(jsonData['resultSets'][0]['rowSet'], columns=jsonData['resultSets'][0]['headers'])
Output:
print (df.to_string())
TEAM_ID TEAM_NAME GP W L W_PCT MIN FGM FGA FG_PCT FG3M FG3A FG3_PCT FTM FTA FT_PCT OREB DREB REB AST TOV STL BLK BLKA PF PFD PTS PLUS_MINUS GP_RANK W_RANK L_RANK W_PCT_RANK MIN_RANK FGM_RANK FGA_RANK FG_PCT_RANK FG3M_RANK FG3A_RANK FG3_PCT_RANK FTM_RANK FTA_RANK FT_PCT_RANK OREB_RANK DREB_RANK REB_RANK AST_RANK TOV_RANK STL_RANK BLK_RANK BLKA_RANK PF_RANK PFD_RANK PTS_RANK PLUS_MINUS_RANK CFID CFPARAMS
0 1610612737 Atlanta Hawks 4 2 2 0.500 48.0 39.5 84.3 0.469 10.0 31.8 0.315 16.0 23.0 0.696 8.5 34.5 43.0 25.0 18.0 10.0 5.3 7.3 23.8 21.5 105.0 1.0 1 11 14 14 10 14 27 5 23 21 21 24 21 27 24 21 25 9 19 3 15 29 17 25 22 15 10 Atlanta Hawks
1 1610612738 Boston Celtics 3 2 1 0.667 48.0 39.0 97.3 0.401 11.7 35.0 0.333 18.0 26.3 0.684 14.7 33.0 47.7 21.0 11.3 9.3 6.3 5.7 25.0 29.3 107.7 5.0 19 11 4 11 10 18 3 28 13 12 18 19 12 28 2 25 13 22 1 7 7 19 20 1 16 10 10 Boston Celtics
2 1610612751 Brooklyn Nets 3 1 2 0.333 51.3 43.0 93.3 0.461 15.3 38.7 0.397 22.7 32.3 0.701 10.7 38.3 49.0 22.7 19.7 8.3 5.3 6.0 26.0 27.3 124.0 0.7 19 18 14 18 1 4 9 8 3 7 4 7 3 24 11 9 7 19 26 15 13 22 21 3 1 16 10 Brooklyn Nets
3 1610612766 Charlotte Hornets 4 1 3 0.250 48.0 38.3 86.5 0.442 14.8 36.8 0.401 14.3 19.8 0.722 10.0 31.5 41.5 24.3 19.3 5.3 4.0 6.5 22.5 21.8 105.5 -13.8 1 18 23 23 10 23 23 16 4 9 3 26 26 23 14 28 29 14 24 30 24 25 10 22 21 28 10 Charlotte Hornets
4 1610612741 Chicago Bulls 4 1 3 0.250 48.0 38.5 95.0 0.405 9.8 35.5 0.275 17.5 23.0 0.761 12.3 32.0 44.3 20.0 12.8 10.0 4.5 7.0 21.3 20.8 104.3 -6.0 1 18 23 23 10 20 6 27 24 11 29 20 21 15 7 26 24 26 2 3 20 27 7 26 24 23 10 Chicago Bulls
5 1610612739 Cleveland Cavaliers 3 1 2 0.333 48.0 39.3 89.3 0.440 10.7 34.7 0.308 13.0 18.7 0.696 10.7 38.0 48.7 20.7 15.7 6.3 4.0 4.7 19.0 19.3 102.3 -5.0 19 18 14 18 10 17 16 19 19 13 24 29 27 26 11 10 10 23 13 25 24 11 3 30 26 22 10 Cleveland Cavaliers
6 1610612742 Dallas Mavericks 4 3 1 0.750 48.0 39.5 86.8 0.455 12.8 40.8 0.313 23.0 31.0 0.742 9.8 36.0 45.8 24.0 13.0 6.8 5.0 2.8 19.3 27.0 114.8 4.0 1 1 4 4 10 14 21 10 8 5 22 5 4 19 17 15 21 15 3 19 19 1 4 7 10 12 10 Dallas Mavericks
7 1610612743 Denver Nuggets 4 3 1 0.750 49.3 37.3 90.5 0.412 11.5 31.8 0.362 19.8 24.3 0.814 13.0 35.5 48.5 22.0 14.3 8.0 5.5 4.5 22.8 23.8 105.8 3.3 1 1 4 4 4 27 13 25 14 21 11 13 20 7 4 19 11 20 7 16 11 9 13 14 20 13 10 Denver Nuggets
8 1610612765 Detroit Pistons 4 2 2 0.500 48.0 38.5 80.0 0.481 10.5 26.0 0.404 19.0 25.3 0.752 8.3 33.5 41.8 21.8 18.8 6.0 5.3 3.8 21.8 21.8 106.5 -3.0 1 11 14 14 10 20 29 3 20 28 2 15 15 17 26 24 28 21 21 27 15 4 9 22 18 21 10 Detroit Pistons
9 1610612744 Golden State Warriors 3 1 2 0.333 48.0 40.0 98.3 0.407 11.3 36.7 0.309 24.7 28.3 0.871 15.3 32.0 47.3 27.0 15.3 9.3 1.3 5.7 19.3 23.3 116.0 -12.0 19 18 14 18 10 10 2 26 15 10 23 3 8 2 1 26 14 4 10 7 30 19 5 17 9 27 10 Golden State Warriors
10 1610612745 Houston Rockets 3 2 1 0.667 48.0 38.3 91.3 0.420 13.0 45.7 0.285 28.0 34.0 0.824 9.3 38.0 47.3 24.3 15.7 6.3 5.3 5.0 23.7 28.0 117.7 0.3 19 11 4 11 10 22 11 23 6 3 27 1 1 5 21 10 14 12 13 25 13 15 15 2 8 18 10 Houston Rockets
11 1610612754 Indiana Pacers 3 0 3 0.000 48.0 39.7 90.0 0.441 8.0 23.3 0.343 13.7 16.7 0.820 9.7 29.3 39.0 24.3 13.3 8.7 4.3 5.3 23.7 19.7 101.0 -7.3 19 28 23 28 10 13 14 18 29 30 14 27 29 6 19 30 30 12 5 10 23 18 15 28 27 26 10 Indiana Pacers
12 1610612746 LA Clippers 4 3 1 0.750 48.0 43.0 82.8 0.520 13.0 32.0 0.406 22.5 28.5 0.789 8.3 34.0 42.3 25.0 17.0 8.5 5.5 3.3 26.3 25.5 121.5 9.0 1 1 4 4 10 4 28 1 6 19 1 8 7 11 26 22 26 9 18 11 11 2 23 10 3 3 10 LA Clippers
13 1610612747 Los Angeles Lakers 4 3 1 0.750 48.0 40.0 87.5 0.457 9.8 29.0 0.336 19.5 24.5 0.796 10.0 36.0 46.0 23.5 15.3 8.5 8.0 3.5 21.5 24.3 109.3 11.8 1 1 4 4 10 10 17 9 24 25 17 14 17 8 14 15 19 17 9 11 1 3 8 12 15 1 10 Los Angeles Lakers
14 1610612763 Memphis Grizzlies 4 1 3 0.250 49.3 39.5 95.3 0.415 9.0 32.0 0.281 19.0 24.5 0.776 11.3 36.5 47.8 24.8 18.8 9.0 6.5 7.0 27.0 23.8 107.0 -13.8 1 18 23 23 4 14 5 24 27 19 28 15 17 14 10 14 12 11 21 9 5 27 26 14 17 28 10 Memphis Grizzlies
15 1610612748 Miami Heat 4 3 1 0.750 49.3 40.3 86.0 0.468 12.8 32.3 0.395 24.8 33.8 0.733 9.8 39.0 48.8 23.8 22.5 8.5 6.5 4.8 27.0 27.3 118.0 8.0 1 1 4 4 4 9 25 6 8 17 5 2 2 20 17 6 9 16 30 11 5 12 26 6 7 6 10 Miami Heat
16 1610612749 Milwaukee Bucks 3 2 1 0.667 49.7 45.0 95.0 0.474 16.7 46.0 0.362 17.3 25.7 0.675 6.3 43.7 50.0 27.3 13.7 8.0 7.0 4.0 24.7 25.7 124.0 6.0 19 11 4 11 2 2 6 4 2 1 10 21 14 29 29 2 3 3 6 16 2 5 19 9 1 9 10 Milwaukee Bucks
17 1610612750 Minnesota Timberwolves 3 3 0 1.000 49.7 42.7 96.7 0.441 12.7 42.0 0.302 23.3 30.7 0.761 13.0 37.0 50.0 25.7 15.3 10.7 3.7 7.7 20.0 27.3 121.3 10.0 19 1 1 1 2 6 4 17 10 4 25 4 5 15 4 13 3 5 10 1 28 30 6 3 4 2 10 Minnesota Timberwolves
18 1610612740 New Orleans Pelicans 4 0 4 0.000 49.3 45.5 100.8 0.452 16.8 45.8 0.366 13.3 18.3 0.726 12.0 34.0 46.0 30.8 16.3 8.0 5.3 4.0 26.5 21.8 121.0 -7.3 1 28 29 28 4 1 1 13 1 2 8 28 28 21 8 22 19 1 17 16 15 5 25 22 5 24 10 New Orleans Pelicans
19 1610612752 New York Knicks 4 1 3 0.250 48.0 37.8 87.0 0.434 10.8 27.8 0.387 18.8 28.0 0.670 13.8 35.3 49.0 18.8 20.3 10.0 3.8 5.3 27.0 23.0 105.0 -7.3 1 18 23 23 10 24 19 20 17 27 6 17 10 30 3 20 7 27 27 3 27 17 26 18 22 24 10 New York Knicks
20 1610612760 Oklahoma City Thunder 4 1 3 0.250 48.0 37.5 84.5 0.444 10.8 29.3 0.368 17.3 24.8 0.697 9.5 40.3 49.8 18.8 18.5 6.8 4.5 4.8 23.5 22.8 103.0 1.8 1 18 23 23 10 25 26 15 17 23 7 22 16 25 20 3 5 27 20 19 20 12 14 19 25 14 10 Oklahoma City Thunder
21 1610612753 Orlando Magic 3 1 2 0.333 48.0 35.3 91.3 0.387 8.7 33.3 0.260 16.7 21.0 0.794 10.7 35.7 46.3 20.3 13.0 9.7 5.7 4.3 17.7 20.3 96.0 -1.3 19 18 14 18 10 28 11 30 28 16 30 23 25 9 11 18 17 24 3 6 9 8 1 27 29 20 10 Orlando Magic
22 1610612755 Philadelphia 76ers 3 3 0 1.000 48.0 38.7 86.7 0.446 10.3 34.7 0.298 22.0 30.3 0.725 10.0 39.7 49.7 25.3 20.3 10.7 7.0 4.0 29.7 27.3 109.7 7.3 19 1 1 1 10 19 22 14 22 13 26 11 6 22 14 5 6 7 29 1 2 5 29 3 14 7 10 Philadelphia 76ers
23 1610612756 Phoenix Suns 4 2 2 0.500 49.3 39.8 87.5 0.454 12.3 34.5 0.355 22.3 26.8 0.832 7.8 39.0 46.8 27.8 16.0 8.5 4.0 6.5 31.3 27.0 114.0 8.8 1 11 14 14 4 12 17 11 12 15 13 10 11 4 28 6 16 2 15 11 24 25 30 7 11 4 10 Phoenix Suns
24 1610612757 Portland Trail Blazers 4 2 2 0.500 48.0 41.5 89.8 0.462 9.3 28.3 0.327 21.0 24.5 0.857 8.5 37.8 46.3 17.0 15.5 6.8 5.3 4.5 26.3 22.5 113.3 0.3 1 11 14 14 10 7 15 7 26 26 20 12 17 3 24 12 18 30 12 19 15 9 23 20 12 19 10 Portland Trail Blazers
25 1610612758 Sacramento Kings 4 0 4 0.000 48.0 34.3 86.5 0.396 11.0 32.3 0.341 16.0 21.5 0.744 11.5 30.8 42.3 18.8 18.8 6.5 4.5 5.0 22.5 22.0 95.5 -19.5 1 28 29 28 10 30 23 29 16 17 15 24 24 18 9 29 26 27 21 23 20 15 10 21 30 30 10 Sacramento Kings
26 1610612759 San Antonio Spurs 3 3 0 1.000 48.0 44.3 92.0 0.482 8.0 23.7 0.338 22.3 28.3 0.788 12.7 38.7 51.3 25.3 16.0 5.7 7.0 5.7 18.7 24.7 119.0 4.7 19 1 1 1 10 3 10 2 29 29 16 9 8 12 6 8 2 7 15 29 2 19 2 11 6 11 10 San Antonio Spurs
27 1610612761 Toronto Raptors 4 3 1 0.750 49.3 37.5 87.0 0.431 14.3 39.3 0.363 22.8 25.8 0.883 9.3 44.3 53.5 22.8 20.3 6.8 5.8 6.3 24.3 24.0 112.0 8.8 1 1 4 4 4 25 19 22 5 6 9 6 13 1 22 1 1 18 27 19 8 24 18 13 13 4 10 Toronto Raptors
28 1610612762 Utah Jazz 4 3 1 0.750 48.0 35.0 77.3 0.453 10.5 29.3 0.359 18.3 23.0 0.793 5.5 39.8 45.3 20.3 19.5 6.5 3.3 4.8 26.0 23.5 98.8 7.3 1 1 4 4 10 29 30 12 20 23 12 18 21 10 30 4 22 25 25 23 29 12 21 16 28 8 10 Utah Jazz
29 1610612764 Washington Wizards 3 1 2 0.333 48.0 41.0 95.0 0.432 12.7 38.7 0.328 11.7 15.0 0.778 9.0 36.0 45.0 25.7 15.0 6.0 5.7 6.0 22.7 19.7 106.3 0.7 19 18 14 18 10 8 6 21 10 7 19 30 30 13 23 15 23 5 8 27 9 22 12 28 19 16 10 Washington Wizards
Using Selenium will be the best way to do it. Then you can get the whole content which is rendered by javascript.
https://towardsdatascience.com/simple-web-scraping-with-pythons-selenium-4cedc52798cd
df
Out[1]:
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
0 978.0 345 17.0 16.5 97 12.22 0 0 292.0 326.8 294.1
1 977.0 354 17.8 16.7 93 12.39 1 0 292.9 328.3 295.1
2 970.0 416 23.4 15.4 61 11.47 4 2 299.1 332.9 301.2
3 963.0 479 24.0 14.0 54 10.54 8 3 300.4 331.6 302.3
4 948.7 610 23.0 13.4 55 10.28 15 6 300.7 331.2 302.5
5 925.0 830 21.4 12.4 56 9.87 20 5 301.2 330.6 303.0
6 916.0 914 20.7 11.7 56 9.51 20 4 301.3 329.7 303.0
7 884.0 1219 18.2 9.2 56 8.31 60 4 301.8 326.7 303.3
8 853.1 1524 15.7 6.7 55 7.24 35 3 302.2 324.1 303.5
9 850.0 1555 15.4 6.4 55 7.14 20 2 302.3 323.9 303.6
10 822.8 1829 13.3 5.6 60 6.98 300 4 302.9 324.0 304.1
How do I interpolate the values of all the columns on specified PRES (pressure) values at say PRES=[950, 900, 875]? Is there an elegant pandas type of way to do this?
The only way I can think of doing this is to first start with making empty NaN values for the entire row for each specified PRES values in a loop, then set PRES as index and then use the pandas native interpolate option:
df.interpolate(method='index', inplace=True)
Is there a more elegant solution?
Use your solution with no loop - reindex by union original index values with PRES list, but working only if all values are unique:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = df.reindex(df.index.union(PRES)).sort_index(ascending=False).interpolate(method='index')
print (df)
HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
978.0 345.0 17.0 16.5 97.0 12.22 0.0 0.0 292.0 326.8 294.1
977.0 354.0 17.8 16.7 93.0 12.39 1.0 0.0 292.9 328.3 295.1
970.0 416.0 23.4 15.4 61.0 11.47 4.0 2.0 299.1 332.9 301.2
963.0 479.0 24.0 14.0 54.0 10.54 8.0 3.0 300.4 331.6 302.3
950.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
948.7 610.0 23.0 13.4 55.0 10.28 15.0 6.0 300.7 331.2 302.5
925.0 830.0 21.4 12.4 56.0 9.87 20.0 5.0 301.2 330.6 303.0
916.0 914.0 20.7 11.7 56.0 9.51 20.0 4.0 301.3 329.7 303.0
900.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
884.0 1219.0 18.2 9.2 56.0 8.31 60.0 4.0 301.8 326.7 303.3
875.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
853.1 1524.0 15.7 6.7 55.0 7.24 35.0 3.0 302.2 324.1 303.5
850.0 1555.0 15.4 6.4 55.0 7.14 20.0 2.0 302.3 323.9 303.6
822.8 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
If possible not unique values in PRES column, then use concat with sort_index:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = (pd.concat([df, pd.DataFrame(index=PRES)])
.sort_index(ascending=False)
.interpolate(method='index'))