Convert JSON data from Request into Pandas DataFrame - python

I'm trying to scrape some data from a web page and put it into a pandas dataframe. I tried and read many things but I just cannot get what I want. And I want a dataframe with all the data in separate columns and rows. Below is my code.
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
r = requests.get('http://www.starcapital.de/test/Res_Stockmarketvaluation_FundamentalKZ_Tbl.php')
a = json.loads(r.text)
res = json_normalize(a)
##print(res)
df = pd.DataFrame(res)
print(df)
##df = pd.read_json(a)
##print(df)
pd.read_json(a) doesn't seem to work in any way.

Or, more simply:
import requests
import pandas as pd
r = requests.get('http://www.starcapital.de/test/Res_Stockmarketvaluation_FundamentalKZ_Tbl.php')
j = r.json()
df = pd.DataFrame.from_dict(j)

you can do it this way:
import requests
import pandas as pd
r = requests.get('http://www.starcapital.de/test/Res_Stockmarketvaluation_FundamentalKZ_Tbl.php')
j = r.json()
df = pd.DataFrame([[d['v'] for d in x['c']] for x in j['rows']],
columns=[d['label'] for d in j['cols']])
Result:
In [217]: df
Out[217]:
Country Weight CAPE PE PC PB PS DY RS 26W RS 52W Score
0 Russia 1.1 5.9 9.1 5.1 1.0 0.9 3.7 1.22 1.35 1.0
1 China 1.1 12.8 7.2 4.5 0.9 0.6 4.2 1.05 1.13 2.0
2 Italy 1.0 12.7 31.5 5.7 1.2 0.6 3.3 1.13 1.11 3.0
3 Austria 0.2 14.3 21.7 7.3 1.1 0.7 2.5 1.10 1.15 4.0
4 Norway 0.4 12.8 32.4 7.4 1.6 1.2 4.0 1.10 1.17 5.0
5 Hungary 0.0 12.5 49.8 7.5 1.4 0.7 2.3 1.12 1.19 6.0
6 Spain 1.2 11.7 24.7 7.0 1.4 1.2 3.7 1.08 1.11 7.0
7 Czech 0.0 8.9 13.6 6.1 1.3 1.0 6.7 1.03 1.05 8.0
8 Brazil 1.3 9.8 42.1 7.4 1.6 1.2 3.0 1.06 1.24 9.0
9 Portugal 0.1 11.3 29.0 4.8 1.5 0.7 3.9 1.05 1.06 10.0
.. ... ... ... ... ... ... ... ... ... ... ...
42 EMERGING MARKETS 13.5 14.0 16.0 8.8 1.6 1.3 2.9 1.04 1.11 NaN
43 DEVELOPED EUROPE 22.4 16.6 26.5 9.9 1.8 1.1 3.2 1.06 1.08 NaN
44 EMERGING EUROPE 1.7 8.6 10.9 5.8 1.1 0.8 3.4 1.13 1.20 NaN
45 EMERGING AMERICA 3.0 15.2 30.1 9.4 1.9 1.2 2.4 1.03 1.11 NaN
46 DEVELOPED ASIA-PACIFIC 17.7 NaN 17.7 8.8 1.3 0.9 2.5 1.03 1.09 NaN
47 EMERGING ASIA-PACIFIC 6.9 14.9 15.1 9.1 1.8 1.4 2.7 1.01 1.08 NaN
48 EMERGING AFRICA 0.8 NaN 16.5 10.6 2.0 1.4 3.8 1.06 1.12 NaN
49 MIDDLE EAST 1.3 NaN 13.7 11.8 1.5 1.8 3.9 1.06 1.10 NaN
50 BRIC 5.9 11.8 14.6 7.4 1.4 1.2 2.7 1.06 1.16 NaN
51 OTHER EMERGING MKT. 2.5 NaN 17.7 12.9 1.8 1.5 3.1 1.16 1.20 NaN
[52 rows x 11 columns]

And one step simpler than Justin's (already helpful) response...by putting .json() at the end of the r = requests.get line
import requests
import pandas as pd
r = requests.get('http://www.starcapital.de/test/Res_Stockmarketvaluation_FundamentalKZ_Tbl.php').json()
df = pd.DataFrame.from_dict(r)

You may also want pd.json_normalize for when your data isn't exactly the way from_dict() expects.
For example:
data = [
{
"id": 1,
"name": "Cole Volk",
"fitness": {"height": 130, "weight": 60},
},
{"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
{
"id": 2,
"name": "Faye Raker",
"fitness": {"height": 130, "weight": 60},
},
]
pd.json_normalize(data, max_level=1)
id name fitness.height fitness.weight
0 1.0 Cole Volk 130 60
1 NaN Mark Reg 130 60
2 2.0 Faye Raker 130 60

Related

Pandas groupby with sorting if conditions

I have the following dataframe df, which comes from a dataset:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PS/G
0 1 Stephen Curry PG 27 GSW 79 79 34.2 10.2 20.2 ... 0.908 0.9 4.6 5.4 6.7 2.1 0.2 3.3 2.0 30.1
1 2 James Harden SG 26 HOU 82 82 38.1 8.7 19.7 ... 0.860 0.8 5.3 6.1 7.5 1.7 0.6 4.6 2.8 29.0
2 3 Kevin Durant SF 27 OKC 72 72 35.8 9.7 19.2 ... 0.898 0.6 7.6 8.2 5.0 1.0 1.2 3.5 1.9 28.2
3 4 DeMarcus Cousins C 25 SAC 65 65 34.6 9.2 20.5 ... 0.718 2.4 9.1 11.5 3.3 1.6 1.4 3.8 3.6 26.9
4 5 LeBron James SF 31 CLE 76 76 35.6 9.7 18.6 ... 0.731 1.5 6.0 7.4 6.8 1.4 0.6 3.3 1.9 25.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
471 472 Joe Harris SG 24 CLE 5 0 3.0 0.2 0.8 ... NaN 0.0 0.6 0.6 0.4 0.0 0.0 0.2 0.2 0.6
472 473 Bruno Caboclo SF 20 TOR 6 1 7.2 0.2 2.0 ... NaN 0.2 0.2 0.3 0.2 0.3 0.2 0.7 0.3 0.5
473 474 Sam Dekker SF 21 HOU 3 0 2.0 0.0 0.0 ... NaN 0.0 0.3 0.3 0.0 0.3 0.0 0.0 0.0 0.0
474 475 J.J. O'Brien SF 23 UTA 2 0 3.0 0.0 0.5 ... NaN 0.0 0.5 0.5 0.0 0.5 0.0 0.0 0.5 0.0
475 476 Nate Robinson PG 31 NOP 2 1 11.5 0.0 0.5 ... NaN 0.0 0.0 0.0 2.0 0.5 0.0 0.0 2.5 0.0
I need to group df by teams (Tm), find the best average scorer(s) per team (PS/G), ignoring the Tm: TOT row. Sort descending by points per game with ties broken by team name (Tm). If there are multiple top scorers, list both and sort them by player name ascending.
What I have done is the following:
grouped = df[df['Tm']!="TOT"].groupby('Tm')['PS/G'].max().sort_values(ascending=False)
And I am getting:
Tm
GSW 30.1
HOU 29.0
OKC 28.2
SAC 26.9
CLE 25.3
POR 25.1
NOP 24.3
TOR 23.5
IND 23.1
BOS 22.2
NYK 21.8
LAC 21.4
SAS 21.2
CHI 20.9
CHO 20.9
MIN 20.7
BRK 20.6
PHO 20.4
WAS 19.9
UTA 19.7
DEN 19.5
MIA 19.1
DET 18.8
DAL 18.3
ORL 18.2
MIL 18.2
LAL 17.6
PHI 17.5
ATL 17.1
MEM 16.6
Name: PS/G, dtype: float64
However, I need to include also the Player column in the result. So my first question is how can I achieve that?
My second question is how to include these two requirements:
with ties broken by team name (Tm).
If there are multiple top scorers, list both and sort them by player name ascending.
I finally managed to figure it out with the following:
grouped1 = df.loc[df[df['Tm']!="TOT"].groupby(['Tm'])['PS/G'].idxmax()].sort_values(by=['PS/G', 'Player'], ascending=[0,1]).reset_index()
grouped_final = grouped1[['Tm', 'Player', 'PS/G']]

Pandas Dataframe sorting a column of doubles from highest to lowest

I'm working on web scraping NBA stats and want to be able to sort by statistics like points, assists, and blocks.
I have my Pandas dataframe and it can properly print out players and statistics, including sorting by integers like age, as shown below.
Example of dataframe sorted by age
However, when I try to sort by points, it doesn't properly sort from highest values to lowest, but rather from highest initial number, like from 9.9 to 0, although there are clearly players with over 10.0 points per game.
Example of dataframe sorted by points
Are the numbers stored in the dataframe actually strings, and as a result the comparison of strings is causing this issue?
Here is the code I am running:
year = 2021
# URL page we will scraping (see image above)
url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year)
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html, features="html.parser")
table = soup.find_all(class_="full_table")
head = soup.find(class_="thead")
headers_raw = [head.text for item in head][0]
headers = headers_raw.replace("\n", ",").split(",")[2:-1]
players = []
for i in range(len(table)):
player = []
for td in table[i].find_all("td"):
player.append(td.text)
players.append(player)
stats = pd.DataFrame(players, columns = headers)
sorted_by_points = stats.sort_values('PTS', ascending=False)
Yes, it sounds like the scores are strings. You can use dtypes() to see.
Yes, convert the PTS column to floats:
stats["PTS"] = pd.to_numeric(stats["PTS"])
sorted_by_points = stats.sort_values(by="PTS", ascending=False)
print(sorted_by_points)
Prints:
Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
37 Bradley Beal SG 27 WAS 47 47 35.4 10.9 22.7 .483 2.2 6.5 .338 8.7 16.2 .541 .531 7.0 7.9 .897 1.2 3.6 4.8 4.8 1.1 0.3 3.3 2.4 31.1
112 Stephen Curry PG 32 GSW 49 49 34.0 10.2 20.8 .491 5.1 12.0 .427 5.1 8.8 .579 .614 5.5 6.0 .922 0.5 5.1 5.6 5.9 1.2 0.1 3.2 1.8 31.0
140 Joel Embiid C 26 PHI 38 38 32.2 9.4 18.2 .516 1.2 3.1 .379 8.2 15.1 .544 .548 10.1 11.8 .855 2.2 8.9 11.1 3.0 1.0 1.4 3.1 2.4 30.0
286 Damian Lillard PG 30 POR 52 52 35.9 8.8 20.1 .440 4.1 10.8 .379 4.8 9.3 .510 .541 6.9 7.5 .925 0.5 3.7 4.2 7.7 0.9 0.3 3.2 1.6 28.7
125 Luka Dončić PG 21 DAL 51 51 35.1 10.2 20.9 .486 3.0 8.3 .359 7.2 12.6 .569 .557 5.3 7.3 .727 0.8 7.1 7.9 8.7 1.0 0.7 4.3 2.3 28.6
11 Giannis Antetokounmpo PF 26 MIL 47 47 33.7 10.3 18.3 .565 1.1 3.7 .299 9.2 14.6 .632 .595 6.7 9.8 .683 1.8 9.5 11.2 6.1 1.1 1.3 3.7 2.7 28.4
276 Zach LaVine SG 25 CHI 53 53 35.2 9.8 19.4 .506 3.4 8.2 .416 6.4 11.2 .572 .594 4.4 5.2 .848 0.6 4.5 5.1 5.1 0.8 0.5 3.6 2.3 27.5
236 Kyrie Irving PG 28 BRK 41 41 35.1 10.4 20.4 .511 2.7 7.0 .389 7.7 13.5 .573 .577 3.7 4.0 .916 1.1 3.7 4.8 6.1 1.3 0.6 2.4 2.7 27.3
134 Kevin Durant PF 32 BRK 24 22 32.7 9.3 17.1 .543 2.5 5.5 .462 6.8 11.6 .581 .617 6.1 7.0 .870 0.4 6.3 6.7 5.2 0.6 1.3 3.7 2.0 27.3
511 Zion Williamson PF 20 NOP 52 52 33.1 10.3 16.8 .614 0.2 0.6 .300 10.1 16.2 .625 .619 6.0 8.6 .699 2.6 4.6 7.2 3.7 0.9 0.7 2.6 2.3 26.8
335 Donovan Mitchell PG 24 UTA 53 53 33.4 9.0 20.6 .438 3.4 8.7 .386 5.7 11.9 .476 .520 5.0 6.0 .845 0.9 3.5 4.4 5.2 1.0 0.3 2.8 2.2 26.4
253 Nikola Jokić C 25 DEN 56 56 35.2 10.2 18.1 .567 1.4 3.3 .422 8.8 14.8 .599 .605 4.2 4.9 .855 2.9 8.1 11.0 8.8 1.5 0.6 3.1 2.7 26.1
458 Jayson Tatum SF 22 BOS 51 51 35.6 9.4 20.4 .462 2.9 7.5 .388 6.5 12.9 .505 .534 4.2 4.8 .873 0.7 6.5 7.1 4.2 1.2 0.4 2.5 1.8 26.0
282 Kawhi Leonard SF 29 LAC 46 46 34.4 9.4 18.2 .516 2.0 5.0 .393 7.4 13.2 .562 .569 5.0 5.7 .878 1.1 5.5 6.7 5.1 1.7 0.4 2.0 1.7 25.7
57 Devin Booker SG 24 PHO 52 52 33.6 9.3 19.2 .485 2.0 5.8 .349 7.3 13.3 .545 .539 4.8 5.6 .862 0.4 3.8 4.1 4.5 0.9 0.3 3.2 2.8 25.5
243 LeBron James PG 36 LAL 41 41 33.9 9.5 18.4 .513 2.4 6.5 .368 7.1 12.0 .592 .578 4.1 5.8 .703 0.6 7.3 7.9 7.9 1.0 0.6 3.7 1.6 25.4
521 Trae Young PG 22 ATL 52 52 34.3 7.7 17.9 .431 2.3 6.5 .357 5.4 11.4 .472 .495 7.7 8.8 .871 0.6 3.3 3.9 9.5 0.8 0.2 4.3 1.9 25.4
156 De'Aaron Fox PG 23 SAC 56 56 35.2 9.2 19.2 .481 1.8 5.5 .323 7.4 13.6 .545 .527 5.1 7.1 .713 0.6 2.9 3.5 7.2 1.5 0.5 3.0 2.9 25.3
194 James Harden PG-SG 31 TOT 42 42 37.1 8.0 17.2 .463 2.8 7.8 .358 5.2 9.4 .549 .544 6.5 7.5 .870 0.8 7.2 8.0 10.9 1.2 0.7 4.1 2.3 25.2
475 Karl-Anthony Towns C 25 MIN 36 36 34.2 8.6 17.6 .491 2.4 6.1 .395 6.2 11.5 .542 .560 5.0 5.8 .874 2.6 8.2 10.8 4.6 0.7 1.4 3.2 3.6 24.7
71 Jaylen Brown SG 24 BOS 52 52 34.1 9.3 18.8 .493 2.7 6.8 .400 6.6 12.0 .546 .566 3.3 4.3 .754 1.2 4.6 5.8 3.4 1.2 0.6 2.7 2.9 24.6
437 Collin Sexton SG 22 CLE 47 47 35.7 8.8 18.4 .479 1.6 4.4 .367 7.2 14.0 .514 .523 4.9 6.0 .816 0.8 2.1 2.9 4.1 1.1 0.1 2.6 2.7 24.2
235 Brandon Ingram SF 23 NOP 52 52 34.7 8.5 18.2 .466 2.4 6.3 .382 6.1 12.0 .510 .532 4.8 5.4 .886 0.6 4.4 5.0 4.8 0.7 0.7 2.5 2.0 24.2
487 Nikola Vučević C 30 TOT 57 57 33.6 9.7 20.1 .485 2.6 6.3 .417 7.1 13.8 .516 .550 2.1 2.5 .836 2.0 9.3 11.3 3.8 1.0 0.7 1.7 1.9 24.1
170 Shai Gilgeous-Alexander SG 22 OKC 35 35 33.7 8.2 16.1 .508 2.0 4.9 .418 6.2 11.3 .547 .571 5.3 6.5 .808 0.5 4.2 4.7 5.9 0.8 0.7 3.0 2.0 23.7
407 Julius Randle PF 26 NYK 57 57 37.4 8.4 18.3 .461 2.1 5.2 .405 6.3 13.1 .483 .519 4.8 5.9 .802 1.3 9.2 10.5 6.1 1.0 0.3 3.5 3.2 23.7
167 Paul George SF 30 LAC 44 44 33.6 8.4 17.5 .480 3.3 7.5 .437 5.1 9.9 .513 .574 3.5 4.0 .886 0.9 5.4 6.3 5.5 1.2 0.5 3.2 2.4 23.6
315 CJ McCollum SG 29 POR 31 31 33.5 8.6 19.4 .444 3.8 9.6 .396 4.8 9.8 .492 .542 2.3 2.8 .818 0.6 3.3 3.9 4.7 1.1 0.5 1.3 1.9 23.4
114 Anthony Davis PF 27 LAL 23 23 32.8 8.9 16.7 .533 0.7 2.5 .293 8.1 14.1 .575 .555 4.0 5.7 .715 2.0 6.3 8.4 3.0 1.3 1.8 2.0 1.8 22.5
178 Jerami Grant SF 26 DET 50 50 34.3 7.4 17.3 .428 2.2 6.2 .354 5.2 11.2 .469 .491 5.4 6.3 .860 0.7 4.0 4.7 2.9 0.7 1.1 2.1 2.3 22.4
500 Russell Westbrook PG 32 WAS 49 49 35.4 8.3 18.9 .441 1.3 4.1 .312 7.1 14.8 .477 .475 4.0 6.3 .624 1.7 9.3 10.9 10.8 1.3 0.4 4.9 2.8 21.9
67 Malcolm Brogdon PG 28 IND 50 50 35.1 8.1 17.6 .459 2.7 6.7 .399 5.4 10.9 .496 .535 2.6 3.0 .859 1.1 3.9 5.0 6.0 0.9 0.2 2.0 2.0 21.4
80 Jimmy Butler SF 31 MIA 41 41 33.7 7.1 14.4 .495 0.5 2.0 .232 6.7 12.4 .537 .511 6.7 7.9 .852 1.9 5.3 7.2 7.2 2.1 0.4 2.1 1.4 21.4
346 Jamal Murray PG 23 DEN 48 48 35.5 7.9 16.5 .477 2.7 6.6 .408 5.2 9.9 .523 .559 2.8 3.2 .869 0.8 3.3 4.0 4.8 1.3 0.3 2.3 2.0 21.2
119 DeMar DeRozan PF 31 SAS 47 47 34.0 7.4 14.9 .495 0.4 1.4 .281 7.0 13.5 .517 .508 6.1 6.9 .880 0.7 3.6 4.3 7.2 0.9 0.3 1.9 2.0 21.2
517 Christian Wood C 25 HOU 34 34 31.9 8.2 15.4 .531 1.8 4.7 .385 6.4 10.7 .595 .590 2.9 4.6 .641 1.7 7.6 9.4 1.6 0.9 1.2 1.9 2.1 21.1
440 Pascal Siakam PF 26 TOR 46 46 35.6 7.5 16.5 .453 1.2 4.2 .282 6.3 12.3 .512 .489 4.7 5.5 .839 1.7 5.5 7.2 4.7 1.1 0.7 2.2 3.3 20.8
427 Terry Rozier SG 26 CHO 53 53 34.0 7.5 16.0 .468 3.4 8.3 .405 4.1 7.7 .536 .573 2.4 2.9 .824 0.6 3.7 4.2 3.8 1.3 0.4 1.9 1.8 20.7
492 John Wall PG 30 HOU 37 37 32.1 7.4 18.1 .405 2.1 6.3 .328 5.3 11.9 .446 .462 3.8 5.2 .734 0.5 2.8 3.2 6.8 1.1 0.8 3.5 1.2 20.6
201 Tobias Harris PF 28 PHI 49 49 33.3 7.9 15.2 .521 1.4 3.5 .407 6.5 11.7 .555 .568 3.2 3.6 .886 1.0 6.2 7.2 3.6 0.9 0.9 1.9 2.0 20.5
399 Kristaps Porziņģis C 25 DAL 37 37 31.4 7.7 16.4 .473 2.3 6.3 .359 5.5 10.0 .544 .542 2.8 3.2 .850 2.0 7.4 9.4 1.6 0.4 1.5 1.4 2.6 20.5
330 Khris Middleton SF 29 MIL 54 54 33.4 7.4 15.5 .479 2.2 5.1 .433 5.2 10.4 .502 .550 3.0 3.4 .886 0.8 5.2 6.1 5.6 1.1 0.2 2.8 2.4 20.1
430 Domantas Sabonis PF 24 IND 53 53 35.7 7.5 14.4 .520 0.8 2.6 .302 6.7 11.8 .568 .547 4.1 5.6 .731 2.5 9.1 11.6 6.0 1.1 0.5 3.4 3.4 19.9
371 Victor Oladipo SG 28 TOT 33 33 32.7 7.1 17.5 .408 2.4 7.2 .326 4.8 10.2 .466 .476 3.2 4.2 .754 0.4 4.5 4.8 4.6 1.4 0.4 2.5 2.5 19.8
207 Gordon Hayward SF 30 CHO 44 44 34.0 7.1 15.0 .473 1.9 4.7 .415 5.1 10.3 .499 .537 3.5 4.2 .843 0.8 5.0 5.9 4.1 1.2 0.3 2.1 1.7 19.6
38 Malik Beasley SG 24 MIN 37 36 32.8 7.1 16.2 .440 3.5 8.7 .399 3.7 7.5 .487 .547 1.8 2.2 .850 0.8 3.6 4.4 2.4 0.8 0.2 1.6 1.7 19.6
483 Fred VanVleet SG 26 TOR 45 45 36.1 6.4 16.4 .391 3.3 8.9 .366 3.1 7.4 .422 .491 3.4 3.9 .885 0.6 3.5 4.2 6.1 1.7 0.8 2.0 2.4 19.5
401 Norman Powell SG-SF 27 TOT 54 43 31.1 6.5 13.4 .486 2.6 6.2 .421 3.9 7.1 .543 .584 3.4 4.0 .860 0.6 2.5 3.1 1.8 1.2 0.3 1.8 2.4 19.1
429 D'Angelo Russell PG 24 MIN 28 19 27.7 6.7 15.6 .429 2.8 7.1 .399 3.9 8.6 .454 .519 2.8 3.4 .802 0.4 2.0 2.4 4.9 1.0 0.4 2.8 1.9 19.0
3 Bam Adebayo C 23 MIA 51 51 33.4 7.2 12.7 .566 0.0 0.2 .250 7.2 12.6 .570 .568 4.5 5.7 .803 2.3 7.0 9.3 5.2 1.0 1.1 2.8 2.3 19.0
339 Ja Morant PG 21 MEM 47 47 31.9 6.7 15.1 .444 1.0 3.6 .274 5.7 11.5 .497 .477 4.3 5.8 .739 0.9 2.7 3.6 7.3 0.8 0.2 3.1 1.4 18.7
155 Evan Fournier SF-SG 28 TOT 30 26 30.1 6.2 13.6 .457 2.8 7.0 .398 3.4 6.5 .520 .560 3.4 4.2 .795 0.2 2.6 2.7 3.4 1.1 0.4 1.9 2.2 18.6
284 Caris LeVert SG 26 TOT 32 24 30.7 7.0 16.2 .431 1.8 5.6 .315 5.2 10.6 .491 .485 2.6 3.2 .804 0.8 3.7 4.5 4.8 1.4 0.5 2.0 2.0 18.3
505 Andrew Wiggins PF 25 GSW 57 57 32.7 6.9 14.4 .476 2.0 5.2 .387 4.9 9.2 .527 .546 2.4 3.4 .697 1.0 3.7 4.7 2.3 0.9 0.9 1.8 2.2 18.2
135 Anthony Edwards SG 19 MIN 58 41 31.3 6.6 16.5 .397 2.2 6.9 .320 4.3 9.6 .453 .465 2.8 3.5 .788 0.8 3.6 4.4 2.7 1.1 0.4 2.1 1.7 18.1
101 John Collins PF 23 ATL 48 48 30.1 7.0 12.8 .545 1.3 3.3 .377 5.8 9.5 .604 .594 2.7 3.3 .840 2.1 5.6 7.6 1.4 0.5 1.0 1.2 3.3 18.0
176 Eric Gordon SG 32 HOU 27 13 29.2 5.9 13.6 .433 2.6 7.8 .329 3.3 5.8 .573 .527 3.5 4.2 .825 0.3 1.9 2.1 2.6 0.5 0.5 1.9 1.6 17.8
490 Kemba Walker PG 30 BOS 37 37 31.4 6.1 15.2 .401 2.8 8.0 .345 3.4 7.2 .464 .492 2.8 3.0 .937 0.3 3.6 3.9 5.1 1.1 0.3 2.1 1.4 17.8
...

Python, pd dataframe extract values based on condition raises error

I have the following dataframe of nba player stats:
print(self.df)
Name PTS REB AST \
(updated to: , 2020-02-24 19:39:00)
0 James Harden 35.2 6.4 7.4
1 Giannis Antetokounmpo 30.0 13.6 5.8
2 Trae Young 30.0 4.4 9.2
3 Bradley Beal 29.6 4.4 6.0
4 Damian Lillard 29.5 4.4 7.9
... ... ... ... ...
261 Jerome Robinson 3.1 1.7 1.1
262 Goga Bitadze 3.1 2.0 0.5
263 Javonte Green 3.0 1.7 0.5
264 Semi Ojeleye 2.9 1.9 0.5
265 Matthew Dellavedova 2.5 1.1 2.6
STL BLK FGM FGA FG% 3PM 3PA \
(updated to: , 2020-02-24 19:39:00)
0 1.7 1.0 10.1 23.1 43.9 4.6 12.8
1 1.1 1.1 11.1 20.1 55.2 1.5 4.8
2 1.2 0.1 9.3 20.8 44.5 3.5 9.5
3 1.1 0.4 10.1 22.2 45.3 2.6 8.0
4 1.0 0.3 9.4 20.4 46.0 3.9 10.0
... ... ... ... ... ... ... ...
261 0.3 0.2 1.2 3.5 34.1 0.5 1.7
262 0.1 0.7 1.3 2.6 48.2 0.1 0.6
263 0.5 0.1 1.2 2.3 51.1 0.1 0.6
264 0.3 0.1 1.0 2.4 39.5 0.5 1.5
265 0.3 0.0 0.9 2.7 32.3 0.2 1.4
3P% FTM FTA FT%
(updated to: , 2020-02-24 19:39:00)
0 35.9 10.4 12.0 86.8
1 31.1 6.4 10.4 61.5
2 37.4 7.9 9.3 85.5
3 32.0 6.9 8.1 84.4
4 39.3 6.8 7.7 88.9
... ... ... ... ...
261 29.5 0.3 0.4 57.1
262 15.4 0.5 0.7 69.0
263 26.1 0.6 0.9 63.9
264 35.0 0.5 0.5 88.9
265 15.9 0.5 0.6 89.3
[266 rows x 15 columns]
I'm trying to analyze some stats by narrowing down the df and get all stats above two column's mean, and when trying to extract some values based on condition, I get the following error.
def get_stat(self):
pts_fgm_df = self.df.head(n=120)
rslt_df = pts_fgm_df.loc[pts_fgm_df['PTS'] > pts_fgm_df['PTS'].mean() & pts_fgm_df['FG%'] > pts_fgm_df.mean()]
print(rslt_df)
TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
My solution eventuality:
top_df = self.df.head(n=120)
mean_pts = top_df['PTS'].mean()
mean_fgp = top_df['FG%'].mean()
rslt_df = top_df[
(top_df['PTS'] >= mean_pts) &
(top_df['FG%'] >= mean_fgp)
]
return rslt_df
My problem was when I wrote the logic it wasn't clear to watch.
# So the solution is to first give every statement a variable name.
mean_pts = top_df['PTS'].mean()
mean_fgp = top_df['FG%'].mean()
pts = top_df['PTS']
fgp = top_df['FG%']
then filter based on them:
# Which makes this a lot clearer to see missing brackets and such.
rslt_df = top_df[
(pts >= mean_pts) &
(fgp >= mean_fgp)
]
return rslt_df

How can I split columns with regex to move trailing CAPS into a separate column?

I'm trying to split a column using regex, but can't seem to get the split correctly. I'm trying to take all the trailing CAPS and move them into a separate column. So I'm getting all the CAPS that are either 2-4 CAPS in a row. However, it's only leaving the 'Name' column while the 'Team' column is blank.
Here's my code:
import pandas as pd
url = "https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
df = pd.read_html(url)[0].join(pd.read_html(url)[1])
df[['Name','Team']] = df['Name'].str.split('[A-Z]{2,4}', expand=True)
I want this:
print(df.head(5).to_string())
RK Name POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER
0 1 LeBron JamesLA SF 35 35.1 24.9 9.6 19.7 48.6 2.0 6.0 33.8 3.7 5.5 67.7 7.9 11.0 1.3 0.5 3.7 28 9 26.10
1 2 Ricky RubioPHX PG 30 32.0 13.6 4.9 11.9 41.3 1.2 3.7 31.8 2.6 3.1 83.7 4.6 9.3 1.3 0.2 2.5 12 1 16.40
2 3 Luka DoncicDAL SF 32 32.8 29.7 9.6 20.2 47.5 3.1 9.4 33.1 7.3 9.1 80.5 9.7 8.9 1.2 0.2 4.2 22 11 31.74
3 4 Ben SimmonsPHIL PG 36 35.4 14.9 6.1 10.8 56.3 0.1 0.1 40.0 2.7 4.6 59.0 7.5 8.6 2.2 0.7 3.6 19 3 19.49
4 5 Trae YoungATL PG 34 35.1 28.9 9.3 20.8 44.8 3.5 9.4 37.5 6.7 7.9 85.0 4.3 8.4 1.2 0.1 4.8 11 1 23.47
to become this:
print(df.head(5).to_string())
RK Name Team POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER
0 1 LeBron James LA SF 35 35.1 24.9 9.6 19.7 48.6 2.0 6.0 33.8 3.7 5.5 67.7 7.9 11.0 1.3 0.5 3.7 28 9 26.10
1 2 Ricky Rubio PHX PG 30 32.0 13.6 4.9 11.9 41.3 1.2 3.7 31.8 2.6 3.1 83.7 4.6 9.3 1.3 0.2 2.5 12 1 16.40
2 3 Luka Doncic DAL SF 32 32.8 29.7 9.6 20.2 47.5 3.1 9.4 33.1 7.3 9.1 80.5 9.7 8.9 1.2 0.2 4.2 22 11 31.74
3 4 Ben Simmons PHIL PG 36 35.4 14.9 6.1 10.8 56.3 0.1 0.1 40.0 2.7 4.6 59.0 7.5 8.6 2.2 0.7 3.6 19 3 19.49
4 5 Trae Young ATL PG 34 35.1 28.9 9.3 20.8 44.8 3.5 9.4 37.5 6.7 7.9 85.0 4.3 8.4 1.2 0.1 4.8 11 1 23.47
You may extract the data into two columns by using a regex like ^(.*?)([A-Z]+)$ or ^(.*[^A-Z])([A-Z]+)$:
df[['Name','Team']] = df['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)
This will keep all up to the last char that is not an uppercase letter in Group "Name" and the last uppercase letters in Group "Team".
See regex demo #1 and regex demo #2
Details
^ - start of a string
(.*?) - Capturing group 1: any zero or more chars other than line break chars, as few as possible
or
(.*[^A-Z]) - any zero or more chars other than line break chars, as many as possible, up to the last char that is not an ASCII uppercase letter (granted the subsequent patterns match) (note that this pattern implies there is at least 1 char before the last uppercase letters)
([A-Z]+) - Capturing group 2: one or more ASCII uppercase letters
$ - end of string.
I have made a few alterations in the functions, You might need to add re package.
Its a bit manual, But I hope this will suffice. Have a great day!
df_obj_skel = dict()
df_obj_skel['Name'] = list()
df_obj_skel['Team'] = list()
for index,row in df.iterrows():
Name = row['Name']
Findings = re.search('[A-Z]{2,4}$', Name)
Refined_Team = Findings[0]
Refined_Name = re.sub(Refined_Team + "$", "", Name)
df_obj_skel['Team'].append(Refined_Team)
df_obj_skel['Name'].append(Refined_Name)
df_final = pd.DataFrame(df_obj_skel)
print(df_final)

beautifulsoup espn table, can't find the proper tag, pictures within

I am trying to scrape a table from the espn site. I just seem not to be able to find the right name to access it.
url="https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
import requests
from bs4 import BeautifulSoup
headers={'User-Agent': 'Mozilla/5.0'}
response=requests.get(url,headers=headers)
soup=BeautifulSoup(response.content, 'html.parser')
soup.find_all('table',class_ ="ResponsiveTable ResponsiveTable--fixed-left mt4 Table2__title--remove-capitalization")
The code only gives me an empty list :(
If you have table tags, let Pandas do the work for you. It uses BeautifulSoup under the hood.
import pandas as pd
url = "https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
dfs = pd.read_html(url)
df = dfs[0].join(dfs[1])
df[['Name','Team']] = df['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)
Output:
print(df.head(5).to_string())
RK Name POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER Team
0 1 LeBron James SF 35 35.1 24.9 9.6 19.7 48.6 2.0 6.0 33.8 3.7 5.5 67.7 7.9 11.0 1.3 0.5 3.7 28 9 26.10 LAL
1 2 Ricky Rubio PG 30 32.0 13.6 4.9 11.9 41.3 1.2 3.7 31.8 2.6 3.1 83.7 4.6 9.3 1.3 0.2 2.5 12 1 16.40 PHX
2 3 Luka Doncic SF 32 32.8 29.7 9.6 20.2 47.5 3.1 9.4 33.1 7.3 9.1 80.5 9.7 8.9 1.2 0.2 4.2 22 11 31.74 DAL
3 4 Ben Simmons PG 36 35.4 14.9 6.1 10.8 56.3 0.1 0.1 40.0 2.7 4.6 59.0 7.5 8.6 2.2 0.7 3.6 19 3 19.49 PHI
4 5 Trae Young PG 34 35.1 28.9 9.3 20.8 44.8 3.5 9.4 37.5 6.7 7.9 85.0 4.3 8.4 1.2 0.1 4.8 11 1 23.47 ATL
Why not just get the flex class and then get the table of players..
import requests
from bs4 import BeautifulSoup
url="https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
headers={'User-Agent': 'Mozilla/5.0'}
response=requests.get(url, headers=headers)
soup=BeautifulSoup(response.content, 'html.parser')
all_tables = soup.find('div', {'class':'flex'})
all_tables.find('table') # To get all players name
The tag you are selecting with:
soup.find_all('table',class_ ="ResponsiveTable ResponsiveTable--fixed-left mt4 Table2__title--remove-capitalization")
Should be not 'table' but 'section':
soup.find_all('section',class_ ="ResponsiveTable ResponsiveTable--fixed-left mt4 Table2__title--remove-capitalization")
To get all data, you can use this example:
import requests
from bs4 import BeautifulSoup
url="https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
headers={'User-Agent': 'Mozilla/5.0'}
response=requests.get(url,headers=headers)
soup=BeautifulSoup(response.content, 'html.parser')
for tr1, tr2 in zip(soup.select('table.Table.Table--align-right.Table--fixed.Table--fixed-left tr'),
soup.select('table.Table.Table--align-right.Table--fixed.Table--fixed-left ~ div tr')):
data = tr1.select('td') + tr2.select('td')
if not data:
continue
print('{:<25}'.format(data[1].get_text(strip=True, separator='-').split()[-1]), end=' ')
for td in data[2:]:
print('{:<6}'.format(td.get_text(strip=True)), end=' ')
print()
Prints:
James-LAL SF 30 34.9 25.7 9.9 20.2 49.1 2.2 6.4 34.4 3.6 5.3 67.9 7.6 10.6 1.2 0.6 3.9 23 7 26.33
Rubio-PHX PG 25 31.8 13.8 5.0 12.2 41.0 1.1 3.7 30.1 2.7 3.2 84.8 4.8 9.2 1.2 0.2 2.6 11 1 16.30
Doncic-DAL SF 26 32.2 29.1 9.4 19.8 47.7 3.0 9.2 32.2 7.3 9.1 79.7 9.6 8.8 1.2 0.1 4.3 17 8 31.43
Simmons-PHI PG 32 34.9 14.3 5.9 10.4 56.3 0.1 0.2 40.0 2.5 4.3 58.3 7.0 8.6 2.2 0.6 3.7 15 2 18.92
Young-ATL PG 31 34.9 28.5 9.3 20.9 44.4 3.4 9.3 36.8 6.5 7.7 84.5 4.3 8.3 1.2 0.1 4.7 9 1 23.21
Graham-CHA PG 34 34.7 19.2 6.1 15.9 38.2 3.8 9.5 39.8 3.2 4.1 79.7 3.9 7.6 0.8 0.3 3.0 9 0 17.20
Brogdon-IND PG 26 31.4 18.3 6.6 14.5 45.2 1.4 4.3 33.3 3.8 4.0 93.3 4.5 7.6 0.9 0.2 2.7 7 0 20.31
Harden-HOU SG 31 37.6 38.1 11.1 24.5 45.2 5.1 13.8 37.2 10.9 12.4 87.5 5.8 7.5 1.9 0.7 4.7 9 0 31.72
Lillard-POR PG 30 36.7 26.9 8.4 19.0 44.3 3.4 9.4 35.8 6.6 7.4 89.6 4.2 7.5 1.0 0.4 2.9 6 0 24.42
Westbrook-HOU PG 28 35.3 24.1 8.9 20.9 42.6 1.2 5.1 23.8 5.1 6.5 79.1 8.1 7.1 1.5 0.4 4.4 12 6 18.68
VanVleet-TOR SG 26 36.3 18.1 5.9 14.5 40.5 2.4 6.6 36.8 3.9 4.5 87.2 3.9 7.0 2.0 0.2 2.6 5 0 16.82
Jokic-DEN C 30 31.3 17.6 7.0 14.4 48.5 1.3 4.1 30.6 2.4 3.0 82.0 10.0 6.8 1.0 0.6 2.5 17 6 23.01
...and so on.
You can also use the same API that the webpage uses to populate its table with player information. If you make a direct GET-request to that API (with the correct headers and query string), you will receive all the player information you could ever want in a JSON-compliant format.
The URL to the API, the relevant headers and query string GET-Parameters are all visible in Google Chrome's Network Log (most modern browsers have something equivalent). I was able to find them by applying a filter and retaining only XMLHttpRequest (XHR) resources, and then clicking the "Show More" button at the bottom of the table.
I've set the "limit" GET-Parameter to "3", because I was only interested in printing data pertaining to the first three players. Changing this string to "50", for example, will query the API for the first fifty players.
def main():
import requests
headers = {
"accept": "application/json, text/plain, */*",
"origin": "https://www.espn.com",
"user-agent": "Mozilla/5.0"
}
params = {
"region": "us",
"lang": "en",
"contentorigin": "espn",
"isqualified": "true",
"page": "1",
"limit": "3",
"sort": "offensive.avgAssists:desc"
}
base_url = "https://site.web.api.espn.com/apis/common/v3/sports/basketball/nba/statistics/byathlete"
response = requests.get(base_url, headers=headers, params=params)
response.raise_for_status()
data = response.json()
print(data["athletes"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Categories