Parsing html with correct encoding

Parsing html with correct encoding - python

I'm trying to use the basketball-reference API using python with the requests and bs4 libraries.
from requests import get
from bs4 import BeautifulSoup
Here's a minimal example of what I'm trying to do:
# example request
r = get(f'https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fteams%2FMIL%2F2015.html&div=div_roster')
soup = BeautifulSoup(dd.content, 'html.parser')
table = soup.find('table')
It all works well, I can then feed this table to pandas with its read_html and get the data I need nicely packed into a dataframe.
The problem I have is the encoding.
In this particular request I got two NBA player names with weird characters: Ersan Ä°lyasova (Ersan İlyasova) and Jorge GutiÃ©rrez (Jorge Gutiérrez). In the current code they are interpreted as "Ersan Ä°lyasova" and "Jorge GutiÃ©rrez", which is obviously not what I want.
So the question is -- how do I fix it? This website seems to suggest they have the windows-1251 encoding, but I'm not sure how to use that information (in fact I'm not even sure if that's true).
I know I'm missing something fundamental here as I'm a bit confused how these encodings work at which point they are being "interpreted" etc, so I'll be grateful if you help me with this!

I really don't know why you are usingformat string and even your question is not clear. you've just copy/paste the url from the network traffic and then you mixing things about quoted string with encoding.
Below you should be able to done it.
import pandas as pd
df = pd.read_html("https://www.basketball-reference.com/teams/MIL/2015.html")
print(df)
Output:
[ No. Player Pos ... Unnamed: 6 Exp College
0 34 Giannis Antetokounmpo SG ... gr 1 NaN
1 19 Jerryd Bayless PG ... us 6 Arizona
2 5 Michael Carter-Williams PG ... us 1 Syracuse
3 9 Jared Dudley SG ... us 7 Boston College
4 11 Tyler Ennis PG ... ca R Syracuse
5 13 Jorge Gutiérrez PG ... mx 1 California
6 31 John Henson C ... us 2 UNC
7 7 Ersan İlyasova PF ... tr 6 NaN
8 23 Chris Johnson SF ... us 2 Dayton
9 11 Brandon Knight PG ... us 3 Kentucky
10 5 Kendall Marshall PG ... us 2 UNC
11 6 Kenyon Martin PF ... us 14 Cincinnati
12 0 O.J. Mayo SG ... us 6 USC
13 22 Khris Middleton SF ... us 2 Texas A&M
14 3 Johnny O'Bryant PF ... us R LSU
15 27 Zaza Pachulia C ... ge 11 NaN
16 12 Jabari Parker PF ... us R Duke
17 21 Miles Plumlee C ... us 2 Duke
18 8 Larry Sanders C ... us 4 Virginia Commonwealth
19 6 Nate Wolters PG ... us 1 South Dakota State

Related

Web scraping a table through multiple pages with a single link

I am trying to web scrape a table on a webpage as part of an assignment using Python. I want to scrape all 618 records of the table which are scattered across 13 pages in the same URL. However, my program only scrapes the first page of the table and its records. The URL is in my code, which can be found below:
from bs4 import BeautifulSoup as bs
import requests as r
base_URL = 'https://www.nba.com/players'
def scrape_webpage(URL):
player_names = []
page = r.get(URL)
print(f'{page.status_code}')
soup = bs(page.content, 'html.parser')
raw_player_names = soup.find_all('div', class_='flex flex-col lg:flex-row')
for name in raw_player_names:
player_names.append(name.get_text().strip())
print(player_names)
scrape_webpage(base_URL)

The player data is embedded inside <script> element in the page. You can decode it with this example:
import re
import json
import requests
import pandas as pd
url = "https://www.nba.com/players"
data = re.search(r'({"props":.*})', requests.get(url).text).group(0)
data = json.loads(data)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data["props"]["pageProps"]["players"])
print(df.head().to_markdown())
Prints:
PERSON_ID
PLAYER_LAST_NAME
PLAYER_FIRST_NAME
PLAYER_SLUG
TEAM_ID
TEAM_SLUG
IS_DEFUNCT
TEAM_CITY
TEAM_NAME
TEAM_ABBREVIATION
JERSEY_NUMBER
POSITION
HEIGHT
WEIGHT
COLLEGE
COUNTRY
DRAFT_YEAR
DRAFT_ROUND
DRAFT_NUMBER
ROSTER_STATUS
FROM_YEAR
TO_YEAR
PTS
REB
AST
STATS_TIMEFRAME
PLAYER_LAST_INITIAL
HISTORIC
0
1630173
Achiuwa
Precious
precious-achiuwa
1610612761
raptors
0
Toronto
Raptors
TOR
5
F
6-8
225
Memphis
Nigeria
2020
1
20
1
2020
2021
9.1
6.5
1.1
Season
A
False
1
203500
Adams
Steven
steven-adams
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
4
C
6-11
265
Pittsburgh
New Zealand
2013
1
12
1
2013
2021
6.9
10
3.4
Season
A
False
2
1628389
Adebayo
Bam
bam-adebayo
1610612748
heat
0
Miami
Heat
MIA
13
C-F
6-9
255
Kentucky
USA
2017
1
14
1
2017
2021
19.1
10.1
3.4
Season
A
False
3
1630583
Aldama
Santi
santi-aldama
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
7
F-C
6-11
215
Loyola-Maryland
Spain
2021
1
30
1
2021
2021
4.1
2.7
0.7
Season
A
False
4
200746
Aldridge
LaMarcus
lamarcus-aldridge
1610612751
nets
0
Brooklyn
Nets
BKN
21
C-F
6-11
250
Texas-Austin
USA
2006
1
2
1
2006
2021
12.9
5.5
0.9
Season
A
False

How to separate a combined column, but with incongruent data

I'm preparing for a new job where I'll be receiving data submissions in varying quality, often times dates/chars/etc are combined together nonsensically and must be separated before analysis. Thinking ahead of how might this be solved.
Using a fictitious example below, I combined region, rep, and product together.
file['combine'] = file['Region'] + file['Sales Rep'] + file['Product']
Shift Region Sales Rep Product Cost per Units Sold combine
0 3 East Shirlene Pencil 5 71 EastShirlenePencil
1 3 South Anderson Folder 17 69 SouthAndersonFolder
2 3 West Shelli Folder 17 185 WestShelliFolder
3 3 South Damion Binder 30 159 SouthDamionBinder
4 3 West Shirlene Stapler 25 41 WestShirleneStapler
Assuming no other data, the question is, how can the 'combine' column be split up?
Many thanks in advance!

If you want space between the strings, you can do:
df["combine"] = df[["Region", "Sales Rep", "Product"]].apply(" ".join, axis=1)
print(df)
Prints:
Shift Region Sales Rep Product Cost per Units Sold combine
0 3 East Shirlene Pencil 5 71 East Shirlene Pencil
1 3 South Anderson Folder 17 69 South Anderson Folder
2 3 West Shelli Folder 17 185 West Shelli Folder
3 3 South Damion Binder 30 159 South Damion Binder
4 3 West Shirlene Stapler 25 41 West Shirlene Stapler
Or: if you want to split the already combined string:
import re
df["separated"] = df["combine"].apply(lambda x: re.findall(r"[A-Z][^A-Z]*", x))
print(df)
Prints:
Shift Region Sales Rep Product Cost per Units Sold combine separated
0 3 East Shirlene Pencil 5 71 EastShirlenePencil [East, Shirlene, Pencil]
1 3 South Anderson Folder 17 69 SouthAndersonFolder [South, Anderson, Folder]
2 3 West Shelli Folder 17 185 WestShelliFolder [West, Shelli, Folder]
3 3 South Damion Binder 30 159 SouthDamionBinder [South, Damion, Binder]
4 3 West Shirlene Stapler 25 41 WestShirleneStapler [West, Shirlene, Stapler]

Can I have a two line caption in pandas dataframe?

Can I have a two line caption in pandas dataframe?
Create dataframe with:
df = pd.DataFrame({'Name' : ['John','Harry','Gary','Richard','Anna','Richard','Gary','Richard'], 'Age' : [25,32,37,43,44,56,37,22], 'Zone' : ['East','West','North','South','East','West','North', 'South']})
df=df.drop_duplicates('Name',keep='first')
df.style.set_caption("Team Members Per Zone")
which outputs:
Team Members Per Zone
Name Age Zone
0 John 25 East
1 Harry 32 West
4 Anna 44 East
6 Gary 37 North
7 Richard 22 South
However I'd like it to look like:
Team Members
Per Zone
Name Age Zone
0 John 25 East
1 Harry 32 West
4 Anna 44 East
6 Gary 37 North
7 Richard 22 South

Using a break works for me in JupyterLab:
df.style.set_caption('This is line one <br> This is line two')

Have you tried with \n ? (Sorry too low reputation to just comment.

How to pull only certain fields with BeautifulSoup

I'm trying to print all the fields that have England in them, the current code i have prints all the Nationalities into a txt file for me, but i want just the england fields to print. the page im pulling from is https://www.premierleague.com/players
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
players.write(playerCountry.text.strip())
players.write("\n")

Just need to check if it's not equal 'England', and if so, skip to next item in list:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
if playerCountry.text.strip() != 'England':
continue
players.write(playerCountry.text.strip())
players.write("\n")

Or, you could just use pandas.read_html() and a couple lines of code:
import pandas as pd
df = pd.read_html("https://www.premierleague.com/players")[0]
print(df.loc[df['Nationality'] != 'England'])
Prints:
Player Position Nationality
2 Charlie Adam Midfielder Scotland
3 Adrián Goalkeeper Spain
4 Adrien Silva Midfielder Portugal
5 Ibrahim Afellay Midfielder Netherlands
6 Benik Afobe Forward The Democratic Republic Of Congo
7 Sergio Agüero Forward Argentina
9 Soufyan Ahannach Midfielder Netherlands
10 Ahmed Hegazi Defender Egypt
11 Nathan Aké Defender Netherlands
14 Toby Alderweireld Defender Belgium
15 Aleix García Midfielder Spain
17 Ali Gabr Defender Egypt
18 Allan Nyom Defender Cameroon
19 Allan Souza Midfielder Brazil
20 Joe Allen Midfielder Wales
22 Marcos Alonso Defender Spain
23 Paulo Alves Midfielder Portugal
24 Daniel Amartey Midfielder Ghana
25 Jordi Amat Defender Spain
27 Ethan Ampadu Defender Wales
28 Nordin Amrabat Forward Morocco

pandas - how to extract top three rows from the dataframe provided

My pandas Data frame df could produce result as below:
grouped = df[(df['X'] == 'venture') & (df['company_code'].isin(['TDS','XYZ','UVW']))].groupby(['company_code','sector'])['X_sector'].count()
The output of this is as follows:
company_code sector
TDS Meta 404
Electrical 333
Mechanical 533
Agri 453
XYZ Sports 331
Electrical 354
Movies 375
Manufacturing 355
UVW Sports 505
Robotics 345
Movies 56
Health 3263
Manufacturing 456
Others 524
Name: X_sector, dtype: int64
What I want to get is the top three sectors within the company codes.
What is the way to do it?

You will have to chain a groupby here. Consider this example:
import pandas as pd
import numpy as np
np.random.seed(111)
names = [
'Robert Baratheon',
'Jon Snow',
'Daenerys Targaryen',
'Theon Greyjoy',
'Tyrion Lannister'
]
df = pd.DataFrame({
'season': np.random.randint(1, 7, size=100),
'actor': np.random.choice(names, size=100),
'appearance': 1
})
s = df.groupby(['season','actor'])['appearance'].count()
print(s.sort_values(ascending=False).groupby('season').head(1)) # <-- head(3) for 3 values
Returns:
season actor
4 Daenerys Targaryen 7
6 Robert Baratheon 6
3 Robert Baratheon 6
5 Jon Snow 5
2 Theon Greyjoy 5
1 Jon Snow 4
Where s is (clipped at 4)
season actor
1 Daenerys Targaryen 2
Jon Snow 4
Robert Baratheon 2
Theon Greyjoy 3
Tyrion Lannister 4
2 Daenerys Targaryen 4
Jon Snow 3
Robert Baratheon 1
Theon Greyjoy 5
Tyrion Lannister 3
3 Daenerys Targaryen 2
Jon Snow 1
Robert Baratheon 6
Theon Greyjoy 3
Tyrion Lannister 3
4 ...

Why would you want things to be complicated, when there are simple codes possible:
Z = df.groupby('country_code')['sector'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing html with correct encoding - python

Related

Web scraping a table through multiple pages with a single link

How to separate a combined column, but with incongruent data

Can I have a two line caption in pandas dataframe?

How to pull only certain fields with BeautifulSoup

pandas - how to extract top three rows from the dataframe provided

Categories

Resources