Cleaning accented unicode characters with Pandas read_html function

Cleaning accented unicode characters with Pandas read_html function - python

I'm downloading football data with pandas read_html function, but not struggling to clean the player names with all the accented characters. This is what I have so far:
import pandas as pd
from unidecode import unidecode
shooting = pd.read_html("https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fshooting%2FPremier-League-Stats&div=div_stats_shooting")
for idx,table in enumerate(shooting):
print("***************************")
print(idx)
print(table)
shooting = table
for col in [('Unnamed: 1_level_0', 'Player')]:
shooting[col] = shooting[col].apply(unidecode)
shooting
shooting = table
#print(shooting.droplevel(1))
table.to_csv (r'C:\Users\khabs\OneDrive\Documents\Python Testing\shooting.csv', index = False, header=True)
print (shooting)
I think the issue is that the coding is messed before I even do the cleaning, but really not sure.
Any help would be greatly appreciated!!

Just use the encoding parameter in pandas.
import pandas as pd
url = "https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fshooting%2FPremier-League-Stats&div=div_stats_shooting"
shooting = pd.read_html(url, header=1, encoding='utf8')[0]
However, that (and I'm assuming) will not get you what you want, as there are extra html characters in the response returned from that widget.
Just go after the actual html. The table is within the comments.
import requests
import pandas as pd
url = 'https://fbref.com/en/comps/9/shooting/Premier-League-Stats'
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
shooting = pd.read_html(html, header=1)[-1]
shooting = shooting[shooting['Rk'].ne('Rk')]
Output:
print(shooting.head(10))
Rk Player Nation Pos ... npxG/Sh G-xG np:G-xG Matches
0 1 Brenden Aaronson us USA FW,MF ... 0.03 -0.1 -0.1 Matches
1 2 Che Adams sct SCO FW ... 0.09 +1.6 +1.6 Matches
2 3 Tyler Adams us USA MF ... 0.01 0.0 0.0 Matches
3 4 Tosin Adarabioyo eng ENG DF ... NaN 0.0 0.0 Matches
4 5 Rayan Aït Nouri fr FRA DF ... 0.08 -0.1 -0.1 Matches
5 6 Nathan Aké nl NED DF ... 0.05 -0.2 -0.2 Matches
6 7 Thiago Alcántara es ESP MF ... NaN 0.0 0.0 Matches
7 8 Trent Alexander-Arnold eng ENG DF ... 0.05 -0.2 -0.2 Matches
8 9 Alisson br BRA GK ... NaN 0.0 0.0 Matches
9 10 Dele Alli eng ENG FW,MF ... NaN 0.0 0.0 Matches

Related

AttributeError: 'NoneType' object has no attribute 'text' - BeautifulShop

I have a little code for scraping info from fbref (link for data: https://fbref.com/en/comps/9/stats/Premier-League-Stats) and it worked well but now I have some problems with some features (I've checked that the fields which don't work now are"player","nationality","position","squad","age","birth_year"). I have also checked that the fields have the same name in the web that it used to be. Any ideas/help to solve the problem?
Many Thanks!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import sys, getopt
import csv
def get_tables(url):
res = requests.get(url)
## The next two lines get around the issue with comments breaking the parsing.
comm = re.compile("<!--|-->")
soup = BeautifulSoup(comm.sub("",res.text),'lxml')
all_tables = soup.findAll("tbody")
team_table = all_tables[0]
player_table = all_tables[1]
return player_table, team_table
def get_frame(features, player_table):
pre_df_player = dict()
features_wanted_player = features
rows_player = player_table.find_all('tr')
for row in rows_player:
if(row.find('th',{"scope":"row"}) != None):
for f in features_wanted_player:
cell = row.find("td",{"data-stat": f})
a = cell.text.strip().encode()
text=a.decode("utf-8")
if(text == ''):
text = '0'
if((f!='player')&(f!='nationality')&(f!='position')&(f!='squad')&(f!='age')&(f!='birth_year')):
text = float(text.replace(',',''))
if f in pre_df_player:
pre_df_player[f].append(text)
else:
pre_df_player[f] = [text]
df_player = pd.DataFrame.from_dict(pre_df_player)
return df_player
stats = ["player","nationality","position","squad","age","birth_year","games","games_starts","minutes","goals","assists","pens_made","pens_att","cards_yellow","cards_red","goals_per90","assists_per90","goals_assists_per90","goals_pens_per90","goals_assists_pens_per90","xg","npxg","xa","xg_per90","xa_per90","xg_xa_per90","npxg_per90","npxg_xa_per90"]
def frame_for_category(category,top,end,features):
url = (top + category + end)
player_table, team_table = get_tables(url)
df_player = get_frame(features, player_table)
return df_player
top='https://fbref.com/en/comps/9/'
end='/Premier-League-Stats'
df1 = frame_for_category('stats',top,end,stats)
df1

I suggest loading the table with panda's read_html. There is a direct link to this table under Share & Export --> Embed this Table.
import pandas as pd
df = pd.read_html("https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fstats%2FPremier-League-Stats&div=div_stats_standard", header=1)
This outputs a list of dataframes, the table can be accessed as df[0]. Output df[0].head():
Rk
Player
Nation
Pos
Squad
Age
Born
MP
Starts
Min
90s
Gls
Ast
G-PK
PK
PKatt
CrdY
CrdR
Gls.1
Ast.1
G+A
G-PK.1
G+A-PK
xG
npxG
xA
npxG+xA
xG.1
xA.1
xG+xA
npxG.1
npxG+xA.1
Matches
0
1
Patrick van Aanholt
nl NED
DF
Crystal Palace
30-190
1990
16
15
1324
14.7
0
1
0
0
0
1
0
0
0.07
0.07
0
0.07
1.2
1.2
0.8
2
0.08
0.05
0.13
0.08
0.13
Matches
1
2
Tammy Abraham
eng ENG
FW
Chelsea
23-156
1997
20
12
1021
11.3
6
1
6
0
0
0
0
0.53
0.09
0.62
0.53
0.62
5.6
5.6
0.9
6.5
0.49
0.08
0.57
0.49
0.57
Matches
2
3
Che Adams
eng ENG
FW
Southampton
24-237
1996
26
22
1985
22.1
5
4
5
0
0
1
0
0.23
0.18
0.41
0.23
0.41
5.5
5.5
4.3
9.9
0.25
0.2
0.45
0.25
0.45
Matches
3
4
Tosin Adarabioyo
eng ENG
DF
Fulham
23-164
1997
23
23
2070
23
0
0
0
0
0
1
0
0
0
0
0
0
1
1
0.1
1.1
0.04
0.01
0.05
0.04
0.05
Matches
4
5
AdriÃ¡n
es ESP
GK
Liverpool
34-063
1987
3
3
270
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Matches

If you're only after the player stats, change player_table = all_tables[1] to player_table = all_tables[2], because now you are feeding team table into get_frame function.
I tried it and it worked fine after that.

How to merge two pandas DataFrames, but without shared elements

I am scraping some NBA data with Python. I have the following script
def scrape_data():
#URL
url = "https://basketball-reference.com/leagues/NBA_2020_advanced.html"
html = urlopen(url)
soup = bs(html, 'html.parser')
soup.findAll('tr', limit = 2)
headers = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]
headers = headers[1:]
rows = soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]for i in range(len(rows))]
stats = pd.DataFrame(player_stats, columns=headers)
stats.head(10)
return stats
Which returns this
Player Pos Age Tm G ...   OBPM DBPM BPM VORP
0 Steven Adams C 26 OKC 43 ... 1.6 3.3 4.9 2.0
1 Bam Adebayo PF 22 MIA 47 ... 1.2 3.8 5.0 2.8
2 LaMarcus Aldridge C 34 SAS 43 ... 1.7 0.6 2.4 1.6
3 Nickeil Alexander-Walker SG 21 NOP 38 ... -3.4 -2.3 -5.6 -0.4
4 Grayson Allen SG 24 MEM 30 ... -0.7 -2.8 -3.5 -0.2
.. ... .. .. ... .. ... .. ... ... ... ...
537 Thaddeus Young PF 31 CHI 49 ... -2.2 0.9 -1.3 0.2
538 Trae Young PG 21 ATL 44 ... 7.8 -2.3 5.5 2.9
539 Cody Zeller C 27 CHO 45 ... 0.0 -0.6 -0.6 0.4
540 Ante Žižić C 23 CLE 16 ... -2.3 -1.4 -3.6 -0.1
541 Ivica Zubac C 22 LAC 48 ... 0.4 2.3 2.7 1.0
I want to scrape a second url, where the table is formatted the exact same, and append the player's stats from this table to the other one, if that makes sense. The problem is, on the second url, there will be a few stats that are on both tables. I don't want to add these in again when I'm "merging" the two tables> How do I go about this?

Your doing a ton of work to put a <table> tag into a table. Let pandas do that for you (it uses BeautifulSoup under the hood). Then to merge, there's 2 ways you can do it:
1) Make one of the dataframes only have what is not contained in the other (However, keep columns that you will do the merge on).
2) Drop columns from the second dataframe that are in the dataframe (again, make sure to not drop the columns you will do the merge on.
import pandas as pd
def scrape_data(url):
stats = pd.read_html(url)[0]
return stats
df1 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_advanced.html")
df1 = df1[df1['Rk'] != 'Rk']
df2 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_per_poss.html")
df2 = df2[df2['Rk'] != 'Rk']
uniqueCols = [ col for col in df2.columns if col not in df1.columns ]
# Below will do the same as above line
#uniqueCols = list(df2.columns.difference(df1.columns))
df2 = df2[uniqueCols + ['Player', 'Tm']]
df = df1.merge(df2, how='left', on=['Player', 'Tm'])
OR
import pandas as pd
def scrape_data(url):
stats = pd.read_html(url)[0]
return stats
df1 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_advanced.html")
df1 = df1[df1['Rk'] != 'Rk']
df2 = scrape_data("https://basketball-reference.com/leagues/NBA_2020_per_poss.html")
df2 = df2[df2['Rk'] != 'Rk']
dropCols = [ col for col in df1.columns if col in df2.columns and col not in ['Player','Tm']]
df2 = df2.drop(dropCols, axis=1)
df = df1.merge(df2, how='left', on=['Player', 'Tm'])

I think you want to use drop_duplicates(). Here's a simplified example:
import pandas as pd
df = pd.DataFrame([["foo", "bar"],["foo2", "bar2"],["foo3", "bar3"]], columns=["first_column", "second_column"])
df2 = pd.DataFrame([["foo3", "bar4"],["foo4", "bar5"],["foo5", "bar6"]], columns=["first_column", "second_column"])
print(pd.concat([df, df2], ignore_index=True).drop_duplicates(subset="first_column"))
Output:
first_column second_column
0 foo bar
1 foo2 bar2
2 foo3 bar3
4 foo4 bar5
5 foo5 bar6
As you can see, the "foo3" row from the second dataframe gets filtered out because it is already contained in the first dataframe.
In your case you would use something like:
pd.concat([stats, stats2], ignore_index=True).drop_duplicates(subset="Player"))

How do I create new columns by combining data in existing columns?

I have a dataset that includes 5 columns Excuse formatting:
id Price Service Rater Name Cleanliness
401013357 5 3 A 1
401014972 2 1 A 5
401022510 3 4 B 2
401022510 5 1 C 9
401022510 3 1 D 4
401022510 2 2 E 2
I would like for there to be only one row for each ID. Therefore, I need to create columns for each of the raters' names and ratings categories (e.g. Rater Name Price, Rater Name Service, Rater name Cleanliness), each in its own column. Thank you.
I've explored groupby but cannot figure out how to manipulate these into new columns. Thank you!
Here's the code and data I'm actually using:
import requests
from pandas import DataFrame
import pandas as pd
linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)
dflines = DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)
a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
for line in game['lines']:
game_dict = dict(id=game['id'])
for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
game_dict[cat] = line[cat]
buf.append(game_dict)
dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'])
From this, I get
formattedSpread overUnder spread
id provider
401013357 consensus UMass -21 68.0 -21.0
401014972 consensus Rice -22.5 58.5 -22.5
401022510 Caesars Colorado State -17.5 57.5 -17.5
consensus Colorado State -17 57.5 -17.0
numberfire Colorado State -17 58.5 -17.0
teamrankings Colorado State -17 58.0 -17.0
401013437 numberfire Wyoming -5 47.0 5.0
teamrankings Wyoming -5 47.0 5.0
401020671 consensus Ball State -19.5 61.5 -19.5
401019470 Caesars UCF -22.5 NaN 22.5
consensus UCF -22.5 NaN 22.5
numberfire UCF -24 70.0 24.0
teamrankings UCF -24 70.0 24.0
401013328 numberfire Minnesota -21.5 47.0 -21.5
teamrankings Minnesota -21.5 49.0 -21.5
The outcome I am looking for is for each of the 4 different providers to have three columns each, so that it's caesars_formattedSpread, caesars_overUnder, Caesars spread, numberfire_formattedSpread, numberfire_overUnder, numberfire_spread, etc.
When I run unstack as suggested, I don't get what I expect. Instead I get:
formattedSpread 0 UMass -21
1 Rice -22.5
2 Colorado State -17.5
3 Colorado State -17
4 Colorado State -17
5 Colorado State -17
6 Wyoming -5
7 Wyoming -5
8 Ball State -19.5
9 UCF -22.5
10 UCF -22.5
11 UCF -24
12 UCF -24

* Edited, based on the edited question *
Given that your dataframe is df:
df = df.set_index(['id', 'Rater Name']) # Make it a Multi Index
df_unstacked = df.unstack()
The problem with your edited code, is that you don't assign dflinestable.set_index(['id', 'provider']) to anything. So when you then use dflinestable.unstack(), you are unstacking the original dflinestable.
So with your entire code, it should be:
import requests
import pandas as pd
linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)
dflines = pd.DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)
a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
for line in game['lines']:
game_dict = dict(id=game['id'])
for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
game_dict[cat] = line[cat]
buf.append(game_dict)
dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'], inplace=True) # Add inplace=True
dflinestable_unstacked = dflinestable.unstack() # unstack (you could also reassign to the same df
# Flatten columns to single level, in the order as described
dflinestable_unstacked.columns = [f'{j}_{i}' for i, j in dflinestable_unstacked.columns]
This will give you a DataFrame like (abbreviated):
Caesars_formattedSpread ... teamrankings_spread
id ...
401012246 Alabama -24 ... -23.5
401012247 Arkansas -34 ... NaN
401012248 Auburn -1 ... -1.5
401012249 NaN ... NaN
401012250 Georgia -44 ... NaN

BeautifulSoup webscraping find_all( ): excluded element appended as last element

I'm trying to retrieve Financial Information from reuters.com, especially the Long Term Growth Rates of Companies. The element I want to scrape doesn't appear on all Webpages, in my example not for the Ticker 'AMCR'. All scraped info shall be appended to a list.
I've already figured out to exclude the element if it doesn't exist, but instead of appending it to the list in a place where it should be, the "NaN" is appended as the last element and not in a place where it should be.
import requests
from bs4 import BeautifulSoup
LTGRMean = []
tickers = ['MMM','AES','LLY','LOW','PWR','TSCO','YUM','ICE','FB','AAPL','AMCR','FLS','GOOGL','FB','MSFT']
Ticker LTGRMean
0 MMM 3.70
1 AES 9.00
2 LLY 10.42
3 LOW 13.97
4 PWR 12.53
5 TSCO 11.44
6 YUM 15.08
7 ICE 8.52
8 FB 19.07
9 AAPL 12.00
10 AMCR 19.04
11 FLS 16.14
12 GOOGL 19.07
13 FB 14.80
14 MSFT NaN
My individual text "not existing" isn't appearing.
Instead of for AMCR where Reuters doesn't provide any information, the Growth Rate of FLS (19.04) is set instead. So, as a result, all info is shifted up one index, where NaN should appear next to AMCR.

Stack() Function in dataframe stacks the column to rows at level 1.
import requests
from bs4 import BeautifulSoup
import pandas as pd
LTGRMean = []
tickers = ['MMM', 'AES', 'LLY', 'LOW', 'PWR', 'TSCO', 'YUM', 'ICE', 'FB', 'AAPL', 'AMCR', 'FLS', 'GOOGL', 'FB', 'MSFT']
for i in tickers:
Test = requests.get('https://www.reuters.com/finance/stocks/financial-highlights/' + i)
ReutSoup = BeautifulSoup(Test.content, 'html.parser')
td = ReutSoup.find('td', string="LT Growth Rate (%)")
my_dict = {}
#validate td object not none
if td is not None:
result = td.findNext('td').findNext('td').text
else:
result = "NaN"
my_dict[i] = result
LTGRMean.append(my_dict)
df = pd.DataFrame(LTGRMean)
print(df.stack())
O/P:
0 MMM 3.70
1 AES 9.00
2 LLY 10.42
3 LOW 13.97
4 PWR 12.53
5 TSCO 11.44
6 YUM 15.08
7 ICE 8.52
8 FB 19.90
9 AAPL 12.00
10 AMCR NaN
11 FLS 19.04
12 GOOGL 16.14
13 FB 19.90
14 MSFT 14.80
dtype: object

Turn an HTML table into a CSV file

How do I turn a table like this--batting gamelogs table--into a CSV file using Python and BeautifulSoup?
I want the first header where it says Rk, Gcar, Gtm, etc. and not any of the other headers within the table (the ones for each month of the season).
Here is the code I have so far:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
def stir_the_soup():
player_links = open('player_links.txt', 'r')
player_ID_nums = open('player_ID_nums.txt', 'r')
id_nums = [x.rstrip('\n') for x in player_ID_nums]
idx = 0
for url in player_links:
print url
soup = BeautifulSoup(urlopen(url), "lxml")
p_type = ""
if url[-12] == 'p':
p_type = "pitching"
elif url[-12] == 'b':
p_type = "batting"
table = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']== (p_type + "_gamelogs"))
header = [[val.text.encode('utf8') for val in table.find_all('thead')]]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('th')])
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
with open("%s.csv" % id_nums[idx], 'wb') as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(row for row in rows if row)
idx += 1
player_links.close()
if __name__ == "__main__":
stir_the_soup()
The id_nums list contains all of the id numbers for each player to use as the names for the separate CSV files.
For each row, the leftmost cell is a tag and the rest of the row is tags. In addition to the header how do I put that into one row?

this code gets you the big table of stats, which is what I think you want.
Make sure you have lxml, beautifulsoup4 and pandas installed.
df = pd.read_html(r'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010')
print(df[4])
Here is the output of first 5 rows. You may need to clean it slightly as I don't know what your exact endgoal is:
df[4].head(5)
Rk Gcar Gtm Date Tm Unnamed: 5 Opp Rslt Inngs PA ... CS BA OBP SLG OPS BOP aLI WPA RE24 Pos
0 1 66 2 (1) Apr 6 ARI NaN SDP L,3-6 7-8 1 ... 0 1.000 1.000 1.000 2.000 9 .94 0.041 0.51 PH
1 2 67 3 Apr 7 ARI NaN SDP W,5-3 7-8 1 ... 0 .500 .500 .500 1.000 9 1.16 -0.062 -0.79 PH
2 3 68 4 Apr 9 ARI NaN PIT W,9-1 8-GF 1 ... 0 .667 .667 .667 1.333 2 .00 0.000 0.13 PH SS
3 4 69 5 Apr 10 ARI NaN PIT L,3-6 CG 4 ... 0 .500 .429 .500 .929 2 1.30 -0.040 -0.37 SS
4 5 70 7 (1) Apr 13 ARI # LAD L,5-9 6-6 1 ... 0 .429 .375 .429 .804 9 1.52 -0.034 -0.46 PH
to select certain columns within this DataFrame: df[4]['COLUMN_NAME_HERE'].head(5)
Example: df[4]['Gcar']
Also, if doing df[4] is getting annoying you could always just switch to another dataframe df2=df[4]

import pandas as pd
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010'
html=urllib2.urlopen(url)
bs = BeautifulSoup(html,'lxml')
table = str(bs.find('table',{'id':'batting_gamelogs'}))
dfs = pd.read_html(table)
This uses Pandas, which is pretty useful for stuff like this. It also puts it in a pretty reasonable format to do other operations on.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning accented unicode characters with Pandas read_html function - python

Related

AttributeError: 'NoneType' object has no attribute 'text' - BeautifulShop

How to merge two pandas DataFrames, but without shared elements

How do I create new columns by combining data in existing columns?

BeautifulSoup webscraping find_all( ): excluded element appended as last element

Turn an HTML table into a CSV file

Categories

Resources