Turn an HTML table into a CSV file

Turn an HTML table into a CSV file - python

How do I turn a table like this--batting gamelogs table--into a CSV file using Python and BeautifulSoup?
I want the first header where it says Rk, Gcar, Gtm, etc. and not any of the other headers within the table (the ones for each month of the season).
Here is the code I have so far:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
def stir_the_soup():
player_links = open('player_links.txt', 'r')
player_ID_nums = open('player_ID_nums.txt', 'r')
id_nums = [x.rstrip('\n') for x in player_ID_nums]
idx = 0
for url in player_links:
print url
soup = BeautifulSoup(urlopen(url), "lxml")
p_type = ""
if url[-12] == 'p':
p_type = "pitching"
elif url[-12] == 'b':
p_type = "batting"
table = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']== (p_type + "_gamelogs"))
header = [[val.text.encode('utf8') for val in table.find_all('thead')]]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('th')])
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
with open("%s.csv" % id_nums[idx], 'wb') as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(row for row in rows if row)
idx += 1
player_links.close()
if __name__ == "__main__":
stir_the_soup()
The id_nums list contains all of the id numbers for each player to use as the names for the separate CSV files.
For each row, the leftmost cell is a tag and the rest of the row is tags. In addition to the header how do I put that into one row?

this code gets you the big table of stats, which is what I think you want.
Make sure you have lxml, beautifulsoup4 and pandas installed.
df = pd.read_html(r'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010')
print(df[4])
Here is the output of first 5 rows. You may need to clean it slightly as I don't know what your exact endgoal is:
df[4].head(5)
Rk Gcar Gtm Date Tm Unnamed: 5 Opp Rslt Inngs PA ... CS BA OBP SLG OPS BOP aLI WPA RE24 Pos
0 1 66 2 (1) Apr 6 ARI NaN SDP L,3-6 7-8 1 ... 0 1.000 1.000 1.000 2.000 9 .94 0.041 0.51 PH
1 2 67 3 Apr 7 ARI NaN SDP W,5-3 7-8 1 ... 0 .500 .500 .500 1.000 9 1.16 -0.062 -0.79 PH
2 3 68 4 Apr 9 ARI NaN PIT W,9-1 8-GF 1 ... 0 .667 .667 .667 1.333 2 .00 0.000 0.13 PH SS
3 4 69 5 Apr 10 ARI NaN PIT L,3-6 CG 4 ... 0 .500 .429 .500 .929 2 1.30 -0.040 -0.37 SS
4 5 70 7 (1) Apr 13 ARI # LAD L,5-9 6-6 1 ... 0 .429 .375 .429 .804 9 1.52 -0.034 -0.46 PH
to select certain columns within this DataFrame: df[4]['COLUMN_NAME_HERE'].head(5)
Example: df[4]['Gcar']
Also, if doing df[4] is getting annoying you could always just switch to another dataframe df2=df[4]

import pandas as pd
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010'
html=urllib2.urlopen(url)
bs = BeautifulSoup(html,'lxml')
table = str(bs.find('table',{'id':'batting_gamelogs'}))
dfs = pd.read_html(table)
This uses Pandas, which is pretty useful for stuff like this. It also puts it in a pretty reasonable format to do other operations on.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

Related

Cleaning accented unicode characters with Pandas read_html function

I'm downloading football data with pandas read_html function, but not struggling to clean the player names with all the accented characters. This is what I have so far:
import pandas as pd
from unidecode import unidecode
shooting = pd.read_html("https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fshooting%2FPremier-League-Stats&div=div_stats_shooting")
for idx,table in enumerate(shooting):
print("***************************")
print(idx)
print(table)
shooting = table
for col in [('Unnamed: 1_level_0', 'Player')]:
shooting[col] = shooting[col].apply(unidecode)
shooting
shooting = table
#print(shooting.droplevel(1))
table.to_csv (r'C:\Users\khabs\OneDrive\Documents\Python Testing\shooting.csv', index = False, header=True)
print (shooting)
I think the issue is that the coding is messed before I even do the cleaning, but really not sure.
Any help would be greatly appreciated!!

Just use the encoding parameter in pandas.
import pandas as pd
url = "https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fshooting%2FPremier-League-Stats&div=div_stats_shooting"
shooting = pd.read_html(url, header=1, encoding='utf8')[0]
However, that (and I'm assuming) will not get you what you want, as there are extra html characters in the response returned from that widget.
Just go after the actual html. The table is within the comments.
import requests
import pandas as pd
url = 'https://fbref.com/en/comps/9/shooting/Premier-League-Stats'
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
shooting = pd.read_html(html, header=1)[-1]
shooting = shooting[shooting['Rk'].ne('Rk')]
Output:
print(shooting.head(10))
Rk Player Nation Pos ... npxG/Sh G-xG np:G-xG Matches
0 1 Brenden Aaronson us USA FW,MF ... 0.03 -0.1 -0.1 Matches
1 2 Che Adams sct SCO FW ... 0.09 +1.6 +1.6 Matches
2 3 Tyler Adams us USA MF ... 0.01 0.0 0.0 Matches
3 4 Tosin Adarabioyo eng ENG DF ... NaN 0.0 0.0 Matches
4 5 Rayan Aït Nouri fr FRA DF ... 0.08 -0.1 -0.1 Matches
5 6 Nathan Aké nl NED DF ... 0.05 -0.2 -0.2 Matches
6 7 Thiago Alcántara es ESP MF ... NaN 0.0 0.0 Matches
7 8 Trent Alexander-Arnold eng ENG DF ... 0.05 -0.2 -0.2 Matches
8 9 Alisson br BRA GK ... NaN 0.0 0.0 Matches
9 10 Dele Alli eng ENG FW,MF ... NaN 0.0 0.0 Matches

Webscraping A Table Using Python and BeautifulSoup

I'm learning on how to webscrape using Python since I'm a novice. Right now, I attempted to webscrape Euros 2020 stats from this website https://theanalyst.com/na/2021/06/euro-2020-player-stats. After running my initial code (see below) to gather the html from the webpage, I cannot locate the table tag and its data-table class. I can see the table and its data-table when I inspected the website, but it is not shown when I print out the page_soup.
from urllib.request import urlopen as uReq # Web client
from bs4 import BeautifulSoup as soup # HTML data structure
url_page = 'https://theanalyst.com/na/2021/06/euro-2020-player-stats'
# Open connection & download the html from the url
uClient = uReq(url_page)
# Parses html into a soup data structure
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
print(page_soup)

The table is loaded dynamically in JSON format via sending a GET request to:
https://dataviz.theanalyst.com/euro-2020-hub/player_stats_3_2020.json
Since we're dealing with JSON data, it's easier to use the requests library to get the data.
Here is an example using the pandas library to print the table into a DataFrame (you don't have to use the pandas library).
import pandas as pd
import requests
url = "https://dataviz.theanalyst.com/euro-2020-hub/player_stats_3_2020.json"
response = requests.get(url).json()
print(pd.json_normalize(response["data"]).to_string())
Output (truncated):
player_id team_id team_name player_first_name player_last_name player age position detailed_position mins_played np_shots np_sot np_goals np_xG op_chances_created op_assists op_xA op_passes op_pass_completion_rate tackles_won interceptions recoveries avg_carry_distance avg_carry_progress carry_w_shot carry_w_goal carry_w_chance_created carry_w_assist take_ons take_ons_success_rate goal_ending total_xG shot_ending team_badge
0 103955 114 England Raheem Sterling Raheem Sterling 26 Forward Second Striker 641 14 8 3 3.82 2 1 1.18 193 0.85 5 4 23 12.98 6.73 3 0 3 1 38 52.63 6 7.08 24 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
1 56979 114 England Jordan Henderson Jordan Henderson 31 Midfielder Central Midfielder 150 1 1 1 0.32 0 0 0.06 111 0.88 0 1 11 7.83 0.49 0 0 0 0 3 66.67 0 0.00 0 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
2 78830 114 England Harry Kane Harry Kane 27 Forward Striker 649 15 7 4 3.57 5 0 0.39 159 0.70 0 3 8 10.52 3.06 2 0 2 0 15 53.33 7 6.38 21 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
3 58621 114 England Kyle Walker Kyle Walker 31 Defender Full Back 599 0 0 0 0.00 2 0 0.18 352 0.87 0 8 37 11.66 5.09 0 0 0 0 1 100.00 3 2.54 10 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
The variable response is now a dictionary (dict) which you can access the keys/values. To view and prettify the data:
from pprint import pprint
print(type(response))
pprint(response)
Output (truncated):
<class 'dict'>
{'data': [{'age': 26,
'avg_carry_distance': 12.98,
'avg_carry_progress': 6.73,
'carry_w_assist': 1,
'carry_w_chance_created': 3,
'carry_w_goal': 0,
'carry_w_shot': 3,
'detailed_position': 'Second Striker',

AttributeError: 'NoneType' object has no attribute 'text' - BeautifulShop

I have a little code for scraping info from fbref (link for data: https://fbref.com/en/comps/9/stats/Premier-League-Stats) and it worked well but now I have some problems with some features (I've checked that the fields which don't work now are"player","nationality","position","squad","age","birth_year"). I have also checked that the fields have the same name in the web that it used to be. Any ideas/help to solve the problem?
Many Thanks!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import sys, getopt
import csv
def get_tables(url):
res = requests.get(url)
## The next two lines get around the issue with comments breaking the parsing.
comm = re.compile("<!--|-->")
soup = BeautifulSoup(comm.sub("",res.text),'lxml')
all_tables = soup.findAll("tbody")
team_table = all_tables[0]
player_table = all_tables[1]
return player_table, team_table
def get_frame(features, player_table):
pre_df_player = dict()
features_wanted_player = features
rows_player = player_table.find_all('tr')
for row in rows_player:
if(row.find('th',{"scope":"row"}) != None):
for f in features_wanted_player:
cell = row.find("td",{"data-stat": f})
a = cell.text.strip().encode()
text=a.decode("utf-8")
if(text == ''):
text = '0'
if((f!='player')&(f!='nationality')&(f!='position')&(f!='squad')&(f!='age')&(f!='birth_year')):
text = float(text.replace(',',''))
if f in pre_df_player:
pre_df_player[f].append(text)
else:
pre_df_player[f] = [text]
df_player = pd.DataFrame.from_dict(pre_df_player)
return df_player
stats = ["player","nationality","position","squad","age","birth_year","games","games_starts","minutes","goals","assists","pens_made","pens_att","cards_yellow","cards_red","goals_per90","assists_per90","goals_assists_per90","goals_pens_per90","goals_assists_pens_per90","xg","npxg","xa","xg_per90","xa_per90","xg_xa_per90","npxg_per90","npxg_xa_per90"]
def frame_for_category(category,top,end,features):
url = (top + category + end)
player_table, team_table = get_tables(url)
df_player = get_frame(features, player_table)
return df_player
top='https://fbref.com/en/comps/9/'
end='/Premier-League-Stats'
df1 = frame_for_category('stats',top,end,stats)
df1

I suggest loading the table with panda's read_html. There is a direct link to this table under Share & Export --> Embed this Table.
import pandas as pd
df = pd.read_html("https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fstats%2FPremier-League-Stats&div=div_stats_standard", header=1)
This outputs a list of dataframes, the table can be accessed as df[0]. Output df[0].head():
Rk
Player
Nation
Pos
Squad
Age
Born
MP
Starts
Min
90s
Gls
Ast
G-PK
PK
PKatt
CrdY
CrdR
Gls.1
Ast.1
G+A
G-PK.1
G+A-PK
xG
npxG
xA
npxG+xA
xG.1
xA.1
xG+xA
npxG.1
npxG+xA.1
Matches
0
1
Patrick van Aanholt
nl NED
DF
Crystal Palace
30-190
1990
16
15
1324
14.7
0
1
0
0
0
1
0
0
0.07
0.07
0
0.07
1.2
1.2
0.8
2
0.08
0.05
0.13
0.08
0.13
Matches
1
2
Tammy Abraham
eng ENG
FW
Chelsea
23-156
1997
20
12
1021
11.3
6
1
6
0
0
0
0
0.53
0.09
0.62
0.53
0.62
5.6
5.6
0.9
6.5
0.49
0.08
0.57
0.49
0.57
Matches
2
3
Che Adams
eng ENG
FW
Southampton
24-237
1996
26
22
1985
22.1
5
4
5
0
0
1
0
0.23
0.18
0.41
0.23
0.41
5.5
5.5
4.3
9.9
0.25
0.2
0.45
0.25
0.45
Matches
3
4
Tosin Adarabioyo
eng ENG
DF
Fulham
23-164
1997
23
23
2070
23
0
0
0
0
0
1
0
0
0
0
0
0
1
1
0.1
1.1
0.04
0.01
0.05
0.04
0.05
Matches
4
5
AdriÃ¡n
es ESP
GK
Liverpool
34-063
1987
3
3
270
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Matches

If you're only after the player stats, change player_table = all_tables[1] to player_table = all_tables[2], because now you are feeding team table into get_frame function.
I tried it and it worked fine after that.

How to scrape a non tabled list from wikipedia and create a datafram?

en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul
in the link above, there is an un-tabulated data for Istanbul Neighborhoods.
I want to fetch these Neighborhoods into a data frame by this code
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('a',{'class':"new"})
neighborhoods=[]
for item in tocList:
text = item.get_text()
neighborhoods.append(text)
df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
print(df)
and I got this output:
Neighborhood
0 Maden
1 Nizam
2 Anadolu
3 Arnavutköy İmrahor
4 Arnavutköy İslambey
... ...
705 Seyitnizam
706 Sümer
707 Telsiz
708 Veliefendi
709 Yeşiltepe
710 rows × 1 columns
but some data are not fetched, check the data below and compare to the output:
Adalar
Burgazada
Heybeliada
Kınalıada
Maden
Nizam
findall() is not fetching the Neighborhoods which referred as links, not class, i.e.
<ol><li>Burgazada</li>
<li>Heybeliada</li>
and can I develop the code into 2 columns, each 'Neighborhood' and its 'District'

Are you trying to fetch this list from Table of Contents ?
Please check if this solves your problem:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('span',{'class':"toctext"})
districts=[]
blocked_words = ['Neighbourhoods by districts','Further reading', 'External links']
for item in tocList:
text = item.get_text()
if text not in blocked_words:
districts.append(text)
df = pd.DataFrame(districts, columns=['districts'])
print(df)
Output:
districts
0 Adalar
1 Arnavutköy
2 Ataşehir
3 Avcılar
4 Bağcılar
5 Bahçelievler
6 Bakırköy
7 Başakşehir
8 Bayrampaşa
9 Beşiktaş
10 Beykoz
11 Beylikdüzü
12 Beyoğlu
13 Büyükçekmece
14 Çatalca
15 Çekmeköy
16 Esenler
17 Esenyurt
18 Eyüp
19 Fatih
20 Gaziosmanpaşa
21 Güngören
22 Kadıköy
23 Kağıthane
24 Kartal
25 Küçükçekmece
26 Maltepe
27 Pendik
28 Sancaktepe
29 Sarıyer
30 Silivri
31 Sultanbeyli
32 Sultangazi
33 Şile
34 Şişli
35 Tuzla
36 Ümraniye
37 Üsküdar
38 Zeytinburnu

BeautifulSoup webscraping find_all( ): excluded element appended as last element

I'm trying to retrieve Financial Information from reuters.com, especially the Long Term Growth Rates of Companies. The element I want to scrape doesn't appear on all Webpages, in my example not for the Ticker 'AMCR'. All scraped info shall be appended to a list.
I've already figured out to exclude the element if it doesn't exist, but instead of appending it to the list in a place where it should be, the "NaN" is appended as the last element and not in a place where it should be.
import requests
from bs4 import BeautifulSoup
LTGRMean = []
tickers = ['MMM','AES','LLY','LOW','PWR','TSCO','YUM','ICE','FB','AAPL','AMCR','FLS','GOOGL','FB','MSFT']
Ticker LTGRMean
0 MMM 3.70
1 AES 9.00
2 LLY 10.42
3 LOW 13.97
4 PWR 12.53
5 TSCO 11.44
6 YUM 15.08
7 ICE 8.52
8 FB 19.07
9 AAPL 12.00
10 AMCR 19.04
11 FLS 16.14
12 GOOGL 19.07
13 FB 14.80
14 MSFT NaN
My individual text "not existing" isn't appearing.
Instead of for AMCR where Reuters doesn't provide any information, the Growth Rate of FLS (19.04) is set instead. So, as a result, all info is shifted up one index, where NaN should appear next to AMCR.

Stack() Function in dataframe stacks the column to rows at level 1.
import requests
from bs4 import BeautifulSoup
import pandas as pd
LTGRMean = []
tickers = ['MMM', 'AES', 'LLY', 'LOW', 'PWR', 'TSCO', 'YUM', 'ICE', 'FB', 'AAPL', 'AMCR', 'FLS', 'GOOGL', 'FB', 'MSFT']
for i in tickers:
Test = requests.get('https://www.reuters.com/finance/stocks/financial-highlights/' + i)
ReutSoup = BeautifulSoup(Test.content, 'html.parser')
td = ReutSoup.find('td', string="LT Growth Rate (%)")
my_dict = {}
#validate td object not none
if td is not None:
result = td.findNext('td').findNext('td').text
else:
result = "NaN"
my_dict[i] = result
LTGRMean.append(my_dict)
df = pd.DataFrame(LTGRMean)
print(df.stack())
O/P:
0 MMM 3.70
1 AES 9.00
2 LLY 10.42
3 LOW 13.97
4 PWR 12.53
5 TSCO 11.44
6 YUM 15.08
7 ICE 8.52
8 FB 19.90
9 AAPL 12.00
10 AMCR NaN
11 FLS 19.04
12 GOOGL 16.14
13 FB 19.90
14 MSFT 14.80
dtype: object

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Turn an HTML table into a CSV file - python

Related

Cleaning accented unicode characters with Pandas read_html function

Webscraping A Table Using Python and BeautifulSoup

AttributeError: 'NoneType' object has no attribute 'text' - BeautifulShop

How to scrape a non tabled list from wikipedia and create a datafram?

BeautifulSoup webscraping find_all( ): excluded element appended as last element

Categories

Resources