bs4: How to extract table "team-roster home" and "team-roster away" - python

I'm trying to get the team lineup for the home and away teams from here: http://www.veikkausliiga.com/tilastot/2016/veikkausliiga/ottelut/498251/kokoonpanot/.
I have tried using tr and td tags, as well as with id and class tags. I cannot get any information from the table.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
def make_soup(url):
thePage = urllib.request.urlopen(url)
soupData = BeautifulSoup(thePage, "html.parser")
return soupData
def makeTableData(url):
matchDataSaved= ""
soup = make_soup(url)
table = soup.find("table",{"class":"team-rosters home"})
return table
I expected the table rows to show, but the code outputs None. How can I get the players lineup?

I'm not sure how you're wanting the output. This does work as it returns the element taht has that table tag. You'd still need to go through it to get the tr and td tags.
The other option is since it does have the <table> tag, is just use Pandas and then manipulate the dataframes to extract what you want and in the format you want it. Probably not in the form I would like it, but gets you the data and you can work with it from there:
import pandas as pd
url = 'http://www.veikkausliiga.com/tilastot/2016/veikkausliiga/ottelut/498251/kokoonpanot/'
tables = pd.read_html(url)
results = pd.DataFrame()
ix_list = [2,3,4,5,7,8,9,10]
for ix in ix_list:
results = results.append(tables[ix])
results = results.reset_index(drop=True)
Output:
print (results)
0 ... 3
0 #21 T. Dähne ... NaN
1 #15 V. Jalasto ... #3 Taiwo
2 #20 Vincent ... NaN
3 #7 N. Alho ... NaN
4 #25 D. Assis ... NaN
5 #9 B. Span ... #11 J. Ibrahim
6 #4 P. Sparrdal Mantilla ... #7 T. Wirtanen
7 #1 W. Viitala ... NaN
[8 rows x 4 columns]

Related

Struggling to grab data from baseball reference

I'm trying to grab the tables for all pitchers batting against found on this page.
I believe the problems lies with the data being behind a comment.
For the sake of the example, I'd like to find, say, Sandy Alcantara's home runs allowed.
import requests
from bs4 import BeautifulSoup as bs
url="https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page=requests.get(url)
soup=bs(page.content,"html.parser")
for tr in soup.find_all('tr'):
td=tr.find_all('td')
print(td)
This prints a lot of team data, but doesn't print the pitcher data.
How can I cleverly get it to print the pitcher data? Ideally, I'd have it in a list or something.
object[0]=Rk
object[1]=Name
object[4]=IP
object[13]=HR
The problem related to the extraction of the table content is that the table itself is stored inside a comment string.
After you have fetched your web page, and loaded it into BeautifulSoup, you can solve this web page scraping issue following these steps:
gather the div tagged id = 'all_players_batting_pitching', which contains your table
extract the table from the comments using the decode_contents function, then reloading your text into a soup
extract each record of your table by looking for the tr tag, then each value by looking for the td tag, if that value is included in your indices [1, 4, 13]
load your values into a Pandas.DataFrame, ready to be used
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# fetching web page
url = "https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page = requests.get(url)
# extracting table from html
soup = bs(page.content,"html.parser")
table = soup.find(id = 'all_players_batting_pitching')
tab_text = table.decode_contents().split('--')[1].strip()
tab_soup = bs(tab_text,"html.parser")
# extracting records from table
records = []
for i, row in enumerate(tab_soup.find_all('tr')):
record = [ele.text.strip() for j, ele in enumerate(row.find_all('td')) if j in [0, 3, 12]]
if record != []:
records.append([row.a['href']] + [i] + record)
Output:
href Rk Name IP HR
0 /players/a/abbotco01.shtml 1 Cory Abbott 48.0 12
1 /players/a/abreual01.shtml 2 Albert Abreu 38.2 5
2 /players/a/abreual01.shtml 3 Albert Abreu 8.2 2
3 /players/a/abreual01.shtml 4 Albert Abreu 4.1 1
4 /players/a/abreual01.shtml 5 Albert Abreu 25.2 2
... ... ... ... ... ...
1063 /players/z/zastrro01.shtml 1106 Rob Zastryzny* 1.0 0
1064 /players/z/zastrro01.shtml 1107 Rob Zastryzny* 3.0 0
1065 /players/z/zerpaan01.shtml 1108 Angel Zerpa* 11.0 2
1066 /players/z/zeuchtj01.shtml 1109 T.J. Zeuch 10.2 5
1067 /players/z/zimmebr02.shtml 1110 Bruce Zimmermann* 73.2 21

How to narrow down the soup.find result and output only relevant text?

Every football players wikipedia page has something named "infobox" where the career is displayed.
My goal is to scrape only the highlighted data from wikipedia pages of football players.
I have gotten this far, im able to output the "infobox" segment of the player in text like this. But the only information I want is the highlighted one.
How do I narrow the result so I only get the highlighted text as my output?
If you feel like you might now the answer please ask questions if necessary because I feel like it is hard to formulate my question good.
The infobox table is a succession of <tr></tr tags.
Globally we are looking for the <tr></tr tag located immediately after the one whose text is "Seniorlag*"
You could do it like this:
import requests
from bs4 import BeautifulSoup
url = "https://sv.wikipedia.org/wiki/Lionel_Messi"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
infobox = soup.find('table', {'class': 'infobox'})
tr_tags = infobox.find_all('tr')
for tr in tr_tags:
if tr.text == "Seniorlag*":
# Search for the following tr tag
next_tr = tr.find_next_sibling('tr')
print(next_tr.text)
output
År2003–20042004–20052004–20212021–
Klubb Barcelona C Barcelona B Barcelona Paris Saint-Germain
SM (GM) 10 00(5)22 00(6)520 (474)39 0(13)
Just in addition to approach of #Vincent Lagache, that answers the question well, you could also deal with css selectors (more) to find your elements:
soup.select_one('tr:has(th:-soup-contains("Seniorlag")) + tr').text
Invoke dict comprehension and stripped_strings to extract the strings:
{
list(e.stripped_strings)[0]:list(e.stripped_strings)[1:]
for e in soup.select('tr:has(th:-soup-contains("Seniorlag")) + tr table td')
}
This results in a dict that on the one hand is already structured and can therefore be easily reused, for example creating a Dataframe
{'År': ['2003–2004', '2004–2005', '2004–2021', '2021–'], 'Klubb': ['Barcelona C', 'Barcelona B', 'Barcelona', 'Paris Saint-Germain'], 'SM (GM)': ['10', '(5)', '22', '(6)', '520 (474)', '39', '(13)']}
Example
This example also includes some pre- and postprocessing steps like decompose() to eliminate unwanted tags and splitting column with tuples with pandas
import requests
import pandas as pd
from bs4 import BeautifulSoup
url='https://sv.wikipedia.org/wiki/Lionel_Messi'
soup = BeautifulSoup(requests.get(url).text)
for hidden in soup.find_all(style='visibility:hidden;color:transparent;'):
hidden.decompose()
d = {
list(e.stripped_strings)[0]:list(e.stripped_strings)[1:]
for e in soup.select('tr:has(th:-soup-contains("Seniorlag")) + tr table td')
}
d['SM (GM)'] = ' '.join(d['SM (GM)']).split()
d['SM (GM)'] = list(zip(d['SM (GM)'][0::2], d['SM (GM)'][1::2]))
df = pd.DataFrame(d)
df[['SM', 'GM']] = pd.DataFrame(df['SM (GM)'].tolist(), index=df.index)
df
Output
År
Klubb
SM (GM)
SM
GM
0
2003–2004
Barcelona C
('10', '(5)')
10
(5)
1
2004–2005
Barcelona B
('22', '(6)')
22
(6)
2
2004–2021
Barcelona
('520', '(474)')
520
(474)
3
2021–
Paris Saint-Germain
('39', '(13)')
39
(13)

How do I use Python and BeautifulSoup to scrape data from an html table?

if you look at this page https://metals-api.com/currencies there is an html table with 2 columns. I would like to extract all the rows from column1 into a list/array. How do I go about this?
import requests
from bs4 import BeautifulSoup
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
with open('outpu2t.txt', 'w', encoding='utf-8') as f:
f.write(soup.text)
To clarify I am not looking to run some fetch price commands against these tickers, I'm trying to compile a list of tickers so I can add them to a dropdown menu for my app
If I understand the question, then you can try the next example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data=[]
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for code in soup.select('.table tbody tr td:nth-child(1)'):
code =code.text
data.append(code)
df=pd.DataFrame(data,columns=['code'])
#df.to_csv('code.csv',index=False)# to store data
print(df)
Output:
code
0 XAU
1 XAG
2 XPT
3 XPD
4 XCU
.. ...
209 LINK
210 XLM
211 ADA
212 BCH
213 LTC
[214 rows x 1 columns]
I sit corrected, I initially just tried pd.read_html("https://metals-api.com/currencies") which normally works, but apparently with a very slight work around it can still work just fine.
import pandas as pd
import requests
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
df = pd.read_html(page.content)[0]
print(df)
Output:
Code Name
0 XAU 1 Ounce of 24K Gold. Use Carat endpoint to dis...
1 XAG Silver
2 XPT Platinum
3 XPD Palladium
4 XCU Copper
.. ... ...
209 LINK Chainlink
210 XLM Stellar
211 ADA Cardano
212 BCH Bitcoin Cash
213 LTC Litecoin
[214 rows x 2 columns]

How to loop through scraped items and add them to a dictionary or pandas dataframe?

For a project I'm scraping data from futbin players and I would like to add that scraped data to a dict or pandas dataframe. I'm stuck for a couple of hours and would like some help if possible. I will put my code below on what I have so far. This piece of code only prints out the data and from that I'm clueless about what to do.
Code:
from requests_html import HTMLSession
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.futbin.com/21/player/87/pele', 'https://www.futbin.com/21/player/27751/robert-lewandowski']
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
info = soup.find('div', id='info_content')
rows = info.find_all('td')
for info in rows:
print(info.text.strip())
The work you have already done to identify the table you want is good.
use read_html() to convert to a dataframe
basic transforms to turn it into columns rather than key value pairs
in list comprehension get details of all wanted footballers
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.futbin.com/21/player/87/pele', 'https://www.futbin.com/21/player/27751/robert-lewandowski']
def myhtml(url):
# use BS4 to get table that has required data
html = str(BeautifulSoup(requests.get(url).content, 'html.parser').find('div', id='info_content').find("table"))
# read_html() returns a list, take first one, first column are attribute name, transpose to build DF
return pd.read_html(html)[0].set_index(0).T
df = pd.concat([myhtml(u) for u in urls])
Name
Club
Nation
League
Skills
Weak Foot
Intl. Rep
Foot
Height
Weight
Revision
Def. WR
Att. WR
Added on
Origin
R.Face
B.Type
DOB
Robert Lewandowski FIFA 21 Career Mode
Age
1
Edson Arantes Nascimento
FUT 21 ICONS
Brazil
Icons
5
4
5
Right
173cm
5'8"
70
Icon
Med
High
2020-09-10
Prime
nan
Unique
23-10-1940
nan
1
Robert Lewandowski
FC Bayern
Poland
Bundesliga
4
4
4
Right
184cm
6'0"
80
TOTY
Med
High
2021-01-22
TOTY
nan
Unique
nan
Robert Lewandowski FIFA 21 Career Mode
I would do it with open() and write()
file = open ("filename.txt", "w")
The w specifies the following :
"w" - Write - Opens a file for writing, creates the file if it does not exist
And then :
file.write (text_to_save)
Be sure to include os.path!
import os.path

How to pull specific columns from a Wikipedia table using python/Beautiful Soup

I've really been stumped for a while on this.
Link to table = https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons
I want to pull the data in the columns highlighed in red below
And put it in a pandas dataframe like this
Here is my code
import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
page = urllib.request.urlopen(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
# print(soup.prettify())
my_table = soup.find('table', {'class':'wikitable sortable'})
season = []
data = []
for row in my_table.find_all('tr'):
s = row.find('th')
season.append(s)
d = row.find('td')
data.append(d)
import pandas as pd
c = {'Season': season, 'Data': data}
df = pd.DataFrame(c)
df
Heres's my output. I'm completely lost on how to get to the simple 5 column table above. Thanks
You are almost there, though you don't really need beautifulsoup for that; just pandas.
Try this:
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
resp = requests.get(url)
tables = pd.read_html(resp.text)
target = tables[2].iloc[:,[0,2,3,4,5]]
target
Output:
Season P W D L
Season League League League League
0 1886–87 NaN NaN NaN NaN
1 1888–89[9] 12 8 2 2
2 1889–90 22 9 2 11
etc. And you can take it from there.

Categories