I am trying to create a dictionary table with a key value for later joining that is associated to a list. Below is the code with the output that the code produces as well as the desired output. Can someone please help me achieve the desired output in dictionary with list form? Note, the second set does not have a link, when something like this occurs can a value be place here, such as "None"?
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
html='<tr><td align="right">1</td><td align="left">Victoria Azarenka</td><td align="left">BLR</td><td align="left">1989-07-31</td></tr> <tr><td align="right">1146</td><td align="left">Brittany Lashway</td><td align="left">USA</td><td align="left">1994-04-06</td></tr>'
soup = BeautifulSoup(html,'lxml')
for cell in soup.find_all('td'):
if cell.find('a', href=True):
print(cell.find('a', href=True).attrs['href'])
print(cell.find('a', href=True).text)
else:
print(cell.text)
'''
Output From Code:
1 --> Rank
http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka --> Website
Victoria Azarenka --> Name
BLR --> Country
1989-07-31 --> Birth Date
1146 --> Rank
Brittany Lashway --> Name
USA --> Country
1994-04-06 --> Birth Date
Desired Output: (Dictionary Table with List component)
{Key, [Rank, Website,Name, Country, Birth Date]}
Example:
{1, [1, http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka, Victoria Azarenka, BLR, 1989-07-31]}
{2, [1146, None, Brittany Lashway, USA, 1994-04-06]}
'''
You can do something like this using list and dict comprehension:
from bs4 import BeautifulSoup as bs
html='<tr><td align="right">1</td><td align="left">Victoria Azarenka</td><td align="left">BLR</td><td align="left">1989-07-31</td></tr> <tr><td align="right">1146</td><td align="left">Brittany Lashway</td><td align="left">USA</td><td align="left">1994-04-06</td></tr>'
# Genrator to find the desired text and links
def find_link_or_text(a):
for cell in a:
if cell.find('a', href=True):
yield cell.find('a', href=True).attrs['href']
yield cell.find('a', href=True).text
else:
yield cell.text
# Parse data using BeautifulSoup
data = bs(html, 'lxml')
# Retrurn only a parsed data within td tag
parsed = data.find_all('td')
# Group elements by 5
sub = [list(find_link_or_text(parsed[k:k+4])) for k in range(0, len(parsed), 4)]
# put the sub dict within a key from 1 to len(sub)+1
final = {key: value for key, value in zip(range(1, len(sub) +1), sub)}
print(final)
Output:
{1: ['1', 'http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka', 'Victoria Azarenka', 'BLR', '1989-07-31'], 2: ['1146', 'Brittany Lashway', 'USA', '1994-04-06']}
Related
Every football players wikipedia page has something named "infobox" where the career is displayed.
My goal is to scrape only the highlighted data from wikipedia pages of football players.
I have gotten this far, im able to output the "infobox" segment of the player in text like this. But the only information I want is the highlighted one.
How do I narrow the result so I only get the highlighted text as my output?
If you feel like you might now the answer please ask questions if necessary because I feel like it is hard to formulate my question good.
The infobox table is a succession of <tr></tr tags.
Globally we are looking for the <tr></tr tag located immediately after the one whose text is "Seniorlag*"
You could do it like this:
import requests
from bs4 import BeautifulSoup
url = "https://sv.wikipedia.org/wiki/Lionel_Messi"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
infobox = soup.find('table', {'class': 'infobox'})
tr_tags = infobox.find_all('tr')
for tr in tr_tags:
if tr.text == "Seniorlag*":
# Search for the following tr tag
next_tr = tr.find_next_sibling('tr')
print(next_tr.text)
output
År2003–20042004–20052004–20212021–
Klubb Barcelona C Barcelona B Barcelona Paris Saint-Germain
SM (GM) 10 00(5)22 00(6)520 (474)39 0(13)
Just in addition to approach of #Vincent Lagache, that answers the question well, you could also deal with css selectors (more) to find your elements:
soup.select_one('tr:has(th:-soup-contains("Seniorlag")) + tr').text
Invoke dict comprehension and stripped_strings to extract the strings:
{
list(e.stripped_strings)[0]:list(e.stripped_strings)[1:]
for e in soup.select('tr:has(th:-soup-contains("Seniorlag")) + tr table td')
}
This results in a dict that on the one hand is already structured and can therefore be easily reused, for example creating a Dataframe
{'År': ['2003–2004', '2004–2005', '2004–2021', '2021–'], 'Klubb': ['Barcelona C', 'Barcelona B', 'Barcelona', 'Paris Saint-Germain'], 'SM (GM)': ['10', '(5)', '22', '(6)', '520 (474)', '39', '(13)']}
Example
This example also includes some pre- and postprocessing steps like decompose() to eliminate unwanted tags and splitting column with tuples with pandas
import requests
import pandas as pd
from bs4 import BeautifulSoup
url='https://sv.wikipedia.org/wiki/Lionel_Messi'
soup = BeautifulSoup(requests.get(url).text)
for hidden in soup.find_all(style='visibility:hidden;color:transparent;'):
hidden.decompose()
d = {
list(e.stripped_strings)[0]:list(e.stripped_strings)[1:]
for e in soup.select('tr:has(th:-soup-contains("Seniorlag")) + tr table td')
}
d['SM (GM)'] = ' '.join(d['SM (GM)']).split()
d['SM (GM)'] = list(zip(d['SM (GM)'][0::2], d['SM (GM)'][1::2]))
df = pd.DataFrame(d)
df[['SM', 'GM']] = pd.DataFrame(df['SM (GM)'].tolist(), index=df.index)
df
Output
År
Klubb
SM (GM)
SM
GM
0
2003–2004
Barcelona C
('10', '(5)')
10
(5)
1
2004–2005
Barcelona B
('22', '(6)')
22
(6)
2
2004–2021
Barcelona
('520', '(474)')
520
(474)
3
2021–
Paris Saint-Germain
('39', '(13)')
39
(13)
I am trying to scrape the text of some elements in a table using requests and BeautifulSoup, specifically the country names and the 2-letter country codes from this website.
Here is my code, which I have progressively walked back:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for i in range(3):
row = soup.find(f'#row{i} td')
print(row) # printing to check progress for now
I had hoped to go row-by-row and walk the tags to get the strings like so (over range 249). However, soup.find() doesn't appear to work, just prints blank lists. soup.select() however, works fine:
for i in range(3):
row = soup.select(f'#row{i} td')
print(row)
Why does soup.find() not work as expected here?
While .find() deals only with the first occurence of an element, .select() / .find_all() will give you a ResultSet you can iterate.
There are a lot of ways to get your goal, but basic pattern is mostly the same - select rows of the table and iterate over them.
In this first case I selected table by its id and close to your initial approach the <tr> also by its id while using css selector and the [id^="row"] that represents id attribute whose value starts with row. In addition I used .stripped_strings to extract the text from the elements, stored it in a list and pick it by index :
for row in soup.select('#countriesTable tr[id^="row"]'):
row = list(row.stripped_strings)
print(row[2], row[3])
or more precisely selecting all <tr> in <tbody> of tag with id countriesTable:
for row in soup.select('#countriesTable tbody tr'):
row = list(row.stripped_strings)
print(row[2], row[3])
...
An alternative and in my opinion best way to scrape tables is the use of pandas.read_html() that works with beautifulsoup under the hood and is doing most work for you:
import pandas as pd
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,:]
or to get only the two specific rows:
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,[1,2]]
Name
ISO 2
0
Afghanistan
AF
1
Åland Islands
AX
2
Albania
AL
3
Algeria
DZ
4
American Samoa
AS
5
Andorra
AD
...
find expects the first argument to be the DOM element you're searching, it won't work with CSS selectors.
So you'll need:
row = soup.find('tr', { 'id': f"row{i}" })
To get the tr with the desired ID.
Then to get the 2-letter country code, for the first a with title: ISO 3166-1 alpha-2 code and get it's .text:
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
To get the full name, there is no classname to search for, so I'd use take the second element, then we'll need to search for the span containing the country name:
name = row.findAll('td')[2].findAll('span')[2].text
Putting it all together gives:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for i in range(3):
row = soup.find('tr', { 'id': f"row{i}" })
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
name = row.findAll('td')[2].findAll('span')[2].text
print(name, iso)
Which outputs:
Afghanistan AF
Åland Islands AX
Albania AL
find_all() and select() select a list but find() and select_one() select only single element.
import requests
import bs4
import pandas as pd
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')
data=[]
for row in soup.select('.tablesorter.mark > tbody tr'):
name=row.find("span",class_="sortkey").text
country_code=row.select_one('td:nth-child(4)').text.replace('\n','').strip()
data.append({
'name':name,
'country_code':country_code})
df= pd.DataFrame(data)
print(df)
Output:
name country_code
0 afghanistan AF
1 aland-islands AX
2 albania AL
3 algeria DZ
4 american-samoa AS
.. ... ...
244 wallis-and-futuna WF
245 western-sahara EH
246 yemen YE
247 zambia ZM
248 zimbabwe ZW
[249 rows x 2 columns]
I'd like zip some lists from html, I use codes like:
html_link = 'https://www.pds.com.ph/index.html%3Fpage_id=3261.html'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
search = re.compile(r"March.+2021")
for td in soup.find_all('td', text=search):
link = td.parent.select_one("td > a")
if link:
titles = link.text
links = f"Link : 'https://www.pds.com.ph/{link['href']}"
dates = td.text
for link, title, date in zip(links, titles, dates):
dataframe = pd.DataFrame({'col1':title,'col2':link,'col3':date},index=[0])
print(dataframe)
But the output is not what I expected:
col1 col2 col3
1 P L M
col1 col2 col3
1 D i a
...
What I EXPECT is:
Titles Links Dates
... ... ...
May I ask if the syntax is correct or what could I do to achieve that?
You can just pass the result from zip directly to pd.DataFrame, specifying the column names in a list:
df = pd.DataFrame(zip(titles, links, dates), columns=['Titles', 'Links', 'Dates'])
If you are trying to create a dataframe from the extracted values then, you need to store them in list before performing zip
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
html_link = 'https://www.pds.com.ph/index.html%3Fpage_id=3261.html'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
search = re.compile(r"March.+2021")
titles = [] # to store extracted values in list
links = []
dates = []
for td in soup.find_all('td', text=search):
link = td.parent.select_one("td > a")
if link:
titles.append(link.text)
links.append(f"Link : 'https://www.pds.com.ph/{link['href']}")
dates.append(td.text)
dataframe = pd.DataFrame(zip(titles, links, dates), columns=['Titles', 'Links', 'Dates'])
# or you can use
# dataframe = pd.DataFrame({'Titles': titles, 'Links': links, 'Dates': dates})
print(dataframe)
# Titles Links Dates
# 0 RCBC Lists PHP 17.87257 Billion ASEAN Sustaina... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 31, 2021
# 1 Aboitiz Power Corporation Raises 8 Billion Fix... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 16, 2021
# 2 Century Properties Group, Inc Returns to PDEx ... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 1, 2021
# 3 PDS Group Celebrates 2020’s Top Performers in ... Link : 'https://www.pds.com.ph/index.html%3Fp=... March 29, 2021
I am trying to figure out how to print all tr elements from a table, but I can't quite get it working right.
Here is the link I am working with.
https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate
Here is my code.
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
html = requests.get(link).text
# If you do not want to use requests then you can use the following code below
# with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("span", {"class": "fn"})
for r in res:
print("Name: " + r.find('a').text)
table_body=soup.find('senators')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
I am trying to print all tr elements from the table named 'senators'. Also, I am wondering if there is a way to click on links of senators, like 'Richard Shelby' which takes me to this:
https://en.wikipedia.org/wiki/Richard_Shelby
From each link, I want to grab the data under 'Assumed office'. In this case the value is: 'January 3, 2018'. So, ultimately, I want to end up with this:
Richard Shelby May 6, 1934 (age 84) Lawyer U.S. House
Alabama Senate January 3, 1987 2022
Assumed office: January 3, 2018
All I can get now is the name of each senator printed out.
In order to locate the "Senators" table, you can first find the corresponding "Senators" label and then get the first following table element:
soup.find(id='Senators').find_next("table")
Now, in order to get the data row by row, you would have to account for the cells with a "rowspan" which stretch across multiple rows. You can either follow the approaches suggested at What should I do when <tr> has rowspan, or the implementation I provide below (not ideal but works in your case).
import copy
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
with requests.Session() as session:
html = session.get(link).text
soup = BeautifulSoup(html, "lxml")
senators_table = soup.find(id='Senators').find_next("table")
headers = [td.get_text(strip=True) for td in senators_table.tr('th')]
rows = senators_table.find_all('tr')
# pre-process table to account for rowspan, TODO: extract into a function
for row_index, tr in enumerate(rows):
for cell_index, td in enumerate(tr('td')):
if 'rowspan' in td.attrs:
rowspan = int(td['rowspan'])
del td.attrs['rowspan']
# insert same td into subsequent rows
for index in range(row_index + 1, row_index + rowspan):
try:
rows[index]('td')[cell_index].insert_after(copy.copy(td))
except IndexError:
continue
# extracting the desired data
rows = senators_table.find_all('tr')[1:]
for row in rows:
cells = [td.get_text(strip=True) for td in row('td')]
print(dict(zip(headers, cells)))
If you want to, then, follow the links to senator "profile" pages, you would first need to extract the link out of the appropriate cell in a row and then use session.get() to "navigate" to it, something along these lines:
senator_link = row.find_all('td')[3].a['href']
senator_link = urljoin(link, senator_link)
response = session.get(senator_link)
soup = BeautifulSoup(response.content, "lxml")
# TODO: parse
where urljoin is imported as:
from urllib.parse import urljoin
Also, FYI, one of the reasons to use requests.Session() here is to optimize making requests to the same host:
The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3’s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
There is also an another way to get the tabular data parsed - .read_html() from pandas. You could do:
import pandas as pd
df = pd.read_html(str(senators_table))[0]
print(df.head())
to get the desired table as a dataframe.
I have a simple 4x2 html table that contains information about a property.
I'm trying to extract the value 1972, which is under the column heading of Year Built. If I find all the tags td, how do I extract the index of the tag that contains the text Year Built?
Because once I find that index, I can just add 4 to get to the tag that contains the value 1972.
Here is the html:
<table>
<tbody>
<tr>
<td>Building</td>
<td>Type</td>
<td>Year Built</td>
<td>Sq. Ft.</td>
</tr>
<tr>
<td>R01</td>
<td>DWELL</td>
<td>1972</td>
<td>1166</td>
</tr>
</tbody>
</table>
For example I know that if my input is index 2 and my output is text of that tag Year Built, I can just do this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(myhtml)
td_list = soup.find_all('td')
print td_list[2].text
But how do I use input of text Year Built to get output of index 2?
If your table has a static scheme, it is better using row and column indexes. Try this:
rows = soup.find("table").find("tbody").find_all("tr")
print rows[1].find_all("td")[2].get_text()
Alternatively if you just want to find index number of the tag containing "Year Built":
from bs4 import BeautifulSoup
soup = BeautifulSoup(myhtml)
td_list = soup.find_all('td')
i = 0
for elem in td_list:
if elem.text == 'Year Built':
ind = i
i += 1
print td_list[ind].text
Convert it to dict and get the value:
from bs4 import BeautifulSoup
table_data = [[cell.text for cell in row("td")] for row in BeautifulSoup(myhtml)("tr")]
dict = dict(zip(table_data[0], table_data[1]))
print dict['Year Built']
Your content is stored in filename.
Please try:
In [3]: soup = BeautifulSoup(open("filename"))
In [4]: print soup.find_all('td')[2].string
Year Built