bs4 soup.select() vs. soup.find() - python

I am trying to scrape the text of some elements in a table using requests and BeautifulSoup, specifically the country names and the 2-letter country codes from this website.
Here is my code, which I have progressively walked back:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for i in range(3):
row = soup.find(f'#row{i} td')
print(row) # printing to check progress for now
I had hoped to go row-by-row and walk the tags to get the strings like so (over range 249). However, soup.find() doesn't appear to work, just prints blank lists. soup.select() however, works fine:
for i in range(3):
row = soup.select(f'#row{i} td')
print(row)
Why does soup.find() not work as expected here?

While .find() deals only with the first occurence of an element, .select() / .find_all() will give you a ResultSet you can iterate.
There are a lot of ways to get your goal, but basic pattern is mostly the same - select rows of the table and iterate over them.
In this first case I selected table by its id and close to your initial approach the <tr> also by its id while using css selector and the [id^="row"] that represents id attribute whose value starts with row. In addition I used .stripped_strings to extract the text from the elements, stored it in a list and pick it by index :
for row in soup.select('#countriesTable tr[id^="row"]'):
row = list(row.stripped_strings)
print(row[2], row[3])
or more precisely selecting all <tr> in <tbody> of tag with id countriesTable:
for row in soup.select('#countriesTable tbody tr'):
row = list(row.stripped_strings)
print(row[2], row[3])
...
An alternative and in my opinion best way to scrape tables is the use of pandas.read_html() that works with beautifulsoup under the hood and is doing most work for you:
import pandas as pd
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,:]
or to get only the two specific rows:
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,[1,2]]
Name
ISO 2
0
Afghanistan
AF
1
Åland Islands
AX
2
Albania
AL
3
Algeria
DZ
4
American Samoa
AS
5
Andorra
AD
...

find expects the first argument to be the DOM element you're searching, it won't work with CSS selectors.
So you'll need:
row = soup.find('tr', { 'id': f"row{i}" })
To get the tr with the desired ID.
Then to get the 2-letter country code, for the first a with title: ISO 3166-1 alpha-2 code and get it's .text:
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
To get the full name, there is no classname to search for, so I'd use take the second element, then we'll need to search for the span containing the country name:
name = row.findAll('td')[2].findAll('span')[2].text
Putting it all together gives:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for i in range(3):
row = soup.find('tr', { 'id': f"row{i}" })
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
name = row.findAll('td')[2].findAll('span')[2].text
print(name, iso)
Which outputs:
Afghanistan  AF
Åland Islands  AX
Albania  AL

find_all() and select() select a list but find() and select_one() select only single element.
import requests
import bs4
import pandas as pd
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')
data=[]
for row in soup.select('.tablesorter.mark > tbody tr'):
name=row.find("span",class_="sortkey").text
country_code=row.select_one('td:nth-child(4)').text.replace('\n','').strip()
data.append({
'name':name,
'country_code':country_code})
df= pd.DataFrame(data)
print(df)
Output:
name country_code
0 afghanistan AF
1 aland-islands AX
2 albania AL
3 algeria DZ
4 american-samoa AS
.. ... ...
244 wallis-and-futuna WF
245 western-sahara EH
246 yemen YE
247 zambia ZM
248 zimbabwe ZW
[249 rows x 2 columns]

Related

How to narrow down the soup.find result and output only relevant text?

Every football players wikipedia page has something named "infobox" where the career is displayed.
My goal is to scrape only the highlighted data from wikipedia pages of football players.
I have gotten this far, im able to output the "infobox" segment of the player in text like this. But the only information I want is the highlighted one.
How do I narrow the result so I only get the highlighted text as my output?
If you feel like you might now the answer please ask questions if necessary because I feel like it is hard to formulate my question good.
The infobox table is a succession of <tr></tr tags.
Globally we are looking for the <tr></tr tag located immediately after the one whose text is "Seniorlag*"
You could do it like this:
import requests
from bs4 import BeautifulSoup
url = "https://sv.wikipedia.org/wiki/Lionel_Messi"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
infobox = soup.find('table', {'class': 'infobox'})
tr_tags = infobox.find_all('tr')
for tr in tr_tags:
if tr.text == "Seniorlag*":
# Search for the following tr tag
next_tr = tr.find_next_sibling('tr')
print(next_tr.text)
output
År2003–20042004–20052004–20212021–
Klubb Barcelona C Barcelona B Barcelona Paris Saint-Germain
SM (GM) 10 00(5)22 00(6)520 (474)39 0(13)
Just in addition to approach of #Vincent Lagache, that answers the question well, you could also deal with css selectors (more) to find your elements:
soup.select_one('tr:has(th:-soup-contains("Seniorlag")) + tr').text
Invoke dict comprehension and stripped_strings to extract the strings:
{
list(e.stripped_strings)[0]:list(e.stripped_strings)[1:]
for e in soup.select('tr:has(th:-soup-contains("Seniorlag")) + tr table td')
}
This results in a dict that on the one hand is already structured and can therefore be easily reused, for example creating a Dataframe
{'År': ['2003–2004', '2004–2005', '2004–2021', '2021–'], 'Klubb': ['Barcelona C', 'Barcelona B', 'Barcelona', 'Paris Saint-Germain'], 'SM (GM)': ['10', '(5)', '22', '(6)', '520 (474)', '39', '(13)']}
Example
This example also includes some pre- and postprocessing steps like decompose() to eliminate unwanted tags and splitting column with tuples with pandas
import requests
import pandas as pd
from bs4 import BeautifulSoup
url='https://sv.wikipedia.org/wiki/Lionel_Messi'
soup = BeautifulSoup(requests.get(url).text)
for hidden in soup.find_all(style='visibility:hidden;color:transparent;'):
hidden.decompose()
d = {
list(e.stripped_strings)[0]:list(e.stripped_strings)[1:]
for e in soup.select('tr:has(th:-soup-contains("Seniorlag")) + tr table td')
}
d['SM (GM)'] = ' '.join(d['SM (GM)']).split()
d['SM (GM)'] = list(zip(d['SM (GM)'][0::2], d['SM (GM)'][1::2]))
df = pd.DataFrame(d)
df[['SM', 'GM']] = pd.DataFrame(df['SM (GM)'].tolist(), index=df.index)
df
Output
År
Klubb
SM (GM)
SM
GM
0
2003–2004
Barcelona C
('10', '(5)')
10
(5)
1
2004–2005
Barcelona B
('22', '(6)')
22
(6)
2
2004–2021
Barcelona
('520', '(474)')
520
(474)
3
2021–
Paris Saint-Germain
('39', '(13)')
39
(13)

How to get different texts separately from one tag in BeautifulSoup?

I am trying to scrape Disney Pictures films data from this Wikipedia page: https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films
This is my code:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
url="https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
page=requests.get(url).content
soup=bs(page,"html.parser")
tbodies=soup.find_all("tbody")
for tbody in tbodies:
trs=tbody.find_all("tr")
for tr in trs:
tds=tr.find_all("td")
for td in tds:
print(td.text)
This is the screenshot of inspect pane:
As you can see, the different texts I want to get (title, date and notes) are in this highlighted "td" tag.
I tried print(td[0].text) or print(td[2].text) in the end of my code but it returns error.
How can I print these three different texts separately?
P.S. I don't want to use pd.read_html(url)
To get different texts separately, You can use css selectors instead of list slicing
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
url="https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
page=requests.get(url).content
soup=bs(page,"html.parser")
t=[]
d=[]
n=[]
title=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td i a')]
#print(title)
t.extend(title)
date=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td:nth-child(2)')]
d.extend(date)
notes=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td:nth-child(3)')]
n.extend(notes)
df = pd.DataFrame(data=list(zip(t,d,n)),columns=['Title','Date', 'Note'])
print(df)
Output:
Title ...
0 Academy Award Review of Walt Disney Cartoons ... Anthology film. Distributed byUnited Artists.
1 Snow White and the Seven Dwarfs ... First film to be distributed byRKO Radio Pictu...
2 Pinocchio ...
3 Fantasia ... Anthology film
4 The Reluctant Dragon ... Fictionalized tour around the Disney studio
.. ... ...
...
531 The Return of the Rocketeer ... co-production withThese Pictures
532 Tower of Terror ...
533 Tron: Ares ... co-production withRideback
534 FC Barcelona ... co-production withPixar Animation Studios
535 Young Woman and the Sea ...
[536 rows x 3 columns]

How to grab all tr elements from a table and click on a link?

I am trying to figure out how to print all tr elements from a table, but I can't quite get it working right.
Here is the link I am working with.
https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate
Here is my code.
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
html = requests.get(link).text
# If you do not want to use requests then you can use the following code below
# with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("span", {"class": "fn"})
for r in res:
print("Name: " + r.find('a').text)
table_body=soup.find('senators')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
I am trying to print all tr elements from the table named 'senators'. Also, I am wondering if there is a way to click on links of senators, like 'Richard Shelby' which takes me to this:
https://en.wikipedia.org/wiki/Richard_Shelby
From each link, I want to grab the data under 'Assumed office'. In this case the value is: 'January 3, 2018'. So, ultimately, I want to end up with this:
Richard Shelby May 6, 1934 (age 84) Lawyer U.S. House
Alabama Senate January 3, 1987 2022
Assumed office: January 3, 2018
All I can get now is the name of each senator printed out.
In order to locate the "Senators" table, you can first find the corresponding "Senators" label and then get the first following table element:
soup.find(id='Senators').find_next("table")
Now, in order to get the data row by row, you would have to account for the cells with a "rowspan" which stretch across multiple rows. You can either follow the approaches suggested at What should I do when <tr> has rowspan, or the implementation I provide below (not ideal but works in your case).
import copy
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
with requests.Session() as session:
html = session.get(link).text
soup = BeautifulSoup(html, "lxml")
senators_table = soup.find(id='Senators').find_next("table")
headers = [td.get_text(strip=True) for td in senators_table.tr('th')]
rows = senators_table.find_all('tr')
# pre-process table to account for rowspan, TODO: extract into a function
for row_index, tr in enumerate(rows):
for cell_index, td in enumerate(tr('td')):
if 'rowspan' in td.attrs:
rowspan = int(td['rowspan'])
del td.attrs['rowspan']
# insert same td into subsequent rows
for index in range(row_index + 1, row_index + rowspan):
try:
rows[index]('td')[cell_index].insert_after(copy.copy(td))
except IndexError:
continue
# extracting the desired data
rows = senators_table.find_all('tr')[1:]
for row in rows:
cells = [td.get_text(strip=True) for td in row('td')]
print(dict(zip(headers, cells)))
If you want to, then, follow the links to senator "profile" pages, you would first need to extract the link out of the appropriate cell in a row and then use session.get() to "navigate" to it, something along these lines:
senator_link = row.find_all('td')[3].a['href']
senator_link = urljoin(link, senator_link)
response = session.get(senator_link)
soup = BeautifulSoup(response.content, "lxml")
# TODO: parse
where urljoin is imported as:
from urllib.parse import urljoin
Also, FYI, one of the reasons to use requests.Session() here is to optimize making requests to the same host:
The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3’s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
There is also an another way to get the tabular data parsed - .read_html() from pandas. You could do:
import pandas as pd
df = pd.read_html(str(senators_table))[0]
print(df.head())
to get the desired table as a dataframe.

BeautifulSoup Parse Text after <b> and before </br>

I have this code trying to parse search results from a grant website (please find the URL in the code, I can't post the link yet until my rep is higher), the "Year"and "Amount Award" after tags and before tags.
Two questions:
1) Why is this only returning the 1st table?
2) Any way I can get the text that is after the (i.e. Year and Amount Award strings) and (i.e. the actual number such as 2015 and $100000)
Specifically:
<td valign="top">
<b>Year: </b>2014<br>
<b>Award Amount: </b>$84,907 </td>
Here is my script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=&region=ASIA&projectCountry=China&amount=&fromDate=&toDate=&' \
'projectFocus%5B%5D=&search=&maxCount=25&orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, "html.parser")
tables = soup.find_all('table')
data = {
'col_names': [],
'info' : [],
'year_amount':[]
}
index = 0
for table in tables:
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
data['col_names'].append(cols[0].get_text())
data['info'].append(cols[1].get_text())
try:
data['year_amount'].append(cols[2].get_text())
except IndexError:
data['year_amount'].append(None)
grant_df = pd.DataFrame(data)
index += 1
filename = 'grant ' + str(index) + '.csv'
grant_df.to_csv(filename)
I would suggest approaching the table parsing in a different manner. All of the information is available in the first row of each table. So you can parse the text of the row like:
Code:
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
How?:
This takes the text and
splits it on newlines
removes any blank lines
removes any leading/trailing space
joins the lines back together into a single text
joins any line ending in : with the next line
Then:
split the text again by newline
split each line by :
strip any whitespace of ends of text on either side of :
use the split text as key and value to a dict
Test Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=&region=ASIA&projectCountry=China&amount=&' \
'fromDate=&toDate=&projectFocus%5B%5D=&search=&maxCount=25&' \
'orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
data = []
for table in soup.find_all('table'):
rows = table.find_all('tr')
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
if data_dict.get('Award Amount'):
data.append(data_dict)
grant_df = pd.DataFrame(data)
print(grant_df.head())
Results:
Award Amount Description \
0 $84,907 To strengthen the capacity of China's rights d...
1 $204,973 To provide an effective forum for free express...
2 $48,000 To promote religious freedom in China. The org...
3 $89,000 To educate and train civil society activists o...
4 $65,000 To encourage greater public discussion, transp...
Organization Name Project Country Project Focus \
0 NaN Mainland China Rule of Law
1 Princeton China Initiative Mainland China Freedom of Information
2 NaN Mainland China Rule of Law
3 NaN Mainland China Democratic Ideas and Values
4 NaN Mainland China Rule of Law
Project Region Project Title Year
0 Asia Empowering the Chinese Legal Community 2014
1 Asia Supporting Free Expression and Open Debate for... 2014
2 Asia Religious Freedom, Rights Defense and Rule of ... 2014
3 Asia Education on Civil Society and Democratization 2014
4 Asia Promoting Democratic Policy Change in China 2014

python beautifulsoup dictionary table with list

I am trying to create a dictionary table with a key value for later joining that is associated to a list. Below is the code with the output that the code produces as well as the desired output. Can someone please help me achieve the desired output in dictionary with list form? Note, the second set does not have a link, when something like this occurs can a value be place here, such as "None"?
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
html='<tr><td align="right">1</td><td align="left">Victoria Azarenka</td><td align="left">BLR</td><td align="left">1989-07-31</td></tr> <tr><td align="right">1146</td><td align="left">Brittany Lashway</td><td align="left">USA</td><td align="left">1994-04-06</td></tr>'
soup = BeautifulSoup(html,'lxml')
for cell in soup.find_all('td'):
if cell.find('a', href=True):
print(cell.find('a', href=True).attrs['href'])
print(cell.find('a', href=True).text)
else:
print(cell.text)
'''
Output From Code:
1 --> Rank
http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka --> Website
Victoria Azarenka --> Name
BLR --> Country
1989-07-31 --> Birth Date
1146 --> Rank
Brittany Lashway --> Name
USA --> Country
1994-04-06 --> Birth Date
Desired Output: (Dictionary Table with List component)
{Key, [Rank, Website,Name, Country, Birth Date]}
Example:
{1, [1, http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka, Victoria Azarenka, BLR, 1989-07-31]}
{2, [1146, None, Brittany Lashway, USA, 1994-04-06]}
'''
You can do something like this using list and dict comprehension:
from bs4 import BeautifulSoup as bs
html='<tr><td align="right">1</td><td align="left">Victoria Azarenka</td><td align="left">BLR</td><td align="left">1989-07-31</td></tr> <tr><td align="right">1146</td><td align="left">Brittany Lashway</td><td align="left">USA</td><td align="left">1994-04-06</td></tr>'
# Genrator to find the desired text and links
def find_link_or_text(a):
for cell in a:
if cell.find('a', href=True):
yield cell.find('a', href=True).attrs['href']
yield cell.find('a', href=True).text
else:
yield cell.text
# Parse data using BeautifulSoup
data = bs(html, 'lxml')
# Retrurn only a parsed data within td tag
parsed = data.find_all('td')
# Group elements by 5
sub = [list(find_link_or_text(parsed[k:k+4])) for k in range(0, len(parsed), 4)]
# put the sub dict within a key from 1 to len(sub)+1
final = {key: value for key, value in zip(range(1, len(sub) +1), sub)}
print(final)
Output:
{1: ['1', 'http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka', 'Victoria Azarenka', 'BLR', '1989-07-31'], 2: ['1146', 'Brittany Lashway', 'USA', '1994-04-06']}

Categories