Cannot scrape the table for href links

Cannot scrape the table for href links - python

I am trying to pull the href links from the table which I later need to click one by one to access the data inside each links. But I cannot figure out a way to do it. I have tried find_all and have been getting "ResultSet object has no attribute '%s' error.
HTML: (Really long so this is a 1/10)
<thead>
<tr class="sctablehead">
<th>Academic Program</th>
<th>Departments</th>
<th>Academic Level</th>
<th>College</th>
<th>Online</th>
<th>Degree Type</th>
</tr>
</thead>
<tbody>
<tr class="even firstrow"><td>Accountancy</td><td>Accounting</td><td>Graduate</td><td>BUS</td><td></td><td>MAC</td></tr>
<tr class="odd"><td>Accounting</td><td>Accounting</td><td>Undergraduate</td><td>BUS</td><td></td><td>BSB</td></tr>
<tr class="even"><td>Accounting</td><td>Accounting</td><td>Undergraduate</td><td>BUS</td><td></td><td>Minor</td></tr>
<tr class="odd"><td>Actuarial Science</td><td>Mathematics, Economics, Finance</td><td>Undergraduate</td><td>STEM</td><td></td><td>Minor</td></tr>
<tr class="even"><td>Adult Gerontology Acute Care Nurse Practitioner</td><td>Nursing</td><td>Graduate</td><td>HHS</td><td></td><td>PMC</td></tr>
<tr class="odd"><td>Advertising and Public Relations</td><td>Advertising</td><td>Undergraduate</td><td>BUS</td><td></td><td>BSB</td></tr>
<tr class="even"><td>Advertising Public Relations</td><td>Marketing</td><td>Undergraduate</td><td>BUS</td><td></td><td>Minor</td></tr>
<tr class="odd"><td>Aerospace Studies</td><td>Aerospace Studies</td><td>Undergraduate</td><td>HHS</td><td></td><td>Minor</td></tr>
<tr class="even"><td>Africana Studies</td><td>Africana Studies</td><td>Undergraduate</td><td>BCLASSE</td><td></td><td>Minor</td></tr>
... And so on
My Code:
r = requests.get(driver.current_url)
soup = bs(r.content, 'html.parser')
programs_table = soup.find_all('table', {"class":"sc_sctable tbl_degrees sorttable"})
for tr in programs_table.find_all('tr class'):
for a in tr.find_all('a'):
print(a['href'])

If your table is found properly(as you have not provided the html for that..)
Then ONLY:-
r = requests.get(driver.current_url)
soup = bs(r.content, 'html.parser')
programs_table = soup.find_all('table', {"class":"sc_sctable tbl_degrees sorttable"})
for tr in programs_table.find_all('tr'):
for a in tr.find_all('a'):
print(a['href'])
In other words you can try programs_table.find_all("tr") instead of programs_table.find_all("tr class")
Because I get results after using this as follows:
/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/bsba-in-accounting/
/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/accounting-minor/
/undergraduate/colleges-programs/college-science-technology-engineering-mathematics/department-mathematics-statistics/actuarial-science-minor/
/graduate/graduate-programs/post-masters-adult-gero-acute-care-nurse-pract-certificate-program/
/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations/
/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations-minor/
/undergraduate/colleges-programs/college-health-human-services/aerospace-studies-program/```

First of all, you shouldn't be using find_all to scrape one tag unless you really want it to be in a list. So, to get the table you should just to do:
programs_table = soup.find('table', class_="sc_sctable")
Now to get inner <a> tags with href links, you can scrape <td> tags which have inner <a> tag:
tags_with_href = programs_table.tbody.find_all('td')
links = [each_tag.a['href'] for each_tag in tags_with_href if each_tag.a]
# -> ['/graduate/graduate-programs/master-accountancy/', ... ]
If you want to have absolute url instead of relative one, you can define base_url and add each relative urls to it:
base_url = '<base_url_of_website>'
links = [base_url + link for link in links]

Related

Get contents of a particular row

I want to locate 'td' where text is 'xyz' so that I can find other attributes in the row. I only have 'xyz' with me and want to get other elements in that row.
.
.
.
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
.
.
.
I can get 'xyz' easily by using
required = soup.find('a', text = 'xyz')
print(required[0].text)
but I'm not able to locate 'td' so that I can use find_next_siblings() to get other columns.
Expected output:
xyz
address
phone number

With bs4 4.7.1 combine pseudo classes of :has and :contains to retrieve the row and tds within.
This bit targets the right a tag if present by its text
a:contains("xyz")
You then retrieve the parent row (tr) having this a tag
tr:has(a:contains("xyz"))
And finally use a descendant combinator and td type selector to get all the tds within that row. Using a list comprehension to return the list.
from bs4 import BeautifulSoup as bs
html = '''
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('tr:has(a:contains("xyz")) td')]
print(items)

If you have modern BeautifulSoup, you can use CSS selector :contains. Then traverse back with find_parent() method.
from bs4 import BeautifulSoup
s = '''
<tr>
<td>Other1</td>
<td>Other1</td>
<td>Other1</td>
</tr>
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(s, 'lxml')
for td in soup.select_one('a:contains(xyz)').find_parent('tr').select('td'):
print(td.text.strip())
Prints:
xyz
address
phone number

Replace your code with this:
from bs4 import BeautifulSoup
html = '''<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(html, 'lxml')
required = soup.find('a', text = 'xyz')
print(required.text)
td = required.parent
siblingsArray = td.find_next_siblings()
for siblings in siblingsArray:
print(siblings.text)
O/P:
xyz
address
phone number
Where parent is Get immediate parent tag and find_next_siblings return list of next siblings tag.

You can use xpath. find_elements_by_xpath().
https://www.softwaretestingmaterial.com/how-to-locate-element-by-xpath-locator/

Parse the DOM like Javascript using BeautifulSoup

I have a sample HTML in a variable html_doc like this :
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
Using Javascript its pretty straightforward if I want to parse the DOM. But if I want to grab ONLY the URL (https://test.com) and Time (01/01/1970, 00:00:00) in 2 different variables from the <td> tag above, how can I do it if there is no class name associated with it.
My test.py file
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
print(test)

You already got all td elements. You can iterate through all of them:
for td in soup.find_all('td'):
if td.text.startswith('http'):
print(td, td.text)
# <td>https://test.com</td> https://test.com
If you want, you can be a bit less explicit by searching for the td element with "highlight" class and find the next sibling, but this is more error prone in case the DOM will change:
for td in soup.find_all('td', {'class': 'highlight'}):
print(td.find_next_sibling())
# <td>https://test.com</td>

You can try using regular expression to get the url
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
for tag in test:
urls = re.match('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', tag.text)
time = re.match('[0-9/:, ]+',tag.text)
if urls!= None:
print(urls.group(0))
if time!= None:
print(time.group(0))
Output
01/01/1970, 00:00:00
https://test.com

This is a very specific solution. If you need a general approach, Hari Krishnan's solution with a few tweaks might be more suitable.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
tds = []
for td in soup.find_all('td', {'class': ['highlight', 'light']}):
tds.append(td.find_next_sibling().string)
time, link = tds

With the reference of #DeepSpace
import bs4, re
from bs4 import BeautifulSoup
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
datepattern = re.compile("\d{2}/\d{2}/\d{4}, \d{2}:\d{2}:\d{2}")
soup = BeautifulSoup(html_doc,'html.parser')
for td in soup.find_all('td'):
if td.text.startswith('http'):
link = td.text
elif datepattern.search(td.text):
time = td.text
print(link, time)

Unable to grab a text from soup

I have an HTML as follows:
<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5（+1.33%）</span></td>
</tr>
</table>
I tried to extract 191.1 from a line containing td class="stoksPrice">191.1</td>.
soup = BeautifulSoup(html)
res = soup.find_all('stoksPrice')
print (res)
But result is [].
How to find it guys?

There seem to be two issues:
First is that your usage of find_all is invalid. The current way you're searching for a tagname called stoksPrice which is wrong ad your tags are table, tr, td, div, span. You need to change that to:
>>> res = soup.find_all(class_='stoksPrice')
to search for tags with that class.
Second, your HTML is malformed. The list with stoksPrice is:
</td>
td class="stoksPrice">191.1</td>
it should have been:
</td>
<td class)="stoksPrice">191.1</td>
(Note that < before the td)
Not sure if that was a copy error into Stack Overflow or the HTML is originally malformed but that is not going to be easy to parse ...

Since there are multiple tags having the same class, you can use CSS selectors to get an exact match.
html = '''<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
<td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5（+1.33%）</span></td>
</tr>
</table>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('td[class="stoksPrice"]').text)
# 191.1
Or, you could use lambda and find to get the same.
print(soup.find(lambda t: t.name == 'td' and t['class'] == ['stoksPrice']).text)
# 191.1
Note: BeautifulSoup converts multi-valued class attributes in lists. So, the classes of the two td tags look like ['stoksPrice'] and ['stoksPrice', 'realTimChange'].

Here is one way to do it using findAll.
Because all the previous stoksPrice are empty the only one that remains is the one with the price..
You can put in a check using try/except clause to check if it is a floating point number.
If it is not it will continue iterating and if it is it will return it.
res = soup.findAll("td", {"class": "stoksPrice"})
for r in res:
try:
t = float(r.text)
print(t)
except:
pass
191.1

Extract data from an HTML tag which has specific attributes

<tr bgcolor="#FFFFFF">
<td class="tablecontent" scope="row" rowspan="1">
ZURICH AMERICAN INSURANCE COMPANY
</td>
<td class="tablecontent" scope="row" rowspan="1">
FARMERS GROUP INC (14523)
</td>
<td class="tablecontent" scope="row">
znaf
</td>
<td class="tablecontent" scope="row">
anhm
</td>
</tr>
I have an HTML document which contains multiple tr tags. I want to extract the href link from the first td and data from third td tag onwards under every tr tag. How can this be achieved?

You can find all tr elements, iterate over them, then do the context-specific searches for the inner td elements and get the first and the third:
for tr in soup.find_all('tr'):
cells = tr.find_all('td')
if len(cells) < 3:
continue # safety pillow
link = cells[0].a['href'] # assuming every first td has an "a" element
data = cells[2].get_text()
print(link, data)
As a side note and depending what you are trying to accomplish in the HTML parsing, I usually find pandas.read_html() a great and convenient way to parse HTML tables into dataframes and work with the dataframes after, which are quite convenient data structures to work with.

You can use css selector nth-of-type to navigate through the tds
Here's a sample"
soup = BeautifulSoup(html, 'html.parser')
a = soup.select('td:nth-of-type(1) a')[0]
href = a['href']
td = soup.select("td:nth-of-type(3)")[0]
text = td.get_text(strip=True)

Deleting all content between brackets from a string using python

I am using beautiful soup to grab data from an html page, and when I grab the data, I am left with this:
<tr>
<td class="main rank">1</td>
<td class="main company"><a href="/colleges/williams-college/">
<img alt="" src="http://i.forbesimg.com/media/lists/colleges/williams-college_50x50.jpg">
<h3>Williams College</h3></img></a></td>
<td class="main">Massachusetts</td>
<td class="main">$61,850</td>
<td class="main">2,124</td>
</tr>
This is the beautifulsoup command I am using to get this:
html = open('collegelist.html')
test = BeautifulSoup(html)
soup = test.find_all('tr')
I now want to manipulate this text so that it outputs
1
Williams College
Massachusetts
$62,850
2,214
and I having difficulty doing so for the entire document, where I have about 700 of these entries. Any advice would be appreciated.

Just get the .text (or use get_text()) for every tr in the loop:
soup = BeautifulSoup(open('collegelist.html'))
for tr in soup.find_all('tr'):
print tr.text # or tr.get_text()
For the HTML you've provided it prints:
1
Williams College
Massachusetts
$61,850
2,124

use get_text()
soup = BeautifulSoup(html)
"".join([x.get_text() for x in soup.find_all('tr')])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot scrape the table for href links - python

Related

Get contents of a particular row

Parse the DOM like Javascript using BeautifulSoup

Unable to grab a text from soup

Extract data from an HTML tag which has specific attributes

Deleting all content between brackets from a string using python

Categories

Resources