I want to locate 'td' where text is 'xyz' so that I can find other attributes in the row. I only have 'xyz' with me and want to get other elements in that row.
.
.
.
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
.
.
.
I can get 'xyz' easily by using
required = soup.find('a', text = 'xyz')
print(required[0].text)
but I'm not able to locate 'td' so that I can use find_next_siblings() to get other columns.
Expected output:
xyz
address
phone number
With bs4 4.7.1 combine pseudo classes of :has and :contains to retrieve the row and tds within.
This bit targets the right a tag if present by its text
a:contains("xyz")
You then retrieve the parent row (tr) having this a tag
tr:has(a:contains("xyz"))
And finally use a descendant combinator and td type selector to get all the tds within that row. Using a list comprehension to return the list.
from bs4 import BeautifulSoup as bs
html = '''
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('tr:has(a:contains("xyz")) td')]
print(items)
If you have modern BeautifulSoup, you can use CSS selector :contains. Then traverse back with find_parent() method.
from bs4 import BeautifulSoup
s = '''
<tr>
<td>Other1</td>
<td>Other1</td>
<td>Other1</td>
</tr>
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(s, 'lxml')
for td in soup.select_one('a:contains(xyz)').find_parent('tr').select('td'):
print(td.text.strip())
Prints:
xyz
address
phone number
Replace your code with this:
from bs4 import BeautifulSoup
html = '''<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(html, 'lxml')
required = soup.find('a', text = 'xyz')
print(required.text)
td = required.parent
siblingsArray = td.find_next_siblings()
for siblings in siblingsArray:
print(siblings.text)
O/P:
xyz
address
phone number
Where parent is Get immediate parent tag and find_next_siblings return list of next siblings tag.
You can use xpath. find_elements_by_xpath().
https://www.softwaretestingmaterial.com/how-to-locate-element-by-xpath-locator/
Related
I am trying to pull the href links from the table which I later need to click one by one to access the data inside each links. But I cannot figure out a way to do it. I have tried find_all and have been getting "ResultSet object has no attribute '%s' error.
HTML: (Really long so this is a 1/10)
<thead>
<tr class="sctablehead">
<th>Academic Program</th>
<th>Departments</th>
<th>Academic Level</th>
<th>College</th>
<th>Online</th>
<th>Degree Type</th>
</tr>
</thead>
<tbody>
<tr class="even firstrow"><td>Accountancy</td><td>Accounting</td><td>Graduate</td><td>BUS</td><td></td><td>MAC</td></tr>
<tr class="odd"><td>Accounting</td><td>Accounting</td><td>Undergraduate</td><td>BUS</td><td></td><td>BSB</td></tr>
<tr class="even"><td>Accounting</td><td>Accounting</td><td>Undergraduate</td><td>BUS</td><td></td><td>Minor</td></tr>
<tr class="odd"><td>Actuarial Science</td><td>Mathematics, Economics, Finance</td><td>Undergraduate</td><td>STEM</td><td></td><td>Minor</td></tr>
<tr class="even"><td>Adult Gerontology Acute Care Nurse Practitioner</td><td>Nursing</td><td>Graduate</td><td>HHS</td><td></td><td>PMC</td></tr>
<tr class="odd"><td>Advertising and Public Relations</td><td>Advertising</td><td>Undergraduate</td><td>BUS</td><td></td><td>BSB</td></tr>
<tr class="even"><td>Advertising Public Relations</td><td>Marketing</td><td>Undergraduate</td><td>BUS</td><td></td><td>Minor</td></tr>
<tr class="odd"><td>Aerospace Studies</td><td>Aerospace Studies</td><td>Undergraduate</td><td>HHS</td><td></td><td>Minor</td></tr>
<tr class="even"><td>Africana Studies</td><td>Africana Studies</td><td>Undergraduate</td><td>BCLASSE</td><td></td><td>Minor</td></tr>
... And so on
My Code:
r = requests.get(driver.current_url)
soup = bs(r.content, 'html.parser')
programs_table = soup.find_all('table', {"class":"sc_sctable tbl_degrees sorttable"})
for tr in programs_table.find_all('tr class'):
for a in tr.find_all('a'):
print(a['href'])
If your table is found properly(as you have not provided the html for that..)
Then ONLY:-
r = requests.get(driver.current_url)
soup = bs(r.content, 'html.parser')
programs_table = soup.find_all('table', {"class":"sc_sctable tbl_degrees sorttable"})
for tr in programs_table.find_all('tr'):
for a in tr.find_all('a'):
print(a['href'])
In other words you can try programs_table.find_all("tr") instead of programs_table.find_all("tr class")
Because I get results after using this as follows:
/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/bsba-in-accounting/
/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/accounting-minor/
/undergraduate/colleges-programs/college-science-technology-engineering-mathematics/department-mathematics-statistics/actuarial-science-minor/
/graduate/graduate-programs/post-masters-adult-gero-acute-care-nurse-pract-certificate-program/
/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations/
/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations-minor/
/undergraduate/colleges-programs/college-health-human-services/aerospace-studies-program/```
First of all, you shouldn't be using find_all to scrape one tag unless you really want it to be in a list. So, to get the table you should just to do:
programs_table = soup.find('table', class_="sc_sctable")
Now to get inner <a> tags with href links, you can scrape <td> tags which have inner <a> tag:
tags_with_href = programs_table.tbody.find_all('td')
links = [each_tag.a['href'] for each_tag in tags_with_href if each_tag.a]
# -> ['/graduate/graduate-programs/master-accountancy/', ... ]
If you want to have absolute url instead of relative one, you can define base_url and add each relative urls to it:
base_url = '<base_url_of_website>'
links = [base_url + link for link in links]
I am scraping a table on a website where I am only trying to return any row where the class is blank (Row 1 and 4)
<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>
(Note there is a trailing space at the end of the is-oos class.
When I do soup.findAll('tr', class_=None) it matches all the rows. This is because Row 2 has the class ['is-oos', ''] due to the trailing space. Is there a simple way to do a soup.findAll() or soup.select() to match these rows?
Try class_="":
from bs4 import BeautifulSoup
html_doc = """<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>"""
soup = BeautifulSoup(html_doc, "html.parser")
print(*soup.find_all('tr', class_=""))
# Or to only get the text
print( '\n'.join(t.text for t in soup.find_all('tr', class_="")) )
Outputs:
<tr class="">Row 1</tr> <tr class="">Row 4</tr>
Row 1
Row 4
Edit To only get what's in stock, we can check the attributes of the tag:
import requests
from bs4 import BeautifulSoup
URL = "https://gun.deals/search/apachesolr_search/736676037018"
soup = BeautifulSoup(requests.get(URL).text, "html.parser")
for tag in soup.find_all('tr'):
if tag.attrs.get('class') == ['price-compare-table__oos-breaker', 'js-oos-breaker']:
break
print(tag.text.strip())
I have a sample HTML in a variable html_doc like this :
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
Using Javascript its pretty straightforward if I want to parse the DOM. But if I want to grab ONLY the URL (https://test.com) and Time (01/01/1970, 00:00:00) in 2 different variables from the <td> tag above, how can I do it if there is no class name associated with it.
My test.py file
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
print(test)
You already got all td elements. You can iterate through all of them:
for td in soup.find_all('td'):
if td.text.startswith('http'):
print(td, td.text)
# <td>https://test.com</td> https://test.com
If you want, you can be a bit less explicit by searching for the td element with "highlight" class and find the next sibling, but this is more error prone in case the DOM will change:
for td in soup.find_all('td', {'class': 'highlight'}):
print(td.find_next_sibling())
# <td>https://test.com</td>
You can try using regular expression to get the url
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
for tag in test:
urls = re.match('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', tag.text)
time = re.match('[0-9/:, ]+',tag.text)
if urls!= None:
print(urls.group(0))
if time!= None:
print(time.group(0))
Output
01/01/1970, 00:00:00
https://test.com
This is a very specific solution. If you need a general approach, Hari Krishnan's solution with a few tweaks might be more suitable.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
tds = []
for td in soup.find_all('td', {'class': ['highlight', 'light']}):
tds.append(td.find_next_sibling().string)
time, link = tds
With the reference of #DeepSpace
import bs4, re
from bs4 import BeautifulSoup
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
datepattern = re.compile("\d{2}/\d{2}/\d{4}, \d{2}:\d{2}:\d{2}")
soup = BeautifulSoup(html_doc,'html.parser')
for td in soup.find_all('td'):
if td.text.startswith('http'):
link = td.text
elif datepattern.search(td.text):
time = td.text
print(link, time)
I have an HTML as follows:
<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>
I tried to extract 191.1 from a line containing td class="stoksPrice">191.1</td>.
soup = BeautifulSoup(html)
res = soup.find_all('stoksPrice')
print (res)
But result is [].
How to find it guys?
There seem to be two issues:
First is that your usage of find_all is invalid. The current way you're searching for a tagname called stoksPrice which is wrong ad your tags are table, tr, td, div, span. You need to change that to:
>>> res = soup.find_all(class_='stoksPrice')
to search for tags with that class.
Second, your HTML is malformed. The list with stoksPrice is:
</td>
td class="stoksPrice">191.1</td>
it should have been:
</td>
<td class)="stoksPrice">191.1</td>
(Note that < before the td)
Not sure if that was a copy error into Stack Overflow or the HTML is originally malformed but that is not going to be easy to parse ...
Since there are multiple tags having the same class, you can use CSS selectors to get an exact match.
html = '''<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
<td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('td[class="stoksPrice"]').text)
# 191.1
Or, you could use lambda and find to get the same.
print(soup.find(lambda t: t.name == 'td' and t['class'] == ['stoksPrice']).text)
# 191.1
Note: BeautifulSoup converts multi-valued class attributes in lists. So, the classes of the two td tags look like ['stoksPrice'] and ['stoksPrice', 'realTimChange'].
Here is one way to do it using findAll.
Because all the previous stoksPrice are empty the only one that remains is the one with the price..
You can put in a check using try/except clause to check if it is a floating point number.
If it is not it will continue iterating and if it is it will return it.
res = soup.findAll("td", {"class": "stoksPrice"})
for r in res:
try:
t = float(r.text)
print(t)
except:
pass
191.1
I am trying to get the contents of a html table with beautifulsoup.
when I get to the level of the cell I need to get only the values that are not between the strike parameter
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
so in the case above I would like to return only $0.41. I am using data.get_text() but I do not know how to filter out the $0.45
any ideas on how to do it?
All the solutions above will work. Adding one more method: extract()
From the documentation:
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
You can use it like this (added one more <td> tag to show how it can be used in a loop):
html = '''
<td>
<strike>
$0.45
</strike>
<br/>
$0.41
</td>
<td>
<strike>
$0.12
</strike>
<br/>
$0.14
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
for td in soup.find_all('td'):
td.strike.extract()
print(td.text.strip())
Output:
$0.41
$0.14
You can look at all NavigableString children of the TD tag and ignore all other elements:
textData = ''.join(x for x in soup.find('td').children \
if isinstance(x, bs4.element.NavigableString)).strip()
#'$0.41'
You can do the same in several ways. Here is one such way:
from bs4 import BeautifulSoup
content="""
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
"""
soup = BeautifulSoup(content,"lxml")
item = soup.find("td").contents[-1].strip()
print(item)
Output:
$0.41
You can do this in the following way
from bs4 import BeautifulSoup
h = '''
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
'''
soup = BeautifulSoup(h, 'lxml')
a = soup.find('td').get_text()
print(a.split('\n')[2].strip())
Split it with Enter and delete both spaces.