Parse the DOM like Javascript using BeautifulSoup - python

I have a sample HTML in a variable html_doc like this :
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
Using Javascript its pretty straightforward if I want to parse the DOM. But if I want to grab ONLY the URL (https://test.com) and Time (01/01/1970, 00:00:00) in 2 different variables from the <td> tag above, how can I do it if there is no class name associated with it.
My test.py file
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
print(test)

You already got all td elements. You can iterate through all of them:
for td in soup.find_all('td'):
if td.text.startswith('http'):
print(td, td.text)
# <td>https://test.com</td> https://test.com
If you want, you can be a bit less explicit by searching for the td element with "highlight" class and find the next sibling, but this is more error prone in case the DOM will change:
for td in soup.find_all('td', {'class': 'highlight'}):
print(td.find_next_sibling())
# <td>https://test.com</td>

You can try using regular expression to get the url
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
for tag in test:
urls = re.match('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', tag.text)
time = re.match('[0-9/:, ]+',tag.text)
if urls!= None:
print(urls.group(0))
if time!= None:
print(time.group(0))
Output
01/01/1970, 00:00:00
https://test.com

This is a very specific solution. If you need a general approach, Hari Krishnan's solution with a few tweaks might be more suitable.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
tds = []
for td in soup.find_all('td', {'class': ['highlight', 'light']}):
tds.append(td.find_next_sibling().string)
time, link = tds

With the reference of #DeepSpace
import bs4, re
from bs4 import BeautifulSoup
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
datepattern = re.compile("\d{2}/\d{2}/\d{4}, \d{2}:\d{2}:\d{2}")
soup = BeautifulSoup(html_doc,'html.parser')
for td in soup.find_all('td'):
if td.text.startswith('http'):
link = td.text
elif datepattern.search(td.text):
time = td.text
print(link, time)

Related

Beautifulsoup Match Empty Class

I am scraping a table on a website where I am only trying to return any row where the class is blank (Row 1 and 4)
<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>
(Note there is a trailing space at the end of the is-oos class.
When I do soup.findAll('tr', class_=None) it matches all the rows. This is because Row 2 has the class ['is-oos', ''] due to the trailing space. Is there a simple way to do a soup.findAll() or soup.select() to match these rows?
Try class_="":
from bs4 import BeautifulSoup
html_doc = """<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>"""
soup = BeautifulSoup(html_doc, "html.parser")
print(*soup.find_all('tr', class_=""))
# Or to only get the text
print( '\n'.join(t.text for t in soup.find_all('tr', class_="")) )
Outputs:
<tr class="">Row 1</tr> <tr class="">Row 4</tr>
Row 1
Row 4
Edit To only get what's in stock, we can check the attributes of the tag:
import requests
from bs4 import BeautifulSoup
URL = "https://gun.deals/search/apachesolr_search/736676037018"
soup = BeautifulSoup(requests.get(URL).text, "html.parser")
for tag in soup.find_all('tr'):
if tag.attrs.get('class') == ['price-compare-table__oos-breaker', 'js-oos-breaker']:
break
print(tag.text.strip())

Get contents of a particular row

I want to locate 'td' where text is 'xyz' so that I can find other attributes in the row. I only have 'xyz' with me and want to get other elements in that row.
.
.
.
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
.
.
.
I can get 'xyz' easily by using
required = soup.find('a', text = 'xyz')
print(required[0].text)
but I'm not able to locate 'td' so that I can use find_next_siblings() to get other columns.
Expected output:
xyz
address
phone number
With bs4 4.7.1 combine pseudo classes of :has and :contains to retrieve the row and tds within.
This bit targets the right a tag if present by its text
a:contains("xyz")
You then retrieve the parent row (tr) having this a tag
tr:has(a:contains("xyz"))
And finally use a descendant combinator and td type selector to get all the tds within that row. Using a list comprehension to return the list.
from bs4 import BeautifulSoup as bs
html = '''
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('tr:has(a:contains("xyz")) td')]
print(items)
If you have modern BeautifulSoup, you can use CSS selector :contains. Then traverse back with find_parent() method.
from bs4 import BeautifulSoup
s = '''
<tr>
<td>Other1</td>
<td>Other1</td>
<td>Other1</td>
</tr>
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(s, 'lxml')
for td in soup.select_one('a:contains(xyz)').find_parent('tr').select('td'):
print(td.text.strip())
Prints:
xyz
address
phone number
Replace your code with this:
from bs4 import BeautifulSoup
html = '''<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(html, 'lxml')
required = soup.find('a', text = 'xyz')
print(required.text)
td = required.parent
siblingsArray = td.find_next_siblings()
for siblings in siblingsArray:
print(siblings.text)
O/P:
xyz
address
phone number
Where parent is Get immediate parent tag and find_next_siblings return list of next siblings tag.
You can use xpath. find_elements_by_xpath().
https://www.softwaretestingmaterial.com/how-to-locate-element-by-xpath-locator/

Scraping with specific criteria when similar classes used in html source

I am trying to scrape the 8 instances of x between td tags on the following
<th class="first"> Temperature </th>
<td> x </td> # repeated for 8 lines
There are however numerous classes on page that are <th class="first"> The only unique identifier is the string that follows first, in this example Temperature.
Not sure what to add to the following code I am using to create some kind of criteria to scrape for <th class="first"> where Temperature (and other strings follow)
for tag in soup.find_all("th", {"class":"first"}):
temps.append(tag.text)
Is it a matter of additional code (re.compile?) or should I use something else entirely?
Edit: Html of interest below
<tbody>
<tr>
<th class="first">Temperature</th>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
Edit: current code
from bs4 import BeautifulSoup as bs
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'c:\program files\firefox\geckodriver.exe')
driver.get("http://www.bom.gov.au/places/nsw/sydney/forecast/detailed/")
html = driver.page_source
soup = bs(html, "lxml")
dates = []
for tag in soup.find_all("a", {"class":"toggle"}):
dates.append(tag.text)
temps = [item.text for item in soup.select('th.first:contains(Temperature) ~ td')]
print(dates)
print(temps)
This is easy with bs4 4.7.1. as you can use the :contains pseudo class with ~ general sibling combinator
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.bom.gov.au/places/nsw/sydney/forecast/detailed'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('[summary*=Temperatures]'):
print(table['summary']) #day of reading
tds = [item.text for item in table.select('.first:contains("Air temperature (°C)") ~ td')] #readings
print(tds)
You can get the hours of the readings with:
print([item.text.strip() for item in table.select('tr:nth-of-type(1) th')][1:-1])
Nicely formatted tables add in pandas:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'http://www.bom.gov.au/places/nsw/sydney/forecast/detailed'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('[summary*=Temperatures]'):
print(table['summary'])
output = pd.read_html(str(table))[0]
print(output)
If I understand correctly, try this:
from bs4 import BeautifulSoup
import re
s = '''
<tr>
<th class="first">Temperature</th>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
'''
soup = BeautifulSoup(s, "lxml")
[td.text for td in soup.find('th', string=re.compile("Temperature")).find_next_siblings()]
and you get:
['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x']

exclude tags with beautifulsoup

I am trying to get the contents of a html table with beautifulsoup.
when I get to the level of the cell I need to get only the values that are not between the strike parameter
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
so in the case above I would like to return only $0.41. I am using data.get_text() but I do not know how to filter out the $0.45
any ideas on how to do it?
All the solutions above will work. Adding one more method: extract()
From the documentation:
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
You can use it like this (added one more <td> tag to show how it can be used in a loop):
html = '''
<td>
<strike>
$0.45
</strike>
<br/>
$0.41
</td>
<td>
<strike>
$0.12
</strike>
<br/>
$0.14
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
for td in soup.find_all('td'):
td.strike.extract()
print(td.text.strip())
Output:
$0.41
$0.14
You can look at all NavigableString children of the TD tag and ignore all other elements:
textData = ''.join(x for x in soup.find('td').children \
if isinstance(x, bs4.element.NavigableString)).strip()
#'$0.41'
You can do the same in several ways. Here is one such way:
from bs4 import BeautifulSoup
content="""
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
"""
soup = BeautifulSoup(content,"lxml")
item = soup.find("td").contents[-1].strip()
print(item)
Output:
$0.41
You can do this in the following way
from bs4 import BeautifulSoup
h = '''
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
'''
soup = BeautifulSoup(h, 'lxml')
a = soup.find('td').get_text()
print(a.split('\n')[2].strip())
Split it with Enter and delete both spaces.

Extracting a row from a table from a url

I want to download EPS value for all years (Under Annual Trends) from the below link.
http://www.bseindia.com/stock-share-price/stockreach_financials.aspx?scripcode=500180&expandable=0
I tried using Beautiful Soup as mentioned in the below answer.
Extracting table contents from html with python and BeautifulSoup
But couldn't proceed after the below code. I feel I am very close to my answer. Any help will be greatly appreciated.
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("http://www.bseindia.com/stock-share-price/stockreach_financials.aspx?scripcode=500180&expandable=0").read()
soup=BeautifulSoup(html)
table = soup.find('table',{'id' :'acr'})
#the below code wasn't working as I expected it to be
tr = table.find('tr', text='EPS')
I am open to using any other language to get this done
The text is in the td not the tr so get the td using the text and then call .parent to get the tr:
In [12]: table = soup.find('table',{'id' :'acr'})
In [13]: tr = table.find('td', text='EPS').parent
In [14]: print(tr)
<tr><td class="TTRow_left" style="padding-left: 30px;">EPS</td><td class="TTRow_right">48.80</td>
<td class="TTRow_right">42.10</td>
<td class="TTRow_right">35.50</td>
<td class="TTRow_right">28.50</td>
<td class="TTRow_right">22.10</td>
</tr>
In [15]: [td.text for td in tr.select("td + td")]
Out[15]: [u'48.80', u'42.10', u'35.50', u'28.50', u'22.10']
Which you will see exactly matches what is on the page.
Another approach would be to call find_next_siblings:
In [17]: tds = table.find('td', text='EPS').find_next_siblings("td")
In [18]: tds
Out[19]:
[<td class="TTRow_right">48.80</td>,
<td class="TTRow_right">42.10</td>,
<td class="TTRow_right">35.50</td>,
<td class="TTRow_right">28.50</td>,
<td class="TTRow_right">22.10</td>]
In [20]: [td.text for td in tds]
Out[20]: [u'48.80', u'42.10', u'35.50', u'28.50', u'22.10']

Categories