Beautifulsoup Match Empty Class - python

I am scraping a table on a website where I am only trying to return any row where the class is blank (Row 1 and 4)
<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>
(Note there is a trailing space at the end of the is-oos class.
When I do soup.findAll('tr', class_=None) it matches all the rows. This is because Row 2 has the class ['is-oos', ''] due to the trailing space. Is there a simple way to do a soup.findAll() or soup.select() to match these rows?

Try class_="":
from bs4 import BeautifulSoup
html_doc = """<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>"""
soup = BeautifulSoup(html_doc, "html.parser")
print(*soup.find_all('tr', class_=""))
# Or to only get the text
print( '\n'.join(t.text for t in soup.find_all('tr', class_="")) )
Outputs:
<tr class="">Row 1</tr> <tr class="">Row 4</tr>
Row 1
Row 4
Edit To only get what's in stock, we can check the attributes of the tag:
import requests
from bs4 import BeautifulSoup
URL = "https://gun.deals/search/apachesolr_search/736676037018"
soup = BeautifulSoup(requests.get(URL).text, "html.parser")
for tag in soup.find_all('tr'):
if tag.attrs.get('class') == ['price-compare-table__oos-breaker', 'js-oos-breaker']:
break
print(tag.text.strip())

Related

Beautifulsoup get text from table grid under located words' grid

I wanted to extract information from this table to a csv file, but only the number of grade and age without the "grade:" and "age:" part:
<table>
<tbody>
<tr>
<td><b>Grade:</b></td>
<td>11</td>
</tr>
<tr>
<td><b>Age:</b></td>
<td>15</td>
</tr>
</tbody>
</table>
Most of the tutorials I've find only shows how to parse all the tables into a csv file rather than parsing the next line of located words:
import csv
from bs4 import BeautifulSoup as bs
with open("1.html") as fp:
soup = bs(fp, 'html.parser')
tables = soup.find_all('table')
filename = "input.csv"
csv_writer = csv.writer(open(filename, 'w'))
for tr in soup.find_all("tr"):
data = []
for th in tr.find_all("th"):
data.append(th.text)
if data:
csv_writer.writerow(data)
continue
for td in tr.find_all("td"):
if td.a:
data.append(td.a.text.strip())
else:
data.append(td.text.strip())
if data:
csv_writer.writerow(data)
How should I do it? Thanks!
You can use the find_next() method to search for a <td> following a <b>:
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select("table tr > td > b"):
print(tag.find_next("td").text)

Get contents of a particular row

I want to locate 'td' where text is 'xyz' so that I can find other attributes in the row. I only have 'xyz' with me and want to get other elements in that row.
.
.
.
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
.
.
.
I can get 'xyz' easily by using
required = soup.find('a', text = 'xyz')
print(required[0].text)
but I'm not able to locate 'td' so that I can use find_next_siblings() to get other columns.
Expected output:
xyz
address
phone number
With bs4 4.7.1 combine pseudo classes of :has and :contains to retrieve the row and tds within.
This bit targets the right a tag if present by its text
a:contains("xyz")
You then retrieve the parent row (tr) having this a tag
tr:has(a:contains("xyz"))
And finally use a descendant combinator and td type selector to get all the tds within that row. Using a list comprehension to return the list.
from bs4 import BeautifulSoup as bs
html = '''
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('tr:has(a:contains("xyz")) td')]
print(items)
If you have modern BeautifulSoup, you can use CSS selector :contains. Then traverse back with find_parent() method.
from bs4 import BeautifulSoup
s = '''
<tr>
<td>Other1</td>
<td>Other1</td>
<td>Other1</td>
</tr>
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(s, 'lxml')
for td in soup.select_one('a:contains(xyz)').find_parent('tr').select('td'):
print(td.text.strip())
Prints:
xyz
address
phone number
Replace your code with this:
from bs4 import BeautifulSoup
html = '''<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(html, 'lxml')
required = soup.find('a', text = 'xyz')
print(required.text)
td = required.parent
siblingsArray = td.find_next_siblings()
for siblings in siblingsArray:
print(siblings.text)
O/P:
xyz
address
phone number
Where parent is Get immediate parent tag and find_next_siblings return list of next siblings tag.
You can use xpath. find_elements_by_xpath().
https://www.softwaretestingmaterial.com/how-to-locate-element-by-xpath-locator/

Scraping with specific criteria when similar classes used in html source

I am trying to scrape the 8 instances of x between td tags on the following
<th class="first"> Temperature </th>
<td> x </td> # repeated for 8 lines
There are however numerous classes on page that are <th class="first"> The only unique identifier is the string that follows first, in this example Temperature.
Not sure what to add to the following code I am using to create some kind of criteria to scrape for <th class="first"> where Temperature (and other strings follow)
for tag in soup.find_all("th", {"class":"first"}):
temps.append(tag.text)
Is it a matter of additional code (re.compile?) or should I use something else entirely?
Edit: Html of interest below
<tbody>
<tr>
<th class="first">Temperature</th>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
Edit: current code
from bs4 import BeautifulSoup as bs
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'c:\program files\firefox\geckodriver.exe')
driver.get("http://www.bom.gov.au/places/nsw/sydney/forecast/detailed/")
html = driver.page_source
soup = bs(html, "lxml")
dates = []
for tag in soup.find_all("a", {"class":"toggle"}):
dates.append(tag.text)
temps = [item.text for item in soup.select('th.first:contains(Temperature) ~ td')]
print(dates)
print(temps)
This is easy with bs4 4.7.1. as you can use the :contains pseudo class with ~ general sibling combinator
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.bom.gov.au/places/nsw/sydney/forecast/detailed'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('[summary*=Temperatures]'):
print(table['summary']) #day of reading
tds = [item.text for item in table.select('.first:contains("Air temperature (°C)") ~ td')] #readings
print(tds)
You can get the hours of the readings with:
print([item.text.strip() for item in table.select('tr:nth-of-type(1) th')][1:-1])
Nicely formatted tables add in pandas:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'http://www.bom.gov.au/places/nsw/sydney/forecast/detailed'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('[summary*=Temperatures]'):
print(table['summary'])
output = pd.read_html(str(table))[0]
print(output)
If I understand correctly, try this:
from bs4 import BeautifulSoup
import re
s = '''
<tr>
<th class="first">Temperature</th>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
'''
soup = BeautifulSoup(s, "lxml")
[td.text for td in soup.find('th', string=re.compile("Temperature")).find_next_siblings()]
and you get:
['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x']

Parse the DOM like Javascript using BeautifulSoup

I have a sample HTML in a variable html_doc like this :
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
Using Javascript its pretty straightforward if I want to parse the DOM. But if I want to grab ONLY the URL (https://test.com) and Time (01/01/1970, 00:00:00) in 2 different variables from the <td> tag above, how can I do it if there is no class name associated with it.
My test.py file
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
print(test)
You already got all td elements. You can iterate through all of them:
for td in soup.find_all('td'):
if td.text.startswith('http'):
print(td, td.text)
# <td>https://test.com</td> https://test.com
If you want, you can be a bit less explicit by searching for the td element with "highlight" class and find the next sibling, but this is more error prone in case the DOM will change:
for td in soup.find_all('td', {'class': 'highlight'}):
print(td.find_next_sibling())
# <td>https://test.com</td>
You can try using regular expression to get the url
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
for tag in test:
urls = re.match('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', tag.text)
time = re.match('[0-9/:, ]+',tag.text)
if urls!= None:
print(urls.group(0))
if time!= None:
print(time.group(0))
Output
01/01/1970, 00:00:00
https://test.com
This is a very specific solution. If you need a general approach, Hari Krishnan's solution with a few tweaks might be more suitable.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
tds = []
for td in soup.find_all('td', {'class': ['highlight', 'light']}):
tds.append(td.find_next_sibling().string)
time, link = tds
With the reference of #DeepSpace
import bs4, re
from bs4 import BeautifulSoup
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
datepattern = re.compile("\d{2}/\d{2}/\d{4}, \d{2}:\d{2}:\d{2}")
soup = BeautifulSoup(html_doc,'html.parser')
for td in soup.find_all('td'):
if td.text.startswith('http'):
link = td.text
elif datepattern.search(td.text):
time = td.text
print(link, time)

exclude tags with beautifulsoup

I am trying to get the contents of a html table with beautifulsoup.
when I get to the level of the cell I need to get only the values that are not between the strike parameter
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
so in the case above I would like to return only $0.41. I am using data.get_text() but I do not know how to filter out the $0.45
any ideas on how to do it?
All the solutions above will work. Adding one more method: extract()
From the documentation:
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
You can use it like this (added one more <td> tag to show how it can be used in a loop):
html = '''
<td>
<strike>
$0.45
</strike>
<br/>
$0.41
</td>
<td>
<strike>
$0.12
</strike>
<br/>
$0.14
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
for td in soup.find_all('td'):
td.strike.extract()
print(td.text.strip())
Output:
$0.41
$0.14
You can look at all NavigableString children of the TD tag and ignore all other elements:
textData = ''.join(x for x in soup.find('td').children \
if isinstance(x, bs4.element.NavigableString)).strip()
#'$0.41'
You can do the same in several ways. Here is one such way:
from bs4 import BeautifulSoup
content="""
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
"""
soup = BeautifulSoup(content,"lxml")
item = soup.find("td").contents[-1].strip()
print(item)
Output:
$0.41
You can do this in the following way
from bs4 import BeautifulSoup
h = '''
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
'''
soup = BeautifulSoup(h, 'lxml')
a = soup.find('td').get_text()
print(a.split('\n')[2].strip())
Split it with Enter and delete both spaces.

Categories