How to search for string with beautiful soup - python

I'm trying to parse the below URL. I would like to get the output of all the prices on this site. The first item would be £59.
I inspected the element and found out that the html looks as below. I believe the best way would be to search for a class 'sr_gs_rackrate_total' or alternatively for a title that starts with "Price for".
How can I do this in Beautiful Soup?
<strong class="price scarcity_color sr_gs_rackrate_price
anim_rack_rate
" title="Price for 1 night £59">
<b>
<span class="sr_gs_rackrate_total">Total: </span>
£59
</b>
</strong>
http://www.booking.com/searchresults.en-gb.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaFCIAQGYAS64AQTIAQTYAQHoAQH4AQs&sid=1a43e0952558ac0ad0061d5b6523a7bc&dcid=1&checkin_monthday=23;checkin_year_month=2016-1;checkout_monthday=24;checkout_year_month=2016-1;&city=-2601889&class_interval=1&csflt=%7B%7D&dtdisc=0&group_adults=7&group_children=0&highlighted_hotels=1192837&hlrd=0&hp_sbox=1&hyb_red=0&inac=0&label_click=undef&nflt=ht_id%3D201%3B&nha_red=0&no_rooms=1&redirected_from_city=0&redirected_from_landmark=0&redirected_from_region=0&review_score_group=empty&room1=A%2CA%2CA%2CA%2CA%2CA%2CA&sb_price_type=total&score_min=0&si=ai%2Cco%2Cci%2Cre%2Cdi&ss=London&ss_all=0&ssafas=1&ssb=empty&sshis=0&ssne=London&ssne_untouched=London&order=price_for_two

Here is one way to do that:
from bs4 import BeautifulSoup
soup = BeautifulSoup(yourhtml)
span = soup.find('span', {'class': 'sr_gs_rackrate_total'})
b = span.parent
b.span.extract()
b.text
In case there is more then one span with a price in it, use
for span in soup.find_all('span', {'class': 'sr_gs_rackrate_total'}):
b = span.parent
b.span.extract()
print b.text

Related

How to scrape last string of <p> tag element?

To start, python is my first language I am learning.
I am scraping a website for rent prices across my city and I am using BeautifulSoup to get the price data, but I am unable to get the value of this tag.
Here is the tag:
<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>
Here is my code:
text = soup.find_all("div", {"class", "plan-group rent"})
for item in text:
rent = item.find_all("p")
for price in rent:
print(price.string)
I also tried:
text = soup.find_all("div", {"class", "plan-group rent"})
for item in text:
rent = item.find_all("p")
for price in rent:
items = price.find_all("strong")
for item in items:
print('item.string')
and that works to print out "Monthly Rent:" but I don't understand why I can't get the actual price. The above code shows me that the monthly rent is in the strong tag, which means that the p tag only contains the price which is what I want.
As mentioned by #kyrony there are two children in your <p> - Cause you select the <strong> you will only get one of the texts.
You could use different approaches stripped_strings:
list(soup.p.stripped_strings)[-1]
or contents
soup.p.contents[-1]
or with recursive argument
soup.p.find(text=True, recursive=False)
Example
from bs4 import BeautifulSoup
html = '''<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>'''
soup = BeautifulSoup(html)
soup.p.contents[-1]
Technically your content has two children
<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>
A strong tag
<strong class="hidden show-mobile-inline">Monthly Rent: </strong>
and a string
2,450 +
The string method in beautiful soup only takes one argument so its going to return None. In order to get the second string you need to use the stripped_strings generator.

BeautifulSoup - extracting text from multiple span elements w/o classes

So that's how HTML looks:
<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>
I need to extract detail2 & detail3.
But with this piece of code I only get detail1.
info = data.find("p", class_ = "details").span.text
How do I extract the needed items?
Thanks in advance!
Select your elements more specific in your case all sibling <span> of <span> with class number:
soup.select('span.number ~ span')
Example
from bs4 import BeautifulSoup
html='''<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>'''
soup = BeautifulSoup(html)
[t.text for t in soup.select('span.number ~ span')]
Output
['detail2', 'detail3']
You can find all <span>s and do normal indexing:
from bs4 import BeautifulSoup
html_doc = """\
<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
spans = soup.find("p", class_="details").find_all("span")
for s in spans[-2:]:
print(s.text)
Prints:
detail2
detail3
Or CSS selectors:
spans = soup.select(".details span:nth-last-of-type(-n+2)")
for s in spans:
print(s.text)
Prints:
detail2
detail3

Get values from CSS span element with constantly changing values

I am trying to scrape a website that seems to use different values each time a particular span element appears. For example, the first few times the span element appears, it could be:
<span title="PM XX">PM XX</span>
<span title="Star Charterist">Star Charterist</span>
<span title="Elephant Trainer">Elephant Trainer</span>
I have tried the following, but I keep getting either empty lists:
site = BeautifulSoup(link.text, "html.parser")
jobs_a = site.find_all("span title")
or
jobs_a = site.find_all("span", attrs="title")
or
jobs_a = site.find_all("span", attrs="title*")
Any suggestions?
I prefer using a CSS selector.
from bs4 import BeautifulSoup
data = '''\
<span title="PM XX">PM XX</span>
<span title="Star Charterist">Star Charterist</span>
<span title="Elephant Trainer">Elephant Trainer</span>
'''
soup = BeautifulSoup(data, 'html.parser')
for s in soup.select('span[title]'):
print(f"{s.text=}\t{s.attrs['title']=}")

How to extract exact information by span class using Beautiful Soup

This is my code and output for my price monitoring code:
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="result_0_name").get_text()
price = soup.find("span", class_ = "normal_price")
#converted_price = price[0:3]
print(price.get_text())
print(title.strip())
the output is as follows
Starting at:
$0.70 USD
$0.67 USD
Operation Broken Fang Case
and html of the page is as so
<span class="market_table_value normal_price">Starting at:<br/>
<span class="normal_price" data-currency="1" data-price="69">$0.69 USD</span>
<span class="sale_price">$0.66 USD</span>
</span>
as you can see there is no ID, so I cannot use that, I only wish to display the 'normal_price' and not the other data in that span. Any ideas?
Just make the selection of the span more specific, for example use the fact, that it is an element inside an element:
soup.select_one('span > span.normal_price').get_text()
Example
from bs4 import BeautifulSoup
html='''
<span class="market_table_value normal_price">
Starting at:<br/>
<span class="normal_price" data-currency="1" data-price="69">$0.69 USD</span>
<span class="sale_price">$0.66 USD</span>
</span>
'''
soup = BeautifulSoup(html,'html.parser')
price = soup.select_one('span > span.normal_price').get_text()
price
Output
$0.69 USD
You can also try like this
from bs4 import BeautifulSoup
html ="""<span class="market_table_value normal_price">
Starting at:<br/>
<span class="normal_price" data-currency="1" data-price="69">$0.69 USD</span>
<span class="sale_price">$0.66 USD</span>
</span>"""
soup = BeautifulSoup(html, 'html.parser')
using attribute
price = soup.select_one('span[data-currency="1"]').get_text()
exact attribute
price = soup.select_one('span[data-currency^="1"]').get_text()
print(price) #$0.69 USD

Parse HTML page to get contents of <p> and <b> tags

There are lots of HTML pages which are structured as a sequence of such groups:
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
The addresses of these pages are like https://some.page.org/year/0001, https://some.page.org/year/0002, etc.
How can I extract the keywords separately from each of such pages? I've tried to use BeautifulSoup, but unsuccessfully. I've only written the program that prints titles of groups (between <b> and </b>).
from bs4 import BeautifulSoup
from urllib2 import urlopen
import re
html_doc = urlopen('https://some.page.org/2018/1234').read()
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
print 'https://some.page.org'+link.get('href')
for node in soup.findAll('b'):
print ''.join(node.findAll(text=True))
I can't test this without knowing the actual source code format but it seems you want the <p> tags text vaue:
for node in soup.findAll('p'):
print(node.text)
# or: keywords = node.text.split(', ')
# print(keywords)
You need to split your string which in this case is url with /
And then you can choose chunks you want
For example if url is https://some.page.org/year/0001 i use split function to split url with / sign
it will convert it to array and then i choose what i need and again convert it to string with ''.join() method you can read about split method in this link
There are different ways to HTML parse the desired categories and keywords from this kind of HTML structure, but here is one of the "BeautifulSoup" ways to do it:
find b elements with a text which ends with a :
use .next_sibling to get to the next text node which contains keywords
Working example:
from bs4 import BeautifulSoup
data = """
<div>
<p>
<b> Category 1:</b>
"keyword_a, keyword_b"
</p>
<p>
<b> Category 2:</b>
"keyword_c, keyword_d"
</p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for category in soup('b', text=lambda text: text and text.endswith(":")):
keywords = category.next_sibling.strip('" \n').split(", ")
print(category.get_text(strip=True), keywords)
Prints:
Category 1: ['keyword_a', 'keyword_b']
Category 2: ['keyword_c', 'keyword_d']
Assuming for each block
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
you want to extract keyword_a and keyword_b for each Keywords/Category. So an example would be:
<p>
<b>Mammals</b>
"elephant, rhino"
</p>
<p>
<b>Birds</b>
"hummingbird, ostrich"
</p>
Once you have the HTML code, you can do:
from bs4 import BeautifulSoup
html = '''<p>
<b>Mammals</b>
"elephant, rhino"
</p>
<p>
<b>Birds</b>
"hummingbird, ostrich"
</p>'''
soup = BeautifulSoup(html, 'html.parser')
p_elements = soup.find_all('p')
for p_element in p_elements:
b_element = soup.find_all('b')[0]
b_element.extract()
category = b_element.text.strip()
keywords = p_element.text.strip()
keyword_a, keyword_b = keywords[1:-1].split(', ')
print('Category:', category)
print('Keyword A:', keyword_a)
print('Keyword B:', keyword_b)
Which prints:
Category: Mammals
Keyword A: elephant
Keyword B: rhino
Category: Birds
Keyword A: hummingbird
Keyword B: ostrich

Categories