Question about using beautifulsoup to parse html - python

I just got started out learning to use BeautifulSoup in Python to parse html and have a very simple stupid question. Somehow, I just couldn't get Text 1 only from the html below (stored in containers).
....
<div class="listA">
<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>
</div>
...
soup = BeautifulSoup(driver.page_source, 'html.parser')
containers = soup.findAll("div", {"class": "listA"})
datas = []
for data in containers:
textspan = data.find("span")
datas.append(textspan.text)
The output is as follows: Text1Text2Text3
Any advice how to delimit them as well? Thanks and much appreciated!

if you just want Text 1 use this code
import bs4
content = "<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>"
soup = bs4.BeautifulSoup(content, 'html.parser')
# soup('span') will give you
# [<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>, <span>Text 1</span>]
span_text = soup('span')
for e in span_text:
if not e('span'):
print(e.text)
Output:
Text 1

Another solution involves simplifieddoc, which does not rely on third-party libraries and is lighter and faster, perfect for beginners.
Here are more examples here
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>
'''
doc = SimplifiedDoc(html)
span = doc.span # Get the outermost span
first = span.span # Get the first span in span
print (first.text)
second = span.b
print (second.text)
third = second.next
print (third.text)
Result:
Text 1
Text 2
Text 3

Related

How to scrape the text within a div if there's multiple divs with the same class name

I am trying to get the amount of chapters in this manga using BeautifulSoup but the way it's contained is making it confusing:
[The Section]https://gyazo.com/c45fef82b0ce52dacd99d213538ab570)
I only want the Chapter number and not the content of the other divs. Currently I have (not the full code):
[The Website]https://www.anime-planet.com/manga/the-beginning-after-the-end
chp = []
temp = soup.select('section.pure-g entryBar > div.pure-1 md-1-5')
for txt in temp:
if "Ch" in txt.text:
chp.append(txt.text)
How would I access the text within the first div?
Looking at the structure of HTML, you can extract text from the first <div> under class="entryBar":
import requests
from bs4 import BeautifulSoup
url = "https://www.anime-planet.com/manga/the-beginning-after-the-end"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
ch = soup.select_one(".entryBar > div").text.split()[-1].strip("+")
print(ch)
Prints:
159

Finding a specific span element with BeautifulSoup

I am trying to create a script to scrape price data from Udemy courses.
I'm struggling with navigating the HTML tree because the element I'm looking for is located inside multiple nested divs.
here's the structure of the HTML element I'm trying to access:
what I tried:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
print(parent_div.find_all("span"))
and even:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span span span")
Here’s the URL: https://www.udemy.com/course/the-complete-web-development-bootcamp/
tried searching all the spans in the HTML and the specific span I'm searching for doesn't appear maybe because it's nested inside a div?
would appreciate a little guidance!
The price is being loaded by JavaScript. So it is not possible to scrape using beautifulsoup.
The data is loaded from an API Endpoint which takes in the course-id of the course.
Course-id of this course: 1565838
You can directly get the info from that endpoint like this.
import requests
course_id = '1565838'
url= f'https://www.udemy.com/api-2.0/course-landing-components/{course_id}/me/?components=price_text'
response = requests.get(url)
x = response.json()
print(x['price_text']['data']['pricing_result']['price'])
{'amount': 455.0, 'currency': 'INR', 'price_string': '₹455', 'currency_symbol': '₹'}
I tried your first approach several times and it works more-or-less for me, although it has returned a different number of span elements on different attempts (10 is the usual number but I have seen as few as 1 on one occasion):
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
spans = parent_div.find_all("span")
print(len(spans))
for span in spans:
print(span)
Prints:
10
<span data-checked="checked" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--4" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Subscribe</span>
<span>Try it free for 7 days</span>
<span class="udlite-text-xs purchase-section-container--cta-subscript--349MH">$29.99 per month after trial</span>
<span class="purchase-section-container--line--159eG"></span>
<span data-checked="" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--6" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Buy Course</span>
<span class="money-back">30-Day Money-Back Guarantee</span>
As afar as your second approach goes, your main div does not have that many nested span elements, so it is bound to fail. Try just one span element:
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span")
print(title)
Prints:
<span class="money-back">30-Day Money-Back Guarantee</span>

python nested Tags (beautiful Soup)

I used beautiful soup using python to get data from a specific website
but I don't know how to get one of these prices but I want the price in gram (g)
AS shown below this is the HTML codeL:
<div class="promoPrice margBottom7">16,000
L.L./200g<br/><span class="kiloPrice">79,999
L.L./Kg</span></div>
I use this code:
p_price = product.findAll("div{"class":"promoPricemargBottom7"})[0].text
my result was:
16,000 L.L./200g 79,999 L.L./Kg
but i want to have:
16,000 L.L./200g
only
You will need to first decompose the span inside the div element:
from bs4 import BeautifulSoup
h = """
<div class="promoPrice margBottom7">16,000 L.L./200g<br/>
<span class="kiloPrice">79,999 L.L./Kg</span></div>
"""
soup = BeautifulSoup(h, "html.parser")
element = soup.find("div", {'class': 'promoPrice'})
element.span.decompose()
print(element.text)
#16,000 L.L./200g
Try using soup.select_one('div.promoPrice').contents[0]
from bs4 import BeautifulSoup
html = """<div class="promoPrice margBottom7">16,000 L.L./200g<br/>
<span class="kiloPrice">79,999 L.L./Kg</span></div>"""
soup = BeautifulSoup(html, features='html.parser')
# value = soup.select('div.promoPrice > span') # for 79,999 L.L./Kg
value = soup.select_one('div.promoPrice').contents[0]
print(value)
Prints
16,000 L.L./200g

Extract Paragraph Data without any tags

I have written a scraper that scrapes off data from a website but unfortunately the data on website is inconsistent. sometimes the paragraph is written using <p> tags and sometimes not. (Code Snippet given below)
Is there any dynamic way of knowing that?
Part of Code that Generates Error
main_content = soup.findAll("div", {"class": "story-detail"})
content = ""
for div in main_content:
links = div.findAll('p')
for a in links:
a = str(a).strip('<p>')
a = str(a).strip('/>')
a = str(a).strip('<')
a = str(a).strip('<br>')
content = content + a
You can get all the text via text attribute. In that case you don't need to worry about underlying structure.
Example:
>>> soup = Soup(first, 'html.parser')
>>> soup
<div class="story-detail">test</div>
>>> soup.find('div').text
'test'
>>> soup = Soup(second, 'html.parser')
>>> soup
<div class="story-detail">another <p>test</p></div>
>>> soup.find('div').text
'another test'
if what you are trying to accomplish is to strip all the <p> and </p> tags from your text I would use regular expressions as follows:
main_content = ["a div without p tags", "<p>a div with p tags</p>"]
import re
for i in range(0,len(main_content)):
main_content[i] = re.sub("<p>|</p>","",main_content[i])

python How can I parse html and print specific output inside html tag

#!/usr/bin/env python
import requests, bs4
res = requests.get('https://betaunityapi.webrootcloudav.com/Docs/APIDoc/APIReference')
web_page = bs4.BeautifulSoup(res.text, "lxml")
for d in web_page.findAll("div",{"class":"actionColumnText"}):
print d
Result:
<div class="actionColumnText">
/service/api/console/gsm/{gsmKey}/sites/{siteId}/endpoints/reactivate
</div>
<div class="actionColumnText">
Reactivates a list of endpoints, or all endpoints on a site. </div>
I am interested to see output with only the last line (Reactivates a list of endpoints, or all endpoints on a site) removing start and end .
Not interested in the line with href
Any help is greatly appreciated.
In a simple case, you can just get the text:
for d in web_page.find_all("div", {"class": "actionColumnText"}):
print(d.get_text())
Or/and, if there is only single element you want to find, you can get the last match by index:
d = web_page.find_all("div", {"class": "actionColumnText"})[-1]
print(d.get_text())
Or, you can also find div elements with a specific class which don't have an a child element:
def filter_divs(elm):
return elm and elm.name == "div" and "actionColumnText" in elm.attrs and elm.a is None
for d in web_page.find_all(fitler_divs):
print(d.get_text())
Or, in case of a single element:
web_page.find(fitler_divs).get_text()
U can select the last one with a CSS selector:
var d = web_page.select("div.actionColmnText:last")
d.string()
If this text changes you can use
#!/usr/bin/env python
import requests, bs4
res = requests.get('https://betaunityapi.webrootcloudav.com/Docs/APIDoc/APIReference')
web_page = bs4.BeautifulSoup(res.text, "lxml")
yourText = web_page.findAll("div",{"class":"actionColumnText"})[-1]
yourText = yourText.split(' ')[0]

Categories