BeautifulSoup parsing issues some div not showing - python

I'm trying to parse this page: https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/
The problem is, in this element: https://gyazo.com/e544be64a41a121bdb0c0f71aef50692 ,
I want the div that contains the price. If you inspect the page, you can see the html code for this part, shows like this:
<div class="price">
<div class"price">
"thePrice"
<sup>93</sup>
</div>
</div>
BUT, when using page_soup = soup(my_html_page, 'html.parser') or page_soup = soup(my_html_page, 'lxml') or page_soup = soup(my_html_page, 'html5lib') I only get this as the result for that part:
<div class="price"></div>
And that's it. I've been searching for hours on the internet to figure out why that inner div doesn't get parsed.
Three different parsers, and none seems to get passed the fact that the inner child shares the same class name than its parent, if this is the issue.

Hope its help you.
from bs4 import BeautifulSoup
import requests
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
html = BeautifulSoup(requests.get(url).content, 'html.parser')
prices = html.find_all("div", {"class": "price"})
for price in prices:
print(price.text)
print output
561€95
169€94
165€95
1 165€94
7 599€95
267€95
259€94
599€95
511€94
1 042€94
2 572€94
783€95
2 479€94
2 699€95
499€94
386€95
169€94
2 343€95
783€95
499€94
499€94
259€94
267€95
165€95
169€94
2 399€95
561€95
2 699€95
2 699€95
6 059€95
7 589€95
10 991€95
9 619€94
2 479€94
3 135€95
7 589€95
511€94
1 042€94
386€95
599€95
1 165€94
2 572€94
783€95
2 479€94
2 699€95
499€94
169€94
2 343€95
2 699€95
3 135€95
6 816€95
7 589€95
561€95
267€95

To scrape all prices where class="price"> see this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Select all the 'price' classes
for tag in soup.select('div.price'):
print(tag.text)

Related

Finding a specific span element with BeautifulSoup

I am trying to create a script to scrape price data from Udemy courses.
I'm struggling with navigating the HTML tree because the element I'm looking for is located inside multiple nested divs.
here's the structure of the HTML element I'm trying to access:
what I tried:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
print(parent_div.find_all("span"))
and even:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span span span")
Here’s the URL: https://www.udemy.com/course/the-complete-web-development-bootcamp/
tried searching all the spans in the HTML and the specific span I'm searching for doesn't appear maybe because it's nested inside a div?
would appreciate a little guidance!
The price is being loaded by JavaScript. So it is not possible to scrape using beautifulsoup.
The data is loaded from an API Endpoint which takes in the course-id of the course.
Course-id of this course: 1565838
You can directly get the info from that endpoint like this.
import requests
course_id = '1565838'
url= f'https://www.udemy.com/api-2.0/course-landing-components/{course_id}/me/?components=price_text'
response = requests.get(url)
x = response.json()
print(x['price_text']['data']['pricing_result']['price'])
{'amount': 455.0, 'currency': 'INR', 'price_string': '₹455', 'currency_symbol': '₹'}
I tried your first approach several times and it works more-or-less for me, although it has returned a different number of span elements on different attempts (10 is the usual number but I have seen as few as 1 on one occasion):
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
spans = parent_div.find_all("span")
print(len(spans))
for span in spans:
print(span)
Prints:
10
<span data-checked="checked" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--4" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Subscribe</span>
<span>Try it free for 7 days</span>
<span class="udlite-text-xs purchase-section-container--cta-subscript--349MH">$29.99 per month after trial</span>
<span class="purchase-section-container--line--159eG"></span>
<span data-checked="" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--6" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Buy Course</span>
<span class="money-back">30-Day Money-Back Guarantee</span>
As afar as your second approach goes, your main div does not have that many nested span elements, so it is bound to fail. Try just one span element:
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span")
print(title)
Prints:
<span class="money-back">30-Day Money-Back Guarantee</span>

python nested Tags (beautiful Soup)

I used beautiful soup using python to get data from a specific website
but I don't know how to get one of these prices but I want the price in gram (g)
AS shown below this is the HTML codeL:
<div class="promoPrice margBottom7">16,000
L.L./200g<br/><span class="kiloPrice">79,999
L.L./Kg</span></div>
I use this code:
p_price = product.findAll("div{"class":"promoPricemargBottom7"})[0].text
my result was:
16,000 L.L./200g 79,999 L.L./Kg
but i want to have:
16,000 L.L./200g
only
You will need to first decompose the span inside the div element:
from bs4 import BeautifulSoup
h = """
<div class="promoPrice margBottom7">16,000 L.L./200g<br/>
<span class="kiloPrice">79,999 L.L./Kg</span></div>
"""
soup = BeautifulSoup(h, "html.parser")
element = soup.find("div", {'class': 'promoPrice'})
element.span.decompose()
print(element.text)
#16,000 L.L./200g
Try using soup.select_one('div.promoPrice').contents[0]
from bs4 import BeautifulSoup
html = """<div class="promoPrice margBottom7">16,000 L.L./200g<br/>
<span class="kiloPrice">79,999 L.L./Kg</span></div>"""
soup = BeautifulSoup(html, features='html.parser')
# value = soup.select('div.promoPrice > span') # for 79,999 L.L./Kg
value = soup.select_one('div.promoPrice').contents[0]
print(value)
Prints
16,000 L.L./200g

using beatifulsoup4 to scrape a specific part of html code

I am wanting to make a variable equal the 1.65 towards the end of the html code. Currently if i was to run my code it will print "price-text". Any help to be able to swap it to print "1.65" would be great.
<div class="priceText_f71sibe"><span class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9" data-automation-id="price-text">1.65</span></div>
html code
uClient.close()
page_soup = soup(page_html, "html.parser")
price_texts = page_soup.findAll("div",{"class":"priceText_f71sibe"})
price_text = price_texts[0]
a =price_text.span["data-automation-id"]
print (a)
The most popular is property .text
price_text.span.text
But there are other properties and methods
price_text.span.text
price_text.span.string
price_text.span.getText()
price_text.span.get_text()
Documentation for method get_text()
Full working code
from bs4 import BeautifulSoup
html = '<div class="priceText_f71sibe"><span class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9" data-automation-id="price-text">1.65</span></div>'
soup = BeautifulSoup(html, "html.parser")
price_texts = soup.findAll("div",{"class":"priceText_f71sibe"})
price_text = price_texts[0]
a = price_text.span["data-automation-id"]
print(price_text.span.text)
print(price_text.span.string)
print(price_text.span.getText())
print(price_text.span.get_text())

How to use BeautifulSoup to get real-time stock price on website?

I am working on a project to get the real-time stock price on http://www.jpmhkwarrants.com/en_hk/market-statistics/underlying/underlying-terms/code/1. I have searched online and tried several way to get the price, but still fail. Here is my code:
def getStockPrice():
url = "http://www.jpmhkwarrants.com/zh_hk/market-statistics/underlying/underlying-terms/code/1"
r = urlopen(url)
soup = BeautifulSoup(r.read(), 'lxmll)
price = soup.find(id = "real_time_box").find({"span", "class":"price"})
print(price)
The output is "None". I know that the price is scripted in the function above but I have no idea how to get the price. Can it be solved by beautifulsoup or else module?
view the page source you will see html like this
<div class="table detail">
.....
<div class="tl">即市走勢 <span class="description">前收市價</span>
.....
<td>買入價(延遲*)<span>82.15</span></td>
the span that we want is in index 2, select it with
price = soup.select('.table.detail td span')[1]
print(price.text)
Demo:
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
from bs4 import BeautifulSoup
from urllib.request import urlopen
def getStockPrice():
url = "http://www.jpmhkwarrants.com/zh_hk/market-statistics/underlying/underlying-terms/code/1"
r = urlopen(url)
soup = BeautifulSoup(r.read(), 'html.parser')
price = soup.select('.table.detail td span')[1]
print(price.text)
getStockPrice()
</code>
</div>

Extracting anchor text from span class with BeautifulSoup

This is the html I am trying to scrape:
<span class="meta-attributes__attr-tags">
cinematic,
dissolve,
epic,
fly,
</span>
I want to get the anchor text for each a href: cinematic, dissolve, epic, etc.
This is the code I have:
url = urllib2.urlopen("http: example.com")
content = url.read()
soup = BeautifulSoup(content)
links = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for link in links:
print link.find_all('a')['href']
If I do it with "link.find_all" I get error: TypeError: List indices must be integers, not str.
But if I do print link.find('a')['href'] I get the first one only.
How can I get all of them ?
You could do the following:
from bs4 import BeautifulSoup
content = '''
<span class="meta-attributes__attr-tags">
cinematic,
dissolve,
epic,
fly,
</span>
'''
soup = BeautifulSoup(content)
spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for span in spans:
links = span.find_all('a')
for link in links:
print link['href']
Output
/tags/cinematic
/tags/dissolve
/tags/epic
/tags/fly
from bs4 import BeautifulSoup
html = """
<span class="meta-attributes__attr-tags">
cinematic,
dissolve,
epic,
fly,
</span>
"""
soup = BeautifulSoup(html, "lxml")
spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for span in spans:
for link in span.find_all('a'):
print link.text, link['href']
Another, pricier, way could be:
from bs4 import BeautifulSoup
html = """
<span class="meta-attributes__attr-tags">
cinematic,
dissolve,
epic,
fly,
</span>
"""
soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a")
for link in links:
if 'meta-attributes__attr-tags' not in link.parent.get('class', []):
continue
print link.text, link['href']
link.find_all('a') returns a list with bs4 Tags. You probably want to index each of this links by href. So maybe this comes closer to your needs:
span = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for links in span:
for link in links.find_all('a'):
print(link['href'])
You may avoid nested loops or any additional if checks inside a loop by using a CSS selector:
for link in soup.select(".meta-attributes__attr-tags a[href]"):
print(link["href"], link.get_text())

Categories