BeautifulSoup not returning children of element [duplicate] - python

This question already has answers here:
Web scraping program cannot find element which I can see in the browser
(2 answers)
Closed 2 years ago.
I'm new to web scraping and have been using BeautifulSoup to scrape numbers off a gambling website. I'm trying to get the text of a certain element but returned None.
Here is my code:
r=requests.get('https://roobet.com/crash')
soup = bs4.BeautifulSoup(r.text,'lxml')
crash = soup.find('div', class_='CrashHistory_2YtPR')
print(crash)
When I copied the content of my soup into a note pad and tried ctrl+f to find the element i could not find it.
The element I'm looking for is in the <div id="root"> element and when I looked closer at the copied soup in the notepad I saw that there was nothing inside the <div id="root"> element.
I don't understand what is happening
how can I get the element I'm looking for.

Right click on the page and view source. This is one sure way of knowing how the DOM looks like when the page loads. If you do this for the site https://roobet.com/crash you will notice that the <body> is almost empty besides some <script> elements.
This is because body of the webpage is dynamically loaded using Javascript. This is most likely done using a framework such as react
This is the reason BeautifulSoup is having trouble finding the element.

Your website seems to be dynamically loaded, meaning it uses Javascript and other components. You can test this by enabling/disabling Javascript. In order to scrape this site, try using Selenium and Chromedriver, you can also use other browsers, just look for their equivalent.

Related

Python/ Beautiful Soup Data Displaying Issue [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
I am trying to pull some data from a website. Once I checked the data that I pulled with beuatifulsoup (using print(soup) in the code below) does not seem very well. It is different than once I check with view-source:URL. I am unable to find the fields that I am looking for.
Could you please help me to find a solution?
Website: https://www.wayfair.com/furniture/pdp/mercury-row-stalvey-contemporary-4725-wide-1-drawer-server-w003245064.html
Basically, I am trying to get price of this product. I used the same code structure on other websites, it worked properly but it is not working on wayfair.
The second thing that I could not find a solution yet is the last line of my code (StyledBox-owpd5f-0 PriceV2__StyledPrice-sc-7ia31j-0 lkFBUo pl-Price-V2 pl-Price-V2--5000). Instead of name of the product is there a way to get only price like $389.99?
Thanks in advance!
This my code:
html = requests.get('https://www.wayfair.com/furniture/pdp/mercury-row-stalvey-contemporary-4725-wide-1-drawer-server-w003245064.html')
soup=BeautifulSoup(html.text,"html.parser")
print(soup)
inps=soup.find("div",class_="SFPrice").find_all("input")
for inp in inps:
print(inp.get("StyledBox-owpd5f-0 PriceV2__StyledPrice-sc-7ia31j-0 lkFBUo pl-Price-V2 pl-Price-V2--5000"))
Try with:
soup.findAll('div', {'class': 'SFPrice'})[0].getText()
Or in a more simple way:
inps=soup.findAll('div', {'class': 'SFPrice'})[0]
inps.getText()
Both return the price for that specific product.
Your site example is a client-side rendered page and the original html-data fetched doesn't include the searched for elements (like the div with class "SFPrice").
Check out this question for learning about how to scrape javascript-rendered pages with beautifulsoup in combination with selenium and phantomJS, dryscrape or other options.
Or you could also look at this guide.

Python Download Website HTML containing JS [duplicate]

This question already has answers here:
Using python Requests with javascript pages
(6 answers)
Closed 3 years ago.
I am attempting to download many dot-bracket notations of RNA sequences from a url link with Python.
This is one of the links I am using: https://rnacentral.org/rna/URS00003F07BD/9606. To navigate to what I want, you have to click on the '2D structure' button, and only then does the thing I am looking for (right below the occurence of this tag)
<h4>Dot-bracket notation</h4>
appear in the Inspect Element tab.
When I use the get function from the requests package, the text and content fields do not contain that tag. Does anyone know how I can get the bracket notation item?
Here is my current code:
import requests
url = 'http://rnacentral.org/rna/URS00003F07BD/9606'
response = requests.get(url)
print(response.text)
Requests library does not render JS. You need to use a web browser-based solution like selenium. I have listed a pseudo-code below.
Use selenium to load the page.
then click the button 2D structure using selenium.
Wait for some time by adding a time.sleep().
And read the page source using selenium.
You should get what you want.

web scraping Beautiful soup [duplicate]

This question already has answers here:
Using python Requests with javascript pages
(6 answers)
Closed 4 years ago.
What I am trying is to get ingredients section from
https://www.walmart.com/ip/Nature-s-Recipe-Chicken-Wild-Salmon-Recipe-in-Broth-Dog-Food-2-75-oz/34199310
so what i did was
import requests
from bs4 import BeautifulSoup
x=requests.get("https://www.walmart.com/ip/Nature-s-Recipe-Chicken-Wild-Salmon-Recipe-in-Broth-Dog-Food-2-75-oz/34199310")
soup=BeautifulSoup(x.content)
print(soup.find_all("p",{"class":"Ingredients"})[0])
But its showing index out of range,i.e. no element found but on checking the website the element do exist 'p class="Ingredients"'
Bad news, looks like those elements are generated via JS. If you "view source" of that page, the elements aren't there, and this is the html that requests is getting.
I would use something like selenium to automate a browser to get the fully rendered html, then you can use beautifulsoup to parse out the ingredients.
I personally find it very annoying when websites use JS to generate large amounts of content rather than to make the page more interactive etc. But what are ya gonna do...

Fetch links having no href attribute : Selenium-Python

I am currently trying to crawl using selenium-python through an entire website with a specified crawl depth. I started with Google and thought of moving forward by crawling with it and simultaneously develop the code.
The way it works is: If the page is 'www.google.com' and has 15 links within it, once all the links are fetched, it is stored in a dictionary with 'www.google.com' as the key and a list of 15 links as value. Then each of the 15 links are then taken from the corresponding dictionary and the crawling continues in a recursive manner.
The problem with this is that it moves forward with respect to the href attribute of every links found on a page. But not every links will have href attribute.
For example: As it crawled and reached the My Account Page it has Help and Feedback in it's footer which has an outerHTML of <span role="button" tabindex="0" class="fK1S1c" jsname="ngKiOe">Help and Feedback</span>.
So what I am not sure is that - what can be done on such a context where a link is highly supported by javascript/ajax for it matters - as it does not have a link but opens up a modal window/dialog box or sorts.
You might need to find a pattern of design for links. For eg: you
could have a link with anchor tag and in your case span.
It depends on the design of the webpage. How the developers intent do design the html elements through attributes/ identifiers.
For eg: if the dev decides to have a common class value for all the links that are not with the anchor tag name, it would be easy to identify all those elements.
You could also try writing a script to fetch all the elements with the
expected tag name( for eg : span) here and try clicking on the
elements. You could fetch the details of the backend response/log
details. So for those, clicks, where you are getting additional
response/log would mean that it has an additional code written behind
giving us an idea that it is not a static element.

How can Selenium (or BeautifulSoup) be used to access these hidden elements?

Here is an example page with pagination controlling dynamically loaded results.
http://www.rehabs.com/local/jacksonville-fl/
All that I presently know to try is:
curButton = 1
driver.find_element_by_css_selector('ul[class="pagination"]').find_elements_by_tag_name('li')[curButton].click()
Nothing seems to happen (also when trying to access and click the a tag or driver.get() the href of the a element).
Is there another way to access the hidden elements? For instance, when reading the html of the entire page, the elements of different pagination are shown, but are apparently inaccessible with BeautifulSoup.
Pagination was added for humans. Maybe you used the wrong xpath or css. Check it.
Use this xpath:
//div[#id="listing-basic"]/article/div[#class="h3"]/a/#href
You can click on the pagination button using:
driver.find_elements_by_css_selector('.pagination li a')[1].click()

Categories