web scraping Beautiful soup [duplicate] - python

This question already has answers here:
Using python Requests with javascript pages
(6 answers)
Closed 4 years ago.
What I am trying is to get ingredients section from
https://www.walmart.com/ip/Nature-s-Recipe-Chicken-Wild-Salmon-Recipe-in-Broth-Dog-Food-2-75-oz/34199310
so what i did was
import requests
from bs4 import BeautifulSoup
x=requests.get("https://www.walmart.com/ip/Nature-s-Recipe-Chicken-Wild-Salmon-Recipe-in-Broth-Dog-Food-2-75-oz/34199310")
soup=BeautifulSoup(x.content)
print(soup.find_all("p",{"class":"Ingredients"})[0])
But its showing index out of range,i.e. no element found but on checking the website the element do exist 'p class="Ingredients"'

Bad news, looks like those elements are generated via JS. If you "view source" of that page, the elements aren't there, and this is the html that requests is getting.
I would use something like selenium to automate a browser to get the fully rendered html, then you can use beautifulsoup to parse out the ingredients.
I personally find it very annoying when websites use JS to generate large amounts of content rather than to make the page more interactive etc. But what are ya gonna do...

Related

Python/ Beautiful Soup Data Displaying Issue [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
I am trying to pull some data from a website. Once I checked the data that I pulled with beuatifulsoup (using print(soup) in the code below) does not seem very well. It is different than once I check with view-source:URL. I am unable to find the fields that I am looking for.
Could you please help me to find a solution?
Website: https://www.wayfair.com/furniture/pdp/mercury-row-stalvey-contemporary-4725-wide-1-drawer-server-w003245064.html
Basically, I am trying to get price of this product. I used the same code structure on other websites, it worked properly but it is not working on wayfair.
The second thing that I could not find a solution yet is the last line of my code (StyledBox-owpd5f-0 PriceV2__StyledPrice-sc-7ia31j-0 lkFBUo pl-Price-V2 pl-Price-V2--5000). Instead of name of the product is there a way to get only price like $389.99?
Thanks in advance!
This my code:
html = requests.get('https://www.wayfair.com/furniture/pdp/mercury-row-stalvey-contemporary-4725-wide-1-drawer-server-w003245064.html')
soup=BeautifulSoup(html.text,"html.parser")
print(soup)
inps=soup.find("div",class_="SFPrice").find_all("input")
for inp in inps:
print(inp.get("StyledBox-owpd5f-0 PriceV2__StyledPrice-sc-7ia31j-0 lkFBUo pl-Price-V2 pl-Price-V2--5000"))
Try with:
soup.findAll('div', {'class': 'SFPrice'})[0].getText()
Or in a more simple way:
inps=soup.findAll('div', {'class': 'SFPrice'})[0]
inps.getText()
Both return the price for that specific product.
Your site example is a client-side rendered page and the original html-data fetched doesn't include the searched for elements (like the div with class "SFPrice").
Check out this question for learning about how to scrape javascript-rendered pages with beautifulsoup in combination with selenium and phantomJS, dryscrape or other options.
Or you could also look at this guide.

How can I *webscrape* if the source HTML doesn't contain the actual number? [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
Hi I'm totally newbie to the computer progamming world. So I might ask stupid questions.
I'm trying to build a web scraping tool using python to scrape some statistics from Korean Statistical Office(KOSIS). So this is How I did and it keeps return error saying "'NoneType' object has no attribute 'find'"
import csv
import requests
from bs4 import BeautifulSoup
url = "https://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1K31002&conn_path=I2"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
data_rows = soup.find("table", attrs = {"id" : "mainTable"}).find("tbody").find_all("tr")
print(data_rows.get_text())
I googled my problem and found out that the DOM in browser is different from the actual HTML source. So I went into view-source page(view-source:https://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1K31002&conn_path=I2) and since I don't know anything about HTML, I ran it in codebeautify and found out that source code doesn't contain any of the number that I'm seeing? huh. Is there anyone who can teach me what's happening. Thanks!
I recommend you to use Puppeteer for web scraping (this uses Google Chrome behind the scenes), because many web pages uses javascript to manipulate the DOM after HTML page load. Therefore, the original DOM is not the same when the page is fully loaded.
There it is a link that I found https://rexben.medium.com/introduction-to-web-scraping-with-puppeteer-1465b89fcf0b

BeautifulSoup not returning children of element [duplicate]

This question already has answers here:
Web scraping program cannot find element which I can see in the browser
(2 answers)
Closed 2 years ago.
I'm new to web scraping and have been using BeautifulSoup to scrape numbers off a gambling website. I'm trying to get the text of a certain element but returned None.
Here is my code:
r=requests.get('https://roobet.com/crash')
soup = bs4.BeautifulSoup(r.text,'lxml')
crash = soup.find('div', class_='CrashHistory_2YtPR')
print(crash)
When I copied the content of my soup into a note pad and tried ctrl+f to find the element i could not find it.
The element I'm looking for is in the <div id="root"> element and when I looked closer at the copied soup in the notepad I saw that there was nothing inside the <div id="root"> element.
I don't understand what is happening
how can I get the element I'm looking for.
Right click on the page and view source. This is one sure way of knowing how the DOM looks like when the page loads. If you do this for the site https://roobet.com/crash you will notice that the <body> is almost empty besides some <script> elements.
This is because body of the webpage is dynamically loaded using Javascript. This is most likely done using a framework such as react
This is the reason BeautifulSoup is having trouble finding the element.
Your website seems to be dynamically loaded, meaning it uses Javascript and other components. You can test this by enabling/disabling Javascript. In order to scrape this site, try using Selenium and Chromedriver, you can also use other browsers, just look for their equivalent.

Python Download Website HTML containing JS [duplicate]

This question already has answers here:
Using python Requests with javascript pages
(6 answers)
Closed 3 years ago.
I am attempting to download many dot-bracket notations of RNA sequences from a url link with Python.
This is one of the links I am using: https://rnacentral.org/rna/URS00003F07BD/9606. To navigate to what I want, you have to click on the '2D structure' button, and only then does the thing I am looking for (right below the occurence of this tag)
<h4>Dot-bracket notation</h4>
appear in the Inspect Element tab.
When I use the get function from the requests package, the text and content fields do not contain that tag. Does anyone know how I can get the bracket notation item?
Here is my current code:
import requests
url = 'http://rnacentral.org/rna/URS00003F07BD/9606'
response = requests.get(url)
print(response.text)
Requests library does not render JS. You need to use a web browser-based solution like selenium. I have listed a pseudo-code below.
Use selenium to load the page.
then click the button 2D structure using selenium.
Wait for some time by adding a time.sleep().
And read the page source using selenium.
You should get what you want.

Scraping dynamic content in a website [duplicate]

This question already has answers here:
Scrape a dynamic website [duplicate]
(8 answers)
Closed 6 months ago.
I need to scrape news announcements from this website, Link.
The announcements seem to be generated dynamically. They dont appear in the source. I usually use mechanize but I assume it wouldnt work. What can I do for this? I'm ok with python or perl.
If the content is generated dynamically, you can use Windmill or Seleninum to drive the browser and get the data once it's been rendered.
You can find an example here.
The polite option would be to ask the owners of the site if they have an API which allows you access to their news stories.
The less polite option would be to trace the HTTP transactions that take place while the page is loading and work out which one is the AJAX call which pulls in the data.
Looks like it's this one. But it looks like it might contain session data, so I don't know how long it will continue to work for.
There's also WWW::Scripter "For scripting web sites that have scripts" . Never used it.
In python you can use urllib and urllib2 to connect to a website and collect data. For example:
from urllib2 import urlopen
myUrl = "http://www.marketvectorsindices.com/#!News/List"
inStream = urlopen(myUrl)
instream.read(1024) # etc, in a while loop
# all your fun page parsing code (perhaps: import from xml.dom.minidom import parse)

Categories