This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
I am trying to pull some data from a website. Once I checked the data that I pulled with beuatifulsoup (using print(soup) in the code below) does not seem very well. It is different than once I check with view-source:URL. I am unable to find the fields that I am looking for.
Could you please help me to find a solution?
Website: https://www.wayfair.com/furniture/pdp/mercury-row-stalvey-contemporary-4725-wide-1-drawer-server-w003245064.html
Basically, I am trying to get price of this product. I used the same code structure on other websites, it worked properly but it is not working on wayfair.
The second thing that I could not find a solution yet is the last line of my code (StyledBox-owpd5f-0 PriceV2__StyledPrice-sc-7ia31j-0 lkFBUo pl-Price-V2 pl-Price-V2--5000). Instead of name of the product is there a way to get only price like $389.99?
Thanks in advance!
This my code:
html = requests.get('https://www.wayfair.com/furniture/pdp/mercury-row-stalvey-contemporary-4725-wide-1-drawer-server-w003245064.html')
soup=BeautifulSoup(html.text,"html.parser")
print(soup)
inps=soup.find("div",class_="SFPrice").find_all("input")
for inp in inps:
print(inp.get("StyledBox-owpd5f-0 PriceV2__StyledPrice-sc-7ia31j-0 lkFBUo pl-Price-V2 pl-Price-V2--5000"))
Try with:
soup.findAll('div', {'class': 'SFPrice'})[0].getText()
Or in a more simple way:
inps=soup.findAll('div', {'class': 'SFPrice'})[0]
inps.getText()
Both return the price for that specific product.
Your site example is a client-side rendered page and the original html-data fetched doesn't include the searched for elements (like the div with class "SFPrice").
Check out this question for learning about how to scrape javascript-rendered pages with beautifulsoup in combination with selenium and phantomJS, dryscrape or other options.
Or you could also look at this guide.
Related
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 1 year ago.
Hi I'm totally newbie to the computer progamming world. So I might ask stupid questions.
I'm trying to build a web scraping tool using python to scrape some statistics from Korean Statistical Office(KOSIS). So this is How I did and it keeps return error saying "'NoneType' object has no attribute 'find'"
import csv
import requests
from bs4 import BeautifulSoup
url = "https://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1K31002&conn_path=I2"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
data_rows = soup.find("table", attrs = {"id" : "mainTable"}).find("tbody").find_all("tr")
print(data_rows.get_text())
I googled my problem and found out that the DOM in browser is different from the actual HTML source. So I went into view-source page(view-source:https://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1K31002&conn_path=I2) and since I don't know anything about HTML, I ran it in codebeautify and found out that source code doesn't contain any of the number that I'm seeing? huh. Is there anyone who can teach me what's happening. Thanks!
I recommend you to use Puppeteer for web scraping (this uses Google Chrome behind the scenes), because many web pages uses javascript to manipulate the DOM after HTML page load. Therefore, the original DOM is not the same when the page is fully loaded.
There it is a link that I found https://rexben.medium.com/introduction-to-web-scraping-with-puppeteer-1465b89fcf0b
This question already has answers here:
Web scraping program cannot find element which I can see in the browser
(2 answers)
Closed 2 years ago.
I'm new to web scraping and have been using BeautifulSoup to scrape numbers off a gambling website. I'm trying to get the text of a certain element but returned None.
Here is my code:
r=requests.get('https://roobet.com/crash')
soup = bs4.BeautifulSoup(r.text,'lxml')
crash = soup.find('div', class_='CrashHistory_2YtPR')
print(crash)
When I copied the content of my soup into a note pad and tried ctrl+f to find the element i could not find it.
The element I'm looking for is in the <div id="root"> element and when I looked closer at the copied soup in the notepad I saw that there was nothing inside the <div id="root"> element.
I don't understand what is happening
how can I get the element I'm looking for.
Right click on the page and view source. This is one sure way of knowing how the DOM looks like when the page loads. If you do this for the site https://roobet.com/crash you will notice that the <body> is almost empty besides some <script> elements.
This is because body of the webpage is dynamically loaded using Javascript. This is most likely done using a framework such as react
This is the reason BeautifulSoup is having trouble finding the element.
Your website seems to be dynamically loaded, meaning it uses Javascript and other components. You can test this by enabling/disabling Javascript. In order to scrape this site, try using Selenium and Chromedriver, you can also use other browsers, just look for their equivalent.
This question already has answers here:
scrape html generated by javascript with python
(5 answers)
Closed 4 years ago.
I am trying to scrape data from a website with python3. The website contains data on players from a champion based FPS multiplayer game named "Paladins". I wanted to get the champion based stats of a player as shown in the website. The problem I'm facing is, when I inspect the page source with Chrome, I get this code which contains "table" tag and is clean and I could scrape it easily:
INSPECTED CODE (my gist link)
but when I make the soup object, I get a different code. and when I went over to the page source, it was the same as the soup. there was no tag there in the page source. (you may check the page source for a better understanding).
Now how may I scrape champion-wise data from the website?
I am using requests and beautifulsoup for python3
import requests as req
import bs4
res = req.get('http://paladins.guru/profile/pc/Encrytedbro/champions')
soup = bs4.BeautifulSoup(res.text, "lxml")
soup.select('#root')
It is giving me this result: HERE'S A LINK TO MY GIST
And i have no idea how to get the data out of there.
I think you're getting error because you are using wrong parser, you should use html.parser instead of lxml
soup = bs4.BeautifulSoup(res.text, "html.parser")
hope this solves the problem
This question already has answers here:
Using python Requests with javascript pages
(6 answers)
Closed 4 years ago.
What I am trying is to get ingredients section from
https://www.walmart.com/ip/Nature-s-Recipe-Chicken-Wild-Salmon-Recipe-in-Broth-Dog-Food-2-75-oz/34199310
so what i did was
import requests
from bs4 import BeautifulSoup
x=requests.get("https://www.walmart.com/ip/Nature-s-Recipe-Chicken-Wild-Salmon-Recipe-in-Broth-Dog-Food-2-75-oz/34199310")
soup=BeautifulSoup(x.content)
print(soup.find_all("p",{"class":"Ingredients"})[0])
But its showing index out of range,i.e. no element found but on checking the website the element do exist 'p class="Ingredients"'
Bad news, looks like those elements are generated via JS. If you "view source" of that page, the elements aren't there, and this is the html that requests is getting.
I would use something like selenium to automate a browser to get the fully rendered html, then you can use beautifulsoup to parse out the ingredients.
I personally find it very annoying when websites use JS to generate large amounts of content rather than to make the page more interactive etc. But what are ya gonna do...
This question already has answers here:
How to click on the first result on google using selenium python
(4 answers)
Closed 4 years ago.
I would like always click on the first link on Google using selenium robot for python.
So for example if I write this :
driver.get("http://www.google.com")
elem = driver.find_element_by_name("q")
elem.send_keys("Tennis")
I would like to click on the first link that appears on Google
Thank you for your help !!
The accepted solution didn't work for me. So for anyone else that comes across that question driver.find_element_by_tag_name("cite").click() using python3 worked for me. However if you just want the link to the top search result it would be faster to use the requests and BeautifulSoup libraries as shown below
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
url = 'http://www.google.com/search?q=something'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
print(soup.find('cite').text)
The xpath which is working as of Aug 2019 is
(//h3)[1]/../../a
i.e. find the first h3 tag, move to its parents and find the first a.
driver.find_element(By.XPATH, '(//h3)[1]/../../a').click()
Google will surely change something in the future and then another approach will be needed.
Old answer
The results links all (used to have) have h3 as the parent element, you can use that
driver.find_element(By.XPATH, '(//h3)[1]/a').click()