Why does html from requests response deviate from dev tools? [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am trying to scrape houzz website
In browser dev tools it shows HTML content. But when I scrape it with beautifulsoup, it returns something else together with some of the html, I do not have much knowledge on this.
A little part of what I get is as follows.
</div><style data-styled="true" data-styled-version="5.2.1">.fzynIk.fzynIk{box-sizing:border-box;margin:0;overflow:hidden;}/*!sc*/
.eiQuKK.eiQuKK{box-sizing:border-box;margin:0;margin-bottom:4px;}/*!sc*/
.chJVzi.chJVzi{box-sizing:border-box;margin:0;margin-left:8px;}/*!sc*/
.kCIqph.kCIqph{box-sizing:border-box;margin:0;padding-top:32px;padding-bottom:32px;border-top:1px solid;border-color:#E6E6E6;}/*!sc*/
.dIRCmF.dIRCmF{box-sizing:border-box;margin:0;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-box-pack:justify;-webkit-justify-content:space-between;-ms-flex-pack:justify;justify-content:space-between;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;margin-bottom:16px;}/*!sc*/
.kmAORk.kmAORk{box-sizing:border-box;margin:0;margin-bottom:24px;}/*!sc*/
.bPERLb.bPERLb{box-sizing:border-box;margin:0;margin-bottom:-8px;}/*!sc*/
What should I do with this? Is not this achievable with beautfulsoup?

Developer Tools operate on a live browser DOM, what you’ll see when inspecting the page source is not the original HTML, but a modified one after applying some browser clean up and executing JavaScript code.
Requests is not executing JavaScript so content can deviate slightly, but you can scrape - Just take a deeper look into your soup.
Example (project titles)
from bs4 import BeautifulSoup
import requests
url_news = " https://www.houzz.com.au/professionals/home-builders/turrell-building-pty-ltd-pfvwau-pf~1099128087"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
response = requests.get(url_news, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
[title.text for title in soup.select('#projects h3')]
Output
['Major Renovation & Master Wing',
'"The Italian Village" Private Residence',
'Country Classic',
'Residential Resort',
'Resort Style Extension, Stone and Timber',
'Old Northern Rd Estate']

Related

Web scrapping with Beautifulsoup returns no text eventhough it is in the html

I'm new to web scrapping and using Beautifulsoup. I need help as I don't understand why my code is returning no text when there is text in the inspect view on the website.
Here is my simple code:
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.nummerplade.net/nummerplade/Dd97487.html")
soup = BeautifulSoup(source.text,"html.parser")
name = soup.find("span",id="debitorer_name1")
print(name)
The output of running my code is:
<span id="debitorer_name1"></span>
When I inspect the HTML on the website I can see the desired name I want to extract, but not when running my script. Can anyone help me solve this issue?
Thanks!
If you reload site the data is reflecting in right side pane it takes same time so where it is uses dynamic data loading and it will not be visible in soup
How to find URL which renders dynamic data:
Go to Network tab and reload site and in left side just type the data that you want to search it will give you URL
Now go to Headers and copy user-agent, referer for headers and it will return data as in json format and you can extract what so data you want
import requests
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36", "referer": "https://www.nummerplade.net/"}
res=requests.get("https://data3.nummerplade.net/bilbogen2.php?stelnr=salza2bt3nh162519",headers=headers)
Output:
'Sebastian Carl Schwabe'
Image:

Cannot scrape Amazon data. BeautifulSoup returns empty HTML element [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed yesterday.
Improve this question
I am trying to webscrape data from Amazon's deal page, but the code below is returning an empty element: <div class="" id="slot-4"> </div>.
Why is this code is not working on the Amazon website?
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.in/gp/goldbox/all-deals/?ie=UTF8&ref_=sv_gb_1"
HEADERS = ({'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'})
def getdata(url, HEADERS):
webContent = requests.get(url, headers=HEADERS)
htmlContent = webContent.content
soup = BeautifulSoup(htmlContent, 'html.parser')
# print(soup.prettify())
return soup
def getdeals(soup):
data= soup.find_all("div",{"id" : 'slot-4'})
print(data)
soup = getdata(url, HEADERS)
getdeals(soup)
This is my output:
[<div class="" id="slot-4"> </div>]
The HTML element <div class="" id="slot-4"> is empty:
You may want to target the deal-card attribute:
You also want to verify that the content isn't being loaded dynamically with JavaScript. If it is, you will need to use a library like Selenium rather BeautifulSoup.

Is stockx.com blocking webscraping? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm trying to web scrape from "https://stockx.com/" using bs4 but I get:
urllib.error.HTTPError: HTTP Error 403: Forbidden.
Is there any way I can fix this?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://stockx.com/"
uClient = uReq(my_url)
Passing a useragent header seems to solve the issue.
Try something like this:
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
my_url = "https://stockx.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
uClient = uReq(Request(url=my_url, headers=headers))
But do know that if the data you're trying to scrape is dynamic, bs4 wouldn't be of much help. consider using pyppeteer or selenium, etc.. for that.
Use scrapy, it'll try to request the site again and it'll follow redirects.

scrape google resultstats with python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I would like to get the estimated results number from google for a keyword. Im using Python3.3 and try to accomplish this task with BeautifulSoup and urllib.request. This is my simple code so far
def numResults():
try:
page_google = '''http://www.google.de/#output=search&sclient=psy-ab&q=pokerbonus&oq=pokerbonus&gs_l=hp.3..0i10l2j0i10i30l2.16503.18949.0.20819.10.9.0.1.1.0.413.2110.2-6j1j1.8.0....0...1c.1.19.psy-ab.FEBvxrgi0KU&pbx=1&bav=on.2,or.r_qf.&bvm=bv.48705608,d.Yms&'''
req_google = Request(page_google)
req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
scounttext = soup.find('div', id='resultStats')
except URLError as e:
print(e)
return scounttext
My problem is that my soup variable is somehow encoded and that i cant get any information out of it. So i get back a None because soup.find doesnt work.
What am i doing wrong and how can i extract the wanted resultstats?
Many thanks!
If you haven't solved this problem yet, it looks like the reason BeautifulSoup is failing to find anything is that the resultStats never appear in the soup - your Request(page_google) is only returning JavaScript, not any search results that the JavaScript is dynamically loading in. You can verify this by adding a
print(soup)
command to your code and you will see that the resultStats div doesn't appear.
The following code:
import sys
from urllib2 import Request, urlopen
import urllib
from bs4 import BeautifulSoup
query = 'pokerbonus'
url = "http://www.google.de/search?q=%s" % urllib.quote_plus(query)
req_google = Request(url)
req_google.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
scounttext = soup.find('div', id='resultStats')
print(scounttext)
Will print
<div class="sd" id="resultStats">Ungefähr 1.060.000 Ergebnisse</div>
Lastly, using a tool like Selenium Webdriver might be a better way to go about solving this, as Google does not allow bots to scrape search results.

How to download youtube videos using a python script [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I need to download videos from youtube using a python script. However i am unable to get the url of the video from the youtube page.
For example, given the url: http://www.youtube.com/watch?v=5qcmCUsw4EQ&feature=g-all-u&context=G2633db8FAAAAAAAAAAA
I need to download the video as a flv or any other format. Also i need to be able to download it multiple quality.
I tried several scripts like youtube-dl and quvi but they all give errors and dont work. Please help. It shall be deeply appreciated.
You need to parse the flashvars variable of the <embed> tag that contains the video. These change around, so some experimentation may be required to find the current variable names. Roughly speaking, you'll want to use a libraries like mechanize to grab the HTML of the page and BeautifulSoup to parse the HTML and extract the flashvars field of the <embed> element. Then look around at the variables to figure out which one contains the video URL.
e.g.,
br = mechanize.Browser()
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('%s?v=%s' % (YOUTUBE_URL, vidId))
soup = BeautifulSoup.BeautifulSoup(br.response().read())
flashVars = urllib2.urlparse.parse_qs(soup.find('embed').get('flashvars'))
# Return the first second video source URL
return flashVars['fmt_stream_map'][0].split('|')[1]

Categories