Is stockx.com blocking webscraping? [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm trying to web scrape from "https://stockx.com/" using bs4 but I get:
urllib.error.HTTPError: HTTP Error 403: Forbidden.
Is there any way I can fix this?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://stockx.com/"
uClient = uReq(my_url)

Passing a useragent header seems to solve the issue.
Try something like this:
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
my_url = "https://stockx.com/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
uClient = uReq(Request(url=my_url, headers=headers))
But do know that if the data you're trying to scrape is dynamic, bs4 wouldn't be of much help. consider using pyppeteer or selenium, etc.. for that.

Use scrapy, it'll try to request the site again and it'll follow redirects.

Related

Why does html from requests response deviate from dev tools? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am trying to scrape houzz website
In browser dev tools it shows HTML content. But when I scrape it with beautifulsoup, it returns something else together with some of the html, I do not have much knowledge on this.
A little part of what I get is as follows.
</div><style data-styled="true" data-styled-version="5.2.1">.fzynIk.fzynIk{box-sizing:border-box;margin:0;overflow:hidden;}/*!sc*/
.eiQuKK.eiQuKK{box-sizing:border-box;margin:0;margin-bottom:4px;}/*!sc*/
.chJVzi.chJVzi{box-sizing:border-box;margin:0;margin-left:8px;}/*!sc*/
.kCIqph.kCIqph{box-sizing:border-box;margin:0;padding-top:32px;padding-bottom:32px;border-top:1px solid;border-color:#E6E6E6;}/*!sc*/
.dIRCmF.dIRCmF{box-sizing:border-box;margin:0;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-box-pack:justify;-webkit-justify-content:space-between;-ms-flex-pack:justify;justify-content:space-between;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;margin-bottom:16px;}/*!sc*/
.kmAORk.kmAORk{box-sizing:border-box;margin:0;margin-bottom:24px;}/*!sc*/
.bPERLb.bPERLb{box-sizing:border-box;margin:0;margin-bottom:-8px;}/*!sc*/
What should I do with this? Is not this achievable with beautfulsoup?
Developer Tools operate on a live browser DOM, what you’ll see when inspecting the page source is not the original HTML, but a modified one after applying some browser clean up and executing JavaScript code.
Requests is not executing JavaScript so content can deviate slightly, but you can scrape - Just take a deeper look into your soup.
Example (project titles)
from bs4 import BeautifulSoup
import requests
url_news = " https://www.houzz.com.au/professionals/home-builders/turrell-building-pty-ltd-pfvwau-pf~1099128087"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
response = requests.get(url_news, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
[title.text for title in soup.select('#projects h3')]
Output
['Major Renovation & Master Wing',
'"The Italian Village" Private Residence',
'Country Classic',
'Residential Resort',
'Resort Style Extension, Stone and Timber',
'Old Northern Rd Estate']

BeautifulSoup and MechanicalSoup won't read website

I am dealing with BeautifulSoup and also trying it with MechanicalSoup and I have got it to load with other websites, but when I request that the website be requested it takes a long time and then never really gets it. Any ideas would be super helpful.
Here is the BeautifulSoup code that I am writing:
import urllib3
from bs4 import BeautifulSoup as soup
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/?bb=hy89sjv-mN24znkgE'
http = urllib3.PoolManager()
r = http.request('GET', url)
Here is the Mechanicalsoup code:
import mechanicalsoup
browser = mechanicalsoup.Browser()
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
page = browser.get(url)
page
What I am trying to do is gather data on different cities and apartments, so the url will change to have be 2-bedrooms and then 3-bedrooms then it will move to a different city and do the same thing there, so I really need this part to work.
Any help would be appreciated.
You see the same thing if you use curl or wget to fetch the page. My guess is they are using browser detection to try to prevent people from stealing their copyrighted information, as you are attempting to do. You can search for the User-Agent header to see how to pretend to be another browser.
import urllib3
import requests
from bs4 import BeautifulSoup as soup
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
r = requests.get(url, headers=headers)
rContent = soup(r.content, 'lxml')
rContent
Just as Tim said, I needed to add headers to my code to ensure that it was being read as not from a bot.

Error when trying to web scraping with urllib.reques

I am trying to get the html of the following web: https://betway.es/es/sports/cpn/tennis/230 in order to get the matches' names and the odds
with the code in python:
from bs4 import BeautifulSoup
import urllib.request
url = 'https://betway.es/es/sports/cpn/tennis/230'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
But when I run the code it throws the next exception: HTTPError: HTTP Error 403: Forbidden
I have seen that maybe with headers could be possible, but I am completely new with this module so no idea how to use them. Any advice? In addition, although I am able to download the url, I cannot find the odds, anyone knows what can be a reason?
I'm unfortunately part of a country blocked by this site.
But, using the requests package:
import requests as rq
from bs4 import BeautifulSoup as bs
url = 'https://betway.es/es/sports/cpn/tennis/230'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"}
page = rq.get(url, headers=headers)
You can find your headers in F12 -> Networks -> random line -> Headers Tab
It is, as a result, a partial answer.

How to avoid 403 problem using BeautifulSoup and headers?

I am using the combination of request and beautifulsoup to develop a web-scraping program in python.
Unfortunately, I got 403 problem (even using header).
Here my code:
from bs4 import BeautifulSoup
from requests import get
headers_m = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
sapo_m = "https://www.idealista.it/vendita-case/milano-milano/"
response_m = get(sapo_m, headers=headers_m)
This is not general python question. The site blocks such straightforward attempts of scraping, you need to find a set of headers (specific for this site) that will pass validation.
Regards,
Simply use Chrome as User-Agent.
from bs4 import BeautifulSoup
BeautifulSoup(requests.get("https://...", headers={"User-Agent": "Chrome"}).content, 'html.parser')

scrape google resultstats with python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I would like to get the estimated results number from google for a keyword. Im using Python3.3 and try to accomplish this task with BeautifulSoup and urllib.request. This is my simple code so far
def numResults():
try:
page_google = '''http://www.google.de/#output=search&sclient=psy-ab&q=pokerbonus&oq=pokerbonus&gs_l=hp.3..0i10l2j0i10i30l2.16503.18949.0.20819.10.9.0.1.1.0.413.2110.2-6j1j1.8.0....0...1c.1.19.psy-ab.FEBvxrgi0KU&pbx=1&bav=on.2,or.r_qf.&bvm=bv.48705608,d.Yms&'''
req_google = Request(page_google)
req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
scounttext = soup.find('div', id='resultStats')
except URLError as e:
print(e)
return scounttext
My problem is that my soup variable is somehow encoded and that i cant get any information out of it. So i get back a None because soup.find doesnt work.
What am i doing wrong and how can i extract the wanted resultstats?
Many thanks!
If you haven't solved this problem yet, it looks like the reason BeautifulSoup is failing to find anything is that the resultStats never appear in the soup - your Request(page_google) is only returning JavaScript, not any search results that the JavaScript is dynamically loading in. You can verify this by adding a
print(soup)
command to your code and you will see that the resultStats div doesn't appear.
The following code:
import sys
from urllib2 import Request, urlopen
import urllib
from bs4 import BeautifulSoup
query = 'pokerbonus'
url = "http://www.google.de/search?q=%s" % urllib.quote_plus(query)
req_google = Request(url)
req_google.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
scounttext = soup.find('div', id='resultStats')
print(scounttext)
Will print
<div class="sd" id="resultStats">Ungefähr 1.060.000 Ergebnisse</div>
Lastly, using a tool like Selenium Webdriver might be a better way to go about solving this, as Google does not allow bots to scrape search results.

Categories