Please look over my code. What am I doing wrong that my request response is empty? Any pointers?
URL in question: (should generate a results page)
https://www.ucr.gov/enforcement/343121222
But I cannot replicate it with python requests. Why?
import requests
headers = {'Host': 'www.ucr.gov',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:105.0) Gecko/20100101 Firefox/105.0',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip, deflate, br',
'Connection' : 'keep-alive'
}
data = {
'scheme': 'https;',
'host': 'www.ucr.gov',
'filename': '/enforcement/3431212'
}
url="https://www.ucr.gov/enforcement/3431212"
result = requests.get(url, params=data, headers=headers)
print(result.status_code)
print(result.text)
The page at the link you provided is fully rendered on the client side using JavaScript. This means that you won't be able to obtain the same response using a simple HTTP request.
In this case a common solution is performing something that is called Headless Scraping, which involves automating a headless browser in order to access a specific website content as a regular client would. In Python headless scraping can be implemented using several libraries, including Selenium and Puppeteer.
Related
I am sending a request to some url. I Copied the curl url to get the code from curl to python tool. So all the headers are included, but my request is not working and I recieve status code 403 on printing and error code 1020 in the html output. The code is
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
}
response = requests.get('https://v2.gcchmc.org/book-appointment/', headers=headers)
print(response.status_code)
print(response.cookies.get_dict())
with open("test.html",'w') as f:
f.write(response.text)
I also get cookies but not getting the desired response. I know I can do it with selenium but I want to know the reason behind this. Thanks in advance.
Note:
I have installed all the libraries installed with request with same version as computer and still not working and throwing 403 error
The site is protected by cloudflare which aims to block, among other things, unauthorized data scraping. From What is data scraping?
The process of web scraping is fairly simple, though the
implementation can be complex. Web scraping occurs in 3 steps:
First the piece of code used to pull the information, which we call a scraper bot, sends an HTTP GET request to a specific website.
When the website responds, the scraper parses the HTML document for a specific pattern of data.
Once the data is extracted, it is converted into whatever specific format the scraper bot’s author designed.
You can use urllib instead of requests, it seems to be able to deal with cloudflare
req = urllib.request.Request('https://v2.gcchmc.org/book-appointment/')
req.add_headers('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8')
req.add_header('Accept-Language', 'en-US,en;q=0.5')
r = urllib.request.urlopen(req).read().decode('utf-8')
with open("test.html", 'w', encoding="utf-8") as f:
f.write(r)
It works on my machine, so I am not sure what the problem is.
However, when I want send a request which does not work, I often try if it works using playwright. Playwright uses a browser driver and thus mimics your actual browser when visiting the page. It can be installed using pip install playwright. When you try it for the first time it may give an error which tells you to install the drivers, just follow the instruction to do so.
With playwright you can try the following:
from playwright.sync_api import sync_playwright
url = 'https://v2.gcchmc.org/book-appointment/'
ua = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/69.0.3497.100 Safari/537.36"
)
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page(user_agent=ua)
page.goto(url)
page.wait_for_timeout(1000)
html = page.content()
print(html)
A downside of playwright is that it requires the installation of the chromium (or other) browsers. This is a downside as it may complicate deployment, as the browser can not simply be added to requirements.txt, and a container image is required.
Try running Burp Suite's Proxy to see all the headers and other data like cookies. Then you could mimic the request with the Python module. That's what I always do.
Good luck!
Had the same problem recently.
Using the javascript fetch-api with Selenium-Profiles worked for me.
example js:
fetch('http://example.com/movies.json')
.then((response) => response.json())
.then((data) => console.log(data));o
Example Python with Selenium-Profiles:
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": profile["cdp"]["useragent"]["acceptLanguage"],
"content-type": "application/json",
# "cookie": cookie_str, # optional
"sec-ch-ua": "'Google Chrome';v='107', 'Chromium';v='107', 'Not=A?Brand';v='24'",
"sec-ch-ua-mobile": "?0", # "?1" for mobile
"sec-ch-ua-platform": "'" + profile['cdp']['useragent']['userAgentMetadata']['platform'] + "'",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"user-agent": profile['cdp']['useragent']['userAgent']
}
answer = driver.requests.fetch("https://www.example.com/",
options={
"body": json.dumps(post_data),
"headers": headers,
"method":"POST",
"mode":"same-origin"
})
I don't know why this occurs, but I assume cloudfare and others are able to detect, whether a request is made with javascript.
My Problem:
I want to scrape the following website: https://www.coches.net/segunda-mano/.
But every time i open it with python selenium, i get the message, that they detected me as a bot.
How can i bypass this detection?
First i tried simple code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome('C:/Python38/chromedriver.exe')
URL = 'https://www.coches.net/segunda-mano/'
browser.get(URL)
Then i tried it with request, but i doesn't work, too.
from selenium import webdriver
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import requests
ua = UserAgent()
headers = {"UserAgent":ua.random}
URL = 'https://www.coches.net/segunda-mano/'
r = requests.get(URL, headers = headers)
print(r.statuscode)
In this case i get the message 403 = Status code stating that access to the URL is prohibited.
Don't know how to get entry to this webpage without getting blocked. I would be very grateful for your help. Thanks in advance.
Selenium is fairly easily detected, especially by all major anti-bot providers (Cloudflare, Akamai, etc).
Why?
Selenium, and most other major webdrivers set a browser variable (that websites can access) called navigator.webdriver to true. You can check this yourself by heading to your Google Chrome console and running console.log(navigator.webdriver). If you're on a normal browser, it will be false.
The User-Agent, typically all devices have what is called a "user agent", this refers to the device accessing the website. Selenium's User-Agent looks something like this: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36. Did you catch that? HeadlessChrome is included, this is another route of detection.
These are just two of the multiple ways a Selenium browser can be detected, I would highly recommend reading up on this and this as well.
And lastly, if you want an easy, drop-in solution to bypass detection that implements almost all of these concepts we've talked about, I'd suggest using undetected-chromedriver. This is an open source project that tries it's best to keep your Selenium chromedriver looking human.
I think your problem is not bot detection. You can't use just requests to get the results from that page, because it makes XHR requests behind the scene. So you must use Selenium, splash, etc, but seems is not possible for this case.
However, if you research a bit in the page you can find which url is requested behind the scenes to display the resutls. I did that research and found this page(https://ms-mt--api-web.spain.advgo.net/search), it returns a json, so it will ease your work in terms of parsing. Using chrome dev tools I got the curl request and just map it to python requests and obtain this code:
import json
import requests
headers = {
'authority': 'ms-mt--api-web.spain.advgo.net',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'accept': 'application/json, text/plain, */*',
'x-adevinta-channel': 'web-desktop',
'x-schibsted-tenant': 'coches',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.coches.net',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.coches.net/',
'accept-language': 'en-US,en;q=0.9,es;q=0.8',
}
data = '{"pagination":{"page":1,"size":30},"sort":{"order":"desc","term":"relevance"},"filters":{"categories":{"category1Ids":[2500]},"offerTypeIds":[0,2,3,4,5],"isFinanced":false,"price":{"from":null,"to":null},"year":{"from":null,"to":null},"km":{"from":null,"to":null},"provinceIds":[],"fuelTypeIds":[],"bodyTypeIds":[],"doors":[],"seats":[],"transmissionTypeId":0,"hp":{"from":null,"to":null},"sellerTypeId":0,"hasWarranty":null,"isCertified":false,"luggageCapacity":{"from":null,"to":null},"contractId":0}}'
while True:
response = requests.post('https://ms-mt--api-web.spain.advgo.net/search', headers=headers, data=data).json()
# you should parse items here.
print(response)
if not response["items"]:
break
data_dict = json.loads(data)
data_dict["pagination"]["page"] = data_dict["pagination"]["page"]+1 # get the next page.
data = json.dumps(data_dict)
Probably there are a lot of headers and body info that are unnecesary, you can code-and-test to improve it.
Proxy rotating can be useful if scraping large data
options = Options()
options.add_arguments('--proxy-server="#ip:#port"')
Then initialize chrome driver with options object
I'm making a GET request to fetch JSON, which works absolutely fine from any browser on any device, but not by python requests:
url = 'https://angel.co/autocomplete/new_tags'
params = {'query': 'sci', 'tag_type': 'MarketTag'}
resp = requests.get(url,params=params)
resp.raise_for_status()
gives HTTPError: 403 Client Error: Forbidden for url: https://angel.co/autocomplete/new_tags?query=ab&tag_type=MarketTag
So I tried:
Python requests. 403 Forbidden - I not only tried using User-Agent in headers but also all other headers that I found in Request Headers section in firefox for JSON response, but still 403!
Python requests - 403 forbidden - despite setting `User-Agent` headers - By making request through Session object, I still get 403!
What can be the possible cause? Is there something else I could try using?
EDIT: Request Headers (inspecting headers section of JSON in firefox) that I used in headers attribute:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Host': 'angel.co',
'If-None-Match: 'W/"5857a9eac987138be074e7bdd4537df8"',
'TE': 'Trailers',
'Upgrade-Insecure-Requests': 1,
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0'}
If a get request returns 403 Forbidden even after adding user-agent to headers, you may need to add more headers like this:
headers = {
'user-agent':"Mozilla/5.0 ...",
'accept': '"text/html,application...',
'referer': 'https://...',
}
r = requests.get(url, headers=headers)
In the chrome, Request headers can be found in the Network > Headers > Request-Headers of the Developer Tools. (Press F12 to toggle it.)
I assume you website detects, when a request isn' sent from a browser (made with javascript).
I had a similar issue recently, and this answer had worked for me.
I'm trying to check if a current #hotmail.com address is taken.
However, I'm not getting the response I would have gotten using chrome developer tools.
#!/usr/bin/python
import urllib
import urllib2
import requests
cookies = {
'MC0': '1449950274804',
'mkt': 'en-US',
'MSFPC': 'ID=a9b016cd39838248bbf321ea5ad1ecae&CS=1&LV=201512&V=1',
'wlv': 'A|ekIL-d:s*cAHzDg.2+1+0+3',
'HIC': '7c5d20284ecdbbaa||0|||',
'wlxS': 'wpc=1&WebIM=1',
'RVC': 'm=1&v=17.5.9510.1001&t=12/12/2015 20:37:45',
'amcanary': '0',
'CkTst': 'MX1449957709484',
'LDH': '9',
'wla42': 'KjEsN0M1RDIwMjg0RUNEQkJBQSwsLDAsLTEsLTE=',
'LN': 'u9GMx1450021043143',
}
headers = {
'Origin': 'https://signup.live.com',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8,ja;q=0.6',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36',
'canary': 'aeIntzIq6OCS9qOE2KKP2G6Q7yCCPLAQVPIw0oy2Vksln3bbwVR9I8DcpfzC9RiCnNiJBw4YxtWsqJfnx0PeR9ovjRG+bF1jKkyPVWUTyuDTO5UkwRNNJFTIdeaClMgHtATSy+gI99ojsAKwuRFBMNbOgCwZIMCRCmky/voftX/63gjTqC9V5Ry/bECc2P66ouDZNC7TA/KN6tfsmszelEoSrmvU7LAKDoZnkhRQjpn6WYGxUzr5S+UYXExa32AY:1:3c',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'application/json',
'Referer': 'https://signup.live.com/signup?wa=wsignin1.0&rpsnv=12&ct=1450038320&rver=6.4.6456.0&wp=MBI_SSL_SHARED&wreply=https',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
}
data = {"signInName":"testfoobar1234#outlook.com","uaid":"f1d115020fc94af6ba17e722277cdcb8","performDisambigCheck":"true","includeSuggestions":"true","uiflvr":"1001","scid":"100118","hpgid":"200407"}
asdf = requests.post('https://signup.live.com/API/CheckAvailableSigninNames?wa=wsignin1.0&rpsnv=12&ct=1450038320&rver=6.4.6456.0&wp=MBI_SSL_SHARED&wreply=https', headers=headers, cookies=cookies, data=data)
print(asdf.json())
This is what chrome gives me when checking testfoobar1234#hotmail.com:
This is what my script is giving me testfoobar1234#hotmail.com:
If you want to connect via python script on your local machine to login.live.com with right credentials but cookies from your Chrome -- it's will not work.
What you want to do: read emails, send email, or just get contacts from address book. Algorithms in script will be different. Example, Mails available via outlook.com system, contacts located in people.live.com (and API as I right remember).
If you want emulate login like Chrome do, you need:
Get and collect all cookies from outlook.com main page, don't forget about all redirects:) - via your python script
Send request with collected cookies and credentials, to login.live.com (outlook will redirect to it).
But, from my experience -- last Outlook version (regular and Outlook Preview systems) in 90% detects wrong attempt of login and send to you page with confirm login question (code or email). That way you will have unstable solution. Do you really want to do it?
If you just want to parse JSON right you need:
import json
data = json.loads(asdf.text)
print(data)
If you want to see, how much actions produced by browser, just install Firebug and disable cleaning "Network" panel, then see how many requests processed before you logged in into your account.
But, for see all traffic suggest to use Firefox + Firebug + Tamper Data.
And also, I think more quicker will be use exists libs like Selenium for browser emulation.
I am trying to use httplib2 to log in to a web page. I am able to log in to the page by simply opening the following URL in a Chrome incognito window:
https://domain.com/auth?name=USERNAME&pw=PASSWORD
I tried the following code to emulate this login with httplib2:
from httplib2 import Http
h = Http(disable_ssl_certificate_validation=True)
resp, content = h.request('https://domain.com/auth?name=USERNAME&pw=PASSWORD')
Unfortunately, this request does not lead to a successful login.
I tried changing the request headers to match those provided by Chrome:
headers = {
'Host': 'domain.com',
'Connection': 'keep-alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8'
}
resp, content = h.request('https://domain.com/auth?name=USERNAME&pw=PASSWORD', 'GET', headers=headers)
This changes the response slightly, but still does not lead to a successful login.
I tried inspecting the actual network traffic with Wireshark but since it's HTTPS and thus encrypted, I can't see the actual traffic.
Does anybody know what the difference in requests between Chrome and httplib2 could be? Maybe httplib2 changes some of my headers?
Following Games Brainiac's comment, I ended up simply using Python Requests instead of httplib2. The following requests code works out of the box:
import requests
session = requests.Session()
response = session.get('https://domain.com/auth?name=USERNAME&pw=PASSWORD')
Further requests with the same username/password can simply be performed on the Session object:
...
next_response = session.get('https://domain.com/someOtherPage')