I'm attempting to post the data of a pop-out form to a local web site. To do this I'm emulating the requests header and data and cookie information provided by the site. (Note: I am largely redacting my email and password from the code (for obvious reasons), but all other code will remain the same.)
I have tried mulitple permutations of the cookie, header, requests, data, etc. Additionally, I have verified in a network inspector the cookie and expected headers and data. I am able to easily set a cookie using requests' sample code. I cannot explain why my code won't work on a live site, and I'd be very grateful for any assistance. Please see the following code for further details.
import requests
import robobrowser
import json
br = robobrowser.RoboBrowser(user_agent="Windows Chrome",history=True)
url = "http://posting.cityweekly.net/gyrobase/API/Login/CookieV2"
data ={"passwordChallengeResponse":"....._SYGwbDLkSyU5gYKGg",
"email": "<email>%40bu.edu",
"ttl":"129600",
"sessionOnly": "1"
}
headers = {
"Origin": "http://posting.cityweekly.net",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.8,ru;q=0.6",
"User-Agent": "Windows Chrome", #"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
"Referer": "http://posting.cityweekly.net/utah/Events/AddEvent",
"X-Requested-With": "XMLHttpRequest",
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Host":"posting.cityweekly.net"
}
cookie = {"Cookie": "__utma=25975215.1299783561.1416894918.1416894918.1416897574.2; __utmc=25975215; __utmz=25975215.1416894918.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __qca=P0-2083194243-1416894918675; __gads=ID=e3b24038c9228b00:T=1416894918:S=ALNI_MY7ewizuxK0oISnqPJWlLDAeKFMmw; _cb_ls=1; _chartbeat2=D6vh2H_ZbNJDycc-t.1416894962025.1416897589974.1; __utmb=25975215.3.10.1416897574; __utmt=1"}
r = br.session.get(url, data=json.dumps(data), cookies=cookie, headers=headers)
print r.headers
print [item for item in r.cookies.__dict__.items()]
Note that I print the cookies object and that the cookies attribute (a dictionary) is empty.
You need to perform a POST to login to the site. Once you do that, I believe the cookies will then have the correct values, (not 100% on that...). This post clarifies how to properly set cookies.
Note: I don't think you need to do the additional import of requests unless you're using it outside of RoboBrowser.
Related
I am sending a request to some url. I Copied the curl url to get the code from curl to python tool. So all the headers are included, but my request is not working and I recieve status code 403 on printing and error code 1020 in the html output. The code is
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
}
response = requests.get('https://v2.gcchmc.org/book-appointment/', headers=headers)
print(response.status_code)
print(response.cookies.get_dict())
with open("test.html",'w') as f:
f.write(response.text)
I also get cookies but not getting the desired response. I know I can do it with selenium but I want to know the reason behind this. Thanks in advance.
Note:
I have installed all the libraries installed with request with same version as computer and still not working and throwing 403 error
The site is protected by cloudflare which aims to block, among other things, unauthorized data scraping. From What is data scraping?
The process of web scraping is fairly simple, though the
implementation can be complex. Web scraping occurs in 3 steps:
First the piece of code used to pull the information, which we call a scraper bot, sends an HTTP GET request to a specific website.
When the website responds, the scraper parses the HTML document for a specific pattern of data.
Once the data is extracted, it is converted into whatever specific format the scraper bot’s author designed.
You can use urllib instead of requests, it seems to be able to deal with cloudflare
req = urllib.request.Request('https://v2.gcchmc.org/book-appointment/')
req.add_headers('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8')
req.add_header('Accept-Language', 'en-US,en;q=0.5')
r = urllib.request.urlopen(req).read().decode('utf-8')
with open("test.html", 'w', encoding="utf-8") as f:
f.write(r)
It works on my machine, so I am not sure what the problem is.
However, when I want send a request which does not work, I often try if it works using playwright. Playwright uses a browser driver and thus mimics your actual browser when visiting the page. It can be installed using pip install playwright. When you try it for the first time it may give an error which tells you to install the drivers, just follow the instruction to do so.
With playwright you can try the following:
from playwright.sync_api import sync_playwright
url = 'https://v2.gcchmc.org/book-appointment/'
ua = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/69.0.3497.100 Safari/537.36"
)
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page(user_agent=ua)
page.goto(url)
page.wait_for_timeout(1000)
html = page.content()
print(html)
A downside of playwright is that it requires the installation of the chromium (or other) browsers. This is a downside as it may complicate deployment, as the browser can not simply be added to requirements.txt, and a container image is required.
Try running Burp Suite's Proxy to see all the headers and other data like cookies. Then you could mimic the request with the Python module. That's what I always do.
Good luck!
Had the same problem recently.
Using the javascript fetch-api with Selenium-Profiles worked for me.
example js:
fetch('http://example.com/movies.json')
.then((response) => response.json())
.then((data) => console.log(data));o
Example Python with Selenium-Profiles:
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": profile["cdp"]["useragent"]["acceptLanguage"],
"content-type": "application/json",
# "cookie": cookie_str, # optional
"sec-ch-ua": "'Google Chrome';v='107', 'Chromium';v='107', 'Not=A?Brand';v='24'",
"sec-ch-ua-mobile": "?0", # "?1" for mobile
"sec-ch-ua-platform": "'" + profile['cdp']['useragent']['userAgentMetadata']['platform'] + "'",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"user-agent": profile['cdp']['useragent']['userAgent']
}
answer = driver.requests.fetch("https://www.example.com/",
options={
"body": json.dumps(post_data),
"headers": headers,
"method":"POST",
"mode":"same-origin"
})
I don't know why this occurs, but I assume cloudfare and others are able to detect, whether a request is made with javascript.
I am trying to scrape ETFs from the website https://www.etf.com/channels. However no matter what I try it returns a 503 error when trying to access it. I've tried using different user agents as well as headers but it still wouldn't let me access it. Sometimes when I try to access the website by browser a page pops up that "checks if the connection is secure" So I assume they have things in place to stop scraping. I've seen others ask the same question and the answer always says to add a user agent but that didn't work for this site.
Scrapy
class BrandETFs(scrapy.Spider):
name = "etfs"
start_urls = ['https://www.etf.com/channels']
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Host": "www.etf.com",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0"
}
custom_settings = {'DOWNLOAD_DELAY': 0.3, "CONCURRENT_REQUESTS": 4}
def start_requests(self):
url = self.start_urls[0]
yield scrapy.Request(url=url)
def parse(self, response):
test = response.css('div.discovery-slat')
yield {
"test": test
}
Requests
import requests
url = 'https://www.etf.com/channels'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
'Referer': 'https://google.com',
'Origin': 'https://www.etf.com'
}
r = requests.post(url, headers=headers)
r.raise_for_status()
Is there anyway to get around these blocks and access the website?
Status 503 - Service Unavailable is often seen in such cases, you are probably right with your assumption that they have taken measures against scraping.
For the sake of completeness, they prohibit what you are attempting in their Terms of Service (No. 7g):
[...] You agree that you will not [...]
Use automated means, including spiders, robots, crawlers [...]
Technical point of view
The User-Agent in the header is just one of many things that you should consider when you try to hide the fact that you automated the requests you are sending.
Since you see a page that seems to verify that you are still/again a human, it is likely that they have figured out what is going on and
have an eye on your IP. It might not be blacklisted (yet) because they notice changes whenever you try to access the page.
How did they find out? Based on your question and code, I guess it's just your IP that did not change in combination with
Request rate: You have sent (too many) requests too quickly, i.e. faster than they consider a human to do this.
Periodic requests: Static delays between requests, so they see pretty regular timing on their side.
There are several other aspects that might or might not be monitored. However, using proxies (i.e. changing IP addresses) would be a step in the right direction.
I'm trying to get the data from https://www.ecfr.gov/cgi-bin/ECFR?page=browse
using requests module in python
Somehow I'm getting HTTP 403-forbidden.
header = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Host": "httpbin.org",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5ef3288f-10e678d0e55c0670c0807730"}
r = requests.get(url , headers= header)
I have also requested using user-agent and all the parameters in headers info(which I'm seeing in developer tools) .
I have tried using free proxies / rotating user header /cookies and everything i can get my hands on. But somehow website is able to know that I'm not using header.
In the html response - I'm seeing that website is asking to complete captcha.
Is there anyways I can skip that ?
Inspecting the http requests, I've found the cloudflare server response trace:
The Cloudflare or ScrapeShield is famous for its scrape protection, security levels. Read more here.
Is there anyways I can skip that ?
There are 2 ways out:
Apply (plug-in) a captcha solving service. That is not that easy providing you use sole python coding.
Leverage the browser automation, making ScrapeShield to think that a real user browses the website. It does take much more resources and time (incl. development time). See a scrape speed comparison table of Chromium headless instance automation vs bare http requests.
I am trying to login to a site called grailed.com and follow a certain product. The code below is what I have tried.
The code below succeeds in logging in with my credentials. However whenever I try to follow a product (the id in the payload is the id of the product) the code runs without any errors but fails to follow the product. I am confused at this behavior. Is it a similar case to Instagram (where Instagram blocks any attempt to interact programmatically with their site and force you to use their API (grailed.com does not have a API for the public to use AFAIK)
I tried the following code (which looks exactly like the POST request sent when you follow on the site).
headers/data defined here
r = requests.Session()
v = r.post("https://www.grailed.com/api/sign_in", json=data,headers = headers)
headers = {
'authority': 'www.grailed.com',
'method': 'POST',
"path": "/api/follows",
'scheme': 'https',
'accept': 'application/json',
'accept-encoding': 'gzip, deflate, br',
"content-type": "application/json",
"x-amplitude-id": "1547853919085",
"x-api-version": "application/grailed.api.v1",
"x-csrf-token": "9ph4VotTqyOBQzcUt8c3C5tJrFV7VlT9U5XrXdbt9/8G8I14mGllOMNGqGNYlkES/Z8OLfffIEJeRv9qydISIw==",
"origin": "https://www.grailed.com",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
payload = {
"id": "7917017"
}
b = r.post("https://www.grailed.com/api/follows",json = payload,headers = headers)
If API is not designed to be public, you are most likely missing csrf token in your follow headers.
You have to find an CSRF token, and add it to /api/follows POST.
taking fast look at code, this might be hard as everything goes inside javascript.
https://open.spotify.com/search/results/cheval is the link that triggers various intermediary requests, one being the attempted request below.
When running the following request in Postman (Chrome plugin), response cookies (13) are shown but do not seem to exist when running this request in Python (response.cookies is empty). I have also tried using a session, but with the same result.
update: Although these cookies were retrieved after using Selenium (to login/solve captcha and transfer the login cookies to the session to use for the following request, it's still unknown what variable/s are required for the target cookies to be returned with that request).
How can those response cookies be retrieved (if at all) with Python?
url = "https://api.spotify.com/v1/search"
querystring = {"type":"album,artist,playlist,track","q":"cheval*","decorate_restrictions":"true","best_match":"true","limit":"50","anonymous":"false","market":"from_token"}
headers = {
'access-control-request-method': "GET",
'origin': "https://open.spotify.com",
'x-devtools-emulate-network-conditions-client-id': "0959BC056CD6303CAEC3E2E5D7796B72",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
'access-control-request-headers': "authorization",
'accept': "*/*",
'accept-encoding': "gzip, deflate, br",
'accept-language': "en-US,en;q=0.9",
'cache-control': "no-cache",
'postman-token': "253b0e50-7ef1-759a-f7f4-b09ede65e462"
}
response = requests.request("OPTIONS", url, headers=headers, params=querystring)
print(response.text)