I am trying to log into a website. When I look at print(g.text) I am not getting back the web page I expect but instead a cloudflare page that says 'Checking your browser before accessing'
import requests
import time
s = requests.Session()
s.get('https://www.off---white.com/en/GB/')
headers = {'Referer': 'https://www.off---white.com/en/GB/login'}
payload = {
'utf8':'✓',
'authenticity_token':'',
'spree_user[email]': 'EMAIL#gmail.com',
'spree_user[password]': 'PASSWORD',
'spree_user[remember_me]': '0',
'commit': 'Login'
}
r = s.post('https://www.off---white.com/en/GB/login', data=payload, headers=headers)
print(r.status_code)
g = s.get('https://www.off---white.com/en/GB/account')
print(g.status_code)
print(g.text)
Why is this occurring when I have set the session?
You might want to try this:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
It does not require Node.js dependency.
All credits go to this pypi page
This is due to the fact that the page uses Cloudflare's anti-bot page (or IUAM).
Bypassing this check is quite difficult to solve on your own, since Cloudflare changes their techniques periodically. Currently, they check if the client supports JavaScript, which can be spoofed.
I would recommend using the cfscrape module for bypassing this. To install it, use pip install cfscrape. You'll also need to install Node.js.
You can pass a requests session into create_scraper() like so:
session = requests.Session()
session.headers = ...
scraper = cfscrape.create_scraper(sess=session)
I had the same problem because they implemented cloudfare in the api, I solved it this way
import cloudscraper
import json
scraper = cloudscraper.create_scraper()
r = scraper.get("MY API").text
y = json.loads(r)
print (y)
You can scrape any Cloudflare protected page by using this tool. Node.js is mandatory in order for the code to work correctly.
Download Node from this link https://nodejs.org/en/
import cfscrape #pip install cfscrape
scraper = cfscrape.create_scraper()
res = scraper.get("https://www.example.com").text
print(res)
curl and hx avoid this problem. But how?
I found, they work by default with HTTP/2. But requests library used only HTTP/1.1.
So, for tests I installed httpx with h2 python library to support HTTP/2 requests) and it works if I do: httpx --http2 'https://some.url'.
So, the solution is to use a library that supports http2. For example httpx with h2
It's not a complete solution, since it won't help to solve Cloudflare's anti-bot ("I'm Under Attack Mode", or IUAM) challenge
Related
I am trying to use Proxies for my WebScraping Project, which i did build with HTTPX.
However when i was setting up my proxies i still got blocked, so i tryed out if the actully work/get used. I bought some proxys from an professional website/seller, so they work just fine.
I found a website, which returns the IP, from which i am making the request.
I Tryed to test the USE of proxies like that:
import httpx
import requests
#Username:PW:Hostname
proxies = {"http://": "http://username:pw.io:hostname"}
#response = requests.get('http://ipinfo.io/json',proxies=proxies)
response = httpx.get('http://ipinfo.io/json',proxies=proxies)
print(response.text)
Both requests and httpx dont work for me, as the response always returns my real IP. How do i need to set up my Proxiex? Keep in mind, that i actually want to use HTTPX and just used requests for debugging aswell.
I want to make a GET request to a tiktok url via python but it does not work.
Let's say we have a tiktok link from a mobile app – https://vt.tiktok.com/ZS81uRSRR/ and I want to get its video_id which is available in a canonical link. This is the canonical link for the provided tiktok: https://www.tiktok.com/#notorious_foodie/video/7169643841316834566?_r=1&_t=8XdwIuoJjkX&is_from_webapp=v1&item_id=7169643841316834566
video_id comes after /video/, for example in the link above video_id would be 7169643841316834566
When I open a mobile link on my laptop in a browser it redirects me to the canonical link, I wanted to achieve the same behavior via code and managed to do it like so:
import requests
def get_canonical_url(url):
return requests.get(url, timeout=5).url
It was working for a while but then it started raising timeout errors every time, I managed to fix it by providing cookie. I made a request to Postman(it works when I make GET request through postman though), copied cookies, modified my function to accept cookies and it started working again. It had been working with cookies for ~6 months although last week it stopped working again, I thought that the reason might be in the expired cookies but when I updated them it didn't help.
This is the error I keep getting:
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.tiktok.com', port=443): Read timed out. (read timeout=5)
The weirdest thing is that I can make my desired request just fine via curl:
Or via Postman:
Recap
So the problem is that my python GET request never succeeded and I can't understand why. I tried using VPN in case tiktok has banned my ip, also I tried to run this request on some of my servers to try different server locations but none of my attempts worked.
Could you give me a piece of advice how to debug this issue further or maybe any other ideas how I can get video_id out of mobile tiktok link?
Method 1 - Using subprocess
Execute curl command and catch the output and it will take ~0.5 seconds.
import subprocess
import re
process_detail = subprocess.Popen(["curl", "https://vt.tiktok.com/ZS81uRSRR/"], stdout=subprocess.PIPE)
output = process_detail.communicate()[0].decode()
process_detail.kill()
canonical_link = re.search("(?P<url>https?://[^\s]+)+\?", output).group("url")
print("Canonical link: ", canonical_link)
Method 2 - Using proxies
We need to use proxies. here is the solution for free proxies which we can scrap and apply dynamically using BeautifulSoup..
First install BeautifulSoup using pip install BeautifulSoup
Solution:
from bs4 import BeautifulSoup
import requests
def scrap_now(url):
print(f"<======================> Scrapping Started <======================>")
print(f"<======================> Getting proxy <======================>")
source = requests.get('https://free-proxy-list.net/').text
soup = BeautifulSoup(source, "html.parser")
ips_container = soup.findAll("table", {"class": "table table-striped table-bordered"})
ip_trs = ips_container[0].findAll('tr')
for i in ip_trs[1:]:
proxy_ip = i.findAll('td')[0].text + ":" + i.findAll('td')[1].text
try:
proxy = {"https": proxy_ip}
print(f"<======================> Trying with: {proxy_ip}<======================>")
headers = {'User-Agent': 'Mozilla/5.0'}
resp = requests.get(url, headers=headers, proxies=proxy, timeout=5)
if resp.status_code == requests.codes.ok:
print(f"<======================> Got Success with: {proxy_ip}<======================>")
return resp.url
except Exception as e:
print(e)
continue
return ""
canonical_link = scrap_now("https://vt.tiktok.com/ZS81uRSRR/")
print("Canonical link: ", canonical_link)
Output:
Method - 3: Using Selenium
We can do this with selenium as well. It will take almost 5 seconds
First, install selenium using pip install selenium==3.141.0
then execute below lines:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", {
"profile.default_content_setting_values.media_stream_mic": 1,
"profile.default_content_setting_values.media_stream_camera": 1,
"profile.default_content_setting_values.geolocation": 1,
"profile.default_content_setting_values.notifications": 1,
"credentials_enable_service": False,
"profile.password_manager_enabled": False
})
options.add_argument('--headless')
options.add_experimental_option("excludeSwitches", ['enable-automation'])
browser = webdriver.Chrome(ChromeDriverManager(cache_valid_range=365).install(), options=options)
browser.get("https://vt.tiktok.com/ZS81uRSRR/")
print("Canonical link: ", browser.current_url)
Note: On first run it will take a bit more time as it will download web drivers automatically, but after that it will use cache only.
So I'm trying to scrape https://craft.co/tesla
When I visit from the browser, it opens correctly. However, when I use scrapy, it fetches the site but when I view the response,
view(response)
It shows the cloudfare site instead of the actual site.
Please how do I go about this??
Cloudflare changes their techniques periodically and anyway you can just use a simple Python module to bypass Cloudflare's anti-bot page.
The module can be useful if you wish to scrape or crawl a website protected with Cloudflare. Cloudflare's anti-bot page currently just checks if the client supports Javascript, though they may add additional techniques in the future.
Due to Cloudflare continually changing and hardening their protection page, cloudscraper requires a JavaScript Engine/interpreter to solve Javascript challenges. This allows the script to easily impersonate a regular web browser without explicitly deobfuscating and parsing Cloudflare's Javascript.
Any script using cloudscraper will sleep for ~5 seconds for the first visit to any site with Cloudflare anti-bots enabled, though no delay will occur after the first request.
[ https://pypi.python.org/pypi/cloudscraper/ ]
Please check this python module.
The simplest way to use cloudscraper is by calling create_scraper().
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print(scraper.get("http://somesite.com").text) # => "<!DOCTYPE html><html><head>..."
Any requests made from this session object to websites protected by Cloudflare anti-bot will be handled automatically. Websites not using Cloudflare will be treated normally. You don't need to configure or call anything further, and you can effectively treat all websites as if they're not protected with anything.
You use cloudscraper exactly the same way you use Requests. cloudScraper works identically to a Requests Session object, just instead of calling requests.get() or requests.post(), you call scraper.get() or scraper.post().
Use requests-HTML. You can use this code to avoid block:
# pip install requests-html
from requests_html import HTMLSession
url = 'your url come here'
s = HTMLSession()
s.headers['user-agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
r = s.get(url)
r.html.render(timeout=8000)
print(r.status_code)
print(r.content)
I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?
I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.
Try setting:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
This will bypass the cookie consent page and will land you staight to the page.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.
I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.
import requests
cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookie)
import requests
from bs4 import BeautifulSoup
a = requests.Session()
soup = BeautifulSoup(a.get("https://www.facebook.com/").content)
payload = {
"lsd":soup.find("input",{"name":"lsd"})["value"],
"email":"my_email",
"pass":"my_password",
"persistent":"1",
"default_persistent":"1",
"timezone":"300",
"lgnrnd":soup.find("input",{"name":"lgnrnd"})["value"],
"lgndim":soup.find("input",{"name":"lgndim"})["value"],
"lgnjs":soup.find("input",{"name":"lgnjs"})["value"],
"locale":"en_US",
"qsstamp":soup.find("input",{"name":"qsstamp"})["value"]
}
soup = BeautifulSoup(a.post("https://www.facebook.com/",data = payload).content)
print([i.text for i in soup.find_all("a")])
Im playing around with requests and have read several threads here in SO about it so I decided to try it out myself.
I am stumped by this line. "qsstamp":soup.find("input",{"name":"qsstamp"})["value"]
because it returns empty thereby cause an error.
however looking at chrome developer tools this "qsstamp" is populated what am I missing here?
the payload is everything shown in the form data on chrome dev tools. so what is going on?
Using Firebug and search for qsstamp gives matched results directs to: Here
You can see: j.createHiddenInputs({qsstamp:u},v)
That means qsstamp is dynamically generated by JavaScript.
requests will not run JavaScript(since what it does is to fetch that page's HTML.) You may want to use something like dryscape or using emulated browser like Selenium.