Heroku BeautifulSoup + cloudscraper don't bypass cloudflare on server side

Heroku BeautifulSoup + cloudscraper don't bypass cloudflare on server side - python

i'm using BeautifulSoup + cloudscraper to scrap a site. The problem is in local it's working but on heroku server it doesn't work.
It's look like when i launch the script via heroku server the JS or cookie are not enable. That why in local cloudscraper can bypass cloudflare and not on heroku.
My code:
import requests
import cloudscraper
from bs4 import BeautifulSoup
session = requests.session()
scraper = cloudscraper.create_scraper(browser='chrome', sess=session)
contract_page = scraper.get("https://bscscan.com/token/0x30e650783b4046c64dcf3b7b78854f3d4a87b058",
headers = {
'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36",
'Cache-Control': "no-cache",
})
soupa = BeautifulSoup(contract_page.content, 'html.parser')
print(soupa)
tokenholders = soupa.find(id='ContentPlaceHolder1_tr_tokenHolders').get_text()
the print of soupa give me this HTML page:
Someone have idea how to enable JS or cookie from a heroku server that run the script please ?

After many tried i found a solution, to bypass it we have to use a proxy to change the IP of heroku server.

Related

Web Scraping - Cloudflare Issues

I am trying to scrape https://www.carsireland.ie/search#q?%20scraper%20python=&toggle%5Bpoa%5D=false&page=1 (I had built a scraper but then they did a total overhaul of their website). The new website has a new format and has Cloudflare to provide the usual security. I have the following code which returns a 403 error, particularly referencing this error:
"https://www.cloudflare.com/5xx-error-landing"
The code which I have built so far is as follows:
from requests_html import HTMLSession
session = HTMLSession()
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
# url of search page
url = 'https://www.carsireland.ie/search#q?sortBy=vehicles_prod%2Fsort%2Fpoa%3Aasc%2Cupdated%3Adesc&page=1'
# create a session with the url
r = session.get(url, headers=header)
# render the url
data = r.html.render(sleep=1, timeout=20)
# Check the response
print(r.text)
I would really appriciate any help which could be provided to correct the CloudFlare issues which I am having.

this problem can be fixed by simply changing the referer property in header to the link you are going to scrape.

503 Error When Trying To Crawl One Single Website Page | Python | Requests

Goal:
I am trying to scrape the HTML from this page: https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=.
(note - I will eventually want to paginate and scrape all job listings from this page)
My issue:
I get a 503 error when I try to scrape the page using Python and Requests. I am working out of Google Colab.
Initial Code:
import requests
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = requests.get(url)
print(response)
Attempted solutions:
Using 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
Implementing this code I found in another thread:
import requests
def getUrl(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
}
res = requests.get(url, headers=headers)
res.raise_for_status()
getUrl('https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=')
I am able to access the website via my browser.
Is there anything else I can try?
Thank you

That page is protected by cloudflare, there's some options to try to bypass it, seems that using cloudscraper works:
import cloudscraper
scraper = cloudscraper.create_scraper()
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = scraper.get(url).text
print(response)
In order to use it, you'll need to install it:
pip install cloudscraper

Python3 - Requests - BS4 - Cloudflare -> 403 Forbidden not use Local Proxy

Codes aren't working. It has got 403 error because system using cloudflare
When i am using anyone http proxy(burp suite/fiddler etc.), I see csrfToken. It works.
Why it works when use local proxy?
import requests
from bs4 import BeautifulSoup
headerIstek = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19041",
"Sec-Fetch-Site" : "none",
"Accept-Language" : "tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7"
}
istekLazim = {"ref":"","display_type":"popup","loc":""}
istekLogin = requests.get("https://www.example.com/join/login-popup/", headers=headerIstek, cookies={"ud_rule_vars":""}, params=istekLazim, verify=False)
soup = BeautifulSoup(istekLogin.text, "html.parser")
print(istekLogin.request.headers)
csrfToken = soup.find("input", {"name":"csrfmiddlewaretoken"})["value"]
print(csrfToken)

Cloudflare performs JavaScript checks on the browser and returns a session if the checks have been successful. If you want to run a one-off script to download stuff off of a CloudFlare protected server, add a session cookie from a previously validated session you obtained using your browser.
The session cookie is named __cfduid. You can get it by fetching a resource using your browser and then opening the developer tools and the network panel. Once you inspect the request, you can see the cookies your browser sent to the server.
Then you can use that cookie for requests using your script:
cookies = {
"__cfduid": "xd0c0985ed80ffbc4dd29d1612168766",
}
response = requests.get(image_url, cookies=cookies)
response.raise_for_status()

Access denied [403] when accessing site with BeautifulSoup python

I want to scrape https://www.jdsports.it/ using BeautifulSoup but I get access denied.
On my pc I don't get any problem accessing the site and I'm using the same user agent of the Python program but on the program the result is different, you can see the output below.
EDIT:
I think I need cookies to gain access to the site. How can I get them and use them to access the site with the python program to scrape it?
-The script works if I use "https://www.jdsports.com" that's the same site but with different region.
Thanks!
import time
import requests
from bs4 import BeautifulSoup
import smtplib
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
url = 'https://www.jdsports.it/'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
status = soup.findAll.get_text()
print (status)
The output is:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.jdsports.it/" on this server.<p>
Reference #18.35657b5c.1589627513.36921df8
</p></body>
</html>
>
python beautifulsoup user-agent cookies python-requests

Suspected HTTP/2 at first, but wasn't able to get that working either. Perhaps you are more lucky, here's a HTTP/2 starting point:
import asyncio
import httpx
import logging
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
url = 'https://www.jdsports.it/'
async def f():
client = httpx.AsyncClient(http2=True)
r = await client.get(url, allow_redirects=True, headers=headers)
print(r.text)
asyncio.run(f())
(Tested both on Windows and Linux.) Could this have something to do with TLS1.2? That's where I'd look next, as curl works.

Unable to login to twitter using python requests library

I have tried to login to the twitter account using requests library. But I am getting url response as "400". It is not working. I used all the required payload parameters and headers. But still, I am unable to figure out how to login.
import requests
from bs4 import BeautifulSoup
payload={
"session[username_or_email]":"***************",
"session[password]":"*************",
"authenticity_token":"*************",
"ui_metrics":'{"rf":{"a4f2d7a8e3d9736f0815ae7b34692191bca9f114a7d5602c7758a3e6087b6b30":0,"ad92fc8b83fb5dec3f720f89a7f0fb415a26130516362f230b02251edd96a54a":0,"a011babb5c5df598f93bcc4a38dfad0276f69df36faff48eea95bac67cefeffe":0,"a75214752b7e90fd50725fce21cc26761ef3613173b0f8764d52c8b53f136bbf":0},"s":"mTArUSdNtTOm6WaGwNeRjMAU3EhNA3VGbFeCIZeEkjjLTAbccFDTJjcTEB2tQ9iuNJUzniFKyvhZNOGdH1LIwmi1YSMcFTOHu2Wi49yKvONv0obfg1dW27znR_C2n-ev2zMvN5166j1ccsxWKIheiWw-eHM7oXA54U40cWHvdCrunJJKj2INkTrcVph-y2fccu1m3hp31vngqBiL-XmeLWYiyZ-NYOmV8f5iXW9WWMvISTcSwzz9vd_n9-tLSKociT-1ap5ZVFWNUWIycSflj8WcOmmRFzq4kwa-NsS0FRp-DQ2FOkozhhhQi9HDvSODUlGsdQWBPkGKKtDWbtnj9gAAAWEty4Xv"}',
"scribe_log":"",
"redirect_after_login":"",
"authenticity_token":"******************",
"return_to_ssl":"",
"remember_me":"1",
"lang":""
}
headers={
"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-encoding":"gzip, deflate, br",
"accept-language":"en-US,en;q=0.9",
"cache-control":"max-age=0",
"cookie":'moments_profile_moments_nav_tooltip_self=true; syndication_guest_id=v1%3A150345116906281638; eu_cn=1; kdt=QErLcBT9OjM5gjEznmsRcHlMTK6biDyAw4gfI5ro; remember_checked_on=1; _ga=GA1.2.1923324433.1496571570; tfw_exp=0; _gid=GA1.2.106381927.1516638134; __utma=43838368.1923324433.1496571570.1516764481.1516764481.1; __utmz=43838368.1516764481.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); lang=en; ct0=7ceea26f7fd3d186152512d26365cddf; _twitter_sess=BAh7CiIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCL8wyy1hAToMY3NyZl9p%250AZCIlNjJjODQ1MjZiZWQzOGUyODZlOWUxNmNkMWJhZTZjYjc6B2lkIiU4MmZm%250AYWQ3Mzc1OGFhNmJjOTIxZjlmOGEyMzk3MjE1NToJdXNlcmwrCQAAVbhKiEIN--32d967262e1de8852d20ace15ec93d87b9a902a8; personalization_id="v1_snKt6bqCONQsnFuE8EOZDA=="; guest_id=v1%3A151689245583269291; _gat=1; ads_prefs="HBERAAA="; twid="u=955475925457502208"; auth_token=50decb38f16f3c264f480b0cd1cc30a9bcce9f08',
"referer":"https://twitter.com/login",
"upgrade-insecure-requests":"1",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
res = requests.get("https://twitter.com/login",data=payload,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
print(res.status_code)
print(res.url)
for item in soup.find_all(class_="title"):
print(item.text)
How to login to twitter? what all parameters did i miss? Please help me out with this.
Note: I am not using APIs or selenium driver. I want to do it using requests library. Please help me. Thanks in Advance.

You're using the GET method to access an auth endpoint, usually the POST method is used for such purposes, try using requests.post instead of requests.get.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Heroku BeautifulSoup + cloudscraper don't bypass cloudflare on server side - python

After many tried i found a solution, to bypass it we have to use a proxy to change the IP of heroku server.

Related

Web Scraping - Cloudflare Issues

503 Error When Trying To Crawl One Single Website Page | Python | Requests

Python3 - Requests - BS4 - Cloudflare -> 403 Forbidden not use Local Proxy

Access denied [403] when accessing site with BeautifulSoup python

Unable to login to twitter using python requests library

Categories

Resources