How to avoid a bot detection and scrape a website using python?

How to avoid a bot detection and scrape a website using python? - python

My Problem:
I want to scrape the following website: https://www.coches.net/segunda-mano/.
But every time i open it with python selenium, i get the message, that they detected me as a bot.
How can i bypass this detection?
First i tried simple code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome('C:/Python38/chromedriver.exe')
URL = 'https://www.coches.net/segunda-mano/'
browser.get(URL)
Then i tried it with request, but i doesn't work, too.
from selenium import webdriver
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import requests
ua = UserAgent()
headers = {"UserAgent":ua.random}
URL = 'https://www.coches.net/segunda-mano/'
r = requests.get(URL, headers = headers)
print(r.statuscode)
In this case i get the message 403 = Status code stating that access to the URL is prohibited.
Don't know how to get entry to this webpage without getting blocked. I would be very grateful for your help. Thanks in advance.

Selenium is fairly easily detected, especially by all major anti-bot providers (Cloudflare, Akamai, etc).
Why?
Selenium, and most other major webdrivers set a browser variable (that websites can access) called navigator.webdriver to true. You can check this yourself by heading to your Google Chrome console and running console.log(navigator.webdriver). If you're on a normal browser, it will be false.
The User-Agent, typically all devices have what is called a "user agent", this refers to the device accessing the website. Selenium's User-Agent looks something like this: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36. Did you catch that? HeadlessChrome is included, this is another route of detection.
These are just two of the multiple ways a Selenium browser can be detected, I would highly recommend reading up on this and this as well.
And lastly, if you want an easy, drop-in solution to bypass detection that implements almost all of these concepts we've talked about, I'd suggest using undetected-chromedriver. This is an open source project that tries it's best to keep your Selenium chromedriver looking human.

I think your problem is not bot detection. You can't use just requests to get the results from that page, because it makes XHR requests behind the scene. So you must use Selenium, splash, etc, but seems is not possible for this case.
However, if you research a bit in the page you can find which url is requested behind the scenes to display the resutls. I did that research and found this page(https://ms-mt--api-web.spain.advgo.net/search), it returns a json, so it will ease your work in terms of parsing. Using chrome dev tools I got the curl request and just map it to python requests and obtain this code:
import json
import requests
headers = {
'authority': 'ms-mt--api-web.spain.advgo.net',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'accept': 'application/json, text/plain, */*',
'x-adevinta-channel': 'web-desktop',
'x-schibsted-tenant': 'coches',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.coches.net',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.coches.net/',
'accept-language': 'en-US,en;q=0.9,es;q=0.8',
}
data = '{"pagination":{"page":1,"size":30},"sort":{"order":"desc","term":"relevance"},"filters":{"categories":{"category1Ids":[2500]},"offerTypeIds":[0,2,3,4,5],"isFinanced":false,"price":{"from":null,"to":null},"year":{"from":null,"to":null},"km":{"from":null,"to":null},"provinceIds":[],"fuelTypeIds":[],"bodyTypeIds":[],"doors":[],"seats":[],"transmissionTypeId":0,"hp":{"from":null,"to":null},"sellerTypeId":0,"hasWarranty":null,"isCertified":false,"luggageCapacity":{"from":null,"to":null},"contractId":0}}'
while True:
response = requests.post('https://ms-mt--api-web.spain.advgo.net/search', headers=headers, data=data).json()
# you should parse items here.
print(response)
if not response["items"]:
break
data_dict = json.loads(data)
data_dict["pagination"]["page"] = data_dict["pagination"]["page"]+1 # get the next page.
data = json.dumps(data_dict)
Probably there are a lot of headers and body info that are unnecesary, you can code-and-test to improve it.

Proxy rotating can be useful if scraping large data
options = Options()
options.add_arguments('--proxy-server="#ip:#port"')
Then initialize chrome driver with options object

Related

python requests get aspx issue

Please look over my code. What am I doing wrong that my request response is empty? Any pointers?
URL in question: (should generate a results page)
https://www.ucr.gov/enforcement/343121222
But I cannot replicate it with python requests. Why?
import requests
headers = {'Host': 'www.ucr.gov',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:105.0) Gecko/20100101 Firefox/105.0',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip, deflate, br',
'Connection' : 'keep-alive'
}
data = {
'scheme': 'https;',
'host': 'www.ucr.gov',
'filename': '/enforcement/3431212'
}
url="https://www.ucr.gov/enforcement/3431212"
result = requests.get(url, params=data, headers=headers)
print(result.status_code)
print(result.text)

The page at the link you provided is fully rendered on the client side using JavaScript. This means that you won't be able to obtain the same response using a simple HTTP request.
In this case a common solution is performing something that is called Headless Scraping, which involves automating a headless browser in order to access a specific website content as a regular client would. In Python headless scraping can be implemented using several libraries, including Selenium and Puppeteer.

Trouble collecting different property ids from a webpage using the requests module

After clicking on the button 11.331 Treffer located at the top right corner within the filter of this webpage, I can see the result displayed on that page. I've created a script using the requests module to fetch the ID numbers of different properties from that page.
However, when I run the script, I get json.decoder.JSONDecodeError. If I copy the cookies from dev tools directly and paste them within the headers, I get the results accordingly.
I don't wish to copy cookies from dev tools every time I run the script, so I used Selenium to collect cookies from the landing page and supply them within headers to get the desired result, but I still get the same error.
I'm trying like:
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
start_url = 'https://www.immobilienscout24.de/'
link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?pagenumber=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?enteredFrom=one_step_search',
'accept': 'application/json; charset=utf-8',
'x-requested-with': 'XMLHttpRequest'
}
def get_cookies():
with webdriver.Chrome() as driver:
driver.get(start_url)
time.sleep(10)
cookiejar = {c['name']:c['value'] for c in driver.get_cookies()}
return cookiejar
cookies = get_cookies()
cookie_string = "; ".join([f"{item}={val}" for item,val in cookies.items()])
with requests.Session() as s:
s.headers.update(headers)
s.headers['cookie'] = cookie_string
res = s.get(link)
container = res.json()['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']
for item in container:
try:
project_id = item['#id']
except KeyError: project_id = ""
print(project_id)
How can I scrape property ids from that webpage using the requests module?
EDIT:
The existence of the following portion within cookies is crucial, without which the script probably leads to that error I mentioned. However, selenium failed to include that portion within cookies.
reese84=3:/qdGO9he7ld4/8a35vlw8g==:+/xBfAtVPRKHBSJgzngTQw1ywoViUvmVKLws+f8Y6edDgM+3s0Xzo17NvfgPrx9Z/suRy7hee5xcEgo85V3LdGsIop9/29g1ib1JQ0pO3UHWrtn81MseS6G8KE6AF4SrWZ2t8eTr1SEogUmCkB1HNSqXT88sAZaEi+XSzUyAGqikVjEcLX9TeI+KN37QNr9Sl+oTaOPchSgS/IowPj83zvT471Ewabg8CAc6q8I9AJ8Zb9FfLqePweCM+QFKIw+ZUp5GR4TXxZVcWdipbIEAyv3kj2x9Xs1K1k+8aXmy9VES6rFvW1xOsAjLmXbg6REPBye+QcAgPUh/x79mBWktcWC/uQ5L2W2dBLBS4eM2+bpEBw5EHMfjq9bk9hnmmZuxPGALLKASeXBt5lUUwx7x+wtGcjyvB9ZSE6gI2VxFLYqncYmhKqoNzgwQY8wRThaEraiJF/039/vVMa2G3S38iwniiOGHsOxq6VTdnWJGgvJqUmpWfXzz6XQXWL2xcykAoj7LMqHF2tC0DQyInUmZ3T7zjPBV7mEMgZkDn0z272E=:qQHyFe1/pp8/BS4RHAtxftttcOYJH4oqG1mW0+aNXF4=;

I think another part of your problem is that the link is not json. It's an html document. Part of the html document does contains javascript that sets a js variable to a json object. You can't get that with res.json()
In theory, you could use selenium to go to the link and grab the contents of the IS24.resultList variable by executing javascript like this:
driver.get(link)
time.sleep(10)
result_list = json.loads( driver.execute_script("return window.IS24.resultList"))
In practice, I think they're really serious about blocking bots and I suspect convincing them you're not a bot might take more than spoofing a cookie. When I visit via Selenium I don't even get the recaptcha option that I get when visiting through a regular browser session with incognito mode.

Problem with web scraping - JavaSript on website is disabled

Hello,
I've been playing with discord bots (in Python) for a while now and I've come across a problem with scraping information on some websites that protect themselves from data collection by disabling javascript on their side so you can't get to their data.
I have already looked at many websites recommending changing in headers among other things, but it has not helped.
The next step was to use selenium, which returns me this information.
We're sorry but Hive-Engine Explorer doesn't work properly without JavaScript enabled. Please enable it to continue.
Code:
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://he.dtools.dev/richlist/BEE")
htmlSource = driver.page_source
print(htmlSource)
I also checked how it looks like on the browser side itself and as we can see after entering the page there is no way to see the html file
Image from website
My question is, is it possible to bypass such security measures? Unfortunately I wanted to download the information from the API but it is not possible in this case.

You don't need to run Selenium to get this data, the site uses a backend api to deliver the data which you can replicate easily in python:
import requests
import pandas as pd
import time
import json
token = 'BEE'
limit = 100
id_ = int(time.time())
headers = {
'accept':'application/json, text/plain, */*',
'content-type':'application/json',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.3'
}
url = 'https://api.hive-engine.com/rpc/contracts'
payload = {"jsonrpc":"2.0","id":id_,"method":"find","params":{"contract":"tokens","table":"balances","query":{"symbol":token},"offset":0,"limit":limit}}
resp = requests.post(url,headers=headers,data=json.dumps(payload)).json()
df= pd.DataFrame(resp['result'])
df.to_csv('HiveData.csv',index=False)
print('Saved to HiveData.csv')

Downloading a Text File from a Public Sharepoint Link using Requests in Python

I'm trying to automate the downloading of a text file from a shared link that was sent to me by email. The original link is to a folder containing two files but I got the direct download link of the file that I need which is:
https://'abc'-my.sharepoint.com/personal/gamma_'abc'/_layouts/15/download.aspx?UniqueId=a0db276e%2Ddf75%2D49b7%2Db671%2D1c49e365ef3f
When I enter the above url into a web browser I get the popup option to open or download the file. I'm trying to write some Python code to download the file automatically and this what I've come up with so far
import requests
url = "https://<abc>-my.sharepoint.com/personal/gamma_<abc>/_layouts/15" \
"/download.aspx?UniqueId=a0db276e%2Ddf75%2D49b7%2Db671%2D1c49e365ef3f "
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Connection': 'keep-alive'}
myfile = requests.get(url, headers=hdr)
open('c:/users/scott/onedrive/desktop/gamma.las', 'wb').write(myfile.content)
I originally tried without the user agent and when I opened gamma.las there was only 403 FORBIDDEN in the file. If I send the header too then the file contains HTML for what looks like a Microsoft login page, so I'm assuming that I'm missing some authentication step.
I have no affiliation with this organization - someone sent me this link via email for me to download a text file which works fine through the browser but not via Python. I don't log in to anything to get it as I have no username or password with this domain.
Am I able to do this using Requests? If not, am I able to use REST API without user credentials for this company's Sharepoint?

Not supplying your credentials most probably means you are (implicitly) using built-in windows authentication in your organization. Check out if this helps: Handling windows authentication while accessing url using requests
The python library mentioned there to handle built-in windows auth is requests-negotiate-sspi. Not sure, if it's going to work with federation (your website ends with ".sharepoint.com" meaning you are probably using federation as well), but may be worth trying.
So, I would try something like this (I doubt headers really matter in your case, but you could try adding them as well)
import requests
from requests_negotiate_sspi import HttpNegotiateAuth
url = ...
myfile = requests.get(url, auth=HttpNegotiateAuth())

How to get response from post with python

I'm trying to check if a current #hotmail.com address is taken.
However, I'm not getting the response I would have gotten using chrome developer tools.
#!/usr/bin/python
import urllib
import urllib2
import requests
cookies = {
'MC0': '1449950274804',
'mkt': 'en-US',
'MSFPC': 'ID=a9b016cd39838248bbf321ea5ad1ecae&CS=1&LV=201512&V=1',
'wlv': 'A|ekIL-d:s*cAHzDg.2+1+0+3',
'HIC': '7c5d20284ecdbbaa||0|||',
'wlxS': 'wpc=1&WebIM=1',
'RVC': 'm=1&v=17.5.9510.1001&t=12/12/2015 20:37:45',
'amcanary': '0',
'CkTst': 'MX1449957709484',
'LDH': '9',
'wla42': 'KjEsN0M1RDIwMjg0RUNEQkJBQSwsLDAsLTEsLTE=',
'LN': 'u9GMx1450021043143',
}
headers = {
'Origin': 'https://signup.live.com',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8,ja;q=0.6',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36',
'canary': 'aeIntzIq6OCS9qOE2KKP2G6Q7yCCPLAQVPIw0oy2Vksln3bbwVR9I8DcpfzC9RiCnNiJBw4YxtWsqJfnx0PeR9ovjRG+bF1jKkyPVWUTyuDTO5UkwRNNJFTIdeaClMgHtATSy+gI99ojsAKwuRFBMNbOgCwZIMCRCmky/voftX/63gjTqC9V5Ry/bECc2P66ouDZNC7TA/KN6tfsmszelEoSrmvU7LAKDoZnkhRQjpn6WYGxUzr5S+UYXExa32AY:1:3c',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'application/json',
'Referer': 'https://signup.live.com/signup?wa=wsignin1.0&rpsnv=12&ct=1450038320&rver=6.4.6456.0&wp=MBI_SSL_SHARED&wreply=https',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
}
data = {"signInName":"testfoobar1234#outlook.com","uaid":"f1d115020fc94af6ba17e722277cdcb8","performDisambigCheck":"true","includeSuggestions":"true","uiflvr":"1001","scid":"100118","hpgid":"200407"}
asdf = requests.post('https://signup.live.com/API/CheckAvailableSigninNames?wa=wsignin1.0&rpsnv=12&ct=1450038320&rver=6.4.6456.0&wp=MBI_SSL_SHARED&wreply=https', headers=headers, cookies=cookies, data=data)
print(asdf.json())
This is what chrome gives me when checking testfoobar1234#hotmail.com:
This is what my script is giving me testfoobar1234#hotmail.com:

If you want to connect via python script on your local machine to login.live.com with right credentials but cookies from your Chrome -- it's will not work.
What you want to do: read emails, send email, or just get contacts from address book. Algorithms in script will be different. Example, Mails available via outlook.com system, contacts located in people.live.com (and API as I right remember).
If you want emulate login like Chrome do, you need:
Get and collect all cookies from outlook.com main page, don't forget about all redirects:) - via your python script
Send request with collected cookies and credentials, to login.live.com (outlook will redirect to it).
But, from my experience -- last Outlook version (regular and Outlook Preview systems) in 90% detects wrong attempt of login and send to you page with confirm login question (code or email). That way you will have unstable solution. Do you really want to do it?
If you just want to parse JSON right you need:
import json
data = json.loads(asdf.text)
print(data)
If you want to see, how much actions produced by browser, just install Firebug and disable cleaning "Network" panel, then see how many requests processed before you logged in into your account.
But, for see all traffic suggest to use Firefox + Firebug + Tamper Data.
And also, I think more quicker will be use exists libs like Selenium for browser emulation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to avoid a bot detection and scrape a website using python? - python

Proxy rotating can be useful if scraping large data options = Options() options.add_arguments('--proxy-server="#ip:#port"') Then initialize chrome driver with options object

Related

python requests get aspx issue

Trouble collecting different property ids from a webpage using the requests module

Problem with web scraping - JavaSript on website is disabled

Downloading a Text File from a Public Sharepoint Link using Requests in Python

How to get response from post with python

Categories

Resources