I've been scraping some real estate pages and hit a wall with one particular website. It started while trying to scrape phone numbers behind JavaScript onclick event. I don't really know too much about JS but from I can tell this is somehow intertwined with displaying ads.
After some closer inspection I've found out this json data on each page:
"data": {
"advert": {
"...,
"phoneObj":[ {
"phone": "735", "phoneCode": "173-28-189-69-82-145-233-192-109-58-19-5-226-110-115-225-135-77-50-22-83-36-187-139-85-8-219-95-87-164-33-33-139-78-248-201"
}
}
Playing around with Web Dev Tools I've determined that this 'phoneCode' is used to get the real phone number by passing its value to the special API URL. I scraped phoneCode, made another request with this special URL and..
Everything worked!
Unfortunately.. After a few sucessful requests I've started recieving 403 errors:
Access Denied
You don't have permission to access "http://www.host.com/frontera/api/item/owner/phone/173-28-189-69-82-145-233-192-109-58-19-5-226-110-115-225-135-77-50-22-83-36-187-139-85-8-219-95-87-164-33-33-139-78-248-201" on this server.
Reference #18.97645e68.1577009665.1a1da860
I'm not trying to be very fast while scraping those pages and I don't really think this is because there are low limits on requests. I've opened a bunch of windows using browser and tried clicking those manually and didn't get any problems whatsoever.
My first thought was that it has to do with proper sessions so I've immediately started tinkering with requests.Session() for cookie persistence and more custom headers:
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'pl,en-US;q=0.7,en;q=0.3',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Host': 'www.host.com',
'Pragma': 'no-cache'
}
But this doesn't really help at all. Is there anything I can try to do in order to better pinpoint what is the problem?
Basically the API is protected with AkamaiGHost which is well known firewall.
If you are browsing the API via the browser, it's will leave you to do whatever you want without any kind of block.
Once you call the webpage via code, it's will allow you once or more, but once it's analyze your connection, It's will block your IP-Address. so you will need to use tor
Here's a solution:
import requests
import time
headers = {'Cookie': 'Cookie: _csrf=nnukVQ2___D88BVt_3GzfUDG; b4da1ddd423e4e8c32114620d61bbfb1=0bfd4455bf8032a1e5e62a2b2db2ca85; bm_sz=5167FDE07330E3099013E268D18AFB4A~YAAQHeF6XED5n3NuAQAAoV5mLQb4rWWWvSkll4rpV5LNOx8bW8lYWVGiw+SKyXZCLJjazkuhzG5U9GpJexyQW4kSullpdq9N7ImTrLC07uFTnoDvyWAcmsXiAtPxUSoxK5ZCSF9xrcr/ZSBrV4ZC3uHuLjryXNfrhaQdQdB1f5gFz+1STW38EPK8TDffNZg=; _abck=68E27962FFDDB95297DAEE8A49E6E038~0~YAAQZ3JlX9+7an5uAQAAYaHaLQOfSeBfhlHt3FQ0LuHf1ZoysOC4RkdU3rtfAAvweF3Ovx8R2/+0IdpC6JNOrX+W/f4QdPA6R31aTAVu/WdgxSNxL5HUQyMqQG1CV1NarTEIramfKxO8++LIpFwfZ2KanBQZbodULrgJAB69ID1tzBz+RMAeBUom/MXsMID9SxRy95qp1lQF+RBxl8t4XlZJF0+2FxqdYEDrlsl2RxO/yqxhn5Z/Xb13c9gnQMn2036VVMhcZVlA2i6n1XppqFyLBoymiMeswoejjYIjRsTexO0jNuvg1LTcRglGn9Umde2l0mls~-1~-1~-1; onap=16f2d666288x105b1c23-3-16f2dd863cfx6b41b522-5-1577024110; ldTd=true; _gcl_au=1.1.1963230701.1577014683; lqstatus=1577022670|16f2dd863cfx6b41b522|gre-9806; laquesis=gre-10591#b#gre-9806#b; laquesisff=; _ga=GA1.2.38751210.1577014684; _gid=GA1.2.543658842.1577014684; __gads=ID=9053a074379a671c-22d104a743a50087:T=1577014685:S=ALNI_MaJQJ-d6rutVuf7LnxOXmGIR03RoA; __gfp_64b=9c7FPwH7yWthkZRHZr3Y4Z0rh6LMYCLWXZPKlTLdINP.a7; PHPSESSID=4dea16ddf9b53cc9374c4e9033cf62a9; mobile_default=desktop; optimizelyEndUserId=oeu1577015678908r0.8964587898345969; lastCatType=101; cookieBarSeen=true; ak_bmsc=6029E960D640205FDEA262145E90C8B65F657267692E0000C972FF5DA7317F36~plNg4UFj14Rs42ExadiFOY0WPsw4RYaVN1W8pQRWgKVubwFxZ4i2E/x+XeH/ae7m3d3fxkGLQG32M7Yc6KatTpznk2ydv7RHJ0WVcaitQBOkOMldJBO36S8oNEl/zrXF3DmMLKg/5A1uOlTPtxjjuIg2baCJZa/9plv2nBg5U+sKZL/VwtKFWJ0QFFi3vJqotOovcwY2BtNel+GVx86sBFeBSxDIbcs7mk9KaoODLqU6IvrZCksH2qM9/uxonw9oZz; bm_sv=B875EC57E094201C119B399A54AB5B08~x4qQoaVc7K+pGqLxOHNEbMgykyIKaofv8b6aP/lHqsuoqKLZxD23NY1uv0r0qOEpWbMTXpL7e5Oj51Ll+GlJ5uuSVZ+/0SbWYqWXHaSKzIL5x0+v7p4ZYJIQBwHCIhMMq4sXjBuf5HRlmeJN4pTvqDjHx3dYDlNbf/6ktsqdsik=; _gat_clientNinja=1'}
for item in range(10):
r = requests.get("https://www.otodom.pl/frontera/api/item/owner/phone/154-147-95-63-124-231-56-151-181-24-172-166-153-110-202-140-185-214-191-162-155-200-255-142-82-184-41-75-23-189-204-95-97-210-122-115", headers=headers)
print(r.text)
time.sleep(10)
You are hitting an API that is probably not intended for external consumption, at least not the way you're using it. 403 Forbidden means that you have explicitly been denied the request. Most likely, someone saw your requests and actively blacklisted you, or an adaptive firewall has blocked you.
Related
Please look over my code. What am I doing wrong that my request response is empty? Any pointers?
URL in question: (should generate a results page)
https://www.ucr.gov/enforcement/343121222
But I cannot replicate it with python requests. Why?
import requests
headers = {'Host': 'www.ucr.gov',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:105.0) Gecko/20100101 Firefox/105.0',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip, deflate, br',
'Connection' : 'keep-alive'
}
data = {
'scheme': 'https;',
'host': 'www.ucr.gov',
'filename': '/enforcement/3431212'
}
url="https://www.ucr.gov/enforcement/3431212"
result = requests.get(url, params=data, headers=headers)
print(result.status_code)
print(result.text)
The page at the link you provided is fully rendered on the client side using JavaScript. This means that you won't be able to obtain the same response using a simple HTTP request.
In this case a common solution is performing something that is called Headless Scraping, which involves automating a headless browser in order to access a specific website content as a regular client would. In Python headless scraping can be implemented using several libraries, including Selenium and Puppeteer.
I'm trying to log into this website santillanaconnect.com with python requests, the problem is that I need to get the value of g-Recaptcha-Response to be able to log in (among some other things but those are not a problem). If I inspect element and search for g-Recaptcha-Response then I'll get:
<input type="hidden" id="g-Recaptcha-Response" name="g-Recaptcha-Response" value="long string">
I tried to make a get request to save the value in some variable, so that I could then make a post request with that token. The problem is that when I try to get the value of g-Recaptcha-Response from the html of the response the value doesn't show up, I already tried adding headers (because maybe it won't show up because python requests User-Agent is blacklisted / not common) but it still doesn't work
import requests
url = "https://www.santillanaconnect.com/Account/Login/?wtrealm=http%3A%2F%2Flms30.santillanacompartir.com%2Flogin%2Fcompartir%2F&wreply=https%3A%2F%2Flms30.santillanacompartir.com%2Flogin%2Fsso%2Floginconnect"
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "es-ES,es;q=0.9",
"Dnt": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.57"
}
with requests.Session() as s:
r = s.get(url, headers=headers)
print (r.text)
https://security.stackexchange.com/questions/238333/how-can-i-bypass-recaptcha-while-using-python-requests
"You can not bypass a recaptcha, unless as #schroeder stated, where there is a weakness or an option on the site where you can bypass it. You might have to do some experimenting to see and find out if this is the case.
A more exact solution involves multiple requests, as well as requests to other websites. First make a request that causes the recaptcha to appear. Then take the received info and send a web request via python, to a re-captcha third party solving service, that accepts recaptcha question data from online requests. Then from the answers you get back from them, send that as requests via python back to where it answers are sent for recaptcha questions. This way you are solving the recaptchas by sending requests via python. hYou still are not bypassing, however you are sending requests via python to get thru the recaptcha.
Another solution for python might be able to solve the recaptcha, and send a request so that the recaptcha is passed. Basically when you solve a recaptcha, your results are sent via POST request. Although not exactly a 100% answer to your question, but it does involve sending requests via python.
Solving the recaptcha in using python however would require intense machine learning, precise image recognition, and AI programing knowledge, most of those are probably still not yet at par 100% to successfully do every captcha. Also, the recaptcha was designed also to be difficult for bots to solve even" - Amol Soneji
How can a server detect a bot from a single HTML request identical to one made from an interactive session? For example, I can open a new private browser in Firefox, enter a URL and have everything come back 200. However, when I copy the initial HTML request that loaded the page -- url, headers and all -- and make it using a scripted tool like requests_html on the same device, I get a 403. What other information is the server using to differentiate between these two requests? Is there something that Firefox or requests_html are doing that is not visible from the developer tools and python code?
Sample code (domain substituted):
from request_html import HTMLSession
url = 'https://www.example.com'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'DNT': '1',
'Host': 'www.example.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8'
}
session = HTMLSession()
response = session.get(url, headers=headers)
I would really recommend using the selenium package. requests is really bad at dealing with dynamic loading and async displays. It's great for interacting with APIs, but if you're working with scraping, selenium is the tool.
requests_http is using a headless Chrome/Chromium browser, so if you really send an identical request it should not be distinguishable.
Normally an http requests, contains only the the http protocol, the method, the headers
So if both are identical it is strange, that a difference can be found out by the web server.
Servers might detect timing, but I assume, that this is the very first request and that you try out both from the same IP address.
Servers might detect a request if it is 100% identical to a previously performed request, but I assume you tested this already by firs trying with your script and then with your anonymous browser.
I assume also you looked at your browser and there were no redirects involved.
Some other differences, that could occur might be during the SSL negotiation (order of keys being offered / accepted.
It might be, that your browser also tries to access 'favicon.ico' and only then the page and that requests_http isn't doing this.
My suggestion is to first ensure, that you are capable of reproducing a request from your browser with requests_http.
I'd suggest following. try to setup your own web server on your local machine, in a virtual machine, in a container on one of your remote servers.
configure nginx to error logging at debug level.
Then perform the access with your private browser and then with your script using request_http and go through the generated log file and look for any difference.
I've got a script that is meant to at first login to Twitter. I get a 200 response if I check it, but I don't redirect to a logged in Twitter account after succeeding, instead it stays on the same page.
url = 'https://twitter.com/login/error?redirect_after_login=%2F'
r = requests.session()
# Need the headers below as so Twitter doesn't reject the request.
headers = {
'Host': "twitter.com",
'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
'Accept-Language': "en-US,en;q=0.5",
'Accept-Encoding': "gzip, deflate, br",
'Referer': "https://twitter.com/login/error?redirect_after_login=%2F",
'Upgrade-Insecure-Requests': "1",
'Connection': "keep-alive"
}
login_data = {"session[username_or_email]":"Username", "session[password]":"Password"}
response = r.post(url, data=login_data, headers=headers, allow_redirects=True)
How do I go about redirecting to my account upon successful POST request to the logged in state. Am I not using the correct headers or something like that? I've not done a huge amount of web stuff before, so I'm sorry if it's something really obvious.
Note: (I cannot use Twitter API to do this) & The referrer is the error page because that's where I'm logging in from - unless of course I'm wrong in doing that.
Perhaps the GET parameter redirect_after_login will use kind of javascript or html meta refresh redirection instead of HTTP redirection, so if it's the case, the requests python module will not handle it correctly.
So once you retrieve your authentication token from your first request, you could make again the second request to https://twiter.com/ without to forget your specify your security token from your HTTP request fields. You can find more information about REST API of twitter here: https://dev.twitter.com/overview/api
But the joy of python is to have libraries for everything, so I suggest you to take a look here:
https://github.com/bear/python-twitter
It's a library to communicate with the REST API of twitter.
I'm trying to check if a current #hotmail.com address is taken.
However, I'm not getting the response I would have gotten using chrome developer tools.
#!/usr/bin/python
import urllib
import urllib2
import requests
cookies = {
'MC0': '1449950274804',
'mkt': 'en-US',
'MSFPC': 'ID=a9b016cd39838248bbf321ea5ad1ecae&CS=1&LV=201512&V=1',
'wlv': 'A|ekIL-d:s*cAHzDg.2+1+0+3',
'HIC': '7c5d20284ecdbbaa||0|||',
'wlxS': 'wpc=1&WebIM=1',
'RVC': 'm=1&v=17.5.9510.1001&t=12/12/2015 20:37:45',
'amcanary': '0',
'CkTst': 'MX1449957709484',
'LDH': '9',
'wla42': 'KjEsN0M1RDIwMjg0RUNEQkJBQSwsLDAsLTEsLTE=',
'LN': 'u9GMx1450021043143',
}
headers = {
'Origin': 'https://signup.live.com',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8,ja;q=0.6',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36',
'canary': 'aeIntzIq6OCS9qOE2KKP2G6Q7yCCPLAQVPIw0oy2Vksln3bbwVR9I8DcpfzC9RiCnNiJBw4YxtWsqJfnx0PeR9ovjRG+bF1jKkyPVWUTyuDTO5UkwRNNJFTIdeaClMgHtATSy+gI99ojsAKwuRFBMNbOgCwZIMCRCmky/voftX/63gjTqC9V5Ry/bECc2P66ouDZNC7TA/KN6tfsmszelEoSrmvU7LAKDoZnkhRQjpn6WYGxUzr5S+UYXExa32AY:1:3c',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'application/json',
'Referer': 'https://signup.live.com/signup?wa=wsignin1.0&rpsnv=12&ct=1450038320&rver=6.4.6456.0&wp=MBI_SSL_SHARED&wreply=https',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
}
data = {"signInName":"testfoobar1234#outlook.com","uaid":"f1d115020fc94af6ba17e722277cdcb8","performDisambigCheck":"true","includeSuggestions":"true","uiflvr":"1001","scid":"100118","hpgid":"200407"}
asdf = requests.post('https://signup.live.com/API/CheckAvailableSigninNames?wa=wsignin1.0&rpsnv=12&ct=1450038320&rver=6.4.6456.0&wp=MBI_SSL_SHARED&wreply=https', headers=headers, cookies=cookies, data=data)
print(asdf.json())
This is what chrome gives me when checking testfoobar1234#hotmail.com:
This is what my script is giving me testfoobar1234#hotmail.com:
If you want to connect via python script on your local machine to login.live.com with right credentials but cookies from your Chrome -- it's will not work.
What you want to do: read emails, send email, or just get contacts from address book. Algorithms in script will be different. Example, Mails available via outlook.com system, contacts located in people.live.com (and API as I right remember).
If you want emulate login like Chrome do, you need:
Get and collect all cookies from outlook.com main page, don't forget about all redirects:) - via your python script
Send request with collected cookies and credentials, to login.live.com (outlook will redirect to it).
But, from my experience -- last Outlook version (regular and Outlook Preview systems) in 90% detects wrong attempt of login and send to you page with confirm login question (code or email). That way you will have unstable solution. Do you really want to do it?
If you just want to parse JSON right you need:
import json
data = json.loads(asdf.text)
print(data)
If you want to see, how much actions produced by browser, just install Firebug and disable cleaning "Network" panel, then see how many requests processed before you logged in into your account.
But, for see all traffic suggest to use Firefox + Firebug + Tamper Data.
And also, I think more quicker will be use exists libs like Selenium for browser emulation.