I'm trying to check if a current #hotmail.com address is taken.
However, I'm not getting the response I would have gotten using chrome developer tools.
#!/usr/bin/python
import urllib
import urllib2
import requests
cookies = {
'MC0': '1449950274804',
'mkt': 'en-US',
'MSFPC': 'ID=a9b016cd39838248bbf321ea5ad1ecae&CS=1&LV=201512&V=1',
'wlv': 'A|ekIL-d:s*cAHzDg.2+1+0+3',
'HIC': '7c5d20284ecdbbaa||0|||',
'wlxS': 'wpc=1&WebIM=1',
'RVC': 'm=1&v=17.5.9510.1001&t=12/12/2015 20:37:45',
'amcanary': '0',
'CkTst': 'MX1449957709484',
'LDH': '9',
'wla42': 'KjEsN0M1RDIwMjg0RUNEQkJBQSwsLDAsLTEsLTE=',
'LN': 'u9GMx1450021043143',
}
headers = {
'Origin': 'https://signup.live.com',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8,ja;q=0.6',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36',
'canary': 'aeIntzIq6OCS9qOE2KKP2G6Q7yCCPLAQVPIw0oy2Vksln3bbwVR9I8DcpfzC9RiCnNiJBw4YxtWsqJfnx0PeR9ovjRG+bF1jKkyPVWUTyuDTO5UkwRNNJFTIdeaClMgHtATSy+gI99ojsAKwuRFBMNbOgCwZIMCRCmky/voftX/63gjTqC9V5Ry/bECc2P66ouDZNC7TA/KN6tfsmszelEoSrmvU7LAKDoZnkhRQjpn6WYGxUzr5S+UYXExa32AY:1:3c',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'application/json',
'Referer': 'https://signup.live.com/signup?wa=wsignin1.0&rpsnv=12&ct=1450038320&rver=6.4.6456.0&wp=MBI_SSL_SHARED&wreply=https',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
}
data = {"signInName":"testfoobar1234#outlook.com","uaid":"f1d115020fc94af6ba17e722277cdcb8","performDisambigCheck":"true","includeSuggestions":"true","uiflvr":"1001","scid":"100118","hpgid":"200407"}
asdf = requests.post('https://signup.live.com/API/CheckAvailableSigninNames?wa=wsignin1.0&rpsnv=12&ct=1450038320&rver=6.4.6456.0&wp=MBI_SSL_SHARED&wreply=https', headers=headers, cookies=cookies, data=data)
print(asdf.json())
This is what chrome gives me when checking testfoobar1234#hotmail.com:
This is what my script is giving me testfoobar1234#hotmail.com:
If you want to connect via python script on your local machine to login.live.com with right credentials but cookies from your Chrome -- it's will not work.
What you want to do: read emails, send email, or just get contacts from address book. Algorithms in script will be different. Example, Mails available via outlook.com system, contacts located in people.live.com (and API as I right remember).
If you want emulate login like Chrome do, you need:
Get and collect all cookies from outlook.com main page, don't forget about all redirects:) - via your python script
Send request with collected cookies and credentials, to login.live.com (outlook will redirect to it).
But, from my experience -- last Outlook version (regular and Outlook Preview systems) in 90% detects wrong attempt of login and send to you page with confirm login question (code or email). That way you will have unstable solution. Do you really want to do it?
If you just want to parse JSON right you need:
import json
data = json.loads(asdf.text)
print(data)
If you want to see, how much actions produced by browser, just install Firebug and disable cleaning "Network" panel, then see how many requests processed before you logged in into your account.
But, for see all traffic suggest to use Firefox + Firebug + Tamper Data.
And also, I think more quicker will be use exists libs like Selenium for browser emulation.
Related
Please look over my code. What am I doing wrong that my request response is empty? Any pointers?
URL in question: (should generate a results page)
https://www.ucr.gov/enforcement/343121222
But I cannot replicate it with python requests. Why?
import requests
headers = {'Host': 'www.ucr.gov',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:105.0) Gecko/20100101 Firefox/105.0',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip, deflate, br',
'Connection' : 'keep-alive'
}
data = {
'scheme': 'https;',
'host': 'www.ucr.gov',
'filename': '/enforcement/3431212'
}
url="https://www.ucr.gov/enforcement/3431212"
result = requests.get(url, params=data, headers=headers)
print(result.status_code)
print(result.text)
The page at the link you provided is fully rendered on the client side using JavaScript. This means that you won't be able to obtain the same response using a simple HTTP request.
In this case a common solution is performing something that is called Headless Scraping, which involves automating a headless browser in order to access a specific website content as a regular client would. In Python headless scraping can be implemented using several libraries, including Selenium and Puppeteer.
I'm trying to automate the downloading of a text file from a shared link that was sent to me by email. The original link is to a folder containing two files but I got the direct download link of the file that I need which is:
https://'abc'-my.sharepoint.com/personal/gamma_'abc'/_layouts/15/download.aspx?UniqueId=a0db276e%2Ddf75%2D49b7%2Db671%2D1c49e365ef3f
When I enter the above url into a web browser I get the popup option to open or download the file. I'm trying to write some Python code to download the file automatically and this what I've come up with so far
import requests
url = "https://<abc>-my.sharepoint.com/personal/gamma_<abc>/_layouts/15" \
"/download.aspx?UniqueId=a0db276e%2Ddf75%2D49b7%2Db671%2D1c49e365ef3f "
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Connection': 'keep-alive'}
myfile = requests.get(url, headers=hdr)
open('c:/users/scott/onedrive/desktop/gamma.las', 'wb').write(myfile.content)
I originally tried without the user agent and when I opened gamma.las there was only 403 FORBIDDEN in the file. If I send the header too then the file contains HTML for what looks like a Microsoft login page, so I'm assuming that I'm missing some authentication step.
I have no affiliation with this organization - someone sent me this link via email for me to download a text file which works fine through the browser but not via Python. I don't log in to anything to get it as I have no username or password with this domain.
Am I able to do this using Requests? If not, am I able to use REST API without user credentials for this company's Sharepoint?
Not supplying your credentials most probably means you are (implicitly) using built-in windows authentication in your organization. Check out if this helps: Handling windows authentication while accessing url using requests
The python library mentioned there to handle built-in windows auth is requests-negotiate-sspi. Not sure, if it's going to work with federation (your website ends with ".sharepoint.com" meaning you are probably using federation as well), but may be worth trying.
So, I would try something like this (I doubt headers really matter in your case, but you could try adding them as well)
import requests
from requests_negotiate_sspi import HttpNegotiateAuth
url = ...
myfile = requests.get(url, auth=HttpNegotiateAuth())
My Problem:
I want to scrape the following website: https://www.coches.net/segunda-mano/.
But every time i open it with python selenium, i get the message, that they detected me as a bot.
How can i bypass this detection?
First i tried simple code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome('C:/Python38/chromedriver.exe')
URL = 'https://www.coches.net/segunda-mano/'
browser.get(URL)
Then i tried it with request, but i doesn't work, too.
from selenium import webdriver
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import requests
ua = UserAgent()
headers = {"UserAgent":ua.random}
URL = 'https://www.coches.net/segunda-mano/'
r = requests.get(URL, headers = headers)
print(r.statuscode)
In this case i get the message 403 = Status code stating that access to the URL is prohibited.
Don't know how to get entry to this webpage without getting blocked. I would be very grateful for your help. Thanks in advance.
Selenium is fairly easily detected, especially by all major anti-bot providers (Cloudflare, Akamai, etc).
Why?
Selenium, and most other major webdrivers set a browser variable (that websites can access) called navigator.webdriver to true. You can check this yourself by heading to your Google Chrome console and running console.log(navigator.webdriver). If you're on a normal browser, it will be false.
The User-Agent, typically all devices have what is called a "user agent", this refers to the device accessing the website. Selenium's User-Agent looks something like this: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36. Did you catch that? HeadlessChrome is included, this is another route of detection.
These are just two of the multiple ways a Selenium browser can be detected, I would highly recommend reading up on this and this as well.
And lastly, if you want an easy, drop-in solution to bypass detection that implements almost all of these concepts we've talked about, I'd suggest using undetected-chromedriver. This is an open source project that tries it's best to keep your Selenium chromedriver looking human.
I think your problem is not bot detection. You can't use just requests to get the results from that page, because it makes XHR requests behind the scene. So you must use Selenium, splash, etc, but seems is not possible for this case.
However, if you research a bit in the page you can find which url is requested behind the scenes to display the resutls. I did that research and found this page(https://ms-mt--api-web.spain.advgo.net/search), it returns a json, so it will ease your work in terms of parsing. Using chrome dev tools I got the curl request and just map it to python requests and obtain this code:
import json
import requests
headers = {
'authority': 'ms-mt--api-web.spain.advgo.net',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'accept': 'application/json, text/plain, */*',
'x-adevinta-channel': 'web-desktop',
'x-schibsted-tenant': 'coches',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.coches.net',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.coches.net/',
'accept-language': 'en-US,en;q=0.9,es;q=0.8',
}
data = '{"pagination":{"page":1,"size":30},"sort":{"order":"desc","term":"relevance"},"filters":{"categories":{"category1Ids":[2500]},"offerTypeIds":[0,2,3,4,5],"isFinanced":false,"price":{"from":null,"to":null},"year":{"from":null,"to":null},"km":{"from":null,"to":null},"provinceIds":[],"fuelTypeIds":[],"bodyTypeIds":[],"doors":[],"seats":[],"transmissionTypeId":0,"hp":{"from":null,"to":null},"sellerTypeId":0,"hasWarranty":null,"isCertified":false,"luggageCapacity":{"from":null,"to":null},"contractId":0}}'
while True:
response = requests.post('https://ms-mt--api-web.spain.advgo.net/search', headers=headers, data=data).json()
# you should parse items here.
print(response)
if not response["items"]:
break
data_dict = json.loads(data)
data_dict["pagination"]["page"] = data_dict["pagination"]["page"]+1 # get the next page.
data = json.dumps(data_dict)
Probably there are a lot of headers and body info that are unnecesary, you can code-and-test to improve it.
Proxy rotating can be useful if scraping large data
options = Options()
options.add_arguments('--proxy-server="#ip:#port"')
Then initialize chrome driver with options object
I've been scraping some real estate pages and hit a wall with one particular website. It started while trying to scrape phone numbers behind JavaScript onclick event. I don't really know too much about JS but from I can tell this is somehow intertwined with displaying ads.
After some closer inspection I've found out this json data on each page:
"data": {
"advert": {
"...,
"phoneObj":[ {
"phone": "735", "phoneCode": "173-28-189-69-82-145-233-192-109-58-19-5-226-110-115-225-135-77-50-22-83-36-187-139-85-8-219-95-87-164-33-33-139-78-248-201"
}
}
Playing around with Web Dev Tools I've determined that this 'phoneCode' is used to get the real phone number by passing its value to the special API URL. I scraped phoneCode, made another request with this special URL and..
Everything worked!
Unfortunately.. After a few sucessful requests I've started recieving 403 errors:
Access Denied
You don't have permission to access "http://www.host.com/frontera/api/item/owner/phone/173-28-189-69-82-145-233-192-109-58-19-5-226-110-115-225-135-77-50-22-83-36-187-139-85-8-219-95-87-164-33-33-139-78-248-201" on this server.
Reference #18.97645e68.1577009665.1a1da860
I'm not trying to be very fast while scraping those pages and I don't really think this is because there are low limits on requests. I've opened a bunch of windows using browser and tried clicking those manually and didn't get any problems whatsoever.
My first thought was that it has to do with proper sessions so I've immediately started tinkering with requests.Session() for cookie persistence and more custom headers:
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'pl,en-US;q=0.7,en;q=0.3',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Host': 'www.host.com',
'Pragma': 'no-cache'
}
But this doesn't really help at all. Is there anything I can try to do in order to better pinpoint what is the problem?
Basically the API is protected with AkamaiGHost which is well known firewall.
If you are browsing the API via the browser, it's will leave you to do whatever you want without any kind of block.
Once you call the webpage via code, it's will allow you once or more, but once it's analyze your connection, It's will block your IP-Address. so you will need to use tor
Here's a solution:
import requests
import time
headers = {'Cookie': 'Cookie: _csrf=nnukVQ2___D88BVt_3GzfUDG; b4da1ddd423e4e8c32114620d61bbfb1=0bfd4455bf8032a1e5e62a2b2db2ca85; bm_sz=5167FDE07330E3099013E268D18AFB4A~YAAQHeF6XED5n3NuAQAAoV5mLQb4rWWWvSkll4rpV5LNOx8bW8lYWVGiw+SKyXZCLJjazkuhzG5U9GpJexyQW4kSullpdq9N7ImTrLC07uFTnoDvyWAcmsXiAtPxUSoxK5ZCSF9xrcr/ZSBrV4ZC3uHuLjryXNfrhaQdQdB1f5gFz+1STW38EPK8TDffNZg=; _abck=68E27962FFDDB95297DAEE8A49E6E038~0~YAAQZ3JlX9+7an5uAQAAYaHaLQOfSeBfhlHt3FQ0LuHf1ZoysOC4RkdU3rtfAAvweF3Ovx8R2/+0IdpC6JNOrX+W/f4QdPA6R31aTAVu/WdgxSNxL5HUQyMqQG1CV1NarTEIramfKxO8++LIpFwfZ2KanBQZbodULrgJAB69ID1tzBz+RMAeBUom/MXsMID9SxRy95qp1lQF+RBxl8t4XlZJF0+2FxqdYEDrlsl2RxO/yqxhn5Z/Xb13c9gnQMn2036VVMhcZVlA2i6n1XppqFyLBoymiMeswoejjYIjRsTexO0jNuvg1LTcRglGn9Umde2l0mls~-1~-1~-1; onap=16f2d666288x105b1c23-3-16f2dd863cfx6b41b522-5-1577024110; ldTd=true; _gcl_au=1.1.1963230701.1577014683; lqstatus=1577022670|16f2dd863cfx6b41b522|gre-9806; laquesis=gre-10591#b#gre-9806#b; laquesisff=; _ga=GA1.2.38751210.1577014684; _gid=GA1.2.543658842.1577014684; __gads=ID=9053a074379a671c-22d104a743a50087:T=1577014685:S=ALNI_MaJQJ-d6rutVuf7LnxOXmGIR03RoA; __gfp_64b=9c7FPwH7yWthkZRHZr3Y4Z0rh6LMYCLWXZPKlTLdINP.a7; PHPSESSID=4dea16ddf9b53cc9374c4e9033cf62a9; mobile_default=desktop; optimizelyEndUserId=oeu1577015678908r0.8964587898345969; lastCatType=101; cookieBarSeen=true; ak_bmsc=6029E960D640205FDEA262145E90C8B65F657267692E0000C972FF5DA7317F36~plNg4UFj14Rs42ExadiFOY0WPsw4RYaVN1W8pQRWgKVubwFxZ4i2E/x+XeH/ae7m3d3fxkGLQG32M7Yc6KatTpznk2ydv7RHJ0WVcaitQBOkOMldJBO36S8oNEl/zrXF3DmMLKg/5A1uOlTPtxjjuIg2baCJZa/9plv2nBg5U+sKZL/VwtKFWJ0QFFi3vJqotOovcwY2BtNel+GVx86sBFeBSxDIbcs7mk9KaoODLqU6IvrZCksH2qM9/uxonw9oZz; bm_sv=B875EC57E094201C119B399A54AB5B08~x4qQoaVc7K+pGqLxOHNEbMgykyIKaofv8b6aP/lHqsuoqKLZxD23NY1uv0r0qOEpWbMTXpL7e5Oj51Ll+GlJ5uuSVZ+/0SbWYqWXHaSKzIL5x0+v7p4ZYJIQBwHCIhMMq4sXjBuf5HRlmeJN4pTvqDjHx3dYDlNbf/6ktsqdsik=; _gat_clientNinja=1'}
for item in range(10):
r = requests.get("https://www.otodom.pl/frontera/api/item/owner/phone/154-147-95-63-124-231-56-151-181-24-172-166-153-110-202-140-185-214-191-162-155-200-255-142-82-184-41-75-23-189-204-95-97-210-122-115", headers=headers)
print(r.text)
time.sleep(10)
You are hitting an API that is probably not intended for external consumption, at least not the way you're using it. 403 Forbidden means that you have explicitly been denied the request. Most likely, someone saw your requests and actively blacklisted you, or an adaptive firewall has blocked you.
I am trying to use httplib2 to log in to a web page. I am able to log in to the page by simply opening the following URL in a Chrome incognito window:
https://domain.com/auth?name=USERNAME&pw=PASSWORD
I tried the following code to emulate this login with httplib2:
from httplib2 import Http
h = Http(disable_ssl_certificate_validation=True)
resp, content = h.request('https://domain.com/auth?name=USERNAME&pw=PASSWORD')
Unfortunately, this request does not lead to a successful login.
I tried changing the request headers to match those provided by Chrome:
headers = {
'Host': 'domain.com',
'Connection': 'keep-alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8'
}
resp, content = h.request('https://domain.com/auth?name=USERNAME&pw=PASSWORD', 'GET', headers=headers)
This changes the response slightly, but still does not lead to a successful login.
I tried inspecting the actual network traffic with Wireshark but since it's HTTPS and thus encrypted, I can't see the actual traffic.
Does anybody know what the difference in requests between Chrome and httplib2 could be? Maybe httplib2 changes some of my headers?
Following Games Brainiac's comment, I ended up simply using Python Requests instead of httplib2. The following requests code works out of the box:
import requests
session = requests.Session()
response = session.get('https://domain.com/auth?name=USERNAME&pw=PASSWORD')
Further requests with the same username/password can simply be performed on the Session object:
...
next_response = session.get('https://domain.com/someOtherPage')