Python script to fetch URL protected by DES/kerberos - python

I have a Python script that does an automatic download from a URL once a day.
Recently the authentication protecting the URL was changed. To get it to work with Internet Explorer I had to enable DES for Kerberos by adding SupportedEncryptionTypes " 0x7FFFFFFF" in a registry entry somewhere. Then it prompts me for my domain/user/password in IE when I browse to the site.
My python code that was working before is:
def __build_ntlm_opener(self):
passman = HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, self.answers_url, self.ntlm_username, self.ntlm_password)
ntlm_handler = HTTPNtlmAuthHandler(passman)
opener = urllib.request.build_opener(ntlm_handler)
opener.addheaders= [
#('User-agent', 'Mozilla/5.0 (Windows NT 6.0; rv:5.0) Gecko/20100101 Firefox/5.0')
('User-agent', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')
]
return opener
Now the code is failing with a simple 401 when using the opener:
urllib.error.HTTPError: HTTP Error 401: Unauthorized
I don't know much about Kerberos or DES but from what I see so far I can't figure out if urllib supports using these.
Is there any 3rd party library or trick I can use to get this working again?

You could try using selenium's webdriver to directly drive a browser. I do that sometimes when I want to scrape sites that are dynamically generated. Here's a code example for opening a page and entering a password
from selenium import webdriver
b = webdriver.Chrome()
b.get('http://www.example.com')
username_field = b.find_element_by_id('username')
username_field.send_keys('my_username')
password_field = b.find_element_by_id('password')
password_field.send_keys('secret')
login_button = b.find_element_by_link_text('login').click()
That would get you past a typical login screen of a web site. Then
b.page_source
Will give you the source code for the page. Even if it was mainly generated with Javascript.
The source code is very simple to parse: http://code.google.com/p/selenium/source/browse/trunk/py/selenium/webdriver/remote/webelement.py

Related

How to complete geetest (captcha) when scraping, by python-requests, while request values are taken by solving captcha manually?

I'm trying to scrape website, which use datadome and after some requests I have to complete geetest (slider captcha puzzle).
Here is a sample link to it:
captcha link
I've decided to don't use selenium (at least for now) and I'm trying to solve my problem by python module: Requests.
My idea was to complete geetest by myself then send the same request in my program, that my web browser is sending after completing that slider.
At the beginning, I've scraped html code which I got on website after captcha prompt:
<head><title>allegro.pl</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var dd={'cid':'AHrlqAAAAAMAsB0jkXrMsMAsis8SQ==','hsh':'77DC0FFBAA0B77570F6B414F8E5BDB','t':'fe','s':29701,'host':'geo.captcha-delivery.com'}</script><script src="https://ct.captcha-delivery.com/c.js"></script></body></html>
I couldn't access iframe where most important info is, but I found out that link to to that iframe can be build with info from that html code above. As u can see in link above:
cid is initialCid, hsh is hash etc., one part of the link, cid is a cookie that I got at the moment when captcha appeared.
I've seen there are available services which can solve captcha for u, so I've decided to complete captcha for myself, then send exact request, including cookies and headers, to my program then send request in my program by requests. For now I'm doing it by hand, but it doesn't work. Response is 403, when manually it's 200 and redirect.
Here is a sample request that my browser is sending after completing captcha:
sample request
I'm sending it in program by:
s = requests.Session()
s.headers = headers
s.cookies.set(cookie_from_web_browser)
captcha = s.get(request)
Response is 403 and I have no idea how to make it work, help me.
Captcha's are really tricky in the web scraping world, most of the time you can bypass this by solving the captcha and then manually taking the returned source's cookie and plugging it into your script. Depending on the website the cookie could hold for 15minutes, a day, or even longer.
The other alternative is to use captcha solving services such as https://www.scraperapi.com/ where you would have to pay a fee for x amount of requests but you won't run into the captcha issue as they solve them for you
Use a header parameter to solve this problem. Just like so
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)
Test it with web cache before running with real url

Scraping Data from website with a login page

I am trying to login to my university website using python and the requests library using the following code, nonetheless I am not able to.
import requests
payloads = {"User_ID": <username>,
"Password": <passwrord>,
"option": "credential",
"Log in":"Log in"
}
with requests.Session() as session:
session.post('', data=payloads)
get = session.get("")
print(get.text)
Does anyone have any idea on what I am doing wrong?
In order to login you will need to to post all the informations requested by the <input> tag. In your case you will have also to provide the hidden inputs. You can do this by scraping for these values and then post them. You might also need to post some headers to simulate a browser behaviour.
from lxml import html
import requests
s = requests.Session()
login_url = "https://intranet.cardiff.ac.uk/students/applications"
session_url = "https://login.cardiff.ac.uk/nidp/idff/sso?sid=1&sid=1"
to_get = s.get(login_url)
tree = html.fromstring(to_get.text)
hidden_inputs = tree.xpath(r'//form//input[#type="hidden"]')
payloads = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
payloads["Ecom_User_ID"] = "<username>"
payloads["Ecom_Password"] = "<password>"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
result = s.post(session_url, data=payloads, headers = headers)
Hope this works
In order to login to a website with python, you will have to use a more involved method than the request library because you will have to simulate the browser in your code and have it make requests to login to the school's website servers. The reason for this is that you need the school's server to think that it is getting the request from the browser, then it should return you the contents of the resulting page, and then you have to have those contents rendered so that you can scrape it. Luckily, a great way to do this is with the selenium module in python.
I would recommend googling around to learn more about selenium. This blog post is a good example of using selenium to log into a web page with detailed explanations of what each line of code is doing. This SO answer on using selenium to login to a website is also good as an entry point into doing this.

403 Forbidden using Urllib2 [Python]

url = 'https://www.instagram.com/accounts/login/ajax/'
values = {'username' : 'User',
'password' : 'Pass'}
#'User-agent', ''
data = urllib.urlencode(values)
req = urllib2.Request(url, data,headers={'User-Agent' : "Mozilla/5.0"})
con = urllib2.urlopen( req )
the_page = response.read()
Does anyone have any ideas with this? I keep getting the error "403 forbidden".
Its possible instagram has something that won't let me connect via python (I don't want to connect via their API). What on earth is going on here, does anyone have any ideas?
Thanks!
EDIT: Adding more info.
The error I was getting was this
This page could not be loaded. If you have cookies disabled in your browser, or you are browsing in Private Mode, please try enabling cookies or turning off Private Mode, and then retrying your action.
I edited my code but am still getting that error.
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
print len(jar) #prints 0
opener.addheaders = [('User-agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36')]
result = opener.open('https://www.instagram.com')
print result.getcode(), len(jar) #prints 200 and 2
url = 'https://www.instagram.com/accounts/login/ajax/'
values = {'username' : 'username',
'password' : 'password'}
data = urllib.urlencode(values)
response = opener.open(url, data)
print response.getcode()
Two important things, for starters:
make sure you stay on the legal side. According to the Instagram's Terms of Use:
We prohibit crawling, scraping, caching or otherwise accessing any content on the Service via automated means, including but not limited to, user profiles and photos (except as may be the result of standard search engine protocols or technologies used by a search engine with Instagram's express consent).
You must not create accounts with the Service through unauthorized means, including but not limited to, by using an automated device, script, bot, spider, crawler or scraper.
there is an Instagram API that would help staying on the legal side and make the life easier. There is a Python client: python-instagram
Aside from that, the Instagram itself is javascript-heavy and you may find it difficult to work with using just urllib2 or requests. If, for some reason, you cannot use the API, you would look into browser automation via selenium. Note that you can automate a headless browser like PhantomJS also. Here is a sample code to log in:
from selenium import webdriver
USERNAME = "username"
PASSWORD = "password"
driver = webdriver.PhantomJS()
driver.get("https://www.instagram.com")
driver.find_element_by_name("username").send_keys(USERNAME)
driver.find_element_by_name("password").send_keys(PASSWORD)
driver.find_element_by_xpath("//button[. = 'Log in']").click()

Unable to get google search results python

I'm building a script to scrape google search results. I've reached till here.
import urllib
keyword = "google"
print urllib.urlopen("https://www.google.co.in/search?q=" + keyword).read()
But it gives me a reply as follows:
<!DOCTYPE html><html lang=en><meta charset=utf-8><meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"><title>Error 403 (Forbidden)!!1</title><style>*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}#media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/errors/logo_sm_2.png) no-repeat}#media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/errors/logo_sm_2_hr.png) 0}}#media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:55px;width:150px}</style><a href=//www.google.com/><span id=logo aria-label=Google></span></a><p><b>403.</b> <ins>That’s an error.</ins><p>Your client does not have permission to get URL <code>/search?q=google</code> from this server. (Client IP address: 117.196.168.89)<br><br>
Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html
<BR><BR><P>If you believe that you have received this response in error, please report your problem. However, please make sure to take a look at our Terms of Service (http://www.google.com/terms_of_service.html). In your email, please send us the <b>entire</b> code displayed below. Please also send us any information you may know about how you are performing your Google searches-- for example, "I'm using the Opera browser on Linux to do searches from home. My Internet access is through a dial-up account I have with the FooCorp ISP." or "I'm using the Konqueror browser on Linux to search from my job at myFoo.com. My machine's IP address is 10.20.30.40, but all of myFoo's web traffic goes through some kind of proxy server whose IP address is 10.11.12.13." (If you don't know any information like this, that's OK. But this kind of information can help us track down problems, so please tell us what you can.)</P><P>We will use all this information to diagnose the problem, and we'll hopefully have you back up and searching with Google again quickly!</P>
<P>Please note that although we read all the email we receive, we are not always able to send a personal response to each and every email. So don't despair if you don't hear back from us!</P>
<P>Also note that if you do not send us the <b>entire</b> code below, <i>we will not be able to help you</i>.</P><P>Best wishes,<BR>The Google Team</BR></P><BLOCKQUOTE>/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/<BR>
aef0-l8vNw3cWys_OWGKrv6VYDewUx0bhWxSeo2Mk4vGTZSoh<BR>
MdeNZki3vp-kzRGjrBTseg6uGBypibuTNGSeJoPRkDPCOFkyA<BR>
YBVgssaJaqSibV7khohBnsUVRVZqALwIe2lD6pdddMQIZ-Zg2<BR>
WEE-rO-ZackE5L2gwlmHZHP2oWML3ZlGgUL6CAbMbFmzVda38<BR>
ZYYVZLKBcjY1gSLk-FSzBc7QQnp0vrhkY6LnrALX94oK7Yrml<BR>
bKX-5KmpyhsI7aW3da5Rt5nt0K9PVPbKvpZ1LN-hdRqg749K6<BR>
T4v8mGfuH6BHSQUAPW1Byx_Wy1TGsyhZJQ02jrz7K0RBg4r0i<BR>
9O6Rs7-FFRzESkiyzRQaExUdpBpl3Mmguh1JXR_yxDJre9R7u<BR>
3AWKfCkt8BxKuv37oAIslM2Caor4QBXSNrq1F7zUetx8HxmaW<BR>
pX_6KsXyjs3-Pfq5NKOuzNCjatrhXdKC74NmNHztTPJU-4MzV<BR>
kUPuUehnDYgcgGAVYLLGiWvG4Scm8G2Gq2UnacMQsZ5BB7rgY<BR>
DXJnZwbMbVX53-llhCMeQfBTteOWIfWQR2FOyc-tuaRHX6c3N<BR>
rzpNDX9ZufFfOXRNkaORCZxkSEoX1xDBq0VGdkkCfwlUdG9Jq<BR>
prYBPnpRyhjxjC3c4n68AuEYHtMTVmbK-fyMtcWLMTVXzIrYS<BR>
EjACpMTnHRavhYza4ZJgs4SViS4FrsmJ0P3CdyLLayR0xMFM6<BR>
m7rxy-zaABo7iof_re5PKcFP6EYqD0Wm-ZlLksUh2a1LVaAsq<BR>
sSqnPPqq5qCu0z8wQe5jeGCRCY2vrT5HWmYNJbhyCyN_HiHGR<BR>
bHDb8f3_OcgAHsT7zv1a4FOG4B0JztqskzYmssBb-ezvErkp6<BR>
uZtwiKJc30F30RpQhKEb_rPjhpwc5dr3MUsTuki2j2tBSQl_O<BR>
kjFef_Jvl3u8TPQY5c6dqUSQv--p0N95Jv-WehS32lvyUbeEB<BR>
mN7ZC8oCFj06BRn5NaU9P8p1d7fmYyxyta2dZ21UfaRMhX8TZ<BR>
VgKiSDVyMO2GZ09bUEFGW4KvvTJDyQT_UMkCsahrv2MP_yI-D<BR>
fwEArSXvPIpyESHeyPXfFN-Z9_OuVwGDU2riHFIWgw5IPwtER<BR>
e0Ukzrn2iwGHHL8j2JdSNbunrifS-RqkK2hgQl16-TfqN11NL<BR>
Lgwtt-Kp3XL86K61Qq7lU-NxB8BOO_i-QOQszn6uRmb3VR__Q<BR>
T_0E9FULbsR9kgTyXDKQmOQ-3qeaFlz4in9V9PJ<BR>
+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+<BR></BLOCKQUOTE>
Doesn't google allow its pages to be scraped?
Actually, google doesn't, in the sense it blocks bots. But you can use mechanize to fake a browser and get the results.
import mechanize
chrome = mechanize.Browser()
chrome.set_handle_robots(False)
chrome.addheaders = [('User-agent',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36')]
base_url = 'https://www.google.co.in/search?q='
search_url = base_url + keyword.replace(' ', '+')
htmltext = chrome.open(search_url).read()
try this. I hope it helps.
You could also fake the headers in the urllib to get the results.
Something like:
import urllib2
keyword = "google"
url = "https://www.google.co.in/search?q=" + keyword
# Build a opener
opener = urllib2.build_opener()
# In case you have proxy then u need to build a ProxyHandler opener
#opener = urllib2.build_opener(urllib2.ProxyHandler(proxies={"http": "http://proxy.corp.ads:8080"}))
# To fake the browser
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open(url).read()
Google treats your script with a different user-agent(if you're using requests it will be python-requests) See more and more.
All you need is just to specify browser user-agent (Chrome, Mozilla, Edge, IE, Safari..) so Google will treat it as a "user" AKA fake a real browser visit.
If you're using requests library, then you can specify it this way (list of user-agents amoung other websites)
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(
'https://www.google.com/search?q=pizza is awesome', headers=headers).text
I answered the question on how to scrape Google Search result titles, summary and links with example code here.
Alternatively, you can use third-party Google Search Engine Results API
or Google Organic Results API from SerpApi. It's a paid API with a free trial.
Check out Playground to test and see the output.
Code to get raw HTML response:
import os, urllib
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "london",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
html = results['search_metadata']['raw_html_file']
print(urllib.request.urlopen(html).read())
Disclaimer, I work for SerpApi.

Can't automate login using python mechanize (must "activate" specific browser)

I seem to have difficulty logging into a website, which requires browser authenticaton.
What happens is when you first log on, the website redirects you to a page saying "We have sent an email to your email, click on the link to authenticate this browser."
I'm using the mechanize module for python. The page would log in, however the website never recognizes the "browser" hence many "Please register this browser" emails! I tried giving custom headers as well as adding a cookie handler as per other examples... no luck. The website thinks the script is a new (unauthorized) browser each time I visit.
Init code looks like this:
self.br = mechanize.Browser( factory=mechanize.RobustFactory() )
self.br.add_handler(PrettifyHandler())
cj = cookielib.LWPCookieJar()
self.br.set_cookiejar(cj)
self.br.addheaders = [('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('User-agent', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Ubuntu Chromium/24.0.1312.56 Chrome/24.0.1312.56 Safari/537.17'),
('Referer', 'https://www.temp.com/logout'),
('Accept-Encoding', 'gzip,deflate,sdch'),
('Accept-Language', 'en-GB,en-US;q=0.8,en;q=0.6'),
('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'),
]
And my login code looks like this. It fills in a simple html form and submits it.
self.br.open('https://www.temp.com/login')
# Select the first (index zero) form
self.br.select_form(nr=0)
# User credentials
self.br.form['username'] = 'temp'
self.br.form['password'] = 'temp'
# Login
self.br.submit()
# Inventory
body = self.br.response().read().split('\n')
And yet everytime I get this email : "To activate your browser, please click on the following link..." even after I follow the link and activate/authenticate the browser.
If you want to save session, try to save cookies with save/load function. Example:
cj = cookielib.LWPCookieJar()
cj.save('cookies.txt', ignore_discard=False, ignore_expires=False)
...
cj.load('cookies.txt', ignore_discard=False, ignore_expires=False)

Categories