I am doing some summer research with my school. I have to download ~2000 images off of a restricted site with graphs. I could absolutely do this manually, but I know it would be much faster to do with some sort of script. I've settled on Python, because I am assuming it will be much easier than another language. I have the URL for the site and the generic link for the database where the images are stored. I plan to feed the program a list of orbit numbers and it will download the appropriate images. The main issue is that when you visit the site, it pops up a login window through the browser, not HTML. I cannot view any of the site code to see how to submit the login.
I have already tried to use urllib and cookielib. I realize that urllib2 does not work in Python 3. I have also looked into using requests and mechanize with no luck.
import cookielib
import urllib2
import string
def cook():
url="SITE"
cj = cookielib.LWPCookieJar()
authinfo = urllib2.HTTPBasicAuthHandler()
realm="realmName"
username="USERNAME"
password="PASS"
host="HOST"
authinfo.add_password(realm, host, username, password)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), authinfo)
urllib2.install_opener(opener)
# Create request object
txheaders = { 'User-agent' : "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)" }
try:
req = urllib2.Request(url, None, txheaders)
cj.add_cookie_header(req)
f = urllib2.urlopen(req)
except IOError as e:
print("Failed to open", url)
if hasattr(e, 'code'):
print("Error code:", e.code)
else:
print (f)
print (f.read())
print (f.info())
f.close()
print('Cookies:')
for index, cookie in enumerate(cj):
print (index, " : ", cookie)
cj.save("cookies.lwp")
The code, obviously, just throws a bunch of errors. I really just need to be able to get into the site and download my images.
Totally was able to fix it by bypassing the verify. I know its not a great method, but it does what I need it to. Thanks guys!
You should use the selenium web driver to make login automate and download images. Read this article it will help you to scrape data from login required website
Related
I know there are alot of question about this matter but I try most of them.
my goal is to get the article from this page and use this in gae.
If I try to log in, it redirects to a long url ,after I log in there it redirects back to the article.
first I try urllib2 which is mentioned in here how to login to a website with python and mechanize and it didnt work.
then I took SelectLoginForm and login functions from https://github.com/cdhigh/KindleEar/blob/master/books/base.py it didnt work neither.
selenium wouldnt work because I gonna use it in gae. I guess gae cant support selenium
I started looking into mechanize module. my current code is :
# -*- coding: cp1254 -*-
import cookielib
import urllib2
import mechanize
b=mechanize.Browser()
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize.HTTPRefreshProcessor(),max_time=1)
b.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
b.open('https://hurpass.com/iframe/login?appkey=52da7ef64037f9497f0acb091390051062215&secret=52da7f0c4037f9497f0acb0b1390051084754&domain=sosyal.hurriyet.com.tr&callback_url=http://sosyal.hurriyet.com.tr/Account/AutoLogin?returnUrl=http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073&referer=http://sosyal.hurriyet.com.tr&user_page=http://sosyal.hurriyet.com.tr/Account/AutoLogin?returnUrl=http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073&is_mobile=0&session_timeout=0&is_vative=0&email=')
b.select_form(name='frm_login')
b["email"]="tasklak#hotmail.com"
b["password"]="123456"
b.submit(type="submit")
url='http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073'
last_response = b.response()
http_header_dict = last_response.info().dict
html_string_list = last_response.readlines()
html_data = "".join(html_string_list)
page = br.open(url)
print page.read().decode("UTF-8")
ha=open("test.html",'w')
ha.write(html_data)
ha.close
again I cant get this working but if I open the html it created, it redirects to logged article page. may it be mechanize redirection problem or is it impossible to login this page?
edit after mihail's answer:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
user = 'tasklak#hotmail.com'
password = '123456'
xor_password = ''.join(chr(12 ^ ord(c)) for c in password)
auth_url = 'http://auth.hurriyet.com.tr/api/loginuser/{}/?{}'.format(user, xor_password)
url='http://www.hurriyet.com.tr/anasayfa/'
sessionidd=urllib2.urlopen(auth_url).read().split(',')[1].split('\"')[3]
print sessionidd
opener.open(url+';ASPSESSIONID='+sessionidd)
print cj
edit 2:
sessionidd=urllib2.urlopen(auth_url).read().split(',')[1].split('\"')[3]
print sessionidd
opener.open(url)
k=0
for a in cj:
if k<2:
a.value=sessionidd
k+=1
print cj
First of all, you should know that if there isn't a publicly available API to do all this without scraping then it's very likely that what you are doing is not welcomed by the website owners, against their terms of service and could even be illegal and punishable by law depending on where you live.
Unless mechanize can interpret javascript code (which I doubt it does although I might be wrong) it's not going to be very helpful, although, skimming through the links you provided with Chrome's DevTools it looks like you could implement what you want with a few pure urlib2 requests.
For example, when you login for the first time, you'll see a GET request to http://auth.hurriyet.com.tr/api/loginuser/tasklak#hotmail.com/?%3D%3E%3F89%3A URL which includes your username and encoded password and returns some session IDs. The reason mechanize wouldn't work is because the password is encoded via a javascript code that's not being interpreted when you are submitting the form in your code.
Going into the source code of the login form you'll see that when the "Submit" button is clicked a loginUser() function is called which when you'll find you'll see that the password is being xor'ed with the following code:
for (i = 0; i < password.length; ++i) {
encoded_password += String.fromCharCode(12 ^ password.charCodeAt(i));
}
which you would have to rewrite in python, so to recieve the initial session IDs you'd have something like:
import urllib2
user = 'tasklak#hotmail.com'
password = '123456'
xor_password = ''.join(chr(12 ^ ord(c)) for c in password)
auth_url = 'http://auth.hurriyet.com.tr/api/loginuser/{}/?{}'.format(user, xor_password)
print(urllib2.urlopen(auth_url).read())
It looks like you're then going to need to validate the session IDs you received and retrieve session cookies which you then can use to get full articles but I will leave that to you.
import urllib
import urllib2
import cookielib
def xueqiuBrower(url,user,passwd):
login_page='http://xueqiu.com/'
try:
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)')]
data=urllib.urlencode({'email':user, 'password':passwd})
opener.open(login_page,data)
op=opener.open(url)
data=op.read()
return data
except Exception,e:
print str(e)
if __name__=='__main__':
url='http://xueqiu.com/'
name='....'
passwd='....'
print xueqiuBrower(url,name,passwd)
I use python 2.7 ,i want to login web, but it can't, return "HTTP Error 404: Not Found
None",Please help me to solve it,thanks
You are using the wrong URL. The correct URL to access the login form is:
http://xueqiu.com/service/login
When you call opener.open with the data argument, Python will send a POST request. However, that does not seem to be allowed on the URL you specified. And the remote site, returns the wrong error code. Instead of 405 Method not allowed, it returns 404 Not found.
Inspecting the source code of the page by simply right-clicking the login form and choosing "inspect element", revealed the correct URL.
I'm working on a screen scraper using BeautifulSoup for what.cd using Python. I came across this script while working and decided to look at it, since it seems to be similar to what I'm working on. However, every time I run the script I get a message that my credentials are wrong, even though they are not.
As far as I can tell, I'm getting this message because when the script tries to log into what.cd, what.cd is supposed to return a cookie containing the information that lets me request pages later in the script. So where the script is failing is:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username,
'password' : password})
check = opener.open('http://what.cd/login.php', login_data)
soup = BeautifulSoup(check.read())
warning = soup.find('span', 'warning')
if warning:
exit(str(warning)+'\n\nprobably means username or pw is wrong')
I've tried multiple methods of authenticating with the site including using CookieFileJar, the script located here, and the Requests module. I've gotten the same HTML message with each one. It says, in short, that "Javascript is disabled", and "Cookies are disabled", and also provides a login box in HTML.
I don't really want to mess around with Mechanize, but I don't see any other way to do it at the moment. If anyone can provide any help, it would be greatly appreciated.
After a few more hours of searching, I found the solution to my problem. I'm still not sure why this code works as apposed to the version above, but it does. Here is the code I'm using now:
import urllib
import urllib2
import cookielib
cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
request = urllib2.Request("http://what.cd/index.php", None)
f = urllib2.urlopen(request)
f.close()
data = urllib.urlencode({"username": "your-login", "password" : "your-password"})
request = urllib2.Request("http://what.cd/login.php", data)
f = urllib2.urlopen(request)
html = f.read()
f.close()
Credit goes to carl.waldbieser from linuxquestions.org. Thanks for everyone who gave input.
I have a perl program that retrieves data from the database of my university library and it works well. Now I want to rewrite it in python but encounter the problem
<urlopen error [errno 104] connection reset by peer>
The perl code is:
my $ua = LWP::UserAgent->new;
$ua->cookie_jar( HTTP::Cookies->new() );
$ua->timeout(30);
$ua->env_proxy;
my $response = $ua->get($url);
The python code I wrote is:
cj = CookieJar();
request = urllib2.Request(url); # url: target web page
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj));
opener = urllib2.install_opener(opener);
data = urllib2.urlopen(request);
I use VPN(virtual private network) to log in my university's library at home, and I tried both the perl code and python code. The perl code works as I expected, but the python code always encountered "urlopen error".
I googled for the problem and it seems that the urllib2 fails to load the environmental proxy. But according to the document of urllib2, the urlopen() function works transparently with proxies which do not require authentication. Now I feels quite confusing. Can anybody help me with this problem?
I tried faking the User-Agent headers as Uku Loskit and Mikko Ohtamaa suggested, and solved my problem. The code is as follows:
proxy = "YOUR_PROXY_GOES_HERE"
proxies = {"http":"http://%s" % proxy}
headers={'User-agent' : 'Mozilla/5.0'}
proxy_support = urllib2.ProxyHandler(proxies)
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler(debuglevel=1))
urllib2.install_opener(opener)
req = urllib2.Request(url, None, headers)
html = urllib2.urlopen(req).read()
print html
Hope it is useful for someone else!
Firstly, as Steve said, you need response.read(), but that's not your problem
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
Can you give details of the error? You can get it like this:
try:
urllib2.urlopen(req)
except URLError, e:
print e.code
print e.read()
Source: http://www.voidspace.org.uk/python/articles/urllib2.shtml
(I put this in a comment but it ate my newlines)
You might find that the requests module is a much easier-to-use replacement for urllib2.
Did you try specifying the proxy manually?
proxy = urllib2.ProxyHandler({'http': 'your_proxy_ip'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
urllib2.urlopen('http://www.uni-database.com')
if it still fails, try faking your User-Agent headers so as to make it seem that the request is coming from a real browser.
I'm interested in getting the number of friends each of my friends on Facebook has. Apparently the official Facebook API does not allow getting the friends of friends, so I need to get around this (somehwhat sensible) limitation somehow. I tried the following:
import sys
import urllib, urllib2, cookielib
username = 'me#example.com'
password = 'mypassword'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'email' : username, 'pass' : password})
request = urllib2.Request('https://login.facebook.com/login.php')
request.add_header('User-Agent','Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.12) Gecko/20101027 Fedora/3.6.12-1.fc14 Firefox/3.6.12')
opener.open(request, login_data)
resp = opener.open('http://facebook.com')
print resp.read()
but I only end up with a captcha page. Any idea how FB is detecting that the request is not from a "normal" browser? I could add an extra step and solve the captcha but that would add unnecessary complexity to the program so I would rather avoid it. When I use a web browser with the same User-Agent string I don't get a captcha.
Alternatively, does anyone have any saner ideas on how to accomplish my goal, i.e. get a list of friends of friends?
Have you tried tracing and comparing HTTP transactions with Fiddler2 or Wireshark? Fiddler can even trace https, as long as your client code can be made to work with bogus certs.
I did try a lot of ways to scrape facebook and the only way that worked for me is :
To install selenium , the firefox plugin, the server and the python client library.
Then with the firefox plugin, you can record the actions you do to login and export as a python script, you use this as a base for your work and it will work. Basically I added to this script a request to my webserver to fectch a list of things to inspect on FB and then at the end of the script I send the results back to my server.
I could NOT find a way to do it directly from my server with a browser simulator like mechanize or else ! I guess It needs to be done from a client browser.