Retrieving pages from what.cd - python

I'm working on a screen scraper using BeautifulSoup for what.cd using Python. I came across this script while working and decided to look at it, since it seems to be similar to what I'm working on. However, every time I run the script I get a message that my credentials are wrong, even though they are not.
As far as I can tell, I'm getting this message because when the script tries to log into what.cd, what.cd is supposed to return a cookie containing the information that lets me request pages later in the script. So where the script is failing is:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username,
'password' : password})
check = opener.open('http://what.cd/login.php', login_data)
soup = BeautifulSoup(check.read())
warning = soup.find('span', 'warning')
if warning:
exit(str(warning)+'\n\nprobably means username or pw is wrong')
I've tried multiple methods of authenticating with the site including using CookieFileJar, the script located here, and the Requests module. I've gotten the same HTML message with each one. It says, in short, that "Javascript is disabled", and "Cookies are disabled", and also provides a login box in HTML.
I don't really want to mess around with Mechanize, but I don't see any other way to do it at the moment. If anyone can provide any help, it would be greatly appreciated.

After a few more hours of searching, I found the solution to my problem. I'm still not sure why this code works as apposed to the version above, but it does. Here is the code I'm using now:
import urllib
import urllib2
import cookielib
cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
request = urllib2.Request("http://what.cd/index.php", None)
f = urllib2.urlopen(request)
f.close()
data = urllib.urlencode({"username": "your-login", "password" : "your-password"})
request = urllib2.Request("http://what.cd/login.php", data)
f = urllib2.urlopen(request)
html = f.read()
f.close()
Credit goes to carl.waldbieser from linuxquestions.org. Thanks for everyone who gave input.

Related

How to login into router with default credentials using python

I have TP Link router (WR841N).I want to login into my TP link router and needs to change primary and secondary DNSusing script.
I tried to login using below script but not succeeded:
import urllib2
import urllib
import cookielib
def main():
userName = 'admin'
pcPassword = 'admin'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'userName' : userName, 'pcPassword' : pcPassword})
resp = opener.open('http://192.168.0.1/userRpm/LoginRpm.htm', login_data)
print(resp.read())
if __name__ == '__main__':
main()
And then how to change primary and secondary dns using script.
CookieProcessor doesn't set POST header, obviously.
You need to set Content-Type and Content-Length to match your login_data.
I would recommend you to install the opener you built using urllib2.install_opener(), and then use request:
r = urllib2.Request('http://192.168.0.1/userRpm/LoginRpm.htm')
r.add_header("Content-Type", "application/x-www-form-urlencoded")
r.add_header("Content-Length", str(len(login_data)))
r.add_data(login_data)
u = urllib2.urlopen(r)
print u.read()
u.close()
Then you have to continue with filling all other forms to change what you want.
If cookies aren't managed by javascript, you will be able to do it. If they are, perhaps even then if you examine the code carefully and extract cookie results manually from javascript code. I did it before.
But, yeah, SSH or telnet or rlogin would be easier than HTTP. To continue using HTTP, take a look at Requests package, it can be helpful, and make your code smaller. It includes session managing for you.
Adding urlencoded type to content-type might not help if login form has enctype attribute set to something else. (plain text or multipart).
I don't think that'll be a case, but if it is you can still do it with a bit more work.

python-requests - can't login

trying to scrape some data, but first I need to login. I am trying to use python-requests, and here is my code so far :
login_url = "https://www.wehelpen.nl/login/"
users_url = "https://www.wehelpen.nl/ik-zoek-hulp/hulpprofielen/"
profile_url = "https://www.wehelpen.nl/profiel/01136/hulpvragen/"
uname = "****"
pword = "****"
def main():
s = login(uname, pword, login_url)
page = s.get(users_url)
print makeUTF8(page.text) # grab html and grep for logged in text to make sure!
def login(uname, pword, url):
s = requests.session()
s.get(url, auth=(uname, pword))
csrftoken = s.cookies['csrftoken']
login_data = dict(username=uname, password=pword,
csrfmiddlewaretoken=csrftoken, next='/')
s.post(url, data=login_data, headers=dict(Referer=url))
return s
def makeUTF8(text):
return text.encode('utf-8')
Basically, I need to login at login_url with a POST request (using a csrf token because I get an error otherwise), then using the session object passed back from login(), I want to check that I am logged in by making a GET request to a user page. When I get the return - page.text I can then run a grep command to check for a certain href which tells me if I am logged in or not.
So, thus far I am unable to login and keep a working session object. Can anyone help me? So far, this has been the most tedious python experience of my life.
EDIT. I have searched, searched and searched SO for answers and nothing is working...
You need to have correct names for dictionary keys. Request libary uses html name of form to find right form. In your case those names are identification and password.
login_data = {'identification'=uname,'password'=pswrd...}
There are lots of options, but I have had success using cookielib instead of trying to "manually" handle the cookies.
import urllib2
import cookielib
cookiejar = cookielib.CookieJar()
cookiejar.clear()
urlOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookiejar))
# ...etc...
Some potentially relevant answers on getting this set up are on SO, including: https://stackoverflow.com/a/5826033/1681480

issue with cookies and sending POST/GET to get te web content in Python [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to use Python to login to a webpage and retrieve cookies for later usage?
I want to download whole webpage source from a service that handles cookies in some unusual way. I wrote a script that actually works and seems to be fine however at some point it returned such error:
urllib2.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
My script works in loop and changes link to subpage wchich content im interested to download.
I get a cookie, send a package of data and then i am able to get to porper link then download html.
script look like this:
import urllib2
data = 'some_string'
url = "http://example/index.php"
url2 = "http://example/source"
req1 = urllib2.Request(url)
response = urllib2.urlopen(req1)
cookie = response.info().getheader('Set-Cookie')
## Use the cookie is subsequent requests
req2 = urllib2.Request(url, data)
req2.add_header('cookie', cookie)
response = urllib2.urlopen(req2)
## reuse again
req3 = urllib2.Request(url2)
req3.add_header('cookie', cookie)
response = urllib2.urlopen(req3)
html = response.read()
Ive been reading sth ab cookiejar/cookielib coz using this lib i am supposed to ged rid of this error mentioned above however i have no clue how to reporoduce my code to be used by: http.cookiejar, urllib.request
i tried sth like this:
import http.cookiejar, urllib.request
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener( urllib.request.HTTPCookieProcessor(cj) )
r = opener.open(url) # now cookies are stored in cj
r1 = urllib.request(url, data) #TypeError: POST data should be bytes or an iterable of bytes. It cannot be str.
r2 = opener.open(url2)
print( r2.read() )
But its not working as my first script.
ps. Sorry for my english but im am not native.
#Piotr Dobrogost thanks for the link, it solved the issue.
TypeError solved by using data=b"string" instead of data="string"
Ive got still some issues due to porting to python3 but issue is to be closed.

Python and Google Checkout

I am trying to use a python script to login and grab the html from my Google Checkout account. It seems to login but returns a strange page:
Which doesn't have any of the order info which I am trying to parse. I know Google Checkout has an API but there is no way to parse just the payout totals, which is all I care about.
Here is my code:
import urllib, urllib2, cookielib
username = 'username'
password = 'password'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'Email' : username, 'Passwd' : password})
opener.open('https://accounts.google.com/ServiceLogin?service=sierra&passive=1200&continue=https://checkout.google.com/sell/orders&followup=https://checkout.google.com/sell/orders&ltmpl=seller&scc=1&authuser=0', login_data)
resp = opener.open('https://checkout.google.com/sell/payouts')
f = file('test.html', 'w')
f.write(resp.read())
f.close()
print "Finished"
How can I get this code to display the proper HTML of my account so I can parse it?
It depends on what sort of browser-detection or javascript tricks Google Checkout may be using. It may be enough simply to set your User-Agent to that of a well-known desktop browser- from the screenshot, it seems Google Checkout is assuming you're on a mobile browser.

Python auth_handler not working for me

I've been reading about Python's urllib2's ability to open and read directories that are password protected, but even after looking at examples in the docs, and here on StackOverflow, I can't get my script to work.
import urllib2
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm=None,
uri='https://webfiles.duke.edu/',
user='someUserName',
passwd='thisIsntMyRealPassword')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
socks = urllib2.urlopen('https://webfiles.duke.edu/?path=/afs/acpub/users/a')
print socks.read()
socks.close()
When I print the contents, it prints the contents of the login screen that the url I'm trying to open will redirect you to. Anyone know why this is?
auth_handler is only for basic HTTP authentication. The site here contains a HTML form, so you'll need to submit your username/password as POST data.
I recommend you using the mechanize module that will simplify the login for you.
Quick example:
import mechanize
browser = mechanize.Browser()
browser.open('https://webfiles.duke.edu/?path=/afs/acpub/users/a')
browser.select_form(nr=0)
browser.form['user'] = 'username'
browser.form['pass'] = 'password'
req = browser.submit()
print req.read()

Categories