I'm aware of a Python API for sale here (http://oktaykilic.com/my-projects/google-alerts-api-python/), but I'd like to understand why the way I'm doing it now isn't working.
Here is what I have so far:
class GAlerts():
def __init__(self, uName = 'USERNAME', passWord = 'PASSWORD'):
self.uName = uName
self.passWord = passWord
def addAlert(self):
self.cj = mechanize.CookieJar()
loginURL = 'https://www.google.com/accounts/ServiceLogin?hl=en&service=alerts&continue=http://www.google.com/alerts'
alertsURL = 'http://www.google.com/alerts'
#log into google
initialRequest = mechanize.Request(loginURL)
response = mechanize.urlopen(initialRequest)
#put in form info
forms = ClientForm.ParseResponse(response, backwards_compat=False)
forms[0]['Email'] = self.uName
forms[0]['Passwd'] = self.passWord
#click form and get cookies
request2 = forms[0].click()
response2 = mechanize.urlopen(request2)
self.cj.extract_cookies(response, initialRequest)
#now go to alerts page with cookies
request3 = mechanize.Request(alertsURL)
self.cj.add_cookie_header(request3)
response3 = mechanize.urlopen(request3)
#parse forms on this page
formsAdd = ClientForm.ParseResponse(response3, backwards_compat=False)
formsAdd[0]['q'] = 'Hines Ward'
#click it and submit
request4 = formsAdd[0].click()
self.cj.add_cookie_header(request4)
response4 = mechanize.urlopen(request4)
print response4.read()
myAlerter = GAlerts()
myAlerter.addAlert()
As far as I can tell, it successfully logs in and gets to the adding alerts homepage, but when I enter a query and "click" submit it sends me to a page that says "Please enter a valid e-mail address". Is there some kind of authentication I'm missing? I also don't understand how to change the values on google's custom drop-down menus? Any ideas?
Thanks
The custom drop-down menus are done using JavaScript, so the proper solution would be to figure out the URL parameters and then try to reproduce them (this might be the reason it doesn't works as expected right now - you are omitting required URL parameters that are normally set by JavaScript when you visit the site in a browser).
The lazy solution is to use the galerts library, it looks like it does exactly what you need.
A few hints for future projects involving mechanize (or screen-scraping in general):
Use Fiddler, an extremely useful HTTP debugging tool. It captures HTTP traffic from most browsers and allows you to see what exactly your browser requests. You can then craft the desired request manually and in case it doesn't work, you just have to compare. Tools like Firebug or Google Chrome's developer tools come in handy too, especially for lots of async requests. (you have to call set_proxies on your browser object to use it with Fiddler, see documentation)
For debugging purposes, do something like for f in self.forms(): print f. This shows you all forms mechanize recognized on a page, along with their name.
Handling cookies is repetitive, so - surprise! - there's an easy way to automate it. Just do this in your browser class constructor: self.set_cookiejar(cookielib.CookieJar()). This keeps track of cookies automatically.
I have been relying a long time on custom parses like BeautifulSoup (and I still use it for some special cases), but in most cases the fastest approach on web screen scraping is using XPath (for example, lxml has a very good implementation).
Mechanize doesn't handle JavaScript, and those drop-down Menus are JS. If you want to do automatization where JavaScript is involved, I suggest using Selenium, which also has Python bindings.
http://seleniumhq.org/
Related
I am having trouble creating and keeping new sessions when I am scraping my page. I am initiating a session within my script using the Requests library and then parsing values to a web form. However, it's is returning a "Your session has timed out" page.
Here is my source:
import requests
session = requests.Session()
params = {'Rctl00$ContentPlaceHolder1$txtName': 'Andrew'}
r = session.post("https://www.searchiqs.com/NYALB/SearchResultsMP.aspx", data=params)
print(r.text)
The url I want to search from is this https://www.searchiqs.com/NYALB/SearchAdvancedMP.aspx
I am searching for a Party 1 name called "Andrew". I have identified the form element holding this search box as 'Rctl00$ContentPlaceHolder1$txtName'. The action url is SearchResultsMP.aspx.
When i do it from a browser, it gives the first page of results. When i do it in the terminal it gives me the session expired page. Any ideas?
First, I would refer you to the advanced documentation related to use of sessions within the requests Python module.
http://docs.python-requests.org/en/master/user/advanced/
I also notice that navigating to the base URL in your invocation of sessions.post redirects to:
https://www.searchiqs.com/NYALB/InvalidLogin.aspx?InvLogInCode=OldSession%2007/24/2016%2004:19:37%20AM
I "hacked" the URL to navigate to:
https://www.searchiqs.com/NYALB/
...and notice that if I click on the Show Login Fields link on that page, I am prompted a form appears with prompts for User ID and Password. Your attempts to programmatically do your searches are likely failing because you have not done any sorts of authentication. It likely works in your browser because you have been permitted to access this, either by some previous authentication you have completed and may have forgotten about, or some sort of server side access rules that don't ask for this based upon some criteria.
Running those commands in a local interpreter, I can see that the site owner did not bother to return a status code indicative of failed auth. If you check, the r.status_code is 200 but your r.text will be the Invalid Login page. I know nada about ASP, but am guessing that HTTP status codes should be indicative of what actually happened.
Here is some code, that does not really work, but may illustrate how you may want to interact with the site and sessions.
import requests
# Create dicts with our login and search data
login_params = {'btnGuestLogin': 'Log+In+as+GUEST'}
search_params = {'ctl00$ContentPlaceHolder1$txtName': 'Andrew'}
full_params = {'btnGuestLogin': 'Log+In+as+GUEST', 'ctl00$ContentPlaceHolder1$txtName': 'Andrew'}
# Create session and add login params
albany_session = requests.session()
albany_session.params = login_params
# Login and confirm login via searching for the 'ASP.NET_SessionId' cookie.
# Use the login page, not the search page first.
albany_session.post('https://www.searchiqs.com/NYALB/LogIn.aspx')
print(albany_session.cookies)
# Prepare a your search request
search_req = requests.Request('POST', 'https://www.searchiqs.com/NYALB/SearchAdvancedMP.aspx',data=search_params)
prepped_search_req = albany_session.prepare_request(search_req)
# Probably should work but does not seem to, for "reasons" unknown to me.
search_response = albany_session.send(prepped_search_req)
print(search_response.text)
An alternative may be for you to consider is Selenium browser automation with Python bindings.
http://selenium-python.readthedocs.io/
I need to log into a website to access its html on a login-protected page for a project I'm doing.
I'm using this person's answer with the values I need:
from twill.commands import *
go('https://example.com/login')
fv("3", "email", "myemail#example.com")
fv("3", "password", "mypassword")
submit()
Assumedly this should log me in so I then run:
sock = urllib.urlopen("https://www.example.com/activities")
html_source = sock.read()
sock.close()
print html_source
Which I thought would print the html of the (now) accessible page but instead just gives me the html of the login page. I've tried other methods (e.g. with mechanize) but I get the identical result.
What am I missing? Do some sites restrict this type of login or does it not work with https or something? (The site is FitBit, since I couldn't use the url in the question)
You're using one library to log in and another to then retrieve the subsequent page. twill and urllib are not sharing data about your sessions. (Similar issue to this one.) If you do that, then you need to manage the session cookie / authentication yourself. Specifically, you'll need to copy the cookie + data and add that to the post-login request in the other library.
Otherwise, and more logically, use the same one for both the login and post-login requests.
I have a Python script using mechanize browser which logs into a self hosted Wordpress blog, navigates to a different page after the automatic redirect to the dashboard to automate several builtin functions.
This script actually works 100% on most of my blogs but goes into a permanent loop with one of them.
The difference is that the only one which fails has a plugin called Wassup running. This plugin sets a session cookie for all visitors and this is what I think is causing the issue.
When the script goes to the new page the Wordpress code doesn't get the proper cookie set, decides that the browser isn't logged in and redirects to the login page. The script logs in again and attempts the same function and round we go again.
I tried using Twill which does login correctly and handles the cookies correctly but Twill, by default, outputs everything to the command line. This is not the behaviour I want as I am doing page manipulation at this point and I need access to the raw html.
This is the setup code
# Browser
self.br = mechanize.Browser()
# Cookie Jar
policy = mechanize.DefaultCookiePolicy(rfc2965=True)
cj = mechanize.LWPCookieJar(policy=policy)
self.br.set_cookiejar(cj)
After successful login I call this function
def open(self):
if 'http://' in str(self.burl):
site = str(self.burl) + '/wp-admin/plugin-install.php'
self.burl = self.burl[7:]
else:
site = "http://" + str(self.burl) + '/wp-admin/plugin-install.php'
try:
r = self.br.open(site, timeout=1000)
html = r.read()
return html
except HTTPError, e:
return str(e.code)
I'm thinking that I will need to save the cookies to a file and then shuffle the order so the Wordpress session cookie gets returned before the Wassup one.
Any other suggestions?
This turned out to be a quite different problem, and fix, than it seemed which is why I have decided to put the answer here for anyone who reads this later.
When a WordPress site is setup there is an option for the url to default to http://sample.com or http://www.sample.com. This turned out to be a problem for the cookie storage. Cookies are stored with the url as part of their name. My program semi-hardcodes the url with one or the other of these formats. This meant that every time I made a new url request it had the wrong format and no cookie with the right name could be found so the WordPress site rightfully decided I wasn't logged in and sent me back to login again.
The fix is to grab the url delivered in the redirect after login and recode the variable (in this case self.burl) to reflect what the .httaccess file expects to see.
This fixed my problem because some of my sites had one format and some the other.
I hope this helps someone out with using requests, twill, mechanise etc.
This script succeeds at getting a 200 response object, getting a cookie, and returning reddit's stock homepage source. However, it is supposed to get the source of the "recent activity" subpage which can only be accessed after logging in. This makes me think it's failing to log in appropriately but the username and password are accurate, I've double checked that.
#!/usr/bin/python
import requests
import urllib2
auth = ('username', 'password')
with requests.session(auth=auth) as s:
c = s.get('http://www.reddit.com')
cookies = c.cookies
for k, v in cookies.items():
opener = urllib2.build_opener()
opener.addheaders.append(('cookie', '{}={}'.format(k, v)))
f = opener.open('http://www.reddit.com/account-activity')
print f.read()
It looks like you're using the standard "HTTP Basic" authentication, which is not what Reddit uses to log in to its web site. (Almost no web sites use HTTP Basic (which pops up a modal dialog box requesting authentication), but implement their own username/password form).
What you'll need to do is get the home page, read the login form fields, fill in the user name and password, POST the response back to the web site, get the resulting cookie, then use the cookie in future requests. There may be quite a number of other details for you to work out too, but you'll have to experiment.
I just think maybe we're having the same problem. I get status code 200 ok. But the script never logged me in. I'm getting some suggestions and help. Hopefully you'll let me know what works for you too. Seems reddit is using the same system too.
Check out this page where my problem is being discussed.
Authentication issue using requests on aspx site
So, I started out with Mechanize, and apparently the first thing I try it on is a monkey-rhino-level high JavaScript navigated site.
Now the thing I'm stuck on is submitting the form.
Normally I'd do a submit using the Mechanize built-in submit() function.
import mechanize
browser = mechanize.Browser()
browser.select_form(name = 'foo')
browser.form['bar'] = 'baz'
browser.submit()
This way it'd use the submit button that's available in the HTML form.
However, the site I'm stuck on had to be one that doesn't use HTML submit buttons... No, they're trying to be JavaScript gurus, and do a submit via JavaScript.
The usual submit() doesn't seem to work with this.
So... Is there a way to get around this?
Any help is appreciated. Many thanks!
--[Edit]--
The JavaScript function I'm stuck on:
function foo(bar, baz) {
var qux = document.forms["qux"];
qux.bar.value = bar.split("$").join(":");
qux.baz.value = baz;
qux.submit();
}
What I did in Python (and what doesn't work):
def foo(browser, bar, baz):
qux = browser.select_form("qux")
browser.form[bar] = ":".join(bar.split("$"))
browser.form[baz] = baz
browser.submit()
Three ways:
The first method is preferable if the form is submitted using the POST/GET method, otherwise you'll have to resort to second and third method.
Submitting the form manually and check for POST/GET requests, their parameters and the post url required to submit the form. Popular tools for checking headers are the Live HTTP headers extension and Firebug extension for Firefox, and Developer Tools extension for Chrome. An example of using the POST/GET method:
import mechanize
import urllib
browser = mechanize.Browser()
#These are the parameters you've got from checking with the aforementioned tools
parameters = {'parameter1' : 'your content',
'parameter2' : 'a constant value',
'parameter3' : 'unique characters you might need to extract from the page'
}
#Encode the parameters
data = urllib.urlencode(parameters)
#Submit the form (POST request). You get the post_url and the request type(POST/GET) the same way with the parameters.
browser.open(post_url,data)
#Submit the form (GET request)
browser.open(post_url + '%s' % data)
Rewrite the javascript and execute it in Python. Check out spidermonkey.
Emulate a full browser. Check out Selenium and Windmill.