im making some script with mechanize.browser module.
one of problem is all other thing is ok, but when submit() form,it not working,
so i was found some suspicion source part.
in the html source i was found such like following.
<form method="post" onsubmit="return loginCheck(this)" name="FRMLOGIN"/>
im thinking, loginCheck(this) making problem when submit form.
but how to handle this kind of javascript function with mechanize module ,so i can
successfully submit form and can receive result?
folloing is my current script source.
if anyone can help me ..much appreciate!!
# -*- coding: cp949-*-
import sys,os
import mechanize, urllib
import cookielib
from BeautifulSoup import BeautifulSoup,BeautifulStoneSoup,Tag
import datetime, time, socket
import re,sys,os,mechanize,urllib,time
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# Want debugging messages?
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.open('http://user.buddybuddy.co.kr/Login/LoginForm.asp?URL=')
html = br.response().read()
print html
br.select_form(name='FRMLOGIN')
print br.viewing_html()
br.form['ID']='zero1zero2'
br.form['PWD']='012045'
br.submit()
print br.response().read()
mechanize doesn't support Javascript at all. If you absolutely have to run that Javascript, look into Selenium. It offers python bindings to control a real, running browser like Firefox or IE.
you will either need to make use of unmaintained module DOMForm and Spidermonkey (http://pypi.python.org/pypi/python-spidermonkey) to process javascript. Or you figure out what loginCheck() is doing and perform its work prior form submission in python. If loginCheck() just checks for obvious validity of login data, that should be pretty easy.
Please note, that the action parameter of the stated form tag is missing. It's probably given in the javascript part.
Depending on what you intend it might be easier to work with urllib2 only. You might assume a static appearance of that web page and just post data with urllib2's methods and get the results with it also.
onsubmit is just ignored by mechanize, no javascript interpretation is done.
You need to verify what loginCheck(); in some limited case (Validation) you can do programmatically what javascript does.
Related
I am trying to scrape the full text of articles from a New York Times archives search for an NLP task (search here: http://query.nytimes.com/search/sitesearch/). I have legal access to all of the articles and can view them if I search the archives manually.
However, when I use urllib2, mechanize or requests to pull the HTML from the search results page, they are not pulling the relevant part of the code (links to the articles, number of hits) so that I can scrape the full articles. I am not getting an error message, the relevant sections, which are clearly visible in inspect element, are simply missing from the HTML that is pulled.
Because some of the articles are accessible to subscribers only, it occurred to me that this may be the problem and I have supplied my user credentials through Mechanize with the request, however this makes no difference in the code pulled.
There is a NYT API, however it does not give access to the full text of the articles, so it is useless to me for my purposes.
I assume that NYT has intentionally made scraping the page difficult, but I have a legal right to view all of these articles and so would appreciate any help with strategies that may help me get around the hurdles they have put up. I am new to web-scraping and am not sure where to start in figuring out this problem.
I tried pulling the HTML with all of the following, and got the same incomplete results each time:
url = 'http://query.nytimes.com/search/sitesearch/#/India+%22united+states%22/from19810101to20150228/allresults/1/allauthors/relevance/Opinion/'
#trying urllib
import urllib
opener = urllib.FancyURLopener()
print opener.open(url).read()
#trying urllib2
import urllib2
request = urllib2.Request(url)
response = urllib2.urlopen(request)
print response.read()
#trying requests
import requests
print requests.get(url).text
#trying mechanize (impersonating browser)
import mechanize
import cookielib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
r = br.open(url)
print r.read()
Why don't you use a framework like Scrapy? This will give you a lot of power out of the box. For example, you will be able to retrieve those parts of the page you are interested in and discard the rest. I wrote a little example for dealing with Scrapy and ajax pages here: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/
Maybe it can help you to get an idea of how Scrapy works.
You could try using a tool like kimonolabs.com to scrape the articles. If you're having trouble using authentication, kimono has a built-in framework that allows you to enter and store your credentials, so that might help where you're otherwise limited with the NYT API. I made this NYT API using kimono that you can clone and use if you make a kimono account: https://www.kimonolabs.com/apis/c8i366xe
Here's a help center article for how to make an API behind a login: https://help.kimonolabs.com/hc/en-us/articles/203275010-Fetch-data-from-behind-a-log-in-auth-API-
This article walks you through how to go through links to get the detailed page information: https://help.kimonolabs.com/hc/en-us/articles/203438300-Source-URLs-to-crawl-from-another-kimono-API
I know there are alot of question about this matter but I try most of them.
my goal is to get the article from this page and use this in gae.
If I try to log in, it redirects to a long url ,after I log in there it redirects back to the article.
first I try urllib2 which is mentioned in here how to login to a website with python and mechanize and it didnt work.
then I took SelectLoginForm and login functions from https://github.com/cdhigh/KindleEar/blob/master/books/base.py it didnt work neither.
selenium wouldnt work because I gonna use it in gae. I guess gae cant support selenium
I started looking into mechanize module. my current code is :
# -*- coding: cp1254 -*-
import cookielib
import urllib2
import mechanize
b=mechanize.Browser()
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize.HTTPRefreshProcessor(),max_time=1)
b.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
b.open('https://hurpass.com/iframe/login?appkey=52da7ef64037f9497f0acb091390051062215&secret=52da7f0c4037f9497f0acb0b1390051084754&domain=sosyal.hurriyet.com.tr&callback_url=http://sosyal.hurriyet.com.tr/Account/AutoLogin?returnUrl=http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073&referer=http://sosyal.hurriyet.com.tr&user_page=http://sosyal.hurriyet.com.tr/Account/AutoLogin?returnUrl=http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073&is_mobile=0&session_timeout=0&is_vative=0&email=')
b.select_form(name='frm_login')
b["email"]="tasklak#hotmail.com"
b["password"]="123456"
b.submit(type="submit")
url='http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073'
last_response = b.response()
http_header_dict = last_response.info().dict
html_string_list = last_response.readlines()
html_data = "".join(html_string_list)
page = br.open(url)
print page.read().decode("UTF-8")
ha=open("test.html",'w')
ha.write(html_data)
ha.close
again I cant get this working but if I open the html it created, it redirects to logged article page. may it be mechanize redirection problem or is it impossible to login this page?
edit after mihail's answer:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
user = 'tasklak#hotmail.com'
password = '123456'
xor_password = ''.join(chr(12 ^ ord(c)) for c in password)
auth_url = 'http://auth.hurriyet.com.tr/api/loginuser/{}/?{}'.format(user, xor_password)
url='http://www.hurriyet.com.tr/anasayfa/'
sessionidd=urllib2.urlopen(auth_url).read().split(',')[1].split('\"')[3]
print sessionidd
opener.open(url+';ASPSESSIONID='+sessionidd)
print cj
edit 2:
sessionidd=urllib2.urlopen(auth_url).read().split(',')[1].split('\"')[3]
print sessionidd
opener.open(url)
k=0
for a in cj:
if k<2:
a.value=sessionidd
k+=1
print cj
First of all, you should know that if there isn't a publicly available API to do all this without scraping then it's very likely that what you are doing is not welcomed by the website owners, against their terms of service and could even be illegal and punishable by law depending on where you live.
Unless mechanize can interpret javascript code (which I doubt it does although I might be wrong) it's not going to be very helpful, although, skimming through the links you provided with Chrome's DevTools it looks like you could implement what you want with a few pure urlib2 requests.
For example, when you login for the first time, you'll see a GET request to http://auth.hurriyet.com.tr/api/loginuser/tasklak#hotmail.com/?%3D%3E%3F89%3A URL which includes your username and encoded password and returns some session IDs. The reason mechanize wouldn't work is because the password is encoded via a javascript code that's not being interpreted when you are submitting the form in your code.
Going into the source code of the login form you'll see that when the "Submit" button is clicked a loginUser() function is called which when you'll find you'll see that the password is being xor'ed with the following code:
for (i = 0; i < password.length; ++i) {
encoded_password += String.fromCharCode(12 ^ password.charCodeAt(i));
}
which you would have to rewrite in python, so to recieve the initial session IDs you'd have something like:
import urllib2
user = 'tasklak#hotmail.com'
password = '123456'
xor_password = ''.join(chr(12 ^ ord(c)) for c in password)
auth_url = 'http://auth.hurriyet.com.tr/api/loginuser/{}/?{}'.format(user, xor_password)
print(urllib2.urlopen(auth_url).read())
It looks like you're then going to need to validate the session IDs you received and retrieve session cookies which you then can use to get full articles but I will leave that to you.
I am trying using Python to simulate the login to company email.
The scripts works find and result page shows an already login sign.
import urllib
import urllib2
import mechanize
import cookielib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)')]
url = 'http://www.company.com/'
form_data = {'name_file': 'THENAME', 'password_field': 'THEPASSWORD'}
params = urllib.urlencode(form_data)
response = br.open(url, params)
However I need to click the “GoTo Email” button on the webpage to enter the email dashboard. Note here the web address doesn’t change, and not redirecting to any other page, before and after clicking the button.
The HTML script is showed as below. I think it’s a button.
id=mail1_btnGoto class=btn onmouseover="this.className='btnOver'" onmouseout="this.className='btn'" name=mail1$btnGoto value="GoTo Email" type=submit>
I thought to use winapi to simulate a mouse click but it’s silly because it controls mouse at the front end only. Selenium isn’t a solution in this case because I want to have the script run at the back end.
How can I have the button ‘click’ on the webpage?
It seems the email dashboard is driven by Javascript, so you cannot simply use winapi to simulate mouseclick without evaluating script.
Generally there are two workarounds:
Use full-feature browser driver. As you mentioned above, selenium is a good choice across many programming language. The webdriver does not need opening browsers manually and can be fully controlled by scripts. You can try ghost driver instead. It uses PhantomJS and should run in backend server.(But installing phantomjs is required)
mock the request. Because logining will usually invoke a http/https request. You can use python to mock that request. You can use http debbuging tools like fiddler, wireshark or Chrome web inspector to capture the information the browser sent to authentication server.
I tries to be specific and detailed. But due to the diversity of web crawling a step by step guide is beyond my reach.
I'm using mechanize/cookiejar/lxml to read a page and it works for some but not others. The error I'm getting in them is the one in the title. I can't post the pages here because they aren't SFW, but is there a way to fix it? Basically, this is what I do:
import mechanize, cookielib
from lxml import etree
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(False)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')]
response = br.open('...')
tree = etree.parse(response) #error
After that I get the root and search the document for the values I want. Apparently iterparse doesn't crash it, but at the moment I'm assuming it doesn't just because I didn't process anything with it. Plus, I haven't figured out yet how to search for the stuff with it.
I've tried disabling gzip and enabling sending the referer as well but neither solves the problem. I also tried saving the sourcecode to the disk and creating the tree from there just for the sake of it and I get the same error.
edit
The response I get seems to be fine, using print repr(response) as suggested I get a <response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>>. I can also save the response using the read() method and check that the saved .xml works on the browser and everything.
Also, in one of the pages, there is a ’ that gives me the following error: "lxml.etree.XMLSyntaxError: Entity 'rsquo' not defined, line 17, column 7054". So far I've replaced it with a regex, but is there a parser that can handle this? I've gotten this error even with the lxml.html.parse suggested below.
Regarding the file being highlighted, I meant that when I open it with gEdit it does this kinda: http://img34.imageshack.us/img34/9574/gedit.jpg
use lxml.html.parse for html it can handle even very broken html, you still get an error then?
What is the nature of response? According to the help, etree.parse is expecting one of:
- a file name/path
- a file object
- a file-like object
- a URL using the HTTP or FTP protocol
I am looking for a way to view the request (not response) headers, specifically what browser mechanize claims to be. Also how would I go about manipulating them, eg setting another browser?
Example:
import mechanize
browser = mechanize.Browser()
# Now I want to make a request to eg example.com with custom headers using browser
The purpose is of course to test a website and see whether or not it shows different pages depending on the reported browser.
It has to be the mechanize browser as the rest of the code depends on it (but is left out as it's irrelevant.)
browser.addheaders = [('User-Agent', 'Mozilla/5.0 blahblah')]
You've got an answer on how to change the headers, but if you want to see the exact headers that are being used try using a proxy that displays the traffic. e.g. Fiddler2 on windows or see this question for some Linux altenatives.
you can modify referer too...
br.addheaders = [('Referer', 'http://google.com')]