Python & mechanize: How to scrape through pages in a row? - python

my problem is as follows:
I'm trying to write a scraper that runs through the order process of an airline ticketing website. So I want to scrape a couple of pages that depend on the results of the pages before (I hope you get what I mean). I am so far right now:
import mechanize, urllib, urllib2
url = 'any url'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 5.2; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11')]
br.open(url)
response = br.response().read()
br.select_form(nr=1)
br.form.set_all_readonly(False)
## now I am reading out the variables of form(nr=1)
for control in br.form.controls:
if not control.name:
print " - (type) =", (control.type)
continue
print " - (name, type, value) =", (control.name, control.type, br[control.name])
## now I am modifying the variables
br['fromdate'] = '2012/11/03'
br['todate'] = '2012/11/07'
## now I am submitting the form and saving the output in the variable bookingsite
response = br.submit()
bookingsite = response.read()
And here is my problem: How can I use the variable bookingsite, which contains again a form that I want to modify and submit, just like a normal URL? Just by setting
br.open(bookingsite)
??? Or is there another way of modifying and submitting the output (and then submit the output again and receive the new output-page)?

After your initial response response = br.submit() Select the form from the response object:
response.select_form()
After you're changed the values of the fields within the form submit the form:
response.submit()
P.S. If you're automating booking sites they most likely have heavy Javascript. Mechanize doesn't handle Javascript. I'd suggest using Requests instead. You'll be happy you did.

Related

Website Login Not Accepting POST Method Python Script Login

I am trying to login to a site with POST method, then navigate to another page and scrape the HTML data from that second page.
However, the website is not accepting the package I am pushing to it, and the script is returning the data for a non-member landing page instead of the member page I want.
Below is the current code that does not run.
#Import Packages
import requests
from bs4 import BeautifulSoup
# Login Data
url = "https://WEBSITE.com/ajax/ajax.login.php"
data ={'username':'NAME%40MAIL.com','password':'PASSWORD%23','token':'ea83a09716ffea1a3a34a1a2195af96d2e91f4a32e4b00544db3962c33d82c40'}
# note that in HTML have encoded '#' like NAME%40MAIL.com
# note that in HTML have encoded '#' like %23
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36"}
# Post data package and log in
session = requests.Session()
r = session.post(url, headers=headers, data=data)
# Navigate to the next page and scrape the data
s = session.get('https://WEBSITE.com/page/93/EXAMPLE-PAGE')
soup = BeautifulSoup(s.text, 'html.parser')
print(soup)
I have inspected the Elements on the login page and the AJAX URL for the login action is correct and there are 3 forms that need to be filled as seen in the image below. I pulled the hidden token value from the inspect element panel and passed it along with the username/e-mail and password:
Inspect Element Panel
I really have no clue what the issue might be, but there is a BOOLEAN variable for IS_GUEST returning TRUE in the HTML return that tells me I have done something wrong and the script has not been granted access.
This is also puzzling to troubleshoot since there is a redirect landing page and no server error codes to analyze or give me a hint.
I am using a different header than my actual machine, but that has never stopped me before from more simple logins.
I have encoded the string passed in the login data email with '%40' instead of '#' and the special character required in the password was encoded as '%23' for '#' (i.e. NAME#MAIL.COM = 'NAME%40MAIL.COM' and PASSWORD# = 'PASSWORD%23') Whenever I change the e-mail to use the '#' I get a garbage response, and I tried putting the '#' back in the password, but that changed nothing either.

Python scraper with POST request doesn't bring any results

I've written a script to scrape the "First Name" from a webpage using post request in python. However, Running my script I get neither any results nor any error. It seems to me that I'm doing things the right way. Hope somebody will point me into the right direction showing me what I'm missing here:
import requests
from lxml import html
payload = {'ScriptManager1':'UpdatePanel1|btnProceed','__EVENTTARGET':'','__EVENTARGUMENT':'','__VIEWSTATE':'/wEPDwULLTE2NzQxNDczNTcPZBYCAgQPZBYCAgMPZBYCZg9kFgQCAQ9kFgQCAQ9kFgICAQ9kFg4CBQ8QZGQWAGQCFQ8QZGQWAWZkAiEPEGRkFgFmZAI3DxBkZBYAZAI7DxBkZBYAZAJvDw9kFgIeBXZhbHVlZWQCew8PZBYCHwBlZAICD2QWAgIBD2QWAgIBD2QWAmYPZBYSAgcPEGRkFgBkAi0PEGRkFgFmZAJFDxYCHgdFbmREYXRlBmYcik5ut9RIZAJNDxBkZBYBZmQCZQ8WAh8BBmYcik5ut9RIZAJ7DxBkZBYAZAKBAQ8QZGQWAGQCyAEPD2QWAh8AZWQC1AEPD2QWAh8AZWQCBw9kFgICAw88KwARAgEQFgAWABYADBQrAABkGAMFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYDBQxyZG9QZXJtYW5lbnQFDHJkb1Byb3Zpc2lvbgUMcmRvUHJvdmlzaW9uBQlHcmlkVmlldzEPZ2QFCk11bHRpVmlldzEPD2RmZFSgnfO4lYFs09JWdr2kB8ZwSO3808nJf+616Y8YJ3UF','__VIEWSTATEGENERATOR':'5629D98D','__EVENTVALIDATION':'/wEdAAekSVFWk+dy9X9XnzfYeR4NT1Z25jJdJ6rNAjXmHpbD+Q8ekkJ2enuXq0jY/CeUlod/njRPjRiZUniYWoSlesZ/+0XiOc/vwjI5jxqS0D5ang1Wtvp3KMocxPzInS3xjMbN+DvxnwFeFeJ9MIBWR693SSiBqUlIhPoALKQ2G08CpjEhrdvaa2JXqLbLG45vzvU=','r1':'rdoPermanent','txtRegistNo':'SRO0394294','__ASYNCPOST':'true','btnProceed':'Proceed'}
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'}
response = requests.post("https://www.icaionlineregistration.org/StudentRegistrationForCaNo.aspx", params=payload, headers=headers).text
tree = html.fromstring(response)
item = tree.xpath('//div[#class="div_input_place"]/input[#id="txt_name"]/#value')
print(item)
URL is given in my script and the reg number to get the "First Name" is "SRO0394294". The xpath I've used above is the correct one.
__VIEWSTATE input is always changing. this input could be used to prevent the registration form from bots
The problem is probably that the __EVENTTARGET field is empty, it may be needed in order to submit your request. You can find the value to set with the form submit button in most cases.
Also since the __VIEWSTATE is always regenerating upon requests you'll need to grab it. You can do firstly a GET request, save the __VIEWSTATE input and then do a POST request with the previous __VIEWSTATE value.

Python mechanize form submission not working

I'm trying to put together a python program that will periodically gather course data from a university website and save them to a csv file for personal use. After some searching around I stumbled across mechanize. I'm trying to set it up so I can log into my account but I've run into a stumbling block. The code I put together here is supposed to submit the form containing the login information. The website is https://idp-prod.cc.ucf.edu/idp/Authn/UserPassword. The response page is supposed to display a red error message when a form is submitted with the wrong log in information. The response I keep getting lacks such an error message and I cant figure out what I'm doing wrong.
import mechanize
from bs4 import BeautifulSoup
import cookielib
import urllib2
# log in page url
url = "https://idp-prod.cc.ucf.edu/idp/Authn/UserPassword"
# create the browser object and give it fake headers
myBrowser = mechanize.Browser()
myBrowser.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
myBrowser.set_handle_robots(False)
# have the browser open the site
myBrowser.open(url)
# handle the cookies
cj = cookielib.LWPCookieJar()
myBrowser.set_cookiejar(cj)
# select third form down
myBrowser.select_form(nr=2)
# fill out the form
myBrowser["j_username"] = "somestudentusername"
myBrowser["pwd"] = "somestudentpassword"
# submit the form and save the response
response = myBrowser.submit()
# upon the submission of incorrect login information an error message is displayed in red
soup = BeautifulSoup(response.read(), 'html.parser')
print (soup.find(color="red")) #find the error message in the response

Login to a website with a radio button using python's request module

I am trying to log in to a Website and retrieve some date therein. I tried the following code:
from requests import session
payload = {
r"Login1$UserName": "myusername",
r"Login1$Password": "thepassword",
r"Login1$RadioButtonList_Type": "Tuna"
}
with session() as s:
s.post("http://elogbook.ofdc.org.tw/", data=payload)
req = s.get("http://elogbook.ofdc.org.tw/Member/CatchReportTuna.aspx")
print(req.text)
But the result shows that I am not logged in the site. And I would like to know why the above code failed, and how to login into the website.
I am new to parsing data from the sites, so any opinion is sincerely welcomed, thanks in advance.
P.S. The name r"Login1$RadioButtonList_Type" refers to the name of a radio button on that website, and I would like to set its value to Tuna.
The key problem is that there are hidden ASP.NET form fields that should also be a part of the payload. This means that you first need to make a GET request to a page and parse the hidden input field values. Also, you need to provide a User-Agent header. Using BeautifulSoup for html-parsing part:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from requests import session
payload = {
r"Login1$UserName": "myusername",
r"Login1$Password": "thepassword",
r"Login1$RadioButtonList_Type": "Tuna",
r"Login1$LoginButton": u"登入"
}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.111 Safari/537.36'}
with session() as s:
s.headers = headers
response = s.get('http://elogbook.ofdc.org.tw/')
soup = BeautifulSoup(response.content)
for input_name in ['__EVENTTARGET', '__EVENTARGUMENT', '__VIEWSTATE', '__VIEWSTATEGENERATOR', '__EVENTVALIDATION']:
payload[input_name] = soup.find('input', {'name': input_name}).get('value', '')
s.post("http://elogbook.ofdc.org.tw/", data=payload)
req = s.get("http://elogbook.ofdc.org.tw/Member/CatchReportTuna.aspx")
print(req.content)
Just FYI, you could have also used the following tools to submit the form without being explicitly worried about the hidden form fields:
mechanize
MechanicalSoup
robobrowser
Another option, would be to mimic a real you by automating a real browser through selenium:
from selenium import webdriver
login = "mylogin"
password = "mypassword"
driver = webdriver.Firefox()
driver.get('http://elogbook.ofdc.org.tw/')
# fill the form
driver.find_element_by_id('Login1_UserName').send_keys(login)
driver.find_element_by_id('Login1_Password').send_keys(password)
driver.find_element_by_id('Login1_RadioButtonList_Type_0').click()
# submit
driver.find_element_by_id('Login1_LoginButton').click()
driver.get('http://elogbook.ofdc.org.tw/Member/CatchReportTuna.aspx')
print driver.page_source

Download html in python?

I am trying to download the html of a page that is requested through a javascript action when you click a link in the browser. I can download the first page because it has a general URL:
http://www.locationary.com/stats/hotzone.jsp?hz=1
But there are links along the bottom of the page that are numbers (1 to 10). So if you click on one, it goes to, for example, page 2:
http://www.locationary.com/stats/hotzone.jsp?ACTION_TOKEN=hotzone_jsp$JspView$NumericAction&inPageNumber=2
When I put that URL into my program and try to download the html, it gives me the html of a different page on the website and I think it is the home page.
How can I get the html of this URL that uses javascript and when there is no specific URL?
Thanks.
Code:
import urllib
import urllib2
import cookielib
import re
URL = ''
def load(url):
data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
opener.addheaders.append(('Cookie','site_version=REGULAR'))
request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
response = opener.open(request)
page = opener.open("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction").read()
h = response.info().headers
jsid = re.findall(r'Set-Cookie: (.*);', str(h[5]))
data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
opener.addheaders.append(('Cookie','site_version=REGULAR; ' + str(jsid[0])))
request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
response = opener.open(request)
page = opener.open(url).read()
print page
load(URL)
The selenium webdriver from the selenium tool suite uses standard browsers to retrieve the HTML (it's main goal is test automation for web applications), so it is well suited for scrapping javascript-rich applications. It has nice Python bindings.
I tend to use selenium to grab the page source after all ajax stuff is fired and parse it with something like BeautifulSoup (BeautifulSoup copes well with malformed HTML).

Categories