This question already has an answer here:
Python mechanize javascript
(1 answer)
Closed 8 years ago.
I'm trying to use mechanize to grab prices for New York's metro-north railroad from this site: http://as0.mta.info/mnr/fares/choosestation.cfm
The problem is that when you select the first option, the site uses javascript to populate your list of possible destinations. I have written equivalent code in python, but I can't seem to get it all working. Here's what I have so far:
import mechanize
import cookielib
from bs4 import BeautifulSoup
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("http://as0.mta.info/mnr/fares/choosestation.cfm")
br.select_form(name="form1")
br.form.set_all_readonly(False)
origin_control = br.form.find_control("orig_stat", type="select")
origin_control_list = origin_control.items
origin_control.value = [origin_control.items[0].name]
destination_control_list = reFillList(0, origin_control_list)
destination_control = br.form.find_control("dest_stat", type="select")
destination_control.items = destination_control_list
destination_control.value = [destination_control.items[0].name]
response = br.submit()
response_text = response.read()
print response_text
I know I didn't give you code for the reFillList() method, because it's long, but assume it correctly creates a list of mechanize.option objects. Python doesn't complain about me about anything, but on submit I get the html for this alert:
"Fare information for travel between two lines is not available on-line. Please contact our Customer Information Center at 511 and ask to speak to a representative for further information."
Am I missing something here? Thanks for all the help!
It can't really be done without trying to make sense of the crazy logic in that function. I suggest a js solution or a full browser like selenium.
Related
I'm trying to put together a python program that will periodically gather course data from a university website and save them to a csv file for personal use. After some searching around I stumbled across mechanize. I'm trying to set it up so I can log into my account but I've run into a stumbling block. The code I put together here is supposed to submit the form containing the login information. The website is https://idp-prod.cc.ucf.edu/idp/Authn/UserPassword. The response page is supposed to display a red error message when a form is submitted with the wrong log in information. The response I keep getting lacks such an error message and I cant figure out what I'm doing wrong.
import mechanize
from bs4 import BeautifulSoup
import cookielib
import urllib2
# log in page url
url = "https://idp-prod.cc.ucf.edu/idp/Authn/UserPassword"
# create the browser object and give it fake headers
myBrowser = mechanize.Browser()
myBrowser.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
myBrowser.set_handle_robots(False)
# have the browser open the site
myBrowser.open(url)
# handle the cookies
cj = cookielib.LWPCookieJar()
myBrowser.set_cookiejar(cj)
# select third form down
myBrowser.select_form(nr=2)
# fill out the form
myBrowser["j_username"] = "somestudentusername"
myBrowser["pwd"] = "somestudentpassword"
# submit the form and save the response
response = myBrowser.submit()
# upon the submission of incorrect login information an error message is displayed in red
soup = BeautifulSoup(response.read(), 'html.parser')
print (soup.find(color="red")) #find the error message in the response
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I would like to get the estimated results number from google for a keyword. Im using Python3.3 and try to accomplish this task with BeautifulSoup and urllib.request. This is my simple code so far
def numResults():
try:
page_google = '''http://www.google.de/#output=search&sclient=psy-ab&q=pokerbonus&oq=pokerbonus&gs_l=hp.3..0i10l2j0i10i30l2.16503.18949.0.20819.10.9.0.1.1.0.413.2110.2-6j1j1.8.0....0...1c.1.19.psy-ab.FEBvxrgi0KU&pbx=1&bav=on.2,or.r_qf.&bvm=bv.48705608,d.Yms&'''
req_google = Request(page_google)
req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
scounttext = soup.find('div', id='resultStats')
except URLError as e:
print(e)
return scounttext
My problem is that my soup variable is somehow encoded and that i cant get any information out of it. So i get back a None because soup.find doesnt work.
What am i doing wrong and how can i extract the wanted resultstats?
Many thanks!
If you haven't solved this problem yet, it looks like the reason BeautifulSoup is failing to find anything is that the resultStats never appear in the soup - your Request(page_google) is only returning JavaScript, not any search results that the JavaScript is dynamically loading in. You can verify this by adding a
print(soup)
command to your code and you will see that the resultStats div doesn't appear.
The following code:
import sys
from urllib2 import Request, urlopen
import urllib
from bs4 import BeautifulSoup
query = 'pokerbonus'
url = "http://www.google.de/search?q=%s" % urllib.quote_plus(query)
req_google = Request(url)
req_google.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
scounttext = soup.find('div', id='resultStats')
print(scounttext)
Will print
<div class="sd" id="resultStats">Ungefähr 1.060.000 Ergebnisse</div>
Lastly, using a tool like Selenium Webdriver might be a better way to go about solving this, as Google does not allow bots to scrape search results.
my problem is as follows:
I'm trying to write a scraper that runs through the order process of an airline ticketing website. So I want to scrape a couple of pages that depend on the results of the pages before (I hope you get what I mean). I am so far right now:
import mechanize, urllib, urllib2
url = 'any url'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 5.2; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11')]
br.open(url)
response = br.response().read()
br.select_form(nr=1)
br.form.set_all_readonly(False)
## now I am reading out the variables of form(nr=1)
for control in br.form.controls:
if not control.name:
print " - (type) =", (control.type)
continue
print " - (name, type, value) =", (control.name, control.type, br[control.name])
## now I am modifying the variables
br['fromdate'] = '2012/11/03'
br['todate'] = '2012/11/07'
## now I am submitting the form and saving the output in the variable bookingsite
response = br.submit()
bookingsite = response.read()
And here is my problem: How can I use the variable bookingsite, which contains again a form that I want to modify and submit, just like a normal URL? Just by setting
br.open(bookingsite)
??? Or is there another way of modifying and submitting the output (and then submit the output again and receive the new output-page)?
After your initial response response = br.submit() Select the form from the response object:
response.select_form()
After you're changed the values of the fields within the form submit the form:
response.submit()
P.S. If you're automating booking sites they most likely have heavy Javascript. Mechanize doesn't handle Javascript. I'd suggest using Requests instead. You'll be happy you did.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I need to download videos from youtube using a python script. However i am unable to get the url of the video from the youtube page.
For example, given the url: http://www.youtube.com/watch?v=5qcmCUsw4EQ&feature=g-all-u&context=G2633db8FAAAAAAAAAAA
I need to download the video as a flv or any other format. Also i need to be able to download it multiple quality.
I tried several scripts like youtube-dl and quvi but they all give errors and dont work. Please help. It shall be deeply appreciated.
You need to parse the flashvars variable of the <embed> tag that contains the video. These change around, so some experimentation may be required to find the current variable names. Roughly speaking, you'll want to use a libraries like mechanize to grab the HTML of the page and BeautifulSoup to parse the HTML and extract the flashvars field of the <embed> element. Then look around at the variables to figure out which one contains the video URL.
e.g.,
br = mechanize.Browser()
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('%s?v=%s' % (YOUTUBE_URL, vidId))
soup = BeautifulSoup.BeautifulSoup(br.response().read())
flashVars = urllib2.urlparse.parse_qs(soup.find('embed').get('flashvars'))
# Return the first second video source URL
return flashVars['fmt_stream_map'][0].split('|')[1]
I'm trying to automate the download of some data from a webform. I'm using python's mechanize module.
The url is here: http://www.hiv.lanl.gov/components/sequence/HIV/search/search.html
I need to fill out the Sequence Length, Subtype and Genomic Region. I've got the Sequence-Length and Genomic-Region figured out but I'm having trouble with selecting the Subtype. When I load the form is get an empty SelectControl and mechanize won't let me select anything.
This code should load the website:
import mechanize
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('http://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html')
br.follow_link(text = 'Search Interface')
br.select_form(nr = 1)
Any help would be greatly appreciated.
-Will
EDIT:
I tried to use BeautifulSoup to re-parse the HTML (as per this SO question) but no luck there.
New Edit:
Below is an excerpt of the mechanize form.
<search POST http://www.hiv.lanl.gov/components/sequence/HIV/search/search.comp multipart/form-data
<TextControl(value SequenceAccessions SA_GenBankAccession 1=)>
<SelectControl(master=[Any, *HIV-1, HIV-2, SIV, SHIV, syntheticwholeDNA, NULL])>
<TextControl(value SEQ_SAMple SSAM_common_name 1=)>
<SelectControl(slave=[])>
<TextControl(value SequenceEntry SE_sequenceLength 4=)>
<CheckboxControl(sample_year_exact=[*1])>
<TextControl(value SEQ_SAMple SSAM_Sample_year 11=)>
<TextControl(value SEQ_SAMple SSAM_Sample_country 1=)>
<CheckboxControl(recombinants=[Recombinants])>
For some reason the slave control is not populated with the possible choices.
The problem is that the Subtype <select> options are populated by Javascript, which mechanize does not execute. The Javascript code runs when the page loads, populating the slave options with the HIV-1 list:
addLoadEvent(fillList(document.forms['search'].slave, lists['HIV-1']));
However, the mapping of Virus -> Subtype is stored in this Javascript file search.js. You may need to store this mapping in your Python script and manually set the slave form value.