I intend to use twill to fill out a form on one page, hit the submit button, and then use BeautifulSoup to parse the resulting page. How can I feed BeautifulSoup the HTML page? I assume I have to read the current url, but I do not know how to actually return the url in order to do so. I have tried twill's TwillBrowser.get_url(), but it only returns None.
For any future sufferers, I have found better luck in using mechanize instead of twill as twill is an un-updated thin shell for mechanize. The solution is as follows:
import mechanize
url = "foo.com"
br = mechanize.Browser()
br.open(url)
br.select_form(name = "YOURFORMNAMEHERE") #make sure to leave the quotation marks
br["YOURINPUTFIELDNAMEHERE"] = ["YOURVALUEHERE"] #this must be in a list even if it is only one value
response = br.submit()
print response.geturl()
Finally figured this out!
If you import twill like so:
import twill.commands as com
then the url =
url = com.browser.get_url()
Source: http://nullege.com/codes/search/twill.commands.browser.get_url?utm_expid=24446124-0.lSQi4Ea5S7WZwxHvFPbOIA.0&utm_referrer=https%3A%2F%2Fwww.google.com%2F
Related
I'm a begginer in the world of python and web-scrapers, i am used to make scrapers with dynamic URLs, where the URI change when i input specific parameters in the URL itself.
Ex: Wikipedia.
(if i input a search named "Stack Overflow" i will have a URI that looks like this: https://en.wikipedia.org/wiki/Stack_Overflow)
At the moment i was challenged to develop a web-scraper to collect data from this page.
The field "Texto/Termos a serem pesquisados" corresponds a search field, but when i input the search the URL stays the same not letting me to get the right HTML code for my research.
I am used to work with BeautifulSoup and Requests to do the scraping thing, but in this case it has no use, since the URL stays the same after the search.
import requests
from bs4 import BeautifulSoup
url = 'http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp'
html = requests.get(url)
bs0bj = BeautifulSoup(html.content,'html.parser')
print(bsObj)
# And from now on i cant go any further
Usually i would do something like
url = 'https://en.wikipedia.org/wiki/'
input = input('Input your search :)
search = url + input
And then do all the BeautifulSoup thing, and findAll thing to get my data from the HTML code.
I have tried to use Selenium too, but im looking for something different than that due to all the webdriver thing. With the following piece of code i have achieved to some odd results but i still cant scrape the HTML in a good way.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from bs4 import BeautifulSoup
# Acess the page and input the search on the field
driver = webdriver.Chrome()
driver.get('http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp')
driver.switch_to.frame('main2')
busca = driver.find_element_by_id("txtTermo")
busca.send_keys("GESTAO DE PESSOAS")
#data_inicio = driver.find_element_by_id('dt_publ_ini')
#data_inicio.send_keys("01/01/2018")
#data_fim = driver.find_element_by_id('dt_publ_fim')
#data_fim.send_keys('20/12/2018')
botao = driver.find_element_by_id('ok')
botao.click()
So given all that:
There is a way to scrape data from these static urls ?
Can i input a search in the field via code ?
Why cant i scrape the right source code ?
The problem is that your initial search page is using frames for the searching & results, which makes it harder for BeautifulSoup to work with it. I was able to obtain the search results by using a slightly different URL and MechanicalSoup instead:
>>> from mechanicalsoup import StatefulBrowser
>>> sb = StatefulBrowser()
>>> sb.open('http://comprasnet.gov.br/ConsultaLicitacoes/ConsLicitacao_texto.asp')
<Response [200]>
>>> sb.select_form() # select the search form
<mechanicalsoup.form.Form object at 0x7f2c10b1bc18>
>>> sb['txtTermo'] = 'search text' # input the text to search for
>>> sb.submit_selected() # submit the form
<Response [200]>
>>> page = sb.get_current_page() # get the returned page in BeautifulSoup form
>>> type(page)
<class 'bs4.BeautifulSoup'>
Note that the URL I'm using here is that of the frame that has the search form and not the page you provided that was inlining it. This removes one layer of indirection.
MechanicalSoup is built on top of BeautifulSoup and provides some tools for interacting with websites in a similar way to the old mechanize library.
I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.
I'm trying to do something very simple using Python's Mechanize library. I want to go to: JobSearch">http://careers.force.com/jobs/ts2_JobSearch, select Dublin Ireland from the drop down list, and then hit enter.
I've written a very short Python script for this, but for some reason when I run it, it returns the HTML for the default search page rather than the search page that is produced after selecting the location (Dublin Ireland) and hitting enter. I have no idea what is going wrong:
import mechanize
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
br.open(link)
br.select_form('j_id0:j_id1:atsForm' )
br.form['j_id0:j_id1:atsForm:j_id38:1:searchCtrl'] = ["Ireland - Dublin"]
response = br.submit()
newsite = response.read()
This is in case you're still having this problem or if not, in case anyone else is having this problem in the future....
I looked at the postdata that was being sent by your browser when you manually selected something and wrote a function for you that will get you to the page you want by manually performing a POST operation with urllib.urlencoded data. Cheers.
import mechanize,cookielib,urllib
def get_search(html,controls):
#viewstate
s=re.search('ViewState" value="', html).span()[1]
e=re.search('"',html[s:]).span()[0]+s
state=html[s:e]
#viewstateversion
s=re.search('ViewStateVersion', html).span()[1]
s=s+re.search('value="', html[s:]).span()[1]
e=re.search('"', html[s:]).span()[0]+s
version=html[s:e]
#viewstatemac
s=re.search('ViewStateMAC',html).span()[1]
s=s+re.search('value=\"',html[s:]).span()[1]
e=re.search('"',html[s:]).span()[0]+s
mac=html[s:e]
return {controls[0]:controls[0], controls[1]:'',controls[2]:'Ireland - Dublin', controls[3]:'Search','com.salesforce.visualforce.ViewState':state,'com.salesforce.visualforce.ViewStateVersion':version,'com.salesforce.visualforce.ViewStateMAC':mac}
#Define variables and create a mechanize browser
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
cj=cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open(link)
#get the html data
html=br.response().read()
#get the control names from the correct form
br.select_form(nr=1)
controls=[control.name for control in br.form.controls]
#run function with html and control names list as parameters and run urllib.urlencode on what gets returned
postdata=urllib.urlencode(get_search(br.response().read(), controls))
#go to the webpage again but this time also submit the encoded data
br.open(link, postdata)
#There Ya Go
print br.response().read()
I am using python and mechanize to login into a site. I can get it to login, but once I am in I need to have mechanize select a new form and then submit again. I will need to do this 3 or for times to get to the page I need. Once I am logged in how od I slect the form on the 2nd apge?
import mechanize
import urlparse
br = mechanize.Browser()
br.open("https://test.com")
print(br.title())
br.select_form(name="Login")
br['login_name'] = "test"
br['pwd'] = "test"
br.submit()
new_br = mechanize.Browser()
new_br.open("test2.com")
new_br.select_form(name="frm_page2") # where the error happens
I get the following error.
FormNotFoundError: no form matching name 'frm_page2'
Thanks for the help.
You can't use name=' ' when finding a form because Mechanize already uses 'name' itself.
If you want to find something by name, you need to do:
br.select_form(attrs={'name': 'Login'})
I know how to do a HEAD request with httplib, but I have to use mechanize for this site.
Essentially, what I need to do is grab a value from the header (filename) without actually downloading the file.
Any suggestions how I could accomplish this?
Mechanize itself only sends GETs and POSTs, but you can easily extend the Request class to send HEAD. Example:
import mechanize
class HeadRequest(mechanize.Request):
def get_method(self):
return "HEAD"
request = HeadRequest("http://www.example.com/")
response = mechanize.urlopen(request)
print response.info()
In mechanize there is no need to do HeadRequest class etc.
Simply
import mechanize
br = mechanize.Browser()
r = br.open("http://www.example.com/")
print r.info()
That's all.