I'm trying scrape data from Mexico's Central Bank website but have hit a wall. In terms of actions, I need to first access a link within an initial URL. Once the link has been accessed, I need to select 2 dropdown values and then hit an activate a submit button. If all goes well, I will be taken to a new url where a set of links to pdfs are available.
The original url is:
"http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html"
The nested URL (the one with the dropbox) is:
"http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces?BMXC_claseIns=GUB&BMXC_lang=es_MX"
The inputs (arbitrary) are, say: '07/03/2019' and '14/03/2019'.
Using BeautifulSoup and requests I feel like I got as far as filling the values in the dropbox, but failed to click the button and achieve the final url with the list of links.
My code follows below :
from bs4 import BeautifulSoup
import requests
pagem=requests.get("http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html")
soupm = BeautifulSoup(pagem.content,"lxml")
lst=soupm.find_all('a', href=True)
url=lst[-1]['href']
page = requests.get(url)
soup = BeautifulSoup(page.content,"lxml")
xin= soup.find("select",{"id":"_id0:selectOneFechaIni"})
xfn= soup.find("select",{"id":"_id0:selectOneFechaFin"})
ino=list(xin.stripped_strings)
fino=list(xfn.stripped_strings)
headers = {'Referer': url}
data = {'_id0:selectOneFechaIni':'07/03/2019', '_id0:selectOneFechaFin':'14/03/2019',"_id0:accion":"_id0:accion"}
respo=requests.post(url,data,headers=headers)
print(respo.url)
In the code, respo.url is equal to url...the code fails. Can anybody pls help me identify where the problem is? I'm a newbie to scraping so that might be obvious - apologize in advance for that...I'd appreciate any help. Thanks!
Last time I checked, you cannot submit a form via clicking buttons with BeautifulSoup and Python. There are typically two approaches I often see:
Reverse engineer the form
If the form makes AJAX calls (e.g. makes a request behind the scenes, common for SPAs written in React or Angular), then the best approach is to use the network requests tab in Chrome or another browser to understand what the endpoint is and what the payload is. Once you have those answers, you can make a POST request with the requests library to that endpoint with data=your_payload_dictionary (e.g. manually do what the form is doing behind the scenes). Read this post for a more elaborate tutorial.
Use a headless browser
If the website is written in something like ASP.NET or a similar MVC framework, then the best approach is to use a headless browser to fill out a form and click submit. A popular framework for this is Selenium. This simulates a normal browser. Read this post for a more elaborate tutorial.
Judging by a cursory look at the page you're working on, I recommend approach #2.
The page you have to scrape is:
http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces
Add the date to consult and JSESSIONID from cookies in the payload and Referer , User-Agent and all the old good stuff in request headers
Example:
import requests
import pandas as pd
cl = requests.session()
url = "http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces"
payload = {
"JSESSIONID": "cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000",
"fechaAConsultar": "21/03/2019"
}
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
"Referer": "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000"
}
response = cl.post(url, data=payload, headers=headers)
tables = pd.read_html(response.text)
When just clicking through the pages it looks like there's some sort of cookie/session stuff going on that might be difficult to take into account when using requests.
(Example: http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000)
It might be easier to code this up using selenium since that will automate the browser (and take care of all the headers and whatnot). You'll still have access to the html to be able to scrape what you need. You can probably reuse a lot of what you're doing as well in selenium.
Related
This question already has answers here:
How to "log in" to a website using Python's Requests module?
(6 answers)
Closed 2 years ago.
I googled my user agent and put that code into my program but no luck
import requests
from bs4 import BeautifulSoup
URL = 'Servicenow blah blah'
headers = {
"User-Agent": Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Very simple code so far.
Ultimately I am trying to get logged into this website (or even circumvent that by using a user-agent that is already logged in, if this is possible (This is my main question here)) and then parse the html for a certain element's html to monitor for changes
OR if there is a better, simpler tool for this I would LOVE to know
I'm seeing in the html that is printed "Your session has expired etc. etc."
Firstly - a user-agent is not typically how session data is tracked, it lets the website know details about what version of browser you are using. Typically this information is kept in your cookies.
For the log in issue, it sounds like you just need to perform the login request and keep track of the provided cookies, etc required. However, as you said "monitor for changes" I suspect there might be a chance of some Javascript down the line ;) I recommend looking into Selenium for this. It's a browser driver which means it just interacts with a normal browser and will take care of all Javascript execution and cookie tracking for you!
This question already has an answer here:
How to programmatically send POST request to JSF page without using HTML form?
(1 answer)
Closed 2 years ago.
I would like to automate the extraction of data from this site:
http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf
Explanation of the steps to be followed for extract the data that I want:
Beginning in the url above click in "Séries Históricas". You should see a page with a form with some inputs. In my case I only need to input the station code in the "Código da Estação" input. Suppose that the station code is 938001, insert that and hit "Consultar". Now you should see a lot of checkboxes. Check the one below "Selecionar", this one will check all checkboxes. Supposing that I dont want all kinds of data, I want rain rate and flow rate, I check only the checkbox below "Chuva" and the other one below "Vazão". After that is necessary to choose the type of the file that are going to be download, chose the "Arquivo Texto (.TXT)", this one is the .txt format. After that is necessary to generate the file, to do that click in "Gerar Arquivo". After that is possible todownload the file, to do that just click "Baixar Arquivo".
Note: the site now is in version v1.0.0.12, it may be different in the future.
I have a list of station codes. Imagine how bad would be to do these operations more than 1000 times?! I want to automate this!
Many people in Brazil have been trying to automate the extraction of data from this web site. Some that I found:
Really old one: https://www.youtube.com/watch?v=IWCrC0MlasQ
Others:
https://pt.stackoverflow.com/questions/60124/gerar-e-baixar-links-programaticamente/86150#86150
https://pt.stackoverflow.com/questions/282111/r-download-de-dados-do-portal-hidroweb
The earlier try that I found, but that does not work too because the site have changed: https://github.com/duartejr/pyHidroWeb
So a lot people need this and none of the above solutions work more because of updates in the site.
I do not want use selenium, it is slow compared with a solution that uses requests library, and it needs a interface.
My attempt:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
from urllib import parse
URL = 'http://www.snirh.gov.br/hidroweb/publico/apresentacao.jsf'
s = requests.Session()
r = s.get(URL)
JSESSIONID = s.cookies['JSESSIONID']
soup = BeautifulSoup(r.content, "html.parser")
javax_faces_ViewState = soup.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']
d = {}
d['menuLateral:menuForm'] = 'menuLateral:menuForm'
d['javax.faces.ViewState'] = javax_faces_ViewState
d['menuLateral:menuForm:menuSection:j_idt68:link'] = 'menuLateral:menuForm:menuSection:j_idt68:link'
h = {}
h['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
h['Accept-Encoding'] = 'gzip, deflate'
h['Accept-Language'] = 'pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
h['Cache-Control'] = 'max-age=0'
h['Connection'] = 'keep-alive'
h['Content-Length'] = '218'
h['Content-Type'] = 'application/x-www-form-urlencoded'
h['Cookie'] = '_ga=GA1.3.4824711.1520011013; JSESSIONID={}; _gid=GA1.3.743342153.1522450617'.format(JSESSIONID)
h['Host'] = 'www.snirh.gov.br'
h['Origin'] = 'http://www.snirh.gov.br'
h['Referer'] = 'http://www.snirh.gov.br/hidroweb/publico/apresentacao.jsf'
h['Upgrade-Insecure-Requests'] = '1'
h['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
URL2 = 'http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf'
post_response = s.post(URL2, headers=h, data=d)
soup = BeautifulSoup(post_response.text, "html.parser")
javax_faces_ViewState = soup.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']
def f_headers(JSESSIONID):
headers = {}
headers['Accept'] = '*/*'
headers['Accept-Encoding'] = 'gzip, deflate'
headers['Accept-Language'] = 'pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
headers['Connection'] = 'keep-alive'
headers['Content-Length'] = '672'
headers['Content-type'] = 'application/x-www-form-urlencoded;charset=UTF-8'
headers['Cookie'] = '_ga=GA1.3.4824711.1520011013; JSESSIONID=' + str(JSESSIONID)
headers['Faces-Request'] = 'partial/ajax'
headers['Host'] = 'www.snirh.gov.br'
headers['Origin'] = 'http://www.snirh.gov.br'
headers['Referer'] = 'http://www.snirh.gov.br/hidroweb/publico/medicoes_historicas_abas.jsf'
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
return headers
def build_data(data, n, javax_faces_ViewState):
if n == 1:
data['form'] = 'form'
data['form:fsListaEstacoes:codigoEstacao'] = '938001'
data['form:fsListaEstacoes:nomeEstacao'] = ''
data['form:fsListaEstacoes:j_idt92'] = 'a39c3713-c0f7-4461-b2c8-c2814b3a9af1'
data['form:fsListaEstacoes:j_idt101'] = 'a39c3713-c0f7-4461-b2c8-c2814b3a9af1'
data['form:fsListaEstacoes:nomeResponsavel'] = ''
data['form:fsListaEstacoes:nomeOperador'] = ''
data['javax.faces.ViewState'] = javax_faces_ViewState
data['javax.faces.source'] = 'form:fsListaEstacoes:bt'
data['javax.faces.partial.event'] = 'click'
data['javax.faces.partial.execute'] = 'form:fsListaEstacoes:bt form:fsListaEstacoes'
data['javax.faces.partial.render'] = 'form:fsListaEstacoes:pnListaEstacoes'
data['javax.faces.behavior.event'] = 'action'
data['javax.faces.partial.ajax'] = 'true'
data = {}
build_data(data, 1, javax_faces_ViewState)
headers = f_headers(JSESSIONID)
post_response = s.post(URL, headers=headers, data=data)
print(post_response.text)
That prints:
<?xml version='1.0' encoding='UTF-8'?>
<partial-response><changes><update id="javax.faces.ViewState"><![CDATA[-18212878
48648292010:1675387092887841821]]></update></changes></partial-response>
Explanations about what I tryed:
I used the chrome develop tool, actually clicked "F12", clicked "Network" and in the website page clicked "Séries Históricas" to discover what are de headers and forms. I think I did it correctly. There is another way or a better way? Some people told me about postman and postman interceptor, but a dont know how to use and if it is helpful.
After that I filled the station code in the "Código da Estação" input with 938001 and hit "Consultar" to see what were the headers and forms.
Why is the site returning a xml? This means that something went wrong?
This xml has an CDATA section.
What does <![CDATA[]]> in XML mean?
A undestand the basic idea of CDATA, but how it is used in this site, and how I shoud use this in the web scrape? I guess that it is used to save partial information, but it is just a guess. I am lost.
I tryed this for the other clicks too, and got more forms and the response was xml. I did not put it here because it makes the code bigger and the xml is big too.
One SO answer that is not complete related to my is this:
https://stackoverflow.com/a/8625286
this answer explain the steps to upload a file, using java, to a JSF-generated form. This is not my case, I want to download a file using python requests.
General questions:
When is not possible and possible to use requests + bs4 to scrape a website?
Whats are the steps to do this kind of web scrape?
In cases like this site, is possible to go straightforward and in one request extract the information or we have to mimic the step by step as we would do by hand filling the form? Based on this answer it looks like the answer is no https://stackoverflow.com/a/35665529
I have faced many dificulties and doubts. In my opinion there is a gap of explanation about this kind of situation.
I agree with this SO question
Python urllib2 or requests post method
in the point that most tutorials are useless for a situation like this site that I am trying. A question like this one https://stackoverflow.com/q/43793998/9577149
that is as hard as my does not have answer.
That is my first post in stackoverflow, sorry if I made mistakes and I am not a native english speaker, feel free to correct me.
1) Its always possible to scrape html websites using bs4. But getting the response you would like requires more than just beautiful soup.
2) My approach with bs4 is usually as follows:
response = requests.request(
method="GET",
url='http://yourwebsite.com',
params=params #(params should be a python object)
)
soup = BeautifulSoup(response.text, 'html.parser')
3) If you notice when you fill out the first form (series historicas) and click submit, the page url (or action url) does not change. This is because an ajax request is being made to retrieve and update the data on the current page. Since you cant see the request its impossible for you to mimic that.
To submit the form i would recommend looking into Mechanize (a python library for filling and submitting form data)
import re
from mechanize import Browser
b = Browser()
b.open("http://yourwebsite.com")
b.select_form(name="form")
b["bacia"] = ["value"]
response = b.submit() # submit the form
The URL of the last request is wrong. In the penultimate line of code s.post(URL, headers=headers, data=data) the parameter should be URL2 instead.
The cookie name, also, is now SESSIONID not JSESSIONID, but that must have been a change made since the question was asked.
You do not need to manage cookies manually like that when using requests.Session(), it will keep track of cookies for you automatically.
I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?
I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.
Try setting:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
This will bypass the cookie consent page and will land you staight to the page.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.
I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.
import requests
cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookie)
I want to download a webpage using python for some web scraping task. The problem is that the website requires cookies to be enabled, otherwise it serves different version of a page. I did implement a solution that solves the problem, but it is inefficient in my opinion. Need your help to improve it!
This is how I go over it now:
import requests
import cookielib
cj = cookielib.CookieJar()
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
#first request to get the cookies
requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
# second request reusing cookies served first time
r = requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
html_text = r.text
Basically, I create a CookieJar object and then send two consecutive requests for the same URL. First time it serves me the bad page but as compensation gives cookies. Second request reuses this cookie and I get the right page.
The question is: Is it possible to just use one request and still get the right cookie enabled version of a page?
I tried to send HEAD request first time instead of GET to minimize traffic, in this case cookies aren't served. Googling for it didn't give me the answer either.
So, it is interesting to understand how to make it efficiently! Any ideas?!
You need to make the request to get the cookie, so no, you cannot obtain the cookie and reuse it without making two separate requests. If by "cookie-enabled" you mean the version that recognizes your script as having cookies, then it all depends on the server and you could try:
hardcoding the cookies before making first request,
requesting some smallest possible page (with smallest possible response yet containing cookies) to obtain first cookie,
trying to find some walkaroung (maybe adding some GET argument will fool the site into believing you have cookies - but you would need to find it for this specific site),
I think the winner here might be to use requests's session framework, which takes care of the cookies for you.
That would look something like this:
import requests
import cookielib
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
s = requests.session(headers=user_agent, timeout=2)
r = s.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&')
html_text = r.text
Try that and see if that works?
I am looking for a way to view the request (not response) headers, specifically what browser mechanize claims to be. Also how would I go about manipulating them, eg setting another browser?
Example:
import mechanize
browser = mechanize.Browser()
# Now I want to make a request to eg example.com with custom headers using browser
The purpose is of course to test a website and see whether or not it shows different pages depending on the reported browser.
It has to be the mechanize browser as the rest of the code depends on it (but is left out as it's irrelevant.)
browser.addheaders = [('User-Agent', 'Mozilla/5.0 blahblah')]
You've got an answer on how to change the headers, but if you want to see the exact headers that are being used try using a proxy that displays the traffic. e.g. Fiddler2 on windows or see this question for some Linux altenatives.
you can modify referer too...
br.addheaders = [('Referer', 'http://google.com')]