I am currently pulling data from a public series data from https://www3.bcb.gov.br/expectativas/publico/en/serieestatisticas
This is a public page that uses apache wicket I believe.
I usually am ok with scraping, whether GET or POST. Here I and my colleagues are stuck. Can anyone help understand what URL needs to be used to actually make the request. Here's what I've got so far:
The form with inputs:
The Fiddler capture manually executed:
Text View:
form19_hf_0=&indicador=0&calculo=0&linhaPeriodicidade%3Aperiodicidade=0&tfDataInicial=11%2F10%2F2015&tfDataFinal=11%2F24%2F2015&divPeriodoRefereEstatisticas%3AgrupoAnoReferencia%3AanoReferenciaInicial=16&divPeriodoRefereEstatisticas%3AgrupoAnoReferencia%3AanoReferenciaFinal=16&btnCSV=Generate+CSV
Form data I'm passing in the request:
Summary:
I need some help, I can't seem to get the POST working correctly, it takes me to a different page, and I'm not sure of how to work through this one.
NB: I'm trying to grab back a CSV.
The libraries I'm using are primarily Requests (I was going to use LXML but I don't think its going to be applicable here).
I've been trying to figure out the right form with Postman and Fiddler to understand what the request needs to be.
So,
The solution to this was somewhat indirect. We were not able to do a straight POST because the the page incremented the actual POST url in a way that was generally impossible to predict.
The solution that we used was installing Selenium web driver and using that to simulate the dropdown visible values and button clicks.
This worked out very cleanly.
Thanks and HTH anyone else who might have a similar problem.
Related
I am not experienced in web development, and trying to use requests.get to get some authenticated data. So far the internet appears to tell me to just do it, and i think i am formatting it wrong, but unsure how. After some trial and error, i was able to grab my cookie for the website. The following is some a made up version of what i grabbed with similar formating.
cookie = "s:abcDEfGHIJ12k34LMNopqRst5UvW-6xy.ZAbCd/eFGhi7j8KlmnoPqrstUvWXYZ90a1BCDE2fGH3"
Then, in python, i am trying to send a request. Following is a bit more pseudo code for what i am doing
r = requests.get('https://www.website.com/api/getData', cookies={"connect.sid": cookie})
After all this, the site keeps sending me a 400 error. Wondering if you guys had any idea if I am putting in the wrong cookie/part of cookie. If everything looks right and it is probably the site at fault, or what.
Grabbed a wireshark capture, and found there were other fields in the cookie that were sent that i had not filled out.
_ga
_gid
___gads
Filled those out with the relevant values, and it works.
I'm currently trying to design a filter that with it I can block certain URLs and also
block based on keywords that may be at the data that came back from the http response.
Just for clarification I'm working on a Windows 10 x64 machine for this project.
In order to be able to do such thing, I quickly understood that I would need a web proxy.
I checked out about 6 proxies written in python that I found on github.
These are the project I tried to use (some are Python3 some 2):
https://github.com/abhinavsingh/proxy.py/blob/develop/proxy.py
https://github.com/inaz2/proxy2/blob/master/proxy2.py
https://github.com/inaz2/SimpleHTTPProxy - this one is the earlier version of the top one
https://github.com/FeeiCN/WebProxy
Abhinavsingh's Proxy (first in the list):
what I want to happen
I want the proxy to be able to block sites based on the requests and the content came back, I also need the filter to be in a separated file and to be generic
so I can apply it on every site and every request/response.
I'd like understand where is the correct place to put a filter on this proxy
and how to do redirect or just send a block page back when the client tries to access
sites that has specific urls, or when the response is a page that contains some keywords.
what I tried
I enabled the proxy on google chrome's 'Open proxy settings'
and executed the script. It looked pretty promising, and I noticed that I can insert a filterer's function call in line 383 at the _process_request function so that I can return
to it maybe another host to redirect to or just block. It worked partially for me.
The problems
First of all, I couldn't fully redirect/block the sites. Sometimes it worked, Sometimes it
didn't.
Another problem I ran into was that I realized that I can't access the content of site that returned, if it is https.
Also, filter the response was unfortunately not clear for me.
I noticed also that proxy2 (the second in the list) can solve that problem I had
of filtering https page's content, but I couldn't find out how to make this feature work (and also I think it requires linux utilities anyhow).
The process I described above was pretty much the one that I tried to work on every proxy in the list. At some proxies, like proxy2.py I couldn't understand at all what I needed to do.
If anybody managed to make a filter on top of this proxy or any other from this list and can help me understand how to do it, I'll be grateful if you'll comment down below.
Thank you.
I want to scrape the following website:
http://www.vaarweginformatie.nl/fdd/main/hydro/overview?contentId=HYDRO_WATERSTANDEN&pub=Hydro+waterstanden+-+rapport&detailPub=Hydro+waterstanden+-+detail&menuId=WATERSTANDEN/ajax
The information that I need is in the table and is provided using an ajax response. I can't seem to capture this response. I read Selenium might work but that it is also unstable, so I think this solution is not great since I want to set up a real time connection.
Are there other possibilities that I can try?
I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)
I've asked one question about this a month ago, it's here: "post" method to communicate directly with a server.
And I still didn't get the reason why sometimes I get 404 error and sometimes everything works fine, I mean I've tried those codes with several different wordpress blogs. Using firefox or IE, you can post the comment without any problem whatever wordpress blog it is, but using python and "post" method directly communicating with a server I got 404 with several blogs. And I've tried to spoof the headers, adding cookies in the code, but the result remains the same. It's bugging me for quite a while... Anybody knows the reason? Or what code should I add to make the program works just like a browser such as firefox or IE etc ? Hopefully you guys would help me out!
You should use somthing like mechanize.
The blog may have some spam protection against this kind of posting. ( Using programmatic post without accessing/reading the page can be easily detected using javascript protection ).
But if it's the case, I'm surprised you receive a 404...
Anyway, if you wanna simulate a real browser, the best way is to use a real browser remote controlled by python.
Check out WebDriver (http://seleniumhq.org/docs/09_webdriver.html) It has a python implementation and can run HtmlUnit, chrome, IE and Firefox browsers.