capture AJAX response - python

I want to scrape the following website:
http://www.vaarweginformatie.nl/fdd/main/hydro/overview?contentId=HYDRO_WATERSTANDEN&pub=Hydro+waterstanden+-+rapport&detailPub=Hydro+waterstanden+-+detail&menuId=WATERSTANDEN/ajax
The information that I need is in the table and is provided using an ajax response. I can't seem to capture this response. I read Selenium might work but that it is also unstable, so I think this solution is not great since I want to set up a real time connection.
Are there other possibilities that I can try?

Related

How to scrape infinitely scrolling websites with login using python request (or similar)

I would like to scrape a website that does not have an API and is an "infinite scroller". I have been using selenium for this, but now I need to scrape a lot more pages and do that all at once. The problem is that selenium is very resource-dependant since I am running a full (headless) chrome browser in each instance and also not stable at all (probably because of limited resources but still). I know that there is a way to look for ajax requests that the site uses and access it with requests library, but I have two issues:
I can't seem to find the desired request
The ones that I try to use with requests library require the user to be logged in and I have no idea how to do that (maybe pass cookies and whatnot, I am not a web developer).
Let me take Twitter as an example since it is exactly the as what I am describing (except it has an API). You have to log in and then the feed is loaded infinitely. So the goal is to "scroll" and take the content of each tweet. How can this be done? If you can, please, provide a working example.
Thank you.

Sending a post request with python to extract youtube comments from json

I am trying a to build a comment scraper for YouTube comments. I already tried to build a scraper with selenium by opening a headless browser and scrolling down, opening comment responses and then extracting the data. But a headless browser is too slow and also the scrolling seems to be unreliable because the number of scraped comments does not match the number of comment given for each video. Maybe I did something wrong, but this is irrelevant to me right now: one of the main reasons why I would like to find another way is the time: it is too slow.
I know there have been a lot of questions about scraping youtube comment on stackoverflow, but almost every answer I found suggested to use some kind of headless browser, i.e. selenium, to do the job. I don’t like that because of the reasons mentioned above. Also, some references I found online suggest to refrain from using selenium for web scraping and instead to try reverse engineering, which means, as much as I understood, to emulate ajax calls and get the data, ideally in json.
When scrolling down a youtube video (here's an example), the comments are loaded dynamically. If I have a look at the XHR activity, I get a lot of information:
Looking at the XHR activity it seems like I find very information I need to make a request to get the json data with comments. But actually I struggle to construct the right the request in order to obtain the json file with the comments.
I read some tutorials online, which mostly offer simple examples which are easy to replicate. But none of them really helped me for my endeavour.
Can someone give me a hint and show me how to post the request in order to get the json file with the comments with python? I am using the requests library in Python.
I know, there is an Youtube API which can do the job, but I wanted to find out whether there is a way of doing the same thing without the API. I think it should be possible.

Python request get data through HTTPS tunnel

I am currently pulling data from a public series data from https://www3.bcb.gov.br/expectativas/publico/en/serieestatisticas
This is a public page that uses apache wicket I believe.
I usually am ok with scraping, whether GET or POST. Here I and my colleagues are stuck. Can anyone help understand what URL needs to be used to actually make the request. Here's what I've got so far:
The form with inputs:
The Fiddler capture manually executed:
Text View:
form19_hf_0=&indicador=0&calculo=0&linhaPeriodicidade%3Aperiodicidade=0&tfDataInicial=11%2F10%2F2015&tfDataFinal=11%2F24%2F2015&divPeriodoRefereEstatisticas%3AgrupoAnoReferencia%3AanoReferenciaInicial=16&divPeriodoRefereEstatisticas%3AgrupoAnoReferencia%3AanoReferenciaFinal=16&btnCSV=Generate+CSV
Form data I'm passing in the request:
Summary:
I need some help, I can't seem to get the POST working correctly, it takes me to a different page, and I'm not sure of how to work through this one.
NB: I'm trying to grab back a CSV.
The libraries I'm using are primarily Requests (I was going to use LXML but I don't think its going to be applicable here).
I've been trying to figure out the right form with Postman and Fiddler to understand what the request needs to be.
So,
The solution to this was somewhat indirect. We were not able to do a straight POST because the the page incremented the actual POST url in a way that was generally impossible to predict.
The solution that we used was installing Selenium web driver and using that to simulate the dropdown visible values and button clicks.
This worked out very cleanly.
Thanks and HTH anyone else who might have a similar problem.

transferring real time data from a website in python

I am programming in Python.
I would like to extract real time data from a webpage without refreshing it:
http://www.fxstreet.com/rates-charts/currency-rates/
I think the real time data webpage is written in AJAX but I am not quite sure..
I thought about opening an internet browser with the program but I do not really know/like this way... Is there an other way to do it?
I would like to fill a dictionnary in my program (or even a SQL database) with the latest numbers each second.
please help me in python, thanks!
To get the data, you'll need to look through the javascript and HTML source to find what URL it's hitting to get the data it's displaying. Then, you can call that URL with urllib or your favorite python library and parse it
Also, it may be easier if you use a plugin like Firebug that lets you watch the AJAX requests.

Parsing lines from a live streaming website in Python

I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.

Categories