fetching an updated page after scrolling with selenium webdriver - python

I'm trying to scrape titles and links from a youtube search, using selenium webdriver, and I'm currently iterating over the process until a certain condition turns false. Though I can see the page scrolling when it's launched, the data I get only seems to be from the first page fetched, before scrolling a single time. How can I access the updated data after I've scrolled down?
This is some of my code:
driver.get(URL)
while (condition)
// extract data, check for condition and write to csv file
driver.execute_script("window.scrollTo(0, 10000)")
WebDriverWait(driver, 60)
if (iteration terminating condition)
// terminate iteration

It depend on what you're using to extract the data. You can do this with selenium but again if you're extracting lots of data then it's probably not that efficient. Generally selenium should be used as a last resort for getting data you can't get through other means.
Consider the following other sources to get dynamic content.
API - Youtube does provide one and it may be worth checking this out. You could use the requests package with this which is more efficient than this.
Re-engineering HTTP requests - This is based on the fact that javascript makes an Asynchronous Javascipt and XML (AJAX) request to display information on a page without it being refreshed. If we can theoretically mimic these requests then we can grab the data we wnat. This applies to infinite scrolling, which occurs the the Youtue Website, but it can be used for search forms etc.. A request is made to a server and that response is then displayed on the page with javascript. This is also an efficient way to deal with dynamic content.
You could use splash, which pre-renders the pages and can execute javascript which is slightly more efficient than say selenium.
Selenium, which you're attempting here. It is meant for automated testing and was never really meant for web scraping. That being said, if it's needed then it's needed. But the downsides are that it is incredibly slow for lots of data and it can be quite brittle. That if the servers take longer to load the pages and the commands are executed then you can run into exceptions you don't want.
If you are thinking of using selenium my advice would be to use as little of selenium as possible. That is if the HTML page is updated when you scroll down, to parse that HTML page with say BeautifulSoup rather than using selenium to grab the data you want. Every single time you use selenium to extract data or scroll, you are making another HTTP request to the server. Selenium works by setting up an HTTP server, a secure connection between the webdriver and chromedriver browser. Browser activity is generated through HTTP requests. So you can imagine if you have a lot of lines of code for extracting data the load on the servers becomes greater.
You could update the driver.page_source as you scroll that will change with each scroll attempt and parse the data. The other option which may make more sense would be to wait until it stops scrolling and then get the driver.page_source, so you can parse the entire HTML with the data you desire.

Related

Python Webscraping Solution Reccomendations required

I would like to know what is the best/preferred PYTHON 3.x solution (fast to execute, easy to implement, option to specify user agent, send browser & version etc to webserver to avoid my IP being blacklisted) which can scrape data on all of below options (mentioned based on complexity as per my understanding).
Any Static webpage with data in tables / Div
Dynamic webpage which completes loading in one go
Dynamic webpage which requires signin using username password & completes loading in one go after we login.
Sample URL for username password: https://dashboard.janrain.com/signin?dest=http://janrain.com
Dynamic web-page which requires sign-in using oauth from popular service like LinkedIn, google etc & completes loading in one go after we login. I understand this involves some page redirects, token handling etc.
Sample URL for oauth based logins: https://dashboard.janrain.com/signin?dest=http://janrain.com
All of bullet point 4 above combined with option of selecting some drop-down (lets say like "sort by date") or can involve selecting some check-boxes, based on which the dynamic data displayed would change.
I need to scrape the data after the action of check-boxes/drop-downs has been performed as any user would do it to change the display of the dynamic data
Sample URL - https://careers.microsoft.com/us/en/search-results?rk=l-seattlearea
You have option of drop-down as well as some checkbox in the page
Dynamic webpage with Ajax loading in which data can keep loading as
=> 6.1 we keep scrolling down like facebook, twitter or linkedin main page to get data
Sample URL - facebook, twitter, linked etc
=> 6.2 or we keep clicking some button/div at the end of the ajax container to get next set of data;
Sample URL - https://www.linkedin.com/pulse/cost-climate-change-indian-railways-punctuality-more-editors-india-/
Here you have to click "Show Previous Comments" at the bottom of the page if you need to look & scrape all the comments
I want to learn & build one exhausted scraping solution which can be tweaked to cater to all options from the easy task of bullet point 1 to the complex task of bullet point 6 above as and when required.
I would recommend to use BeautifulSoup for your problems 1 and 2.
For 3 and 5 you can use Selenium WebDriver (available as python library).
Using Selenium you can perform all the possible operations you wish (e.g. login, changing drop down values, navigating, etc.) and then you can access the web content by driver.page_source (you may need to use sleep function to wait until the content is fully loaded)
For 6 you can use their own API to get list of news feeds and their links (mostly the returned object comes with link to a particular news feed), once you get the links you can use BeautifulSoup for get the web content.
Note: Pleas do read each web site terms and conditions before scraping because some of them have mentioned Automated Data Collection as Unethical behavior which we we should not do as professional.
Scrapy is for you if you looking for the real scaleable bulletproof solution. In fact scrapy framework is an industry standard for python crawling tasks.
By the way: I'd suggest you avoid JS rendering: all that stuff(chromedriver, selenium, phantomjs) is a last option to crawl sites.
Most of ajax data you can parse simply by forging needed requests.
Just spend more time in Chrome's "network" tab.

How to approach web-scraping in python

I am new to python just started on python web-scraping. I have to scrape data from this realtor site
I need to scrape all the details op read-state agents according to their real-state agency;
For this on the web-browser I have to follow the following instructions
Go to this site
click on agency offices button, enter 4000 pin in search box and then submit.
then we get list of the agencies.
go to our team tab and then we get agents their.
then we have to go to each agents page and record their information.
Can anyone tell me how to approach this. Whats the best way to make this type of scrapers.
Do i have to use selenium for the interaction with the pages.
I have worked on request, BeautifulSoup and simple form submit using mechanize
I would recommend on a searching site that you either use Selenium or Requests with sessions, the advantage of Selenium it it will probably work however it will be slow. For Selenium you should just use the Selenium IDE (Firefox add on) to record what you do then get the HTML from the webpage and use beautifulsoup to parse the data.
If you want to scrape the data quickly and without using much resources I usually use Requests with Sessions. To scrape a website like this you should open up a modern web browser (Firefox, Chrome) and use the network tools for that browser (usually located in developer tools or via right click inspect element). Once you are recording the network you can interact with the webpage to see the connections made to the server. In an example search they may use suggestions e.g
https://suggest.example.com.au/smart-suggest?query=4000&n=7&regions=false
The response then will probably be a JSON of the suggested results. Once you select a suggestion you can just submit a request with that search parameters e.g
https://www.example.com.au/find-agent/agents/petrie-terrace-qld-4000
The URLs for the agents will the be in that HTML page, you just then need to separately send a request to each page to get the information using BeautifulSoup.
You might wanna give Node and Jquery a try. I used to use Python all the time, but it gets messy and hard to maintain after a while.
Using node, you can turn the page HTML into a DOM object and then scrape all the data very easily using Jquery. I have done this for imdb here: “Using JQuery & NodeJS to scrape the web” #asimmittal https://medium.com/#asimmittal/using-jquery-nodejs-to-scrape-the-web-9bb5d439413b
You can modify this to scrape yelp

Scrapy for dynamic content

Can we use Scrapy for getting content from a web page which is loaded by Javascript?
I'm trying to scrape usage examples from this page,
but since they are loaded using Javascript as a JSON object I'm not able to get them with Scrapy.
Could you suggest what is the best way to deal with such issues?
Open your browser's developer tools and look at the Network tab. If you hit the "next" button on that page enough, it'll send out a new request:
After removing the JSONP paramter, the URL is pretty straightforward:
https://corpus.vocabulary.com/api/1.0/examples.json?query=unalienable&maxResults=24&startOffset=24&filter=0
By making the minimal number of requests, your spider will be fast.
If you want to just emulate a full browser and execute the JavaScript, you can use something like Selenium or Scrapinghub's Splash (and its corresponding Scrapy plugin).

Removing a part of HTML before loading a page in WebDriver - Selenium & Python

I have a script (inside <script></script> tags) that is being executed every time I load a page. Is it possible to remove an WebElement before the page being loaded in the WebDriver to prevent that script from executing?
I am thinking of something in the lines of:
Somehow get the raw HTML code (perhaps get source or something), remove the part (with selenium or parser), "inject" the edited code back into Selenium (Firefox WebDriver or maybe PhantomJS) and finally execute it for all pages on that website.
Is it possible to do that or is this perhaps impossible by design?
If you install selenium-requests, you can make a GET request for the page, process the html/etc that is loaded, and then place in the tab.
It might be tricky to insert the processed result since you will likely need to also set the current browser URL to match (simply inserting it will cause issues with cross-domain loading of scripts, relative paths, etc) - perhaps there is a way of overriding (or allowing overriding) the 'get' response that selenium receives with the pre-processed information
Selenium-Requests makes a request using the requests library that uses the running webdriver's cookies for that domain and emulates the default HTTP headers sent by that webdriver. The result is a low-level HTTP request and response created with the webdriver's state. This is needed because the Selenium interface is very high-level, and doing much more than opening pages and exploring the DOM is not really natively possible in Python.

Selenium: What functions would fire request?

I am new to Selenium and web applications. Please bear with me for a second if my question seems way too obvious. Here is my story.
I have written a scraper in Python that uses Selenium2.0 Webdriver to crawl AJAX web pages. One of the biggest challenge (and ethics) is that I do not want to burn down the website's server. Therefore I need a way to monitor the number of requests my webdriver is firing on each page parsed.
I have done some google-searches. It seems like only selenium-RC provides such a functionality. However, I do not want to rewrite my code just for this reason. As a compromise, I decided to limit the rate of method calls that potentially lead to the headless browser firing requests to the server.
In the script, I have the following kind of method calls:
driver.find_element_by_XXXX()
driver.execute_script()
webElement.get_attribute()
webElement.text
I use the second function to scroll to the bottom of the window and get the AJAX content, like the following:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Based on my intuition, only the second function will trigger request firing, since others seem like parsing existing html content.
Is my intuition wrong?
Many many thanks
Perhaps I should elaborate more. I am automating a process of crawling on a website in Python. There is a subtantial amount of work done, and the script is running without large bugs.
My colleagues, however, reminded me that if in the process of crawling a page I made too many requests for the AJAX list within a short time, I may get banned by the server. This is why I started looking for a way to monitor the number of requests I am firing from my headless PhantomJS browswer in script.
Since I cannot find a way to monitor the number of requests in script, I made the compromise I mentioned above.
Therefore I need a way to monitor the number of requests my webdriver
is firing on each page parsed
As far as I know, the number of requests is depending on the webpage's design, i.e. the resources used by the webpage and the requests made by Javascript/AJAX. Webdriver will open a browser and load the webpage just like a normal user.
In Chrome, you can check the requests and responses using Developer Tools panel. You can refer to this post. The current UI design of Developer Tools is different but the basic functions are still the same. Alternatively, you can also use the Firebug plugin in Firefox.
Updated:
Another method to check the requests and responses is by using Wireshark. Please refer to these Wireshark filters.

Categories