Opening pages asynchronously in headless browser (PhantomJS)

Opening pages asynchronously in headless browser (PhantomJS) - python

I am using PhantomJS via Python through Selenium+Ghostdriver.
I am looking to load several pages simultaneously and to do so, I am looking for an async method to load pages.
From my research, PhantomJS already lives in a separate thread and supports multiple tabs, so I believe the only missing piece of the puzzle is a method to load pages in a non-blocking way.
Any solution would be welcome, be it a simple Ghostdriver method I overlooked, bypassing Ghostdriver and interfacing directly with PhantomJS or a different headless browser.
Thanks for the help and suggestions.
Yuval

If you want to bypass ghostdriver, then you can directly write your PhantomJS scripts in JavaScript or CoffeeScript. As far as I know there is no way of doing this with the selenium webdriver except with different threads in the language of your choice (python).
If you are not happy with it, there is CasperJS which has more freedom in writing scripts than with selenium, but you will only be able to use PhantomJS or SlimerJS.

I'm not completely sure on how to do this via Selenium/Ghostdriver specifically, but if you (or future readers) are able to touch the phantom scripts directly, then the solution is as simple as:
page.open(newUrl, ...);
The "page.open()" method is async by default, and should serve your needs. - Quite some time has passed since you asked this question so not sure if you need the help anymore. But, again, for those who may read this later I do hope this helps!

Related

Why am I not seeing the "full" html case? [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 4 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.

The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.

This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.

Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.

Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(

There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.

Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.

i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.

This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?

Is it possible to create a python program that controls my web pages?

I have been wondering about this since I could really benefit from a program that makes actions on websites that I use for my job that require the same command over and over again.
I know some python and I love to learn new things.
I tried looking for it on google but I guess I'm not sure how to find it.
I would love it if you could direct me to a guide or something like that.
Thank you very much!

Selenium interacts with a web browser directly, although you can hide the browser window in the code (look up Selenium in --headless mode). This is a good choice for filling out a lot of forms or interacting with graphical user interface elements.
However, if you need to request information from websites, you don't always need to interact with the web browser directly. You can use the package called Requests. This doesn't depend on any web browsers and can run silently in the background.

I think you can do it with Python and some packages like selenium. Also you need some html knowledge to search in the html source code of the specific wegpage.
I found an interesting use case, maybe that helps you:
https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08

Automate webpage tasks without having to have a browser open?

I know about tampermonkey/greasemonkey and have used it a fair bit, but now my task is to write a program that runs in the background and automates mundane tasks (clicking buttons, typing into input fields etc.) on a specific webpage. Running a browser in the background takes too much RAM and processing power, so I'm looking for an alternative.
So far I've found selenium, but after a bit of research it looks like that it requires to have a browser open at all times as well (or maybe not? the documentation isn't that good). I thought about python scripts too, but I don't have any experience with those nor have I any idea if they can handle anything that's not basic html. If they can, does anyone know of a good tutorial for python scripts? I have used that language a few years ago, so I shouldn't really have a problem with python itself.
If python scripts aren't ideal either, is there a (preferably somewhat simple) way I could achieve what I want?

It depends on whether you want a script that interacts with a web UI, or a script that automates web requests. Do you really need to click buttons and type into input fields? Presumably, the data from those buttons and input fields is eventually sent to a web server. You could skip the entire UI and just make the requests directly. You don't need a browser for that and python is fine for doing these types of things (you don't even need selenium, you can just use requests)
On the other hand, if you're trying to test out the UI of a web page, or you actually need to interact with the web UI for some other reason, then yeah, you'll need an application (like a web browser) that's capable of rendering the UI so you can interact with it.

Finding Python webpages crawler complete solution

First of all - many thanks in advance. I really appreciate it all.
So I'm in need for crawling a small amount of urls rather constantly (around every hour) and get specific data
A PHP site will be updated with the crawled data, I cannot change that
I've read this solution: Best solution to host a crawler? which seems to be fine and has the upside of using cloud services if you want something to be scaled up.
I'm also aware of the existence of Scrapy
Now, I winder if there is a more complete solution to this matter without me having to set all these things up. It seems to me that it's not a very distinguish problem that I'm trying to solve and I'd like to save time and have some more complete solution or instructions.
I would contact the person in this thread to get more specific help, but I can't. (https://stackoverflow.com/users/2335675/marcus-lind)
Currently running Windows on my personal machine, trying to mess with Scrapy is not the easiest thing, with installation problems and stuff like that.
Do you think there is no way avoiding this specific work?
In case there isn't, how do I know if I should go with Python/Scrapy or Ruby On Rails, for example?

If the data you're trying to get are reasonably well structured, you could use a third party service like Kimono or import.io.
I find setting up a basic crawler in Python to be incredibly easy. After looking at a lot of them, including Scrapy (it didn't play well with my windows machine either due to the nightmare dependencies), I settled on using Selenium's python package driven by PhantomJS for headless browsing.
Defining your crawling function would probably only take a handful of lines of code. This is a little rudimentary but if you wanted to do it super simply as a straight python script you could even do something like this and just let it run while some condition is true or until you kill the script.
from selenium import webdriver
import time
crawler = webdriver.PhantomJS()
crawler.set_window_size(1024,768)
def crawl():
crawler.get('http://www.url.com/')
# Find your elements, get the contents, parse them using Selenium or BeautifulSoup
while True:
crawl()
time.sleep(3600)

Dynamically decide which browser to use

I'm trying to find a way to dynamically decide which web browser will open the link I clicked.
There are a few sites that I visit that work best on Iexplore and others that I prefer to open with chrome. If I set my default browser to one of these, than I'll constantly find myself opening a site with one browser, than copying the url and opening it in a new one. This happens a lot when people send me links.
I've thought of making a python script as the default browser and making a function that decides which browser should open the page. I've tried setting the script as my default browser by changing some registry keys. It seemed to work but when I try to open a site (for example writing "http://stackoverflow.com" in the run window), the url doesn't show in sys.argv.
Is there another way of finding the arguments sent to the program?
The registry keys I changed are:
HKEY_CURRENT_USER\Software\Classes\http\shell\open\command
HKEY_CURRENT_USER\Software\Classes\https\shell\open\command
HKEY_LOCAL_MACHINE\SOFTWARE\Classes\http\shell\open\command
HKEY_LOCAL_MACHINE\SOFTWARE\Classes\https\shell\open\command
It seemed to work on windows XP but it doesn't work on 7 (the default browser is still the same...)

Have you considered using browsers extension that emulate IE rendering instead of a homegrown solution? I believe there is one called 'ie tab' for chrome/firefox. http://www.ietab.net/

You can try build something on top of existing software which automates browser-webpage interaction, have a look at Selenium, maybe you can tweak it somehow to suit your needs.
But beware, the problem you are trying to solve is fairly complex and complicated, for instance consider just this: how are you going to translate your own subjective experience of a website into code? There are some objective indices, some pages simply break, but many things, such as bad css styling are difficult to asses and quantify.
EDIT: here's a web testing framework in which you can generate your own tests in Python It's probably easier to start with then Selenium.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Opening pages asynchronously in headless browser (PhantomJS) - python

Related

Why am I not seeing the "full" html case? [duplicate]

Is it possible to create a python program that controls my web pages?

Automate webpage tasks without having to have a browser open?

Finding Python webpages crawler complete solution

Dynamically decide which browser to use

Categories

Resources