Phantomjs / Splinter - Issue with cache - python

I have an EC2 ubuntu instance where I have planned a script twice a day.
The script uses Splinter Python lib with PhantomJs headless browser to test some button and actions on my website.
I have just noticed that my T1.micro instance is slower and slower, until my script is not launching anymore.
Run du on my instance and found that Phantomjs takes a lot of memory on my disk.
Can I remove thoses files?
How can I prevent this stack of files ?
Can't find anything related on Splinter nor Phantomjs.
Thanks!

Related

WinError 10061 Running Selenium+Chromedriver

I have a Python GUI application that uses Selenium and Chromedriver to crawl sites, interact with elements, download files, etc. The application has been packaged as a standalone .exe (produced using PyInstaller) and has performed well in tests across a few different Windows and Mac machines. However, on one machine it is producing WinError 10061, screenshot below:
A few other details:
The Web Crawler application appears to work fine and hit all targets when run in headless mode
Directly ahead of this error, the crawler successfully 1) opened the Chromedriver browser (outside of headless mode, so the webpage was visible) 2) accessed the start URL and performed automated tasks on the page (I.e., filling out and completing a login page, clicking 'Submit' button, refreshing page). It's only when accessing subsequent URLs that the Chromedriver quits and produces this error. I'm not sure why it be able to successfully initiate the browser, get the start URL and perform tasks, but fails upon getting another URL on the same site
The URL it fails upon is https://econtent.hogrefe.com/toc/prx/current, but the error has been seen on completely different sites that similarly do not use the headless browser.
Any ideas as to what's happening here?

How to install Selenium (python) on a Apache Web Server?

I have up and running an Apache Server with Python 3.x installed already on it. Right now I am trying to run ON the server a little python program (let's say filename.py). But this python program uses the webdriver for Chrome from Selenium. Also it uses sleep from time (but I think this comes by default, so I figure it won't be a problem)
from selenium import webdriver
When I code this program for the first time on my computer, not only I had to write the line of code above but also to manually download the webdriver for Chrome and paste it on /usr/local/bin. Here is the link to the file in case you wonder: Webdriver for Chorme
Anyway, I do not know what the equivalences are to configure this on my server. Do you have any idea how to do it? Or any concepts I could learn related to installing packages on an Apache Server?
Simple solution:
You don't need to install the driver in usr/local/bin. You can have the .exe anywhere and you can specify that with an executable path, see here for an example.
Solution for running on a server
If you have python installed on the server, ideally >3.4 which comes with pip as default. Then install ChromeDriver on a standalone server, follow the instructions here
Note that, Selenium always need an instance of a browser to control.
Luckily, there are browsers out there that aren't that heavy as the usual browsers you know. You don't have to open IE / Firefox / Chrome / Opera. You can use HtmlUnitDriver which controls HTMLUnit - a headless Java browser that does not have any UI. Or a PhantomJsDriver which drives PhantomJS - another headless browser running on WebKit.
Those headless browsers are much less memory-heavy, usually are faster (since they don't have to render anything), they don't require a graphical interface to be available for the computer they run at and are therefore easily usable server-side.
Sample code of headless setup
op = webdriver.ChromeOptions()
op.add_argument('headless')
driver = webdriver.Chrome(options=op)
It's also worth reading on running Selenium RC, see here on that.

Selenium firefox instances freeze

I have a process running to scrape some information in bulk. It uses selenium and starts one firefox instance per URL to be scraped.
Similar processes run on multiple instances (to split workload). On one such instance, now I see processes freezing. They keep running for hours and hours. I have put no conditional waits in my selenium scripts. Other instances are running fine (which are also scraping the same website).
Selenium version - 3.0.2
Geckodriver version - 0.13.0
Firefox version - Mozilla Firefox 50.1.0
I tried going through the geckodriver logs but could not make sense of anything. Any ideas on how to debug this further?

What would use less RAM and CPU, Selenium and XVFB with IceWeasel or Selenium with PhantomJS on a Raspberry Pi

I am planning on running browser automation on my Raspberry Pi Model B, it will be automating submitting forms and clicking on buttons of webpages. I plan to control this from Python as I currently have a working solution using iMacros scripting feature controlling Firefox on a Windows machine.
(Firefox uses uBlock, NoScript and Memory Fox to reduce RAM)
I want to know what would use the least amount of CPU and RAM, I know that I will have to use a precompiled PhantomJS binary as it would take 2 days to compile. The alternative is to use XVFB and PyVirtualDisplay to run IceWeasel/FireFox.
My bot needs to be able to log into a few websites (only one at a time) with cookies that uses a Captcha upon logging in (by saving a screenshot of the webpage and manually solving it), email verification and save the cookies so it does not need to log in manually each time. (easy if using IceWeasel or FireFox not so easy in PhantomJS). The bot should be able to run for weeks without stopping, so I can't use anything with memory leaks and would like something that could deal with the internet going down.
I would also like the feature to know if the command I send to the browser was completed successfully or not e.g with a try: except: or by the command returning an error code like it does with iMacros.

Running Selenium tests with IE using python

The installation page here says to add the selenium server standalone jar to the CLASSPATH. What does the jar do? Do I need it? I ran some selenium code already and it works without it. I just instantiated IE doing
driver = driver.Ie()
I am running webdriver (selenium 2) in Python, trying to test IE9 (and then after test IE8). (I'm not using .NET, just running a .py file) Thanks!
That was the jar you needed to run with "Selenium 1". It is not necessary with Selenium 2 as far as I know. It may be used when doing remote testing (I haven't done that with Selenium 2), but it is definitely not needed for local testing.

Categories