Python Selenium Wire Webdriver not utilizing Page Load Strategy - python

Edit: There are other questions addressing the ability to interact with pages that aren't fully loaded. THIS IS NOT THAT. This is specific to the SeleniumWire driver, not just Selenium Webdriver.
I'm currently working with a project using Selenium with Chromedriver in Python 3.8, which requires manipulating a page that takes a very long time to load. As such, I'm using the page loading strategy 'eager' options.page_load_strategy = 'eager' in order to be able to manipulate certain elements of the page before it loads fully.
I set up a test that measures the time for an element to be clicked after the browser has been declared. (Effectively measuring how long the page loading is taking to the point where a constant button can be clicked). When I used the regular Selenium Webdriver, running 15 tests got me an average time of 0.7352 seconds. However, when I used the SeleniumWire Webdriver (with the only change being the change in the difference of Webdriver), my load times after 15 tests were on average 4.3745. These load times were on par as when I ran this test on Selenium Webdriver using the 'normal' (or default) page load strategy which after 15 tests were on average 4.3900.
Thus, I believe that SeleniumWire is not utilizing the page load strategy and I was looking for possible solutions. How can I make sure that SeleniumWire uses eager loading?

Related

Google returning different layouts for pagination

I am using selenium and chrome to search on google. But it is returning different layouts for pagination. I am using different proxies and different user agents using the fake_useragent library.
I only want the second image layout. Does anybody know how can I get it every time?
First Image
Second Image
The issue was fake_useragent library was returning old user-agents sometimes even if I update the database. I tried this library(https://pypi.org/project/latest-user-agents/) and it returns newer user-agents.
Here is the working code.
from latest_user_agents import get_latest_user_agents
import random
from selenium import webdriver
PATH = 'C:\Program Files (x86)\chromedriver.exe'
proxy = ''
url = ''
user_agent = random.choice(get_latest_user_agents())
options = webdriver.ChromeOptions()
options.add_argument(f'--proxy-server={proxy}')
options.add_argument(f'user-agent={user_agent}')
driver = webdriver.Chrome(PATH, options=options)
driver.get(url)
The difference between the two layouts is when you disable javascript, Google will show the pagination as the first image layout.
To ensure that you get the second layout every time, you would need to make sure javascript is enabled.
If you have a chrome driver from selenium like: options = webdriver.ChromeOptions(), the following would make sure javascript is always enabled:
options.add_argument("--enable-javascript")
Edit based on OP's comment
I got it working by using the latest_user_agents library. The fake_useragent library was returning old user-agents sometimes. That's why it was showing the old layout.
Installing the latest_user_agents library: https://pypi.org/project/latest-user-agents/
Hey Dont try to automate google and google products by automation tools because every day google are changing webelements and view of thier pages.
For multiple reasons, logging into sites like Gmail and Facebook using WebDriver is not recommended. Aside from being against the usage terms for these sites (where you risk having the account shut down), it is slow and unreliable.
The ideal practice is to use the APIs that email providers offer, or in the case of Facebook the developer tools service which exposes an API for creating test accounts, friends, and so forth. Although using an API might seem like a bit of extra hard work, you will be paid back in speed, reliability, and stability. The API is also unlikely to change, whereas webpages and HTML locators change often and require you to update your test framework.
Logging in to third-party sites using WebDriver at any point of your test increases the risk of your test failing because it makes your test longer. A general rule of thumb is that longer tests are more fragile and unreliable.
WebDriver implementations that are W3C conformant also annotate the navigator object with a WebDriver property so that Denial of Service attacks can be mitigated.

IE webdriver of Selenium Python loads the webpage and goes to a halt state

Hey there Python Experts,
I have used Beautiful Soup and REquests to scrape data from static web for my project. But for Dynamic contents i am unable to do the same. I have installed selenium for the same. But when i execute the below code; The code goes to sleep mode after opening browser. I can see only '1Test' in my op window.
Please help :)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Ie(executable_path='IEDriverServer_Win32_3.150.1/IEDriverServer.exe')
print('1test')
driver.get('http://www.google.com')
print('2test')
driver.close()
print('3test')
driver.quit()
you need to temporarily stop the execution of the program by using
driver.sleep(<number of seconds>)
or by importing the time module and writing
time.sleep(<number of seconds>)
this needs to be done after every step.
this needs to be done the different elements on a webpage need time to load up and hence if called before they are fully loaded, selenium will not be able to access them and will raise a error

Selenium + Webdriver return same information as first get after repeated gets

Repeated calls to the same URL using Python Selenium and a webdriver (geckodriver or Chrome driver) return the proper information the first time the program is run. But each successive runs even after a wait of 60 secs and even a reboot still returns the information from the first get URL.
Below is the code for the start of a program to scrape the odds at a racetrack every minute. The sleep in the code is less for testing purposes.
import time
import os
from selenium import webdriver
#url = "https://www.drf.com/live_odds/winodds/track/DED/USA/3/D"
#url = "https://www.drf.com/live_odds/winodds/track/TAM/USA/10/D"
#url = "https://www.drf.com/live_odds/winodds/track/AUS-MNG/AUS/5/D"
#url = "https://www.drf.com/live_odds/winodds/track/AUS-AUC/AUS/2/D"
url = "https://www.drf.com/live_odds/winodds/track/SA/USA/5/D"
driver = webdriver.Chrome()
driver.get(url)
driver.refresh()
time.sleep(50)
#url = "https://www.drf.com/live_odds/winodds/track/AQU/USA/7/D"
#url = "https://www.drf.com/live_odds/winodds/track/LA/USA/4/D"
driver.close()
driver.quit()
#os.system(killall "Chrome")
At first I thought the problem was in my requests so I transferred to Selenium and geckodriver and later Chrome Driver. Then it worked. The first time I got the URL the proper information was returned. The second time I used the same URL and did a get- the odds will eventually change - I still got the results from the first get URL. Even when I run the program again I still get the same results as the first get URL. But if I run Chrome without Selenium and go to the same URL I then get the proper updated odds. Also running Chrome without Selenium the odds are displayed horizontally across the page while the odds are displayed in a column when I run Selenium and the Chrome driver. I know there are often compatibility issues, but I downloaded Selenium and the drivers within the last two or three weeks.
This program will be far from accurate if you run it when SA - Santa Anita Racetrack is not running and in this code it would be race number 5. You can easily change the race number that matches the current race. And you can change the track by going to www.drf.com and then going to entries and clicking on Live Odds. There you will see a list of tracks and you can click on one and there you will see the appropriate URL. Paste that into the program and assign it as the new URL. Again you will see that the screen returned is correct but you can run the program again and again and you will only get the results obtained in the first screen. You will not get the new odds that you will get if you run Chrome without Selenium. Is there some references stuck in a buffer or is the website trying to prohibit continuous web scraping of the odds. I also tried rebooting but still got the old results from the URL when running again Selenium and Chrome Driver.
Again if I run just Chrome I get the new updates. Could this mean there is some reference to the original request that must be saved on the disk as all references in memory would be erased with a reboot? Could this involve a socket reference?

Is it advisable to speed up scraping using selenium by starting multiple webdrivers?

I have over 19,000 links which I need to visit to scrape data from. Each takes about 5 seconds to fully load, which means that I will need slightly more than 26 hours to scrape everything on a single webdriver.
To me, it seems that a solution is simply to start another webdriver (or few others) in a separate python notebook which goes through another portion of the links in parallel. i.e:
In first iPython notebook:
from selenium import webdriver
driver1 = webdriver.Firefox()
... scraping code looping over links 0-9500 using driver1...
In second iPython notebook:
from selenium import webdriver
driver2 = webdriver.Firefox()
... scraping code looping over links 9501-19000 using driver2...
I'm fairly new to scraping so this question may be completely elementary/ridiculous(?). However, I've tried searching for this and haven't seen anything on the topic, so I would appreciate any advice on this matter. Or any recommendations for a better/more correct way to implement this.
I've heard of multi-threading using the thread module (http://www.tutorialspoint.com/python/python_multithreading.htm), but wonder whether implementing it in this manner would have any advantage over simply creating multiple webdrivers as in the aforementioned code.
You really need to use Selenium in order to do this?
Check Scrapy with this framework you can easily send a lots of request and scrape data. Selenium is useful to get browser automation.

Python django: How to call selenium.set_speed() with django LiveServerTestCase

To run my functional tests i use LiveServerTestCase.
I want to call set_speed (and other methods, set_speed is just an example) that aren't in the webdriver, but are in the selenium object.
http://selenium.googlecode.com/git/docs/api/py/selenium/selenium.selenium.html#module-selenium.selenium
my subclass of LiveServerTestCase
from selenium import webdriver
class SeleniumLiveServerTestCase(LiveServerTestCase):
#classmethod
def setUpClass(cls):
cls.driver = webdriver.Firefox()
cls.driver.implicitly_wait(7)
cls.driver.maximize_window()
# how to call selenium.selenium.set_speed() from here? how to get the ref to the selenium object?
super(SeleniumLiveServerTestCase, cls).setUpClass()
How to get that? I can't call the constructor on selenium, i think.
You don't. Setting the speed in WebDriver is not possible and the reason for this is that you generally shouldn't need to, and the 'waiting' is now done at a different level.
Before it was possible to tell Selenium, don't run this at normal speed, run it at a slower speed to allow more things to be available on page load, for slow loading pages or AJAX'ified pages.
Now, you do away with that altogether. Example:
I have a login page, I login and once logged in I see a "Welcome" message. The problem is the Welcome message is not displayed instantly and is on a time delay (using jQuery).
Pre WebDriver Code would dictate to Selenium, run this test, but slow down here so we can wait until the Welcome message appears.
Newer WebDriver code would dictate to Selenium, run this test, but when we login, wait up to 20 seconds for the Welcome Message to appearing, using explicit waits.
Now, if you really want access to "set" Selenium's speed, first off I'd recommend against it but the solution would be to dive into the older, now deprecated code.
If you use WebDriver heavily already, you can use the WebDriverBackedSelenium which can give you access to the older Selenium methods, whilst keeping the WebDriver backing the same, therefore much of your code would stay the same.
https://groups.google.com/forum/#!topic/selenium-users/6E53jIIT0TE
Second option is to dive into the old Selenium code and use it, this will change a lot of your existing code (because it is before the "WebDriver" concept was born).
The code for both Selenium RC & WebDriverBackedSelenium lives here, for the curious:
https://code.google.com/p/selenium/source/browse/py/selenium/selenium.py
Something along the lines of:
from selenium import webdriver
from selenium import selenium
driver = webdriver.Firefox()
sel = selenium('localhost', 4444, '*webdriver', 'http://www.google.com')
sel.start(driver = driver)
You'd then get access to do this:
sel.setSpeed(5000)

Categories