How to read data from dynamic website faster in selenium - python

I got a few dynamic websites (football live bets). There's no API I'm reading all of them in selenium. I've got infinite loop and finding elements every time.
while True:
elements = self.driver.find_elements_by_xpath(games_path)
for e in elements:
match = Match()
match.betting_opened = len(e.find_elements_by_class_name('no_betting_odds')) == 0
The problem is it's one hundred times slower than I need it to be.
What's the alternative to this? Any other library or how to speed it up with Selenium?
One of websites I'm scraping https://www.betcris.pl/zaklady-live#/Soccer

The pice of code of yours has a while True loop without a break. That is an implemenation of an infinite loop. From a short snipplet I can not tell if is this the root cause of your "infinite loop" issue, but may be so, check if you have any break statements inside your while loop.
As for the other part of your question: I am not sure how you measure performance of an infinite loop, but there is a way to speed up parsing pages with selenium: not using selenium. Grab a snapshot from the page and use that for evaluating states, values and stuff.
import lxml.html
page_snapshot = lxml.html.document_fromstring(self.driver.page_source)
games = page_snapshot.xpath(games_path)
This approach is about 2 magnitudes faster than querying via selenium api. Grab the page once, parse the hell out of it real quick and grab the page again later if you want to. If you want to just read stuff, you don't need webelements at all, just the tree of data. To interact with elements you'll need the webelement of course with selenium, but to get values and states, a snapshot may be sufficient.
Or what you could do with selenium only: add the 'no_betting_odds' to the games_path xpath. It seems to me that you want to grab those elements which do not have a 'no_betting_odds' class. Then just add the './/*[not contains(#class, "no_betting_odds")]' to the games_path (which you did not share so I can't update).

Related

How can I write Python selenium test for PrimeFaces?

I am trying to write a Selenium test but the issue is I have learned that the page is generated with PrimeFaces, thus the element IDs randomly change from time to time. Not using IDs is not very reliable. Is there anything I can do?
Not having meaningful stable IDs is not a problem, as there are always alternative ways to locate elements on a page. Just to name a few options:
partial id matches with XPath or CSS, e.g.:
# contains
driver.find_element_by_css_selector("span[id*=customer]")
driver.find_element_by_xpath("//span[contains(#id, 'customer')]")
# starts with
driver.find_element_by_css_selector("span[id^=customer]")
driver.find_element_by_xpath("//span[starts-with(#id, 'customer')]")
# ends with
driver.find_element_by_css_selector("span[id$=customer]")
classes which refer/tell some valuable information about the data types ("data-oriented locators"):
driver.find_element_by_css_selector(".price")
driver.find_element_by_class_name("price")
going sideways from a label:
# <label>Price</label><span id="65123safg12">10.00</span>
driver.find_element_by_xpath("//label[.='Price']/following-sibling::span")
links by link text or partial link text:
driver.find_element_by_link_text("Information")
driver.find_element_by_partial_link_text("more")
And, you can, of course, get creative and combine them. There are more:
Locating Elements
There is also this relevant thread which goes over best practices when choosing a method to locate an element on a page:
What makes a good selenium locator?

Scraping text values using Selenium with Python

For each vendor in an ERP system (total # of vendors = 800+), I am collecting its data and exporting this information as a pdf file. I used Selenium with Python, created a class called Scraper, and defined multiple functions to automate this task. The function, gather_vendors, is responsible for scraping and does this by extracting text values from tag elements.
Every vendor has a section called EFT Manager. EFT Manager has 9 rows I am extracting from:
For #2 and #3, both have string values (crossed out confidential info). But, #3 returns null. I don’t understand why #3 onward returns null when there are text values to be extracted.
The format of code for each element is the same.
I tried switching frames but that did not work. I tried to scrape from edit mode and that didn’t work as well. I was curious if anyone ever encountered a similar situation. It seems as though no matter what I do I can’t scrape certain values… I’d appreciate any advice or insight into how I should proceed.
Thank you.
Why not try to use
find_element_by_class_name("panelList").find_elements_by_tag_name('li')
To collect all of the li elements. And using li.text to retrieve their text values. Its hard to tell what your actual output is besides you saying "returns null"
Try to use visibility_of_element_located instead of presence_of_element_located
Try to get textContent with javascript fo element Given a (python) selenium WebElement can I get the innerText?
element = driver.find_element_by_id('txtTemp_creditor_agent_bic')
text= driver.execute_script("return attributes[0].textContent", element)
The following is what worked for me:
Get rid of the try/except blocks.
Find elements via ID's (not xpath).
That allowed me to extract text from elements I couldn't extract from before.
You should change the way of extracting the elements on web page to ID's, since all the the aspects have different id provided. If you want to use xpaths, then you should try the JavaScript function to find them.
E.g.
//span[text()='Bank Name']

Selenium WebDriver Very Slow to Append WebElement Data to List

I'm trying to store webelement content to a python list. While it works, it's taking ~15min to process ~2,000 rows.
# Grab webelements via xpath
rowt = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/th[#class='listing-title']")
rowl = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/td[#class='listing-location']")
rowli = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/th/a")
title = []
location = []
link = []
# Add webElement strings to lists
print('Compiling list...')
[title.append(i.text) for i in rowt]
[location.append(i.text) for i in rowl]
[link.append(i.get_attribute('href')) for i in rowli]
Is there a faster way to do this?
your solution is parsing through the table three times, once for the titles, once for the locations, and once for the links.
Try parsing the table just once. Have a selector for the row, then loop through the rows, and for each row, extract the 3 elements using a relative path, e.g. for the link, it would look like this:
link.append(row.find_elements_by_xpath("./th/a").get_attribute('href'))
Suggestions (apologies if it’s not helpful):
I think Pandas can be used to load HTML tables directly. If your intent is to scrape a table then libraries like Bs4 also might come handy.
You can store the entire HTML and the parse it using Regex,cause all the data you are extracting is gonna be enclosed in fixed set of HTML tags.
Depending on what you're trying to do, if the server that is presenting the page has an API, it would likely be significantly faster for you to use that to retrieve the data, rather than scraping the content from the page.
You could use the browser tools to see what the different requests are being sent to the server, and perhaps the data is being returned in a JSON form that you can easily retrieve your data from.
This, of course, assumes that you're interested in the data, not in verifying the content of the page directly.
I guess the slowest one is [location.append(i.text) for i in rowl].
When you call i.text, Selenium needs to determine what will be displayed in that element, so it needs more time to process.
You can use a workaround i.get_attribute('innerText') instead.
[location.append(i.get_attribbute('innerText')) for i in rowl]
However, I can't guarantee that the result will be the same. (It should be the same or similar to .Text).
I've tested this on my machines with ~2000 row, i.text took 80 sec. while i.get_attribute('innerText') took 28 sec.
Using bs4 would definitely help.
Even if you may have to find elements again using bs4, it was still faster to use bs4.
I'd like to suggest you try bs4.
I.e., code like this would work
soup = bs4.BeautifulSoup(driver.page_source, "html.parser")
elements = soup.find_all(...)
Loop using i
Some job using elements[i]['target attribute']

Possible bottle-neck issue in web-scraping with Python

First of all I apologize for the vague title, but the problem is that I'm not sure what is causing the error.
I'm using Python to extrapolate some data from a website.
The code I created works perfectly when passing one link at the time, but somehow breaks when trying to collect the data from the 8000 pages I have (it actually breaks way before). The process I need to do is this:
Collect all the links from one single page (8000 links)
From each link extrapolate another link contained in an iframe
Scrape the date from the link in 2.
Point 1 is easy and works fine.
Point 2 and 3 works for a while and then I get some errors. Every time at a different point and it's never the same. After some tests, I decided to try a different approach and run my code until point 2 on all the links in 1, trying to collect all the links first. And at this point I found out that, probably, I get the error during this stage.
The code works like this: in a for cycle I pass each item of a list of urls to the function below. It's supposed to search for a link to the Disqus website. There should be only one link and there is always one link. Because with a library as lxml, it's not possible to scan inside the iframe, I use selenium and the ChromeDriver.
def get_url(webpage_url):
chrome_driver_path= '/Applications/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
driver.get(webpage_url)
iframes=driver.find_elements_by_tag_name("iframe")
list_urls=[]
urls=[]
# collects all the urls of all the iframe tags
for iframe in iframes:
driver.switch_to_frame(iframe)
time.sleep(3)
list_urls.append(driver.current_url)
driver.switch_to_default_content()
driver.quit()
for item in list_urls:
if item.startswith('http://disqus'):
urls.append(item)
if len(urls)>1:
print "too many urls collected in iframes"
else:
url=urls[0]
return url
At the beginning there was no time.sleep and it worked for roughly 30 links. Then I put a time.sleep(2) and it arrived to about 60. Now with time.sleep(3) it works for around 130 links. Of course, this cannot be a solution. The error I get now, it's always the same (index out of range in url=urls[0]), but each time with a different link. If I check my code with the single link where it breaks, the code works, so it can actually find urls there. And of course, sometimes passes a link where it stopped before and it works with no issue.
I suspect I get this because maybe of a time-out, but of course I'm not sure.
So, how can I understand what's the issue, here?
If the problem is that it makes too many requests (even though the sleep), how can I deal with this?
Thank you.
From your description of the problem, it might be that the host throttles your client when you issue too many requests in a given time. This is a common protection againts DoS attacks and ill-behaved robots - like yours.
The clean solution here is to checkout if the site has a robots.txt file and if so parse it and respect the rules - else, set a large enough wait time between two requests so you dont get kicked.
Also you can get quite a few other issues - 404, lost network connection etc - and even load time issues with selenium.webdriver as documented here:
Dependent on several factors, including the OS/Browser combination,
WebDriver may or may not wait for the page to load. In some
circumstances, WebDriver may return control before the page has
finished, or even started, loading. To ensure robustness, you need to
wait for the element(s) to exist in the page using Explicit and
Implicit Waits.
wrt/ your IndexError, you blindly assume that you'll get at least one url (which means at least one iframe), which might not be the case for any of the reasons above (and a few others too). First you want to make sure you properly handle all corner cases, then fix your code so you don't assume that you do have at least one url:
url = None
if len(urls) > 1:
print "too many urls collected in iframes"
elif len(urls) == 0:
url = urls[0]
else:
print "no url found"
Also if all you want is the first http://disqus url you can find, no need to collect them all, then filter them out, then return the first:
def get_url(webpage_url):
chrome_driver_path= '/Applications/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
driver.get(webpage_url)
iframes=driver.find_elements_by_tag_name("iframe")
# collects all the urls of all the iframe tags
for iframe in iframes:
driver.switch_to_frame(iframe)
time.sleep(3)
if driver.current_url.startswith('http;//disqus'):
return driver.current_url
driver.switch_to_default_content()
driver.quit()
return None # nothing found

How can I iterate through the pages of a website using Python?

I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with
"http://stackoverflow.com/questions/"
Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?
Try Scrapy.
It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.
To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.
If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.
For example, you can make a simple for loop, like this:
def webIterate():
base_link = "http://stackoverflow.com/questions/"
for i in xrange(24):
print "http://stackoverflow.com/questions/%d" % (i)
The output will be:
http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23
It's just an example. You can pass numbers of questions and make with them whatever you want

Categories