Fastest way to extract html from Chromedriver - python

I need a fast way of extracting the html-code for a specific table using Chromedriver with selenium in python. So far I have found that this option
table_data = webdriver.find_element_by_xpath("//table[#class='cell-table']").get_attribute('innerHTML')
is slightly faster than this option
table_data = webdriver.find_element_by_xpath("//table[#class='cell-table']").text
and both options give me the html-code I need. This option is significantly faster
table_data = webdriver.find_elements(By.XPATH,"//table[#class]/tbody/tr")
however, as far as I can tell, for each row in table_data it needs the following code to actually get access to the html-data:
for row in table_data:
row.get_attribute('innerHTML')
Which is quite slow. Seems like does it actually goes back to the browser to extract the html-code for each row?
Does anyone have suggestions on how to extract the html-code for a table in a faster way? Due to my setup I need to use Chromedriver.

First of all, your guess is correct. Selenium WebElement object is just a reference, a pointer to the physical web element on the web page. So, applying actions like row.get_attribute('innerHTML') you are passing Selenium a reference row, Selenium accesses the web page, accesses the physical web element according to passed WebElement parameter and retrieves it attribute.
So, code like this:
for row in table_data:
row.get_attribute('innerHTML')
will actually access the web page at least len(table_data) times. And yes, this will take some time.
So, if you are looking for the fastest way you need to use
table_data = webdriver.find_element_by_xpath("//table[#class='cell-table']").text
As you mentioned this is slightly faster than
table_data = webdriver.find_element_by_xpath("//table[#class='cell-table']").get_attribute('innerHTML')

Related

Python - How to use scrape table from website with dropdown of available rows

I am trying to scrape the earnings calendar data from the table from zacks.com and the url is attached below.
https://www.zacks.com/stock/research/aapl/earnings-calendar
The thing is I am trying to scrape all data from the table, but it has a dropdown list to select 10, 25, 50 and 100 rows on a page. Ideally I want to scrape for all 100 rows but when I select 100 from the dropdown list, the url doesn't change. My code is below.
To note that the website blocks user-agent so I had to use chrome driver to impersonate human visiting the web. The obtained result from the pd.read_html is a list of all the tables and the d[4] returns the earnings calendar with only 10 rows (which I want to change to 100)
driver = webdriver.Chrome('../files/chromedriver96')
symbol = 'AAPL'
url = 'https://www.zacks.com/stock/research/{}/earnings-calendar'.format(symbol)
driver.get(url)
content = driver.page_source
d = pd.read_html(content)
d[4]
So calling help for anyone to guide me on this
Thanks!
UPDATE: it looks like my last post was downgraded due to lack of clear articulation and evidence of showing the past research. Maybe I am still a newbie to posting questions on this site. Actually, I have found several pages including this page with the same issue but the solutions didn't seem to work for me, which is why I came to post this as a new question
UPDATE 12/05:
Thanks a lot for the advise. As commented below, I finally got it working. Below is the code I used
dropdown = driver.find_element_by_css_selector('#earnings_announcements_earnings_table_length')
time.sleep(1)
hundreds = dropdown.find_element_by_xpath(".//option[. = '100']")
hundreds.click()
Having taken a look this is not going to be something that is easy to scrape. Given that the table is produced from the javascript I would say you have two options.
Option one:
Use selenium to render the page allowing the javascript to run. This way you can simply use the id/class of the drop down to interact with it.
You can then scrape the data by looking at the values in the table.
Option two:
This is the more challenging one. Look through the data that the page gets in response and try to find requests which result in the data you then see on the page. By cross-referencing these there will be a way to directly request the data you want.
You may find that to get at the data you want you need to accept a key from the original request to the page and then send that key as part of a second request. This way should allow you to scrape the data without having to run a selenium instance which will run more efficiently.
My personal suggestion is to go with option one as computer resources are cheap and developer time expensive.

How to read data from dynamic website faster in selenium

I got a few dynamic websites (football live bets). There's no API I'm reading all of them in selenium. I've got infinite loop and finding elements every time.
while True:
elements = self.driver.find_elements_by_xpath(games_path)
for e in elements:
match = Match()
match.betting_opened = len(e.find_elements_by_class_name('no_betting_odds')) == 0
The problem is it's one hundred times slower than I need it to be.
What's the alternative to this? Any other library or how to speed it up with Selenium?
One of websites I'm scraping https://www.betcris.pl/zaklady-live#/Soccer
The pice of code of yours has a while True loop without a break. That is an implemenation of an infinite loop. From a short snipplet I can not tell if is this the root cause of your "infinite loop" issue, but may be so, check if you have any break statements inside your while loop.
As for the other part of your question: I am not sure how you measure performance of an infinite loop, but there is a way to speed up parsing pages with selenium: not using selenium. Grab a snapshot from the page and use that for evaluating states, values and stuff.
import lxml.html
page_snapshot = lxml.html.document_fromstring(self.driver.page_source)
games = page_snapshot.xpath(games_path)
This approach is about 2 magnitudes faster than querying via selenium api. Grab the page once, parse the hell out of it real quick and grab the page again later if you want to. If you want to just read stuff, you don't need webelements at all, just the tree of data. To interact with elements you'll need the webelement of course with selenium, but to get values and states, a snapshot may be sufficient.
Or what you could do with selenium only: add the 'no_betting_odds' to the games_path xpath. It seems to me that you want to grab those elements which do not have a 'no_betting_odds' class. Then just add the './/*[not contains(#class, "no_betting_odds")]' to the games_path (which you did not share so I can't update).

Scraping text values using Selenium with Python

For each vendor in an ERP system (total # of vendors = 800+), I am collecting its data and exporting this information as a pdf file. I used Selenium with Python, created a class called Scraper, and defined multiple functions to automate this task. The function, gather_vendors, is responsible for scraping and does this by extracting text values from tag elements.
Every vendor has a section called EFT Manager. EFT Manager has 9 rows I am extracting from:
For #2 and #3, both have string values (crossed out confidential info). But, #3 returns null. I don’t understand why #3 onward returns null when there are text values to be extracted.
The format of code for each element is the same.
I tried switching frames but that did not work. I tried to scrape from edit mode and that didn’t work as well. I was curious if anyone ever encountered a similar situation. It seems as though no matter what I do I can’t scrape certain values… I’d appreciate any advice or insight into how I should proceed.
Thank you.
Why not try to use
find_element_by_class_name("panelList").find_elements_by_tag_name('li')
To collect all of the li elements. And using li.text to retrieve their text values. Its hard to tell what your actual output is besides you saying "returns null"
Try to use visibility_of_element_located instead of presence_of_element_located
Try to get textContent with javascript fo element Given a (python) selenium WebElement can I get the innerText?
element = driver.find_element_by_id('txtTemp_creditor_agent_bic')
text= driver.execute_script("return attributes[0].textContent", element)
The following is what worked for me:
Get rid of the try/except blocks.
Find elements via ID's (not xpath).
That allowed me to extract text from elements I couldn't extract from before.
You should change the way of extracting the elements on web page to ID's, since all the the aspects have different id provided. If you want to use xpaths, then you should try the JavaScript function to find them.
E.g.
//span[text()='Bank Name']

Selenium WebDriver Very Slow to Append WebElement Data to List

I'm trying to store webelement content to a python list. While it works, it's taking ~15min to process ~2,000 rows.
# Grab webelements via xpath
rowt = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/th[#class='listing-title']")
rowl = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/td[#class='listing-location']")
rowli = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/th/a")
title = []
location = []
link = []
# Add webElement strings to lists
print('Compiling list...')
[title.append(i.text) for i in rowt]
[location.append(i.text) for i in rowl]
[link.append(i.get_attribute('href')) for i in rowli]
Is there a faster way to do this?
your solution is parsing through the table three times, once for the titles, once for the locations, and once for the links.
Try parsing the table just once. Have a selector for the row, then loop through the rows, and for each row, extract the 3 elements using a relative path, e.g. for the link, it would look like this:
link.append(row.find_elements_by_xpath("./th/a").get_attribute('href'))
Suggestions (apologies if it’s not helpful):
I think Pandas can be used to load HTML tables directly. If your intent is to scrape a table then libraries like Bs4 also might come handy.
You can store the entire HTML and the parse it using Regex,cause all the data you are extracting is gonna be enclosed in fixed set of HTML tags.
Depending on what you're trying to do, if the server that is presenting the page has an API, it would likely be significantly faster for you to use that to retrieve the data, rather than scraping the content from the page.
You could use the browser tools to see what the different requests are being sent to the server, and perhaps the data is being returned in a JSON form that you can easily retrieve your data from.
This, of course, assumes that you're interested in the data, not in verifying the content of the page directly.
I guess the slowest one is [location.append(i.text) for i in rowl].
When you call i.text, Selenium needs to determine what will be displayed in that element, so it needs more time to process.
You can use a workaround i.get_attribute('innerText') instead.
[location.append(i.get_attribbute('innerText')) for i in rowl]
However, I can't guarantee that the result will be the same. (It should be the same or similar to .Text).
I've tested this on my machines with ~2000 row, i.text took 80 sec. while i.get_attribute('innerText') took 28 sec.
Using bs4 would definitely help.
Even if you may have to find elements again using bs4, it was still faster to use bs4.
I'd like to suggest you try bs4.
I.e., code like this would work
soup = bs4.BeautifulSoup(driver.page_source, "html.parser")
elements = soup.find_all(...)
Loop using i
Some job using elements[i]['target attribute']

xpath query on id //*[#id="page"] returns two elements

I'm trying to scrap the site ketabejam.ir
I'm using python3.4.1 and for parsing I use lxml 3.4.1
by the way I parsed it with lxml.html.fromstring method
when I load the document on my interpreter and ask for following query to get number of pages , so I can handle pagination:
s = doc.xpath("//*[#id='page']")
surprisingly I get the result:
>>>len(s) == 2
True
I got the address of the element from firebug's minimal xpath,
when I choose normal xpath , the query run smoothly
Is it a bug, or I'm doing something wrong??
You can work around this in general by always doing something like:
s = doc.xpath("(//*[#id='page'])[1]")
...if you know you really just want the first node that matches, and can safely ignore any subsequent ones (which seems like a safe bet in this case).
Looking at the page source for the page you linked, there are exactly two elements with that id in the page. Most probably the one of the top of the table, and the other one of the bottom of the table.
The copy minimal xpath version of firebug works based on the id of the element. It is only available for elements that have an id tag and it creates an xpath in the format -
//*[#id="elementID"]
Which is what you are getting.
Ideally, in every html page , there should only be one element with a particular id , that is id should be unique across the page. And seem like firebug's minimal xpath depends on that.
In your context, I think both elements return the same link, so you can use either to continue your scraping. Or as you indicated , you can use the normal xpath for that.

Categories