I'm trying to store webelement content to a python list. While it works, it's taking ~15min to process ~2,000 rows.
# Grab webelements via xpath
rowt = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/th[#class='listing-title']")
rowl = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/td[#class='listing-location']")
rowli = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/th/a")
title = []
location = []
link = []
# Add webElement strings to lists
print('Compiling list...')
[title.append(i.text) for i in rowt]
[location.append(i.text) for i in rowl]
[link.append(i.get_attribute('href')) for i in rowli]
Is there a faster way to do this?
your solution is parsing through the table three times, once for the titles, once for the locations, and once for the links.
Try parsing the table just once. Have a selector for the row, then loop through the rows, and for each row, extract the 3 elements using a relative path, e.g. for the link, it would look like this:
link.append(row.find_elements_by_xpath("./th/a").get_attribute('href'))
Suggestions (apologies if it’s not helpful):
I think Pandas can be used to load HTML tables directly. If your intent is to scrape a table then libraries like Bs4 also might come handy.
You can store the entire HTML and the parse it using Regex,cause all the data you are extracting is gonna be enclosed in fixed set of HTML tags.
Depending on what you're trying to do, if the server that is presenting the page has an API, it would likely be significantly faster for you to use that to retrieve the data, rather than scraping the content from the page.
You could use the browser tools to see what the different requests are being sent to the server, and perhaps the data is being returned in a JSON form that you can easily retrieve your data from.
This, of course, assumes that you're interested in the data, not in verifying the content of the page directly.
I guess the slowest one is [location.append(i.text) for i in rowl].
When you call i.text, Selenium needs to determine what will be displayed in that element, so it needs more time to process.
You can use a workaround i.get_attribute('innerText') instead.
[location.append(i.get_attribbute('innerText')) for i in rowl]
However, I can't guarantee that the result will be the same. (It should be the same or similar to .Text).
I've tested this on my machines with ~2000 row, i.text took 80 sec. while i.get_attribute('innerText') took 28 sec.
Using bs4 would definitely help.
Even if you may have to find elements again using bs4, it was still faster to use bs4.
I'd like to suggest you try bs4.
I.e., code like this would work
soup = bs4.BeautifulSoup(driver.page_source, "html.parser")
elements = soup.find_all(...)
Loop using i
Some job using elements[i]['target attribute']
Related
I need a fast way of extracting the html-code for a specific table using Chromedriver with selenium in python. So far I have found that this option
table_data = webdriver.find_element_by_xpath("//table[#class='cell-table']").get_attribute('innerHTML')
is slightly faster than this option
table_data = webdriver.find_element_by_xpath("//table[#class='cell-table']").text
and both options give me the html-code I need. This option is significantly faster
table_data = webdriver.find_elements(By.XPATH,"//table[#class]/tbody/tr")
however, as far as I can tell, for each row in table_data it needs the following code to actually get access to the html-data:
for row in table_data:
row.get_attribute('innerHTML')
Which is quite slow. Seems like does it actually goes back to the browser to extract the html-code for each row?
Does anyone have suggestions on how to extract the html-code for a table in a faster way? Due to my setup I need to use Chromedriver.
First of all, your guess is correct. Selenium WebElement object is just a reference, a pointer to the physical web element on the web page. So, applying actions like row.get_attribute('innerHTML') you are passing Selenium a reference row, Selenium accesses the web page, accesses the physical web element according to passed WebElement parameter and retrieves it attribute.
So, code like this:
for row in table_data:
row.get_attribute('innerHTML')
will actually access the web page at least len(table_data) times. And yes, this will take some time.
So, if you are looking for the fastest way you need to use
table_data = webdriver.find_element_by_xpath("//table[#class='cell-table']").text
As you mentioned this is slightly faster than
table_data = webdriver.find_element_by_xpath("//table[#class='cell-table']").get_attribute('innerHTML')
I got a few dynamic websites (football live bets). There's no API I'm reading all of them in selenium. I've got infinite loop and finding elements every time.
while True:
elements = self.driver.find_elements_by_xpath(games_path)
for e in elements:
match = Match()
match.betting_opened = len(e.find_elements_by_class_name('no_betting_odds')) == 0
The problem is it's one hundred times slower than I need it to be.
What's the alternative to this? Any other library or how to speed it up with Selenium?
One of websites I'm scraping https://www.betcris.pl/zaklady-live#/Soccer
The pice of code of yours has a while True loop without a break. That is an implemenation of an infinite loop. From a short snipplet I can not tell if is this the root cause of your "infinite loop" issue, but may be so, check if you have any break statements inside your while loop.
As for the other part of your question: I am not sure how you measure performance of an infinite loop, but there is a way to speed up parsing pages with selenium: not using selenium. Grab a snapshot from the page and use that for evaluating states, values and stuff.
import lxml.html
page_snapshot = lxml.html.document_fromstring(self.driver.page_source)
games = page_snapshot.xpath(games_path)
This approach is about 2 magnitudes faster than querying via selenium api. Grab the page once, parse the hell out of it real quick and grab the page again later if you want to. If you want to just read stuff, you don't need webelements at all, just the tree of data. To interact with elements you'll need the webelement of course with selenium, but to get values and states, a snapshot may be sufficient.
Or what you could do with selenium only: add the 'no_betting_odds' to the games_path xpath. It seems to me that you want to grab those elements which do not have a 'no_betting_odds' class. Then just add the './/*[not contains(#class, "no_betting_odds")]' to the games_path (which you did not share so I can't update).
Okay so.
The heading might seem like this question has already been asked but I had no luck finding an answer for it.
I need help with making link extracting program with python.
Actually It works. It finds all <a> elements on a webpage. Takes their href="" and puts it in an array. Then it exports it in csv file. Which is what I want.
But I can't get a hold of one thing.
The website is dynamic so I am using the Selenium webdriver to get JavaScript results.
The code for the program is pretty simple. I open a website with webdriver and then get its content. Then I get all links with
results = driver.find_elements_by_tag_name('a')
Then I loop through results with for loop and get href with
result.get_attribute("href")
I store results in an array and then print them out.
But the problem is that I can't get the name of the links.
This leads to Google
Is there any way to get 'This leads to Google' string.
I need it for every link that is stored in an array.
Thank you for your time
UPDATE!!!!!
As it seems it only gets dynamic links. I just notice this. This is really strange now. For hard coded items, it returns an empty string. For a dynamic link, it returns its name.
Okay. So. The answer is that instad of using .text you shoud use get_attribute("textContent"). Works better than get_attribute("innerHTML")
Thanks KunduK for this answer. You saved my day :)
For each vendor in an ERP system (total # of vendors = 800+), I am collecting its data and exporting this information as a pdf file. I used Selenium with Python, created a class called Scraper, and defined multiple functions to automate this task. The function, gather_vendors, is responsible for scraping and does this by extracting text values from tag elements.
Every vendor has a section called EFT Manager. EFT Manager has 9 rows I am extracting from:
For #2 and #3, both have string values (crossed out confidential info). But, #3 returns null. I don’t understand why #3 onward returns null when there are text values to be extracted.
The format of code for each element is the same.
I tried switching frames but that did not work. I tried to scrape from edit mode and that didn’t work as well. I was curious if anyone ever encountered a similar situation. It seems as though no matter what I do I can’t scrape certain values… I’d appreciate any advice or insight into how I should proceed.
Thank you.
Why not try to use
find_element_by_class_name("panelList").find_elements_by_tag_name('li')
To collect all of the li elements. And using li.text to retrieve their text values. Its hard to tell what your actual output is besides you saying "returns null"
Try to use visibility_of_element_located instead of presence_of_element_located
Try to get textContent with javascript fo element Given a (python) selenium WebElement can I get the innerText?
element = driver.find_element_by_id('txtTemp_creditor_agent_bic')
text= driver.execute_script("return attributes[0].textContent", element)
The following is what worked for me:
Get rid of the try/except blocks.
Find elements via ID's (not xpath).
That allowed me to extract text from elements I couldn't extract from before.
You should change the way of extracting the elements on web page to ID's, since all the the aspects have different id provided. If you want to use xpaths, then you should try the JavaScript function to find them.
E.g.
//span[text()='Bank Name']
I'm new to programming. I'm trying out my first Web Crawler program that will help me with my job. I'm trying to build a program that will scrape tr/td table data from a web page, but am having difficulties succeeding. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
def start(url):
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for table_data in soup.find_all('td', {'class': 'sorting_1'}):
print(table_data)
start('http://www.datatables.net/')
My goal is to print out each line and then export it to an excel file.
Thank you,
-Cire
My recommendation is that if you are new to Python, play with things via the iPython notebook (interactive prompt) to get things working first and to get a feel for things before you try writing a script or a function. On the plus side all variables will stick around and it is much easier to see what is going on.
From the screen shot here, you can see immediately that the find_all function is not finding anything. An empty lists [] is being returned. By using ipython you can easily try other variants of a function on a previously defined variable. For example, the soup.find_all('td').
Looking at the source of http://www.datatables.net, I do not see any instances of the text sorting_1, so I wouldn't expect a search for all table cells of that class to return anything.
Perhaps that class appeared on a different URL associated with the DataTables website, in which case you would need to use that URL in your code. It's also possible that that class only appears after certain JavaScript has been run client-side (i.e. after certain actions with the sample tables, perhaps), and not on the initially loaded page.
I'd recommend starting with tags you know are on the initial page (seen by looking at the page source in your browser).
For example, currently, I can see a div with class="content". So the find_all code could be changed to the following:
for table_data in soup.find_all('div', {'class': 'content'}):
print(table_data)
And that should find something.
Response to comments from OP:
The precise reason why you're not finding that tag/class pairing in this case is that DataTables renders the table client-side via JavaScript, generally after the DOM has finished loading (although it depends on the page and where the DataTables init code is placed). That means the HTML associated with the base URL does not contain this content. You can see this if you curl the base URL and look at the output.
However when loading it in a browser, once the JavaScript for DataTables fires, the table is rendered and the DOM is dynamically modified to add the table, including cells with the class for which you're looking.