For each vendor in an ERP system (total # of vendors = 800+), I am collecting its data and exporting this information as a pdf file. I used Selenium with Python, created a class called Scraper, and defined multiple functions to automate this task. The function, gather_vendors, is responsible for scraping and does this by extracting text values from tag elements.
Every vendor has a section called EFT Manager. EFT Manager has 9 rows I am extracting from:
For #2 and #3, both have string values (crossed out confidential info). But, #3 returns null. I don’t understand why #3 onward returns null when there are text values to be extracted.
The format of code for each element is the same.
I tried switching frames but that did not work. I tried to scrape from edit mode and that didn’t work as well. I was curious if anyone ever encountered a similar situation. It seems as though no matter what I do I can’t scrape certain values… I’d appreciate any advice or insight into how I should proceed.
Thank you.
Why not try to use
find_element_by_class_name("panelList").find_elements_by_tag_name('li')
To collect all of the li elements. And using li.text to retrieve their text values. Its hard to tell what your actual output is besides you saying "returns null"
Try to use visibility_of_element_located instead of presence_of_element_located
Try to get textContent with javascript fo element Given a (python) selenium WebElement can I get the innerText?
element = driver.find_element_by_id('txtTemp_creditor_agent_bic')
text= driver.execute_script("return attributes[0].textContent", element)
The following is what worked for me:
Get rid of the try/except blocks.
Find elements via ID's (not xpath).
That allowed me to extract text from elements I couldn't extract from before.
You should change the way of extracting the elements on web page to ID's, since all the the aspects have different id provided. If you want to use xpaths, then you should try the JavaScript function to find them.
E.g.
//span[text()='Bank Name']
Related
I'm trying to get the number in all the <b> tags on this website. I want every single "qid" (question id) so I think I have to use qids = driver.find_elements_by_tag_name("b"), and based on other questions I've found I also need to implement a for loop and then print(qids.get_attribute("text")) but my code can't even seem to find elements with the <b> since I keep on getting the NoSuchElementException. The appearance of the website leads me to believe the content I'm looking for is within an iframe but I'm not sure if that affects the functionality of my code.
Here's a screencap of the website for reference
The html isn't of much use because the tag is its only defining trait:
<b>13570etc...</b>
Any help is much appreciated.
You could try searching by XPath:
driver.find_elements_by_xpath("//b")
Where // means "find all matching elements regardless of where they are in the document/current scope." Check out the XPath syntax here and mess around with a few different options.
Okay so.
The heading might seem like this question has already been asked but I had no luck finding an answer for it.
I need help with making link extracting program with python.
Actually It works. It finds all <a> elements on a webpage. Takes their href="" and puts it in an array. Then it exports it in csv file. Which is what I want.
But I can't get a hold of one thing.
The website is dynamic so I am using the Selenium webdriver to get JavaScript results.
The code for the program is pretty simple. I open a website with webdriver and then get its content. Then I get all links with
results = driver.find_elements_by_tag_name('a')
Then I loop through results with for loop and get href with
result.get_attribute("href")
I store results in an array and then print them out.
But the problem is that I can't get the name of the links.
This leads to Google
Is there any way to get 'This leads to Google' string.
I need it for every link that is stored in an array.
Thank you for your time
UPDATE!!!!!
As it seems it only gets dynamic links. I just notice this. This is really strange now. For hard coded items, it returns an empty string. For a dynamic link, it returns its name.
Okay. So. The answer is that instad of using .text you shoud use get_attribute("textContent"). Works better than get_attribute("innerHTML")
Thanks KunduK for this answer. You saved my day :)
Python and Selenium beginner here. I'm trying to scrape the title of the sections of an Udemy class. I've tried using the find_elements_by_class_name and others but for some reason only brings back partial data.
page I'm scraping: https://www.udemy.com/selenium-webdriver-with-python3/
1) I want to get the title of the sections. They are the bold titles.
2) I want to get the title of the subsections.
from selenium import webdriver
driver = webdriver.Chrome()
url = 'https://www.udemy.com/selenium-webdriver-with-python3/'
driver.get(url)
main_titles = driver.find_elements_by_class_name("lecture-title-text")
sub_titles = driver.find_elements_by_class_name("title")
Problem
1) Using main_titles, I got the length to be only 10. It only goes from Introduction to Modules. Working With Files and ones after all don't come out. However, the class names are exactly the same. Not sure why it's not. Modules / WorkingWithFiles is basically the cutoff point. The elements in the inspection also looks different at this point. They all have same span class tag but not sure why only partial is being returned
<span class="lecture-title-text">
Element Inspection between Modules title and WorkingWithFiles title
At this point the webscrape breaks down. Not sure why.
2) Using sub_titles, I got length to be 58 items but when I print them out, I only get the top two:
Introduction
How to reach me anytime and ask questions? *** MUST WATCH ***
After this, it's all blank lines. Not sure why it's only pulling the top two and not the rest when all the tags have
<div class='title'>
Maybe I could try using BeautifulSoup but currently I'm trying to get better using Selenium. Is there a dynamic content throwing off the selenium scrape or am I not scraping it in a proper way?
Thank you guys for the input. Sorry for the long post. I wanted to make sure I describe the problem correctly.
The reason why your only getting the first 10 sections is because only the first ten courses are shown. You might be logged in on your browser, so when you go to check it out, it shows every section. But for me and your scraper it's only showing the first 10. You'll need to click that .section-container--more-sections button before looking for the titles.
As for the weird case of the titles not being scraped properly: It's because when a element is hidden text attribute will always be undefined, which is why it only works for the first section. I'd try using the WebElement.get_attribute('textContent') to scrape the text.
Ok I've went through the suggestions in the comments and have solved it. I'm writing it here in case anyone in future wants to see how solution went.
1) Using suggestions, I made a command to click on the '24 more sections' to expand the tab and then scrape it, which worked perfectly!
driver.find_element_by_class_name("js-load-more").click()
titles = driver.find_elements_by_class_name("lecture-title-text")
for each in titles:
print (each.text)
This pulled all 34 section titles.
2) Using Matt's suggestion, I found the WebElement and used get_attribute('textContent') to pull out the text data. There were bunch of spaces so I used split() to get strings only.
sub_titles = driver.find_elements_by_class_name("title")
for each in sub_titles:
print (each.get_attribute('textContent').strip())
This pulled all 210 subsection titles!
This is the first question I've posted so do let me know if I should make the question clearer. Furthermore I've only just started out Python so I hope I can phrase the question right with the correct terms.
Basically I have created a customizable webscraper that relies on user's knowledge of CSS selectors. Users will first have to go to the website that they want to scrape and jot down the css selectors ("AA") of their desired elements and enter it in an excel file, in which the python script will read the inputs and pass it through browser.find_elements_by_css_selector("AA") and get the relevant text though .text.encode('utf-8')
However I noticed that sometimes there might be important information in the attribute value that should be scraped. I've looked around and found that the suggestion is always to include .get_attribute()
1) Is there an alternative to getting attribute values by just using browser.find_elements_by_css_selector("AA") without using browser.find_elements_by_css_selector("AA").get_attribute("BB"). Otherwise,
2) Is it possible for users to enter some value in "BB" in browser.find_elements_by_css_selector("AA").get_attribute("BB") such that only browser.find_elements_by_css_selector("AA") will run?
Yes, there is an alternative to retrieve the text attribute values without without using get_attribute() method. I am not sure if that can be achieved through css or not but through xpath it is possible. A couple of examples are as follows :
//h3[#class="lvtitle"]/a/text()
/*/book[1]/title/#lang
I'm trying gather some data from a site using Python, Selenium and Xpath. There are multiple datapoints I want and they are all in this structure:
/tr[1]/td
/tr[2]/td
/tr[3]/td
/tr[4]/td
I do not know how many <tr>'s there are so I am trying to search in a way that just gives me all results (hopefully in a list). How do I do that?
Here is my actual code but this is only giving me individual results. I'm new to web scraping and unsure if the issue is with my Xpath (not doing wildcards correctly or if its related to my get_attribute tag - if its getting innerhtml then is it only getting it for the single entry?)
data = driver.find_element_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr[5]/td').get_attribute("innerHTML")
print data
You should give find_elements_by_xpath a try.
I think, without seeing your full HTML, that this would work:
data = driver.find_elements_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr/td')
for element in data:
print element.get_attribute("innerHTML")