Extracting links from website with selenium bs4 and python - python

Okay so.
The heading might seem like this question has already been asked but I had no luck finding an answer for it.
I need help with making link extracting program with python.
Actually It works. It finds all <a> elements on a webpage. Takes their href="" and puts it in an array. Then it exports it in csv file. Which is what I want.
But I can't get a hold of one thing.
The website is dynamic so I am using the Selenium webdriver to get JavaScript results.
The code for the program is pretty simple. I open a website with webdriver and then get its content. Then I get all links with
results = driver.find_elements_by_tag_name('a')
Then I loop through results with for loop and get href with
result.get_attribute("href")
I store results in an array and then print them out.
But the problem is that I can't get the name of the links.
This leads to Google
Is there any way to get 'This leads to Google' string.
I need it for every link that is stored in an array.
Thank you for your time
UPDATE!!!!!
As it seems it only gets dynamic links. I just notice this. This is really strange now. For hard coded items, it returns an empty string. For a dynamic link, it returns its name.

Okay. So. The answer is that instad of using .text you shoud use get_attribute("textContent"). Works better than get_attribute("innerHTML")
Thanks KunduK for this answer. You saved my day :)

Related

Find Element in Selenium (Python) not possible due to multiple html tags

Hi :) This is my first time here and I am new to programming.
I am currently trying to automate some work-steps, using Selenium.
There was no problem, mainly using the find_element(By.ID,'') function and clicking stuff.
But now I cannot find any element that comes after the second "html" tag on the site (see screenshot)
I tried to google this "multiple html" problem, but all I found was people saying it is not possible to have multiple html tags. I basically don't know anything about html, but this site seems to have more than one - there are actually three. And anything after the first one cannot be subject to the find_element function. Please help me with this confusion.
These "multiple html" are due to the i frames in the html code. Each iframe has its own html code. If the selector you are using is meant to find something inside one of these iframes you have to "move" your driver inside the iframe. You can find an example in this other question

Scrapy can't see a list

I'm trying to crawl a specific page of a website (https://www.johnlewis.com/jaeger-wool-check-knit-shift-dress-navy-check/p3767291) to get used to Scrapy and its features. However, I can't get Scrapy to see the 'li' that contains the thumbnail images on the carousel. My parse Function currently looks as follows:
def parse(self, response):
for item in response.css('li.thumbnail-slide'):
#The for loop works for li.size-small-item
print("We have a match!")
No matter what Scrapy isn't "seeing" the li. I've tried viewing the page in a scrapy shell to check Scrapy could see the images and they are showing up in the response for that (so I'm assuming Scrapy can definitely see the list/images in the list). I've tried alternative lists and I've got a different list to work (as per the comment in the code).
My only thoughts are that the carousel may be loaded with JavaScript / AJAX but I can't be too sure. I do know that the list class will change if it is the selected image from "li.thumbnail-slide" to "li.thumbnail-slide thumbnail-slide-active" however, I've tried the following in my script to no avail:
li.thumbnail-slide
li.thumbnail-slide-active
li.thumbnail-slide.thumbnail-slide-active
li.thumbnail-slide thumbnail-slide-active
Nothing works.
Does anyone have any suggestions on what I may be doing wrong? Or suggest any further reading that may help?
Thanks in advance!
You assumption is correct, the elements are there, but not exactly where you think they are.
To easily check if an element is part of the response html and not being loaded by javascript I normally recommend using a browser plugin to disable javascript.
If you want the images, they are still part of the html response, you can get them with:
response.css('li.product-images__item')
the main image appears separately:
response.css('meta[itemprop=image]::attr(content)')
Hope that helps you.

Scraping text values using Selenium with Python

For each vendor in an ERP system (total # of vendors = 800+), I am collecting its data and exporting this information as a pdf file. I used Selenium with Python, created a class called Scraper, and defined multiple functions to automate this task. The function, gather_vendors, is responsible for scraping and does this by extracting text values from tag elements.
Every vendor has a section called EFT Manager. EFT Manager has 9 rows I am extracting from:
For #2 and #3, both have string values (crossed out confidential info). But, #3 returns null. I don’t understand why #3 onward returns null when there are text values to be extracted.
The format of code for each element is the same.
I tried switching frames but that did not work. I tried to scrape from edit mode and that didn’t work as well. I was curious if anyone ever encountered a similar situation. It seems as though no matter what I do I can’t scrape certain values… I’d appreciate any advice or insight into how I should proceed.
Thank you.
Why not try to use
find_element_by_class_name("panelList").find_elements_by_tag_name('li')
To collect all of the li elements. And using li.text to retrieve their text values. Its hard to tell what your actual output is besides you saying "returns null"
Try to use visibility_of_element_located instead of presence_of_element_located
Try to get textContent with javascript fo element Given a (python) selenium WebElement can I get the innerText?
element = driver.find_element_by_id('txtTemp_creditor_agent_bic')
text= driver.execute_script("return attributes[0].textContent", element)
The following is what worked for me:
Get rid of the try/except blocks.
Find elements via ID's (not xpath).
That allowed me to extract text from elements I couldn't extract from before.
You should change the way of extracting the elements on web page to ID's, since all the the aspects have different id provided. If you want to use xpaths, then you should try the JavaScript function to find them.
E.g.
//span[text()='Bank Name']

Selenium Webscraping for some reason data only brings back partial instead of everything. Not sure if any dynamic data is in background

Python and Selenium beginner here. I'm trying to scrape the title of the sections of an Udemy class. I've tried using the find_elements_by_class_name and others but for some reason only brings back partial data.
page I'm scraping: https://www.udemy.com/selenium-webdriver-with-python3/
1) I want to get the title of the sections. They are the bold titles.
2) I want to get the title of the subsections.
from selenium import webdriver
driver = webdriver.Chrome()
url = 'https://www.udemy.com/selenium-webdriver-with-python3/'
driver.get(url)
main_titles = driver.find_elements_by_class_name("lecture-title-text")
sub_titles = driver.find_elements_by_class_name("title")
Problem
1) Using main_titles, I got the length to be only 10. It only goes from Introduction to Modules. Working With Files and ones after all don't come out. However, the class names are exactly the same. Not sure why it's not. Modules / WorkingWithFiles is basically the cutoff point. The elements in the inspection also looks different at this point. They all have same span class tag but not sure why only partial is being returned
<span class="lecture-title-text">
Element Inspection between Modules title and WorkingWithFiles title
At this point the webscrape breaks down. Not sure why.
2) Using sub_titles, I got length to be 58 items but when I print them out, I only get the top two:
Introduction
How to reach me anytime and ask questions? *** MUST WATCH ***
After this, it's all blank lines. Not sure why it's only pulling the top two and not the rest when all the tags have
<div class='title'>
Maybe I could try using BeautifulSoup but currently I'm trying to get better using Selenium. Is there a dynamic content throwing off the selenium scrape or am I not scraping it in a proper way?
Thank you guys for the input. Sorry for the long post. I wanted to make sure I describe the problem correctly.
The reason why your only getting the first 10 sections is because only the first ten courses are shown. You might be logged in on your browser, so when you go to check it out, it shows every section. But for me and your scraper it's only showing the first 10. You'll need to click that .section-container--more-sections button before looking for the titles.
As for the weird case of the titles not being scraped properly: It's because when a element is hidden text attribute will always be undefined, which is why it only works for the first section. I'd try using the WebElement.get_attribute('textContent') to scrape the text.
Ok I've went through the suggestions in the comments and have solved it. I'm writing it here in case anyone in future wants to see how solution went.
1) Using suggestions, I made a command to click on the '24 more sections' to expand the tab and then scrape it, which worked perfectly!
driver.find_element_by_class_name("js-load-more").click()
titles = driver.find_elements_by_class_name("lecture-title-text")
for each in titles:
print (each.text)
This pulled all 34 section titles.
2) Using Matt's suggestion, I found the WebElement and used get_attribute('textContent') to pull out the text data. There were bunch of spaces so I used split() to get strings only.
sub_titles = driver.find_elements_by_class_name("title")
for each in sub_titles:
print (each.get_attribute('textContent').strip())
This pulled all 210 subsection titles!

How to get attribute using Selenium python without using .get_attribute()

This is the first question I've posted so do let me know if I should make the question clearer. Furthermore I've only just started out Python so I hope I can phrase the question right with the correct terms.
Basically I have created a customizable webscraper that relies on user's knowledge of CSS selectors. Users will first have to go to the website that they want to scrape and jot down the css selectors ("AA") of their desired elements and enter it in an excel file, in which the python script will read the inputs and pass it through browser.find_elements_by_css_selector("AA") and get the relevant text though .text.encode('utf-8')
However I noticed that sometimes there might be important information in the attribute value that should be scraped. I've looked around and found that the suggestion is always to include .get_attribute()
1) Is there an alternative to getting attribute values by just using browser.find_elements_by_css_selector("AA") without using browser.find_elements_by_css_selector("AA").get_attribute("BB"). Otherwise,
2) Is it possible for users to enter some value in "BB" in browser.find_elements_by_css_selector("AA").get_attribute("BB") such that only browser.find_elements_by_css_selector("AA") will run?
Yes, there is an alternative to retrieve the text attribute values without without using get_attribute() method. I am not sure if that can be achieved through css or not but through xpath it is possible. A couple of examples are as follows :
//h3[#class="lvtitle"]/a/text()
/*/book[1]/title/#lang

Categories