The website I am trying to scrape is 'https://www.lamiastampante.it/cerca_codice_cartuccia.php?codice=D111L&lg=it', and I am using python with Selenium for that.
I want to click on the title of the first product came out of my search.
It is an a element within a div, but when I copy the parent's XPath of such element (the 's XPath), my python script thinks I am referring to another (incorrect) element which is a pane located on the right of the webpage.
I noticed that because if I print out the class of the element gotten from that XPath I get "panel-heading", while it should be "col-xs-12 col-sm-12 col-md-12".
This is my very short python script:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get("https://www.lamiastampante.it")
driver.find_element(By.ID, "form_oem_code").send_keys("D111L" + Keys.ENTER)
first_product = driver.find_element_by_xpath("""/html/body/div[6]/div/div[1]/div[4]/div[1]""")# XPath of the target's parent element.
# first_product.click() /Commented out because I should first get the <a> element within that contains the link that can be clicked.
You can go visit the web page and inspect its HTML structure. I had some hard times trying to copy paste it here in a comprehensive and useful manner.
To click on the link after search results induce WebDriverWait() and ait for element_to_be_clickable() and following xpath.
driver.get("https://www.lamiastampante.it")
driver.find_element(By.ID, "form_oem_code").send_keys("D111L" + Keys.ENTER)
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//a[starts-with(#id,'a_') and contains(.,'Toner')])[1]"))).click()
Or yon can use visibility_of_all_elements_located() and below xapth.
driver.get("https://www.lamiastampante.it")
driver.find_element(By.ID, "form_oem_code").send_keys("D111L" + Keys.ENTER)
elements=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.XPATH,"//a[starts-with(#id,'a_') and contains(.,'Toner')]")))
elements[0].click()
You need to import following libraries.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Related
Hi i'm new at selenium and webscraping and i need some help.
i try to scrape one site and i need and i dont know how to get span class.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
PATCH = "/Users/bobo/Downloads/chromedriver"
driver = webdriver.Chrome(PATCH)
driver.get("https://neonet.pl")
print(driver.title)
search = driver.find_element_by_class_name("inputCss-input__label-263")
search.send_keys(Keys.RETURN)
time.sleep(5)
i try to extract this span
<span class="inputCss-input__label-263">Szukaj produktu</span>
I can see that you are trying to search something in the search bar.
First I recommend you to use the xpath instead of the class name, here is a simple technique to get the xpath of every element on a webpage:
right-click/ inspect element/ select the mouse in a box element on the upper left/ click on the element on the webpage/ it will directly show you the corresponding html/ then right click on the selected html/ copy options and then xpath.
Here is a code example that searches an element on the webpage, I also included the 'Webdriver-wait' option because sometimes the code runs to fast and can't find the next element so this function make the code wait till the element is visible:
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path="/Users/bobo/Downloads/chromedriver")
driver.get("https://neonet.pl") #loading page
wait = WebDriverWait(driver, 20) #defining webdriver wait
search_word = 'iphone\n' # \n is going to act as an enter key
wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="root"]/main/div[1]/div[4]/div/div/div[2]/div/button[1]'))).click() #clicking on cookies popup
wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="root"]/main/header/div[2]/button'))).click() #clicking on search button
wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="root"]/aside[2]/section/form/label/input'))).send_keys(search_word) #searching on input button
print('done!')
sleep(10)
Hope this helped you!
wait=WebDriverWait(driver,10)
driver.get('https://neonet.pl')
elem=wait.until(EC.visibility_of_element_located((By.XPATH, "//span[contains(#class,'inputCss-input')]"))).text
print(elem)
To output the value of the search bar you use .text on the Selenium Webelement.
Imports:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
For one study research I would like to scrape some links from webpages which located out of viewport (to see this links you need to scroll down the page).
Page example (https://www.twitch.tv/lirik)
Link example: https://www.amazon.com/dp/B09FVR22R2
Link located in div class='Layout-sc-nxg1ff-0 itdjvg default-panel' (in total 16 links on the page).
I have write the script but I get empty list:
from selenium import webdriver
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
time.sleep(3)
browser.execute_script("window.scrollBy(0,document.body.scrollHeight)")
time.sleep(3)
panel_blocks = browser.find_elements(by='class name', value='Layout-sc-nxg1ff-0 itdjvg default-panel')
browser.close()
print(panel_blocks)
print(type(panel_blocks))
I just get empty list after page was loaded. Here is output from the script above:
/usr/local/bin/python /Users/greg.fetisov/PycharmProjects/baltazar_platform/Twitch_parser.py
[]
<class 'list'>
Process finished with exit code 0
p.s.
when webdriver opens the page, I see there is no scroll down action. It just open a page and then close it after time.sleep cooldown.
How I can change the script to get the links properly?
Any help or advice would be appreciated!
You are using a wrong locator.
You should use expected conditions explicit waits instead of hardcoded pauses.
find_elements method returns a list of web elements while you want to the link inside the element(s).
This should work better:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[#class='channel-panels-container']//a")))
time.sleep(0.5)
link_blocks = browser.find_elements_by_xpath("//div[#class='channel-panels-container']//a")
for link_block in link_blocks:
link = link_block.get_attribute("href")
print(link)
browser.close()
To print the values of the href attribute you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get("https://www.twitch.tv/lirik")
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.Layout-sc-nxg1ff-0.itdjvg.default-panel > a")))])
Console Output:
['https://www.amazon.com/dp/B09FVR22R2', 'http://bs.serving-sys.com/Serving/adServer.bs?cn=trd&pli=1077437714&gdpr=$%7BGDPR%7D&gdpr_consent=$%7BGDPR_CONSENT_68%7D&adid=1085757156&ord=[timestamp]', 'https://store.epicgames.com/lirik/rumbleverse', 'https://bitly/3GP0cM0', 'https://lirik.com/', 'https://streamlabs.com/lirik', 'https://twitch.amazon.com/tp', 'https://www.twitch.tv/subs/lirik', 'https://www.youtube.com/lirik?sub_confirmation=1', 'http://www.twitter.com/lirik', 'http://www.instagram.com/lirik', 'http://gfuel.ly/lirik', 'http://www.cyberpowerpc.com/', 'https://www.cyberpowerpc.com/page/Intel/LIRIK/', 'https://discord.gg/lirik', 'http://www.amazon.com/?_encoding=UTF8&camp=1789&creative=390957&linkCode=ur2&tag=l0e6d-20&linkId=YNM2SXSSG3KWGYZ7']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
I'm new to python and selenium.
This is an HTML snippet of the page im trying to scrape a page this
I want to scrape on the basis of data-attr1 given in the html snippet.
Please share some code to find this link and click on it.
<a class="js-track-click challenge-list-item" data-analytics="ChallengeListChallengeName" data-js-track="Challenge-Title" data-attr1="grading" data-attr3="code" data-attr4="true" data-attr5="true" data-attr7="10" data-attr-10="0.9" data-attr11="false" href="/challenges/grading"><div class="single-item challenges-list-view-v2 first-challenge cursor"><div id="contest-challenges-problem" class="individual-challenge-card-v2 content--list-v2 track_content"><div class="content--list_body"><header class="content--list_header-v2"><div class="challenge-name-details "><div class="pull-left inline-block"><h4 class="challengecard-title">Grading Students<div class="card-details pmT"><span class="difficulty easy detail-item">Easy</span><span class="max-score detail-item">Max Score: 10</span><span class="success-ratio detail-item">Success Rate: 96.59%</span></div></h4></div></div><span class="bookmark-cta"><button class="ui-btn ui-btn-normal ui-btn-plain star-button" tabindex="0" aria-label="Add bookmark"><div class="ui-content align-icon-right"><span class="ui-text"><i class="js-bookmark star-icon ui-icon-star"></i></span></div></button></span><div class="cta-container"><div class="ctas"><div class="challenge-submit-btn"><button class="ui-btn ui-btn-normal primary-cta ui-btn-line-primary" tabindex="0"><div class="ui-content align-icon-right has-icon"><span class="ui-text">Solved</span><i class="ui-icon-check-circle ui-btn-icon"></i></div></button></div></div></div></header></div></div><div class="__react_component_tooltip place-top type-dark " data-id="tooltip"></div></div></a>
If the page is not loading as fast as expected you can try:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#data-attr1='grading']"))).click()
You can click on the element using the xpath:
element = driver.find_element_by_xpath("//a[#data-attr1='grading']")
element.click();
Since there is no name id or class available directly you can use xpath.
The element that you are looking for is in div with class challenges-list and you want to click on first link inside it. You can use this xpath
//a[#data-attr1='grading']
And for clicking you can do
driver.find_element_by_xpath("//a[#data-attr1='grading']").click()
Use double-click with Actionchains because click might not work in Python.
eg.
from selenium.webdriver import ActionChains
# Get the element however you want
element = driver.find_element_by_xpath("//a[#data-attr1='grading']")
ActionChains(driver).double_click(settings_icon).perform()
When I run the code, the website loads up fine but then it won't click on the button- an error appears saying the element is not interacterble. What do I need to do to click the button? I am relatively new to this and would be grateful for any help.
I have already tried finding it by id and tag.
page = driver.get("https://kenpreston.co.uk/author/")
element = driver.find_element_by_id('mk-button-31')
element.click()
SOLVED:
I used driver.find_element_by_link_text and this worked fine.
I have checked the website and noticed that mk-button-31 is an id for a div tag and inside it there is an a tag. Try getting the url from the a tag and do another driver.get instead of clicking on it.
Also the whole div tag is not clickable so that is why you are getting this error.
Use sleep from time library to be sure page fully loaded
from time import sleep
page = driver.get("https://kenpreston.co.uk/author/")
sleep(2)
element = driver.find_element_by_id('mk-button-31')
element.click()
Looks like your element is not clickable you need to replace this id selector with the css and need to wait for the element before click on it.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
page = driver.get("https://kenpreston.co.uk/author/")
element = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#mk-button-31 span"))
element.click();
Consider adding Explicit Wait to your script as it might be the case the DOM had finished loading and the button you're looking for is still not there.
The classes you're looking for are:
WebDriverWait
expected_conditions
Suggested code change:
#your other imports here
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
#your other code here
page = driver.get("https://kenpreston.co.uk/author/")
element = WebDriverWait(driver, 10).until(expected_conditions.element_to_be_clickable((By.ID, "mk-button-31")))
element.click()
More information: How to use Selenium to test web applications using AJAX technology
I want to extract the tag names (hashtags) from the explore page in twitter using selenium on python3. But there are no special tags or classes or even ids to be able to locate them and save them.
Is there a way that I can extract them even if they change without having to edit my code every time?
I think the following code will take me to the explore page using the link text. But I can not use the same method to locate the tags as they change every now and then.
explore = driver.find_element_by_link_text("Explore")
I want to be able to locate the tags and save them into a list so I can use that list in my work later on.
This is the html code for on of the tags:
<span class="r-18u37iz"><span dir="ltr" class="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0">#ARSBUR</span></span>
The classes are not unique and they are used in other elements of the page, so I can not use them.
If there is a way to locate the (#) mark so I can only get the text that includes them.
To extract the hashtags from the explore page in twitter i.e https://twitter.com/explorer?lang=en using Selenium on Python 3 you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get("https://twitter.com/explorer?lang=en")
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[href^='/hashtag']>span.trend-name")))])
Using XPATH:
driver.get("https://twitter.com/explorer?lang=en")
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[starts-with(#href, '/hashtag')]/span[contains(#class, 'trend-name')]")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
['#MCITOT', '#WorldSupportsKashmir', '#MCIvsTOT', '#11YearsOFViratism', '#ManCity']
You could dump page source into beautifulsoup 4.7.1 + and use :contains along with class. Your classes appear different from the ones I see but I am making an assumption about url.
N.B. On the page there can be other # under a different class which would make selector ".trend-name, .twitter-hashtag" .
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
d = webdriver.Chrome(r'path\chromedriver.exe')
d.get('https://twitter.com/explorer?lang=en')
WebDriverWait(d,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".trend-name")))
soup = bs(d.page_source, 'lxml')
hashtag_trends = [i.text for i in soup.select('.trend-name:contains("#")')]
print(hashtag_trends)
Or test whether .text begins with # for selenium only
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
d = webdriver.Chrome(r'path\chromedriver.exe')
d.get('https://twitter.com/explorer?lang=en')
hashtag_trends = [i.text for i in
WebDriverWait(d,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".trend-name")))
if i.text.startswith('#')
]
For locator trending topic you can using xpath.
driver.find_element(By.XPATH, '(//*[contains(#class,"trend-name")])[1]').text
driver.find_element(By.XPATH, '(//*[contains(#class,"trend-name")])[1]').click()
You can get count the element by :
len_locator = driver.find_elements(By.XPATH, '//*[contains(#class,"trend-name")]')
print len(len_locator)
Or if you only want locator only start with #, you can use :
driver.find_element(By.XPATH, '(//*[#dir="ltr" and starts-with(text(), "#")])[1]').text
driver.find_element(By.XPATH, '(//*[#dir="ltr" and starts-with(text(), "#")])[1]').click
You can get count the element by :
len_locator = driver.find_elements(By.XPATH, '//*[#dir="ltr" and starts-with(text(), "#")]')
print len(len_locator)
It's the first locator of the trending topic, if you want the second and so on, then replace [1] to [2] etc. Use iteration to the grab all.