I've written some code in python in combination with selenium to parse different product names from a webpage. There are few load more buttons visible if the browser is made to scroll downward. The webpage displays it's full content if the page is made to scroll downmost until there is no load more button to click. My scraper seems to be doing good but I'm not getting all the results. There are around 200 products in that page but I'm getting 90 out of them. What change should I bring about in my scraper to get them all? Thanks in advance.
The webpage I'm dealing with: Page_Link
This is the script I'm trying with:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("put_above_url_here")
wait = WebDriverWait(driver, 10)
page = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".listing_item")))
for scroll in range(17):
page.send_keys(Keys.PAGE_DOWN)
time.sleep(2)
try:
load = driver.find_element_by_css_selector(".lm-btm")
load.click()
except Exception:
pass
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[id^=item_]"))):
name = item.find_element_by_css_selector(".pro-name.el2").text
print(name)
driver.quit()
Try below code to get required data:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.purplle.com/search?q=hair%20fall%20shamboo")
wait = WebDriverWait(driver, 10)
header = driver.find_element_by_tag_name("header")
driver.execute_script("arguments[0].style.display='none';", header)
while True:
try:
page = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".listing_item")))
driver.execute_script("arguments[0].scrollIntoView();", page)
page.send_keys(Keys.END)
load = wait.until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, "LOAD MORE")))
driver.execute_script("arguments[0].scrollIntoView();", load)
load.click()
wait.until(EC.staleness_of(load))
except:
break
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[id^=item_]"))):
name = item.find_element_by_css_selector(".pro-name.el2").text
print(name)
driver.quit()
You should only Use Selenium as a last resort.
A simple look around in the webpage showed the API it called to get your data.
It returns a JSON output with all the details:
Link
You can now just loop over and store in a dataframe easily.
Very fast, fewer errors than selenium.
Related
As title. Couldn't find a solution online. You'll see by the print statements that the program will get the text property from the WebElement object, but it is always an empty string even though I am using WebDriverWait.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
PATH = "C:\\Users\\Anthony\\Desktop\\chromedriver.exe"
driver = webdriver.Chrome(PATH)
website = "https://www.magicspoiler.com"
driver.get(website)
el_name = '//div[#class="set-card-2 pad5"]'
try:
main = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, el_name)))
articles = main.find_elements(By.XPATH, el_name)
for article in articles:
print(type(article.text))
print(article.text)
finally:
driver.quit()
I checked the HTML DOM and I don't see any text in the element you are looking for. They are simply images with anchor links and therefore you are getting blank responses.
I have given an alternate try, to extract the links and they were successful.
So, after refactoring your code, it looks like this:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
PATH = "C:\\Users\\Anthony\\Desktop\\chromedriver.exe"
driver = webdriver.Chrome(PATH)
website = "https://www.magicspoiler.com"
driver.get(website)
el_name = '//div[#class="set-card-2 pad5"]/a'
try:
main = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, el_name)))
articles = main.find_elements(By.XPATH, el_name)
for article in articles:
# print(type(article.text))
print(article.get_attribute("href"))
finally:
pass
driver.quit()
Here is the response I got:
So, please check if you want to get the links or really the text. If you want the text, then there is none that I see in DOM as I said. You may have to traverse through each of the link by clicking on it, and find if any you need fro the navigated page.
I'm trying to make my program fetch the link of an image and then store it as a string in a variable.
This is the xpath of the image. I need to do it through xpaths because the xpaths on the website are very similar bar the "/article[x]". This allow me to increase the number with a variable so that I can go through all the xpaths on the page.
/html/body/div[2]/div[2]/div[3]/div[2]/div[2]/div[1]/div/article[1]/div[2]/div[1]/a/img
Picture of the website that I'm trying to retrieve the links of the image
My code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import tkinter
import time
Anime = input("Enter Anime:")
driver = webdriver.Chrome(executable_path=r"C:\Users\amete\Documents\chromedriver.exe")
driver.get("https://myanimelist.net/search/all?q=one%20piece&cat=all")
search = driver.find_element_by_xpath('//input[#name="q"]')
wait = WebDriverWait(driver, 20)
wait.until(EC.element_to_be_clickable((By.XPATH, '//input[#name="q"]')))
# Clears the field
search.send_keys(Keys.CONTROL, 'a')
search.send_keys(Keys.DELETE)
# The field is now cleared and the program can type whatever it wants
search.send_keys(Anime)
search.send_keys(Keys.RETURN)
# Accept the cookies
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[#id="qc-cmp2-ui"]/div[2]/div/button[3]'))).click()
# Added this wait
wait.until(EC.element_to_be_clickable((By.XPATH,'//h2[#id="anime"]//ancestor::div[#class="content-left"]//article[1]/div[contains(#class, "list")][1]/div[contains(#class, "information")]/a[1]')))
link = driver.find_element_by_xpath('//h2[#id="anime"]//ancestor::div[#class="content-left"]//article[1]/div[contains(#class, "list")][1]/div[contains(#class, "information")]/a[1]').text
piclink = driver.('/html/body/div[2]/div[2]/div[3]/div[2]/div[2]/div[1]/div/article[1]/div[2]/div[1]/a/img')
print (piclink)
you can get it like this (specify the attribute)
piclink = driver.find_element_by_xpath('/html/body/div[2]/div[2]/div[3]/div[2]/div[2]/div[1]/div/article[1]/div[2]/div[1]/a/img').get_attribute('src')
print(piclink)
I would like to make scrape a web page which was opened by Selenium from a different webpage.
I entered a search term into a website using Selenium and this landed me in a new page. My aim is to create soup out of this new page. But, the soup is getting created out of the previous page where I entered my search term. Help please!
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
inputElement.send_keys(Keys.ENTER)
driver.wait.until(staleness_of('txtStock')
source = driver.page_source
soup = BeautifulSoup(source)
You need to know the exect company names for your search. After you are using send_keys, you tried to check for staleness of an element. I did not understand how that statement should work. I added WebDriverWait for an element of the new page.
The following works for me reagrding the selenium part up to getting the page source:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries Ltd.')
inputElement.send_keys(Keys.ENTER)
company = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'lblCompany')))
source = driver.page_source
You should add exception handling.
#Jens Dibbern has given a working solution. But it is not necessary that the exact name of the company should be given in the search. What happens is that when you type a non-exact name, a drop-down will pop up.
I have observed that until and unless this drop-down is present enter key is not working. You can check this by going to the site, pasting the name and without waiting press the enter key as fast as possible. Nothing happens.
You could also wait for this drop-down to be visible instead and the send the enter key.This also works perfectly. Note that this will end up selecting the first item in the drop-down if more than one is present.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
drop_down=driver.find_element_by_css_selector("#listPlacementStock")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#listPlacementStock:not([style*="display: none"])')))
inputElement.send_keys(Keys.ENTER)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="CompanyLink"]')))
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
print(soup)
I'm trying to do some webscraping from a betting website:
As part of the process, I have to click on the different buttons under the "Favourites" section on the left side to select different competitions.
Let's take the ENG Premier League button as example. I identified the button as:
(source: 666kb.com)
The XPath is: //*[#id="SportMenuF"]/div[3] and the ID is 91.
My code for clicking on the button is as follows:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_path = "C:\Python27\Scripts\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("URL Removed")
content = driver.find_element_by_xpath('//*[#id="SportMenuF"]/div[3]')
content.click()
Unfortunately, I always get this error message when I run the script:
"no such element: Unable to locate element:
{"method":"xpath","selector":"//*[#id="SportMenuF"]/div[3]"}"
I have tried different identifiers such as CCS Selector, ID and, as shown in the example above, the Xpath. I tried using waits and explicit conditions, too. None of this has worked.
I also attempted scraping some values from the website without any success:
from selenium import webdriver
from selenium.webdriver.common.by import By
chrome_path = "C:\Python27\Scripts\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("URL removed")
content = driver.find_elements_by_class_name('price-val')
for entry in content:
print entry.text
Same problem, nothing shows up.
The website embeddes an iframe from a different website. Could this be the cause of my problems? I tried scraping directly from the iframe URL, too, which didn't work, either.
I would appreciate any suggestions.
Sometimes elements are either hiding behind an iframe, or they haven't loaded yet
For the iframe check, try:
driver.switch_to.frame(0)
For the wait check, try:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '-put the x-path here-')))
I've written a script in python with selenium to parse some results populated upon filling in an inputbox and preessing the Go button. My script does this portion well at this moment. However, my main goal is to parse the title of that container visible as Toys & Games as well.
This is my try so far (I could not find any idea to make a loop to do the same for all the containers):
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://www.fbatoolkit.com/"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
driver.find_element_by_css_selector(".estimator-container .estimator-input").send_keys("25000",Keys.RETURN)
time.sleep(2)
item = driver.find_element_by_css_selector(".estimator-result div").text
print(item)
driver.quit()
The result I get:
4 (30 Days Avg)
Result I would like to have:
Toys & Games
4 (30 Days Avg)
Link to an image in which you can see how they look like in that site. Expected fields are also marked with a pencil to let you know the location of the fields I'm trying to parse.
Try below code to get required output
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
url = "https://www.fbatoolkit.com/"
driver = webdriver.Chrome()
driver.get(url)
for container in wait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[class='chart-container']"))):
wait(container, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input.estimator-input"))).send_keys("25000", Keys.RETURN)
title = wait(container, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".chart text"))).text
item = wait(container, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".estimator-result div"))).text
print(title, item)
driver.quit()