Scraping with Selenium not showing all data (possible duplicate) - python

I was trying to make a simple code for scraping a dynamic website (a newbie with Selenium here). The data I intended to scrape is the product name and the price. I ran over the code and it worked, but only showed 10 entries, while there are 60 entries for each page. Here is the code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.tokopedia.com/p/komputer-laptop/media-penyimpanan-data') # the link
product_name = driver.find_elements(By.CSS_SELECTOR, value='span.css-1bjwylw')
product_price = driver.find_elements(By.CSS_SELECTOR, value='span.css-o5uqvq')
list_product = []
list_price = []
for i in range(len(product_name)):
list_product.append(product_name[i].text)
for j in range(len(product_price)):
list_price.append(product_price[i].text)
driver.quit()
df = pd.DataFrame(columns=['product', 'price'])
df['product'] = list_product
df['price'] = list_price
print(df)
I used the chromedriver installer instead of downloading the driver first and then locating it because I just thought it was just a simpler way. Also, I used Service instead of Options (many tutorial using Options) because I got some errors with it, and with Service it worked out fine. Oh, and I used PyCharm, if that just makes sense of something, maybe.
Any help or suggestions will be very much appreciated, thank you!

According to me you need to scroll down to bottom of the page first for all 60 of data to be loaded. As website is dynamic and as you scroll below data gets loaded. You can use javascript script for scrolling via webdriver as follows: driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") add this below driver.get() and before find_elements().
Don't forget to use sleep after scroll as it require time to get loaded.

Related

Selenium webpage not loading properly

I am trying to web scrape university ranking infomation from USNews site. And the problem is when I use selenium to open the webpage, the 'Load More Button' is not working properly. (I think I successfully click it but in the Chrome window opened by webdriver, when I scroll down to the button, is says that 'We're sorry, there was a problem loading the next page of search results'.
I am new to web scrawler and I did a lot of research on this, there are several similar questions but none of those answers helped. I really need some help. Here is my code:
driver_path = 'xxx' (chromedriver path)
driver = webdriver.Chrome(executable_path=driver_path)
url2 = 'https://www.usnews.com/education/best-global-universities/rankings'
wait = WebDriverWait(driver, 30)
driver.get(url2)
driver.maximize_window()
count = 1
while True:
try:
print(1)
# driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
wait.until(EC.visibility_of_element_located((By.XPATH,"//*[#id='rankings']/div[3]/button")))
print(2)
show_more = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='rankings']/div[3]/button")))
ActionChains(browser).move_to_element(show_more).click().perform()
print(3)
# driver.find_element(By.XPATH,"//*[#id='rankings']/div[3]/button").click()
# print(4)
# wait.until(EC.visibility_of_element_located((By.XPATH,"//*[#id='rankings']/div[3]/button")))
# print(5)
count += 1
time.sleep(2)
if count >=2:
break
Even though I did not write code to close the ad, but I don't think the ad is the problem since when I manually close it and then click the button, it is still not working. Is it the problem with the website?
import requests
import os
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
It is clear that there is an anti scraping control in the specific site. It is always recommended to consult the robots.txt file beforehand and check whether scraping is possible on a certain site or not.
In general, this site blocks just the IP (try to go to other pages afterwards, you will see that you will get a 403 error).
In general, however, the approach you used does not seem wrong to me. You can try contacting the site directly to see if the problem can be solved in some other way.

Selenium can not find element from workera.ai

I am trying to scrape question answers from workera.ai but I am stuck because Selenium cannot find any element I searched for using class. When I check the page source the element is available but Selenium can not find it. Here is what I am doing.
Signup using: https://workera.ai/candidates/signup
from selenium import webdriver
from selenium.webdriver.chrome import service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time, os
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
option.add_experimental_option("excludeSwitches", ["enable-automation"])
option.add_experimental_option('useAutomationExtension', False)
option.add_argument("--disable-blink-features")
option.add_argument("--disable-gpu")
option.add_argument(r"--user-data-dir=C:\Users\user_name\AppData\Local\Google\Chrome\User Data") #e.g. C:\Users\You\AppData\Local\Google\Chrome\User Data
option.add_argument(r'--profile-directory=Profile 2') # using profile which is logged into the website
#option.add_argument("--headless")
option.add_argument('--disable-blink-features=AutomationControlled')
wd = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=option)
skill_sets = ['https://workera.ai/app/learner/skillset/82746bf6-4eb2-4065-b2fb-740bc3207d14','https://workera.ai/app/learner/skillset/7553e8f8-52bf-4136-a4ea-6aa63eb963d9','https://workera.ai/app/learner/skillset/e11cb698-38c1-4a4f-aa7b-43b85bdf5a51','https://workera.ai/app/learner/skillset/a999048c-ab99-4576-b849-4e72c9455418','https://workera.ai/app/learner/skillset/7df84ad9-ae67-4faf-a981-a95c1c02adbb', 'https://workera.ai/app/learner/skillset/737fa250-8c66-4ea0-810b-6847c304aa5b','https://workera.ai/app/learner/skillset/ed4f2f1f-2333-4b28-b36a-c7f736da9647','https://workera.ai/app/learner/skillset/323ba5d9-fffe-48c0-b7b4-966d1ebca99a','https://workera.ai/app/learner/skillset/488492e9-53c4-4600-b336-6dfe44340402']
# AI fluent AI literate DATA ANAlyst DATA Engineer DATA scientist Deep learn ML Responsible AI Software Engineer
for skill in skill_sets:
wd.get(skill)
time.sleep(20)
num = wd.find_element(By.CLASS_NAME, "sc-jNHgKk hrMhpT")# class name is different for every account
num = num.split('of')[1]
num = int(num)
print(num)
button = wd.find_elements(By.CLASS_NAME, "styled__SBase-sc-cmjz60-0 styled__SPrimary-sc-cmjz60-1 kSmXiJ hwoYMb sc-fKVqWL eOjNfz")
print(len(button))
wd.close()
I don't know why it is happening. Does the site block Selenium web drivers or it is something else?
Edit
I tried getting page source from Selenium and then accessing elements using bs4 and it is working. So I think the website is blocking Selenium by some mean.
The problem with selenium is that you can't select elements that has more than one class like this.
In order to select them, you can either mention one class in the value, or use "."
for example:
wd.find_element(By.CLASS_NAME,"class1.class2")
Also you can select the class that exists for all the answers which I believe it is this one "sc-jNHgKk", so you won't have the problem to select a class for each account, or you can just use XPATH instead.
num = int(wd.find_element(By.CLASS_NAME, "sc-jNHgKk").text.split("of ")[1])
button = wd.find_elements(By.CLASS_NAME, "styled__SBase-sc-cmjz60-0")
print(len(button))

No output while scraping Google search page

I am trying to scrape from Google search results the blue highlighted portion as shown below:
When I use inspect element, it shows: span class="YhemCb". I have tried using various soup.find and soup.find_all commands, but everything I have tried has no
output so far. What command should I use to scrape this part?
Google uses javascript to display most of its web elements, so using something like requests and BeautifulSoup is unfortunately not enough.
Instead, use selenium! It essentially allows you to control a browser using code.
First, you will need to navigate to the google page you wish to scrape
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
Then, you have to wait until the review page loads in the browser.
This is done using WebDriverWait: you have to specify an element that needs to appear on the page. The [data-attrid="kc:/local:one line summary"] span css selector allows me to select the review info about the hotel.
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the rating
print(review_element.get_attribute('innerHTML'))
Here's the full code in case you want to play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
# navigate to google
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
# wait until the page loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
# print the rating
print(review_element.get_attribute('innerHTML'))
Note Google is notoriously defensive against anyone who is trying to scrape them. On first few attempts you might be successful, but eventually you will have to deal with Google Captcha.
To work around that, I would suggest using the search engine scraper, something like the quickstart guide to get you started!
Disclaimer: I work at Oxylabs.io

How do I use selenium ChromeDriver to scroll the sidebar on Google maps to load more results?

I’ve run into a problem trying to use Selenium ChromeDriver to scroll down the sidebar of a google maps results page. I am trying to get to the 6th result down but the result does not fully load until you scroll down. Using the find_element_by_xpath method, I am successfully able to access results 1-5 and click into them individually, but when trying to use the actions.move_to_element(link).perform() method to scroll to the 6th element, it does not work and throws an error message.
The error that I get is:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element:
However, I know this element exists because when I manually scroll and more results are loaded, the Xpath works correctly. What am I doing wrong? I’ve spent many hours trying to solve this and I haven’t been able to solve with the available content out there. I appreciate any help or insights you can offer, thank you!
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as soup
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.google.com/maps")
time.sleep(7)
page = soup(driver.page_source, 'html.parser')
#find the searchbar, enter search, and hit return
search = driver.find_element_by_id('searchboxinput')
search.send_keys("dentists in Austin Texas")
search.send_keys(Keys.RETURN)
driver.maximize_window()
time.sleep(7)
#I want to get the 6th result down but it requires a sidebar scroll to load
link = driver.find_element_by_xpath("//*[#id='pane']/div/div[1]/div/div/div[4]/div[1]/div[13]/div/a")
actions.move_to_element(link).perform()
link.click()
time.sleep(5)
driver.back()```
I found a solution that works, it is to target the element in XPATH from the javascript interface of selenium. You must then execute two commands on an instruction (targeting and scroll)
driver.executeScript("var el = document.evaluate('/html/body/jsl/div[3]/div[10]/div[8]/div/div[1]/div/div/div[4]/div[1]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue; el.scroll(0, 5000);");
this is the only solution that worked for me
The search results in the google map are located with //div[contains(#aria-label,'dentists in Austin Texas')]//div[contains(#jsaction,'mouseover')] XPath.
So, to select 6-th element there you can do the following
from selenium.webdriver.common.action_chains import ActionChains
results = driver.find_elements_by_xpath('//div[contains(#aria-label,"dentists in Austin Texas")]//div[contains(#jsaction,"mouseover")]')
ActionChains(driver).move_to_element(results[6]).click(button).perform()
I was just implementing scrolling on google map sidebar, it's working on my side. check this code please
# selecting scroll body
driver.find_element_by_xpath('/html/body/div[3]/div[9]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div[1]').click()
#start scrolling your sidebar
html = driver.find_element_by_xpath('/html/body/div[3]/div[9]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div[1]')
html.send_keys(Keys.END)
also add the "KEYS" library
from selenium.webdriver.common.keys import Keys
I hope it would help you.
by the way I have implemented scrapping of google map with its available data and used above code to scroll. check if you have any problem, let me know then

Using Python to Scrape a JS Form

I'm currently working on a research project in which we are trying to collect saved image files from Brazil's Hemeroteca database. I've done web scraping on PHP pages before using C/C++ with HTML forms, but as this is a shared script, I need to switch to python such that everyone in the group can use this tool.
The page which I'm trying to scrape is: http://bndigital.bn.gov.br/hemeroteca-digital/
There are three forms which populate, the first being the newspaper/journal. Upon selecting this, the available times populate, and the final field is the search term. I've inspected the HTML page here and the three IDs of these are respectively: 'PeriodicoCmb1_Input', 'PeriodoCmb1_Input', and 'PesquisaTxt1'.
Some google searches on this topic led me to the Selenium package, and I've put together this sample code to attempt to read the page:
import webbrowser
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
print("Begin...")
browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)
print("Waiting to load page... (Delay 3 seconds)")
time.sleep(3)
print("Searching for elements")
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
print(journal)
print("Set fields, delay 3 seconds between input")
search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"
journal.send_keys(search_journal)
time.sleep(3)
timeRange.send_keys(search_timeRange)
time.sleep(3)
searchTerm.send_keys(search_text)
print("Perform search")
submitButton = button.find_element_by_id("PesquisarBtn1_input")
submitButton.click()
The script runs to the print(journal) statement, where an error is thrown saying the element cannot be found.
Can anyone take a quick sweep of the page in question and make sure I've got the general premise of this script in line correctly, or point me towards some examples to get me running on this problem?
Thanks!
Your DOM elements you are trying to find are located in iframe. So before using find_element_by_id API you should switch to iframe context.
Here is a code how to switch to iframe context:
# add your code
frame_ref = browser.find_elements_by_tag_name("iframe")[0]
iframe = browser.switch_to.frame(frame_ref)
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
# add your code
Here is a link describing switching to iframe context.

Categories