I was able to scrape the first page of eBay sold items, so I attempted the pagination, here's what I have:
ebay_url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720'
# Load in html
html = requests.get(ebay_url)
# print(html.text)
driver = wd.Chrome(executable_path=r'/Users/mburley/Downloads/chromedriver')
driver.maximize_window() #Maximizes window
driver.implicitly_wait(30) # Gives an implicit wait for 30 seconds
driver.get(ebay_url)
wait = WebDriverWait(driver, 20) # Makes driver wait 20 seconds
sold_date = []
title = []
price = []
i = 1
## Loop here to get multiple pages
next_page = True
while next_page:
try:
# for item in all_items
for item in driver.find_elements(By.XPATH, "//div[contains(#class,'title--
tagblock')]/span[#class='POSITIVE']"):
try:
# Get Sale Date of item and update 'data'
sold_date.append(item.text)
except NoSuchElementException:
# Element not found
sold_date.append(None)
try:
# Get title of each item and update 'data'
title.append(driver.find_element_by_xpath(f".
(//div[contains(#class,'title--tagblock')]/span[#class='POSITIVE']/ancestor::div[contains(#class,'tag')]/following-sibling::a/h3)[{i}]").text)
except NoSuchElementException:
# Element not found
title.append(None)
try:
# Get price of each item and update 'data'
price.append(item.find_element_by_xpath(f"(//div[contains(#class,'title--tagblock')]/span[#class='POSITIVE']/ancestor::div[contains(#class,'tag')]/following-sibling::div[contains(#class,'details')]/descendant::span[#class='POSITIVE'])[{i}]").text)
except NoSuchElementException:
# Element not found
price.append(None)
i = i + 1
# Print results of scraped data on page
print(sold_date)
print(title)
print(price)
data = {
'Sold_date': sold_date,
'title': title,
'price': price
}
# Load Next Page by clicking button
button = driver.find_element_by_name('pagination__next icon-link')
button.click()
print("Clicked on Next Page!")
time.sleep(1)
except:
print("Done!")
next_page = False
df = pd.DataFrame.from_dict(data)
df.to_csv('out_two.csv', index = 0)
After I had the code to scrape page 1, I added:
... code ...
## Loop here to get multiple pages
next_page = True
while next_page:
try:
... code to scrape page 1 ...
# Load Next Page by clicking button
button = driver.find_element_by_name('pagination__next icon-link')
button.click()
print("Clicked on Next Page!")
time.sleep(1)
except:
print("Done!")
next_page = False
Which unfortunately edits the code to scrape the first item, then searches for the next page, and can't find the "button" so it exits and prints done. I don't know a lot about scraping, so I tried to follow an online example. Can anyone help? Thanks!
Related
I am trying to extract the reviews with their respective titles of a particular hotel in Trip Advisor, using Web Scraping techniques with Python and Selenium but I have only been able to extract the reviews of a single page and I need to extract all or most of the reviews, but the pagination is not working. What I do is click on the Next button, iterate over a range of pages and extract the information.
The web scraping is from the home page https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html
notice that when the page changes this is added: -or20- depending if it is the fourth -or30- or fifth -or40- always shows 10 reviews
for example this is the third page: https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-or20-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html
Basically this is what I do: read a csv, open the page, (change to all laguanges, this is optional), expand reviews, read reviews, click on Next button, write to csv, iterate in a range of pages.
Any help, thanks in advance !!
Images:
Trip Advisor HTML reviews
Trip Advisor HTML Next button
This is my code so far:
# Web scraping
# Writing to csv
with open('reviews_hotel_9.csv', 'w', encoding="utf-8") as file:
file.write('titles, reviews \n')
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html")
sleep(3)
cookie = driver.find_element_by_xpath('//*[#id="onetrust-accept-btn-handler"]')# cookies accept
try:
cookie.click()
except:
pass
print('ok')
for k in range(10): #range pagination
#container = driver.find_elements_by_xpath("//div[#data-reviewid]")
try:
# radio button all languages (optional)
#driver.execute_script("arguments[0].click();", WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="component_14"]/div/div[3]/div[1]/div[1]/div[4]/ul/li[1]/label/span[1]'))))
# read more expand reviews
driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//span[#class="Ignyf _S Z"]'))))
# review titles
titles = driver.find_elements_by_xpath('//div[#class="KgQgP MC _S b S6 H5 _a"]/a/span') #
sleep(1)
# reviews
reviews = driver.find_elements_by_xpath('//q[#class="QewHA H4 _a"]/span')#
sleep(1)
except TimeoutException:
pass
with open('reviews_hotel_9.csv', 'a', encoding="utf-8") as file:
for i in range(len(titles)):
file.write(titles[i].text + ";" + reviews[i].text + "\n")
try:
#driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//a[#class="ui_button nav next primary "]')))) #
# click on Next button
driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, 'ui_button nav next primary '))))
except TimeoutException:
pass
file.close()
driver.quit()
You can scrape data from each review page by clicking on the Next button on the review page. In the code below we go on an infinite loop and keep clicking the Next button. We stop going ahead when we encounter the exception ElementClickInterceptedException. At the end we are saving the data to the file MyData.csv
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.common.exceptions import ElementClickInterceptedException
import pandas as pd
import time
chrome_path = r"C:\Users\hpoddar\Desktop\Tools\chromedriver_win32\chromedriver.exe"
s = Service(chrome_path)
url = 'https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html'
driver = webdriver.Chrome(service=s)
driver.get(url)
df = pd.DataFrame(columns = ['title', 'review'])
while True:
time.sleep(2)
reviews = driver.find_elements(by=By.CSS_SELECTOR, value='.WAllg._T')
for review in reviews:
title = review.find_element(by=By.CSS_SELECTOR, value='.KgQgP.MC._S.b.S6.H5._a').text
review = review.find_element(by=By.CSS_SELECTOR, value='.fIrGe._T').text
df.loc[len(df)] = [title, review]
try:
driver.find_element(by=By.CSS_SELECTOR, value='.ui_button.nav.next.primary').click()
except ElementClickInterceptedException:
break
df.to_csv("MyData.csv")
This gives us the output :
https://www.bestbuy.com/site/promo/health-fitness-deals
I want to loop through these 10 pages and scrape their names and hrefs
Below is my code which only scrapes the 1st page continuously 10 times:
def name():
for i in range(1, 11):
tag = driver.find_elements_by_xpath('/html/body/div[4]/main/div[9]/div/div/div/div/div/div/div[2]/div[2]/div[3]/div/div[5]/ol/li[3]/div/div/div/div/div/div[2]/div[1]/div[2]/div/h4')
for a in tag:
for name in a.find_elements_by_tag_name('a'):
links = name.get_attribute("href")
names = name.get_attribute('text')
watches_name.append(names)
watches_link.append(links)
# print(watches_name)
# print(watches_link)
name()
If you want to get elements from next pages then you have to click() on link >
driver.find_element_by_css_selector('.sku-list-page-next').click()
Minimal working code with other changes.
I reduced xpath to something much simpler. And I keep name, link as pair because it is simpler to write in file CSV or in database or to filter and sort.
I had to use longer sleep - sometimes my browser needs more time to update elements on page.
from selenium import webdriver
import time
url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()
items = []
for page in range(1, 11):
print('\n[DEBUG] wait 15 seconds to update page\n')
time.sleep(15)
print('\n--- page', page, '---\n')
all_links = driver.find_elements_by_css_selector('#main-results h4 a')
for a in all_links:
link = a.get_attribute("href")
name = a.get_attribute('text')
items.append( [name, link] )
print(name)
print('\n[DEBUG] click next\n')
driver.find_element_by_css_selector('.sku-list-page-next').click()
#print(items)
BTW:
This method could be done with while True and some method to recognize if there is link > - and exit loop when there is no >. This way it could work with any number of pages.
Other method.
When you manually visit few pages then you should see that second page has url with ?cp=2, third with ?cp=3, etc. so you could use it to load pages
driver.get(url + '?cp=' + str(page+1) )
Minimal working code.
from selenium import webdriver
import time
url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()
items = []
for page in range(1, 11):
print('\n[DEBUG] wait 15 seconds to update page\n')
time.sleep(15)
print('\n--- page', page, '---\n')
all_links = driver.find_elements_by_css_selector('#main-results h4 a')
for a in all_links:
link = a.get_attribute("href")
name = a.get_attribute('text')
items.append( [name, link] )
print(name)
print('\n[DEBUG] load next url\n')
driver.get(url + '?cp=' + str(page+1) )
#print(items)
This method could also use while True and variable page to get any number of pages.
EDIT:
Versions with while True
from selenium import webdriver
import time
url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()
items = []
page = 1
while True:
print('\n[DEBUG] wait 15 seconds to update page\n')
time.sleep(15)
print('\n--- page', page, '---\n')
all_links = driver.find_elements_by_css_selector('#main-results h4 a')
for a in all_links:
link = a.get_attribute("href")
name = a.get_attribute('text')
items.append( [name, link] )
print(name)
page += 1
print('\n[DEBUG] load next url\n')
driver.get(url + '?cp=' + str(page) )
if driver.title == 'Best Buy: Page Not Found':
print('\n[DEBUG] exit loop\n')
break
#print(items)
and
from selenium import webdriver
import time
url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()
items = []
page = 1
while True:
print('\n[DEBUG] wait 15 seconds to update page\n')
time.sleep(15)
print('\n--- page', page, '---\n')
all_links = driver.find_elements_by_css_selector('#main-results h4 a')
for a in all_links:
link = a.get_attribute("href")
name = a.get_attribute('text')
items.append( [name, link] )
print(name)
page += 1
print('\n[DEBUG] click next\n')
item = driver.find_element_by_css_selector('.sku-list-page-next')
if item.get_attribute("href"):
item.click()
else:
print('\n[DEBUG] exit loop\n')
break
#print(items)
I guess if your code is working right, you will just need to click the pagination button. I found it can be located with the help of css selector ('#Caret_Right_Line_Sm'). Try adding this line to your function:
def name():
for i in range(1, 11):
tag = driver.find_elements_by_xpath('/html/body/div[4]/main/div[9]/div/div/div/div/div/div/div[2]/div[2]/div[3]/div/div[5]/ol/li[3]/div/div/div/div/div/div[2]/div[1]/div[2]/div/h4')
for a in tag:
for name in a.find_elements_by_tag_name('a'):
links = name.get_attribute("href")
names = name.get_attribute('text')
watches_name.append(names)
watches_link.append(links)
# print(watches_name)
# print(watches_link)
driver.find_elements_by_css_selector('#Caret_Right_Line_Sm')[1].click()
name()
I have a problem with scraping all pictures from Instagram profile, I'm scrolling the page till bottom then find all "a" tags finally always I get only last 30 links to pictures. I think that driver doesn't see full content of page.
#scroll
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
match=False
while(match==False):
last_count = scrolldown
time.sleep(2)
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
if last_count==scrolldown:
match=True
#posts
posts = []
time.sleep(2)
links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
post = link.get_attribute('href')
if '/p/' in post:
posts.append(post)
Looks like you first scrolling to the page bottom and only then getting the links instead of getting the links and treating them inside the scrolling loop.
So, if you want to get all the links you should perform the
links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
post = link.get_attribute('href')
if '/p/' in post:
posts.append(post)
inside the scrolling, also before the first scrolling.
Something like this:
def get_links():
time.sleep(2)
links = driver.find_elements_by_tag_name('a')
time.sleep(2)
for link in links:
post = link.get_attribute('href')
if '/p/' in post:
posts.add(post)
posts = set()
get_links()
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
match=False
while(match==False):
get_links()
last_count = scrolldown
time.sleep(2)
scrolldown = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var scrolldown=document.body.scrollHeight;return scrolldown;")
if last_count==scrolldown:
match=True
I have been struggling with the Next page button; the scraper manages to click the next page and goes to it, however, it keeps going to the first page and eventually breaks. I only want to scrape all the next pages (in this case there is only one, but might be more of them in the future).
Any ideas on what might be wrong here? Here is the code:
class DatatracSpider(scrapy.Spider):
name = 'data_trac'
start_urls = [
# FOR SALE
'https://www.milieuproperties.com/search-results.aspx?paramb=ADVANCE%20SEARCH:%20Province%20(Western%20Cape),%20%20Area%20(Cape%20Town)']
def __init__(self):
#path to driver
self.driver = webdriver.Chrome('my_path')
def parse(self,response):
self.driver.get(response.url)
url = self.driver.current_url
while True:
try:
elem = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="ContentPlaceHolder1_lvDataPager1"]/a[text()="Next"]')))
elem.click()
except TimeoutException:
break
WebDriverWait(self.driver, 10).until(lambda driver: self.driver.current_url != url)
url = self.driver.current_url
yield scrapy.Request(url=url, callback=self.parse_page, dont_filter=False)
def parse_page(self, response):
offering = response.css('span#ContentPlaceHolder1_lblbreadcum::text').get()
try:
offering = 'rent' if 'Rental' in offering else 'buy'
except TypeError:
offering = 'buy'
base_link = response.request.url.split('/')
try:
base_link = base_link[0] + '//' + base_link[2] + '/'
except:
pass
for p in response.xpath('//div[#class="ct-itemProducts ct-u-marginBottom30 ct-hover"]'):
link = base_link + p.css('a::attr(href)').get()
yield scrapy.Request(
link,
callback=self.parse_property,
meta={'item': {
'url': link,
'offering': offering,
}},
)
# follow to next page
def parse_property(self, response):
item = response.meta.get('item')
. . .
For this particular web-site your code won't work as after you click Next button URL doesn't change. Try to wait until current page number is changed (instead of waiting for URL to change)
def parse(self,response):
self.driver.get(response.url)
current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text
while True:
try:
elem = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="ContentPlaceHolder1_lvDataPager1"]/a[text()="Next"]')))
elem.click()
except TimeoutException:
break
WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text != current_page_number)
current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text
yield scrapy.Request(url=url, callback=self.parse_page, dont_filter=False)
To repeatedly click the next button until the end page do the following.
driver.get('https://www.milieuproperties.com/search-results.aspx?paramb=ADVANCE%20SEARCH:%20Province%20(Western%20Cape),%20%20Area%20(Cape%20Town)')
wait = WebDriverWait(driver, 10)
while True:
try:
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[contains(.,'Next') and not(contains(#class,'aspNetDisabled'))]"))).click()
except:
break
Please check out this logic, where i was able to move to the next pages successfully
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver= webdriver.Chrome('') // Please provide your driver path here
driver.get('https://www.milieuproperties.com/search-results.aspx?paramb=ADVANCE%20SEARCH:%20Province%20(Western%20Cape),%20%20Area%20(Cape%20Town)')
driver.find_elements(By.XPATH,'//*[#id="ContentPlaceHolder1_lvDataPager1"]').click();
I'm scraping an E-Commerce website, Lazada using Selenium and bs4, I manage to scrape on the 1st page but I unable to iterate to the next page. What I'm tyring to achieve is to scrape the whole pages based on the categories I've selected.
Here what I've tried :
# Run the argument with incognito
option = webdriver.ChromeOptions()
option.add_argument(' — incognito')
driver = webdriver.Chrome(executable_path='chromedriver', chrome_options=option)
driver.get('https://www.lazada.com.my/')
driver.maximize_window()
# Select category item #
element = driver.find_elements_by_class_name('card-categories-li-content')[0]
webdriver.ActionChains(driver).move_to_element(element).click(element).perform()
t = 10
try:
WebDriverWait(driver,t).until(EC.visibility_of_element_located((By.ID,"a2o4k.searchlistcategory.0.i0.460b6883jV3Y0q")))
except TimeoutException:
print('Page Refresh!')
driver.refresh()
element = driver.find_elements_by_class_name('card-categories-li-content')[0]
webdriver.ActionChains(driver).move_to_element(element).click(element).perform()
print('Page Load!')
#Soup and select element
def getData(np):
soup = bs(driver.page_source, "lxml")
product_containers = soup.findAll("div", class_='c2prKC')
for p in product_containers:
title = (p.find(class_='c16H9d').text)#title
selling_price = (p.find(class_='c13VH6').text)#selling price
try:
original_price=(p.find("del", class_='c13VH6').text)#original price
except:
original_price = "-1"
if p.find("i", class_='ic-dynamic-badge ic-dynamic-badge-freeShipping ic-dynamic-group-2'):
freeShipping = 1
else:
freeShipping = 0
try:
discount = (p.find("span", class_='c1hkC1').text)
except:
discount ="-1"
if p.find(("div", {'class':['c16H9d']})):
url = "https:"+(p.find("a").get("href"))
else:
url = "-1"
nextpage_elements = driver.find_elements_by_class_name('ant-pagination-next')[0]
np=webdriver.ActionChains(driver).move_to_element(nextpage_elements).click(nextpage_elements).perform()
print("- -"*30)
toSave = [title,selling_price,original_price,freeShipping,discount,url]
print(toSave)
writerows(toSave,filename)
getData(np)
The problem might be that the driver is trying to click the button before the element is even loaded correctly.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(PATH, chrome_options=option)
# use this code after driver initialization
# this is make the driver wait 5 seconds for the page to load.
driver.implicitly_wait(5)
url = "https://www.lazada.com.ph/catalog/?q=phone&_keyori=ss&from=input&spm=a2o4l.home.search.go.239e359dTYxZXo"
driver.get(url)
next_page_path = "//ul[#class='ant-pagination ']//li[#class=' ant-pagination-next']"
# the following code will wait 5 seconds for
# element to become clickable
# and then try clicking the element.
try:
next_page = WebDriverWait(driver, 5).until(
EC.element_to_be_clickable((By.XPATH, next_page_path)))
next_page.click()
except Exception as e:
print(e)
EDIT 1
Changed the code to make the driver wait for the element to become clickable. You can add this code inside a while loop for iterating multiple times and break the loop if the button is not found and is not clickable.