How to make web scraping in multiple pages with Selenium?

How to make web scraping in multiple pages with Selenium? - python

I would like to make web scraping using Selenium in all pages of the website below, but, until now, I could make it only in the first page. I also put data on a Pandas dataframe. How can I do this operation in all pages of this website? For now, I have:
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(executable_path=r"C:/Users/Usuario/.spyder-py3/chromedriver.exe")
driver.get("https://www.mercadolivre.com.br/ofertas")
driver.implicitly_wait(3)
tituloProduto = driver.find_elements_by_class_name('promotion-item__title')
precoProduto = driver.find_elements_by_class_name('promotion-item__price')
df = pd.DataFrame()
produtos = []
for x in tituloProduto:
produtos.append(x.text)
preco = []
for x in price:
preco.append(x.text)
df['produto'] = produtos
df['preco'] = preco
df.head()
produto preco
Furadeira Parafusadeira Com Impacto 20v 2 Bate... R$ 34232
Sony Playstation 4 Slim 1tb Mega Pack: Ghost O... R$ 2.549
Tablet Galaxy A7 Lite T225 4g Ram 64gb Grafite... R$ 1.199
Smart Tv Philco Ptv55q20snbl Dled 4k 55 110v/220v R$ 2.799
Nintendo Switch 32gb Standard Cor Vermelho-néo... R$ 2.349

I found the website you want to scrape has 209 pages in total and can be accessed with the page number: https://www.mercadolivre.com.br/ofertas?page=2, so it should be not too difficult.
One thing you can do is to loop 209 times for getting the data from each page. A better approach would be to identify the "next page" button and loop until it's unavailable, but simply using the given page number (209) is easier, so will use that.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
driver = webdriver.Chrome(executable_path=r".../chromedriver.exe")
...
# Initialize outside the loop
preco = []
produtos = []
for i in range(209):
# Parse each page with the code you already have.
driver.get('https://www.mercadolivre.com.br/ofertas?page=' + str(i))
# You may have to wait for each page to load
wait = WebDriverWait(driver, 10)
wait.until(ec.visibility_of_element_located((By.CSS_SELECTOR, "a.sc-2vbwj7-22.blyzsR")))
# If you want to speed things up, you can process them in parallel
# But you should do this only if it's worth it since it will take development time.
# Get the variables you want
tituloProduto = driver.find_elements_by_class_name('promotion-item__title')
precoProduto = driver.find_elements_by_class_name('promotion-item__price')
for x in tituloProduto:
produtos.append(x.text)
for x in price:
preco.append(x.text)
Store list in DataFrame and do what you want with it.

You can use this code.
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(
executable_path=r"C:/Users/Usuario/.spyder-py3/chromedriver.exe")
url = "https://www.mercadolivre.com.br/ofertas?page="
df = pd.DataFrame()
produtos = []
preco = []
for i in range(1, 209):
driver.get(url + str(i))
driver.implicitly_wait(3)
tituloProduto = driver.find_elements_by_class_name('promotion-item__title')
precoProduto = driver.find_elements_by_class_name('promotion-item__price')
for x in tituloProduto:
produtos.append(x.text)
for x in precoProduto:
preco.append(x.text)
df['produto'] = produtos
df['preco'] = preco
print(df)
Hope to be helpful for you. Thanks.

What you could do is find the pagination button and set it to a next_page variable like so:
next_page = response.xpath('XPATH HERE').css('a::attr(href)').extract_first()
and then call it like so:
yield scrapy.Request(next_page, callback=self.parse)

Related

Python - Need Help Web Scraping Dynamic Website

I'm pretty new to web scraping and would appreciate any advice for the scenarios below:
I'm trying to produce a home loans listing table using data from https://www.canstar.com.au/home-loans/
I'm mainly trying to get listings values like the ones below:
Homestar Finance | Star Essentials P&I 80% | Variable
Unloan | Home Loan LVR <80% | Variable
TicToc Home Loans | Live-in Variable P&I | Variable
ubank | Neat Home Loan Owner Occupied P&I 70-80% | Variable
and push them into a nested table
results = [[Homestar Finance, Star Essentials P&I 80%, Variable], etc, etc]
My first attempt, I've used BeautifulSoup entirely and practice on an offline version of the site.
import pandas as pd
from bs4 import BeautifulSoup
with open('/local/path/canstar.html', 'r') as canstar_offline :
content = canstar_offline.read()
results = [['Affiliate', 'Product Name', 'Product Type']]
soup = BeautifulSoup(content, 'lxml')
for listing in soup.find_all('div', class_='table-cards-container') :
for listing1 in listing.find_all('a') :
if listing1.text.strip() != 'More details' and listing1.text.strip() != '' :
results.append(listing1.text.strip().split(' | '))
df = pd.DataFrame(results[1:], columns=results[0]).to_dict('list')
df2 = pd.DataFrame(df)
print(df2)
I pretty much got very close to what I wanted, but unfortunately it doesn't work for the actual site cause it looks like I'm getting blocked for repeated requests.
So I tried this again on Selenium but now I'm stuck.
I tried using as much of the transferrable filtering logic that I used from BS, but I can't get anywhere close to what I had using Selenium.
import time
from selenium.webdriver.common.by import By
url = 'https://www.canstar.com.au/home-loans'
results = []
driver = webdriver.Chrome()
driver.get(url)
# content = driver.page_source
# soup = BeautifulSoup(content)
time.sleep(3)
tables = driver.find_elements(By.CLASS_NAME, 'table-cards-container')
for table in tables :
listing = table.find_element(By.TAG_NAME, 'a')
print(listing.text)
This version (above) only returns one listing (I'm trying to get the entire table through iteration)
import time
from selenium.webdriver.common.by import By
url = 'https://www.canstar.com.au/home-loans'
results = []
driver = webdriver.Chrome()
driver.get(url)
# content = driver.page_source
# soup = BeautifulSoup(content)
time.sleep(3)
tables = driver.find_elements(By.CLASS_NAME, 'table-cards-container')
for table in tables :
# listing = table.find_element(By.TAG_NAME, 'a')
print(table.text)
This version (above) looks like it gets all the text from the 'table-cards-container' class, but I'm unable to filter through it to just get the listings.

I think you can try something like this, I hope the comments in the code explain what it is doing.
# Needed libs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Initiate the driver and navigate
driver = webdriver.Chrome()
url = 'https://www.canstar.com.au/home-loans'
driver.get(url)
# We save the loans list
loans = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//cnslib-table-card")))
# We make a loop once per loan in the loop
for i in range(1, len(loans)):
# With this Xpath I save the title of the loan
loan_title = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//a)[1]"))).text
print(loan_title)
# With this Xpath I save the first percentaje we see for the loan
loan_first_percentaje = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[1]"))).text
print(loan_first_percentaje)
# With this Xpath I save the second percentaje we see for the loan
loan_second_percentaje = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[3]"))).text
print(loan_second_percentaje)
# With this Xpath I save the amount we see for the loan
loan_amount = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[5]"))).text
print(loan_amount)

Why does selenium display the same page even after going to next page?

I'm trying to scrape rental listing data on Zillow. Specifically, I want the link, price, and address of each property. However, after scraping the first page successfully and clicking the next arrow button, it just displays the same listings even though the page shows I'm on page 2, 3, etc. How do I get the next page(s) listings? The project is supposed to use BeautifulSoup and Selenium, but after some research it looks like using only selenium is the easiest way to do this since Zillow uses lazy-loading.
main.py code:
DRIVER_PATH = "D:\chromedriver.exe"
FORM_URL = "HIDDEN"
WEBPAGE = "https://www.zillow.com/toronto-on/rentals/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22mapBounds%22%3A%7B%22west%22%3A-79.40771727189582%2C%22east%22%3A-79.35750631913703%2C%22south%22%3A43.639155005365474%2C%22north%22%3A43.66405824004801%7D%2C%22mapZoom%22%3A15%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A792680%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22sf%22%3A%7B%22value%22%3Afalse%7D%2C%22tow%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%7D"
data_entry = DataEntry(DRIVER_PATH)
# Opens the webpage and gets count of total pages via self.next_btns_len)
data_entry.open_webpage(WEBPAGE)
# n is the iterator for the number of pages on the site.
n = 1
# Scrapes link, price, address data, adds each to a specified class list, and then goes to next page.
while n < (data_entry.next_btns_len + 1):
# Scrapes one page of data and adds data to list in class object
data_entry.scrape_data()
# Goes to next page for scraping
sleep(5)
data_entry.next_page()
n += 1
enter_data.py code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options
from time import sleep
class DataEntry:
"""Enters the data from soup into Google Form"""
def __init__(self, driver_path):
# Options keeps the browser open after execution.
self.chrome_options = Options()
self.chrome_options.add_experimental_option("detach", True)
self.driver = webdriver.Chrome(executable_path=driver_path, chrome_options=self.chrome_options)
self.links = []
self.prices = []
self.addresses = []
self.next_btns_len = 0
def open_webpage(self, webpage):
# Opens desired webpage and gives two seconds to load
self.driver.get(webpage)
sleep(2)
# Gets total page numbers for main.py while loop
page_nums = self.driver.find_element(By.CSS_SELECTOR, '.Text-c11n-8-69-2__sc-aiai24-0.gCvDSp')
self.next_btns_len = int(page_nums.text.split()[3])
def scrape_data(self):
# Scrolls to each listing to make it visible to Selenium.
n = 1
while n < 41:
listing = self.driver.find_element(By.XPATH, f'/html/body/div[1]/div[5]/div/div/div/div[1]/ul/li[{n}]')
self.driver.execute_script("arguments[0].scrollIntoView(true);", listing)
print(n)
n += 1
# todo: Create a list of links for all the listings you scraped.
links = self.driver.find_elements(By.CSS_SELECTOR, ".list-card-info .list-card-link")
link_list = [link.get_attribute("href") for link in links]
# The if statement is to check if the DOM class name has changed, which produces an empty list.
# If the list is empty, then changes the css_selector. The website alternates between two.
if len(link_list) == 0:
links = self.driver.find_elements(By.CSS_SELECTOR, ".StyledPropertyCardDataArea-c11n-8-69-2__sc-yipmu-0.dZxoFm.property-card-link")
link_list = [link.get_attribute("href") for link in links]
self.links.extend(link_list)
print(len(self.links))
print(self.links)
# todo: Create a list of prices for all the listings you scraped.
prices = self.driver.find_elements(By.CSS_SELECTOR, ".list-card-price")
price_list = [price.text for price in prices]
if len(price_list) == 0:
prices = self.driver.find_elements(By.CSS_SELECTOR, ".StyledPropertyCardDataArea-c11n-8-69-2__sc-yipmu-0.kJFQQX")
price_list = [price.text for price in prices]
split_price_list = [price.split() for price in price_list]
final_price_list = [price[0].strip("C+/mo") for price in split_price_list]
self.prices.extend(final_price_list)
print(len(self.prices))
print(self.prices)
# todo: Create a list of addresses for all the listings you scraped.
addresses = self.driver.find_elements(By.CSS_SELECTOR, ".list-card-addr")
address_list = [address.text for address in addresses]
if len(address_list) == 0:
addresses = self.driver.find_elements(By.CSS_SELECTOR, ".StyledPropertyCardDataArea-c11n-8-69-2__sc-yipmu-0.dZxoFm.property-card-link address")
address_list = [address.text for address in addresses]
self.addresses.extend(address_list)
print(len(self.addresses))
print(self.addresses)
def next_page(self):
# Clicks the next arrow and waits 2 seconds for page to load
next_arrow = self.driver.find_element(By.XPATH, "//a[#title='Next page']")
next_arrow.click()
sleep(5)
def close_webpage(self):
self.driver.quit()
def enter_data(self, form_url, address, rent, link):
# Opens the Google Form and waits 3 seconds to load.
self.driver.get(form_url)
sleep(2)
# Enters each address, rent, and link into the form. Clicks submit after.
address_input = self.driver.find_element(By.XPATH, '//*[#id="mG61Hd"]/div[2]/div/div[2]/div[1]/div/div/div['
'2]/div/div[1]/div/div[1]/input')
address_input.send_keys(address)
rent_input = self.driver.find_element(By.XPATH, '//*[#id="mG61Hd"]/div[2]/div/div[2]/div[2]/div/div/div['
'2]/div/div[1]/div/div[1]/input')
rent_input.send_keys(rent)
link_input = self.driver.find_element(By.XPATH, '//*[#id="mG61Hd"]/div[2]/div/div[2]/div[3]/div/div/div['
'2]/div/div[1]/div/div[1]/input')
link_input.send_keys(link)
submit_btn = self.driver.find_element(By.XPATH, '//*[#id="mG61Hd"]/div[2]/div/div[3]/div[1]/div['
'1]/div/span/span')
submit_btn.click()

There is a less complex way to obtain the data you're looking for, using cloudscraper and pandas (and tqdm for convenience). You might also be in for a surprise, considering the time taken to get the data:
import cloudscraper
import pandas as pd
from tqdm import tqdm
scraper = cloudscraper.create_scraper()
df_list = []
for current_page in tqdm(range(1, 21)):
url = f'https://www.zillow.com/search/GetSearchPageState.htm?searchQueryState=%7B%22pagination%22%3A%7B%22currentPage%22%3A{current_page}%7D%2C%22mapBounds%22%3A%7B%22west%22%3A-79.44174913987678%2C%22east%22%3A-79.32347445115607%2C%22south%22%3A43.57772225826024%2C%22north%22%3A43.7254027835563%7D%2C%22mapZoom%22%3A13%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A792680%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22isForSaleForeclosure%22%3A%7B%22value%22%3Afalse%7D%2C%22isAllHomes%22%3A%7B%22value%22%3Atrue%7D%2C%22sortSelection%22%3A%7B%22value%22%3A%22days%22%7D%2C%22isAuction%22%3A%7B%22value%22%3Afalse%7D%2C%22isNewConstruction%22%3A%7B%22value%22%3Afalse%7D%2C%22isForRent%22%3A%7B%22value%22%3Atrue%7D%2C%22isSingleFamily%22%3A%7B%22value%22%3Afalse%7D%2C%22isTownhouse%22%3A%7B%22value%22%3Afalse%7D%2C%22isForSaleByOwner%22%3A%7B%22value%22%3Afalse%7D%2C%22isComingSoon%22%3A%7B%22value%22%3Afalse%7D%2C%22isForSaleByAgent%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%7D&wants=%7B%22cat1%22:[%22listResults%22,%22mapResults%22]%7D&requestId=6'
r = scraper.get(url)
for x in r.json()['cat1']['searchResults']['listResults']:
status = x['statusText']
address = x['address']
try:
price = x['units'][0]['price']
except Exception as e:
price = x['price']
if not 'https://www.' in x['detailUrl']:
url = 'https://zillow.com' + x['detailUrl']
else:
url = x['detailUrl']
df_list.append((address, price, url))
df = pd.DataFrame(df_list, columns = ['Address', 'Price', 'Url'])
df.to_csv('renting_in_toronto.csv')
print(df)
This will save the data in a csv file, and print out:
100%
20/20 [00:16<00:00, 1.19it/s]
Address Price Url
0 2221 Yonge St, Toronto, ON C$1,900+ https://zillow.com/b/Toronto-ON/43.70606,-79.3...
1 10 Yonge St, Toronto, ON C$2,100+ https://zillow.com/b/10-yonge-st-toronto-on-BM...
2 924 Avenue Rd, Toronto, ON M5P 2K6 C$1,895/mo https://www.zillow.com/homedetails/924-Avenue-...
3 797 Don Mills Rd, Toronto, ON C$1,850+ https://zillow.com/b/Toronto-ON/43.71951,-79.3...
4 15 Queens Quay E, Toronto, ON C$2,700+ https://zillow.com/b/Toronto-ON/43.64202,-79.3...
... ... ...
You can install the packages with pip install cloudscraper & pip install tqdm. The urls accessed are visible in Dev Tools, Network tab, and are providing JSON data which is loaded by Javascript into page.

How Scraping Table Web-Site with Button "Option value"?

In particular I am trying to scrap this table (https://whalewisdom.com/filer/berkshire-hathaway-inc#tabholdings_tab_link) But I would like to scraping via python code, the first 50 rows.
For this reason I need to setup option value in order to see the first 50 rows per pages:
my currently code are:
test = {}
dict_scr = {}
for ii in range (0,12):
options = webdriver.FirefoxOptions()
options.binary_location = r'C:/Users/Mozilla Firefox/firefox.exe'
driver = selenium.webdriver.Firefox(executable_path='C:/Users/geckodriver.exe' , options=options)
driver.execute("get", {'url': link_scr['Links'][ii]})
Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='50']"))))
test[link_scr.index[ii]] = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, "table#current_holdings_table"))).get_attribute("outerHTML")
dict_scr[link_scr.index[ii]] = pd.read_html(test[link_scr.index[ii]])
print(test[link_scr.index[ii]])
How I can modify this code in order to get firs 50 rows scraping dataframe?

I write two samples, you can refer to github:
sample:
from time import sleep
from clicknium import clicknium as cc, locator
tab = cc.chrome.open("https://whalewisdom.com/filer/berkshire-hathaway-inc#tabholdings_tab_link")
tab.find_element(locator.chrome.whalewisdom.button_25).click()
tab.find_element(locator.chrome.whalewisdom.a_50).click()
sleep(3) #wait for table laoded
elems_sector = tab.find_elements(locator.chrome.whalewisdom.td_informationtechnology)
elemns_shares = tab.find_elements(locator.chrome.whalewisdom.td_890923410)
count = len(elems_sector)
for idx in range(count):
sector = elems_sector[idx].get_text()
shares = elemns_shares[idx].get_text()
print({'sector': sector, 'shares': shares})
sample1: don't change page number, scrape two pages data
from time import sleep
from clicknium import clicknium as cc, locator
tab = cc.chrome.open("https://whalewisdom.com/filer/berkshire-hathaway-inc#tabholdings_tab_link")
i = 0
while True:
elems_sector = tab.find_elements(locator.chrome.whalewisdom.td_informationtechnology)
elemns_shares = tab.find_elements(locator.chrome.whalewisdom.td_890923410)
count = len(elems_sector)
for idx in range(count):
sector = elems_sector[idx].get_text()
shares = elemns_shares[idx].get_text()
print({'sector': sector, 'shares': shares})
i += 1
if i>1:
break
tab.find_element(locator.chrome.whalewisdom.a).click()
sleep(2) #wait for table loaded

scraping sports table from oddsportal

I try to scrape this webpage https://www.oddsportal.com/moving-margins
But the code sometime work, and sometimes don't, and even if work don't scrape all the data I need per match.
u = 'https://www.oddsportal.com/moving-margins/'
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe")
driver.maximize_window()
driver.get(u)
#Use Explicit time wait for fast execution
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#moving_margins_content_overall")))
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
table_data = driver.find_elements_by_xpath("//div[#id='moving_margins_content_overall']//tr[#class='odd' or #class='dark']")
table =[]
# Creating a list of lists, where each list consist all data in each row either with class dark or odd
for data in table_data:
row = []
dark_row = data.find_elements_by_xpath((".//th//a"))
for col in dark_row:
row.append(col.text.replace("\n"," "))
odd_row = data.find_elements_by_xpath((".//following-sibling::tr[#class='odd']//td"))
for col in odd_row:
row.append(col.text.replace("\n", " "))
table.append(row)
My goal is to store data into csv file with those columns:
sport country competiton handicap match_date match hdp_open hdp_close bookmaker
Tennis Czech Ostrava.. AH 0 Games Today12:00 Karatsev A. - Otte O. 0.5 -1.5 Nordicbet

I think the problem in you code is that the page has, in some cases, a single "dark" row for many "odds" rows. So when you loop the elements, you create a single record for a table that actually has more records.
This code should fit you needs, but keep in mind that it's not optimal since doesn't take care of possible exceptions, but it's a starting point:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
from selenium.webdriver.common.by import By
u = 'https://www.oddsportal.com/moving-margins/'
driver = webdriver.Chrome(executable_path=r"chromedriver.exe")
driver.maximize_window()
driver.get(u)
#Use Explicit time wait for fast execution
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#moving_margins_content_overall")))
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
tables = driver.find_elements_by_xpath("//div[#id='moving_margins_content_overall']//table")
tableData =[]
for table in tables:
trDark = table.find_element_by_xpath('.//tr[#class="dark"]')
trOdds = table.find_elements_by_xpath('.//tr[#class="odd"]')
row = [trDark.text.strip().replace("\n", " ")]
for odd in trOdds:
tds = [
td.text.strip().replace("\n", " ")
for td in odd.find_elements_by_xpath('.//td')
]
row = row + tds
tableData.append(row)
print(tableData)

How to extract data from a dropdown menu using python beautifulsoup

I am trying to scrape data from a website that has a multilevel drop-down menu every time an item is selected it changes the sub items for sub drop-downs.
problem is that for every loop it extracts same sub items from the drop down items. the selection happens but it do not update the items on behalf of new selection from loop
can any one help me why I am not getting the desired results.
Perhaps this is because my drop-down list is in java Script or something.
for instance like this manue in the picture below:
i have gone this far:
enter code here
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
import csv
//#from selenium.webdriver.support import Select
import time
print ("opening chorome....")
driver = webdriver.Chrome()
driver.get('https://www.wheelmax.com/')
time.sleep(10)
csvData = ['Year', 'Make', 'Model', 'Body', 'Submodel', 'Size']
//#variables
yeart = []
make= []
model=[]
body = []
submodel = []
size = []
Yindex = Mkindex = Mdindex = Bdindex = Smindex = Sindex = 0
print ("waiting for program to set variables....")
time.sleep(20)
print ("initializing and setting variables....")
//#initializing Year
Year = Select(driver.find_element_by_id("icm-years-select"))
Year.select_by_value('2020')
yr = driver.find_elements(By.XPATH, '//*[#id="icm-years-select"]')
time.sleep(15)
//#initializing Make
Make = Select(driver.find_element_by_id("icm-makes-select"))
Make.select_by_index(1)
mk = driver.find_elements(By.XPATH, '//*[#id="icm-makes-select"]')
time.sleep(15)
//#initializing Model
Model = Select(driver.find_element_by_id("icm-models-select"))
Model.select_by_index(1)
mdl = driver.find_elements(By.XPATH, '//*[#id="icm-models-select"]')
time.sleep(15)
//#initializing body
Body = Select(driver.find_element_by_id("icm-drivebodies-select"))
Body.select_by_index(1)
bdy = driver.find_elements(By.XPATH, '//*[#id="icm-drivebodies-select"]')
time.sleep(15)
//#initializing submodel
Submodel = Select(driver.find_element_by_id("icm-submodels-select"))
Submodel.select_by_index(1)
sbm = driver.find_elements(By.XPATH, '//*[#id="icm-submodels-select"]')
time.sleep(15)
//#initializing size
Size = Select(driver.find_element_by_id("icm-sizes-select"))
Size.select_by_index(0)
siz = driver.find_elements(By.XPATH, '//*[#id="icm-sizes-select"]')
time.sleep(5)
Cyr = Cmk = Cmd = Cbd = Csmd = Csz = ""
print ("fetching data from variables....")
for y in yr:
obj1 = driver.find_element_by_id("icm-years-select")
Year = Select(obj1)
Year.select_by_index(++Yindex)
obj1.click()
#obj1.click()
yeart.append(y.text)
Cyr = y.text
time.sleep(10)
for m in mk:
obj2 = driver.find_element_by_id("icm-makes-select")
Make = Select(obj2)
Make.select_by_index(++Mkindex)
obj2.click()
#obj2.click()
make.append(m.text)
Cmk = m.text
time.sleep(10)
for md in mdl:
Mdindex =0
obj3 = driver.find_element_by_id("icm-models-select")
Model = Select(obj3)
Model.select_by_index(++Mdindex)
obj3.click()
#obj3.click(clickobj)
model.append(md.text)
Cmd = md.text
time.sleep(10)
Bdindex = 0
for bd in bdy:
obj4 = driver.find_element_by_id("icm-drivebodies-select")
Body = Select(obj4)
Body.select_by_index(++Bdindex)
obj4.click()
#obj4.click(clickobj2)
body.append(bd.text)
Cbd = bd.text
time.sleep(10)
Smindex = 0
for sm in sbm:
obj5 = driver.find_element_by_id("icm-submodels-select")
Submodel = Select(obj5)
obj5.click()
Submodel.select_by_index(++Smindex)
#obj5.click(clickobj5)
submodel.append(sm.text)
Csmd = sm.text
time.sleep(10)
Sindex = 0
for sz in siz:
Size = Select(driver.find_element_by_id("icm-sizes-select"))
Size.select_by_index(++Sindex)
size.append(sz.text)
Scz = sz.text
csvData += [Cyr, Cmk, Cmd, Cbd,Csmd, Csz]

Because of https://www.wheelmax.com has multilevel drop-down menu dependent on each other for example if you select Select Year drop down option, after selected year based on Select Make drop down is enable and display option based on the selected year option.
So basically you need to use Selenium package for handle dynamic option.
Install selenium web driver as per your browser
Download chrome web driver :
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
unzip ~/Downloads/chromedriver_linux64.zip -d ~/Downloads
chmod +x ~/Downloads/chromedriver
sudo mv -f ~/Downloads/chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
selenium tutorial
https://selenium-python.readthedocs.io/
Eg. using selenium to select multiple dropdown options
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Chrome()
driver.get('https://www.wheelmax.com/')
time.sleep(4)
selectYear = Select(driver.find_element_by_id("icm-years-select"))
selectYear.select_by_value('2019')
time.sleep(2)
selectMakes = Select(driver.find_element_by_id("icm-makes-select"))
selectMakes.select_by_value('58')
Update:
select drop down option value or count total options
for option in selectYear.options:
print(option.text)
print(len(selectYear.options))
Se more

How to extract data from a dropdown menu using python beautifulsoup
The page does a callback to populate with years. Simply mimic that.
If you actually need to change years and select from dependent drop downs, which becomes a different question, you need browser automation e.g. selenium, or to manually perform this and inspect network tab to see if there is an xhr request you can mimic to submit your choices.
import requests

r = requests.get('https://www.iconfigurators.com/json2/?returnType=json&bypass=true&id=13898&callback=yearObj').json()
years = [item['year'] for item in r['years']]
print(years)

I guess the reason you can't parse the years with beautiful soup is because the 'select' tag containing the 'option' tags with all the years is not present yet/is hidden at the moment when beautiful soup downloads the page. It is added to the DOM by executing additional JavaScript I assume. If you look at the DOM of the loaded page using the developer tools of your browser, for example F12 for Mozilla, you'll see that the tag containing the information you look for is: <select id="icm-years-select"">. If you try to parse for this tag with the object downloaded with beautiful soup, you get an empty list of tag objects:
from bs4 import BeautifulSoup
from requests import get
response = get('https://www.wheelmax.com/')
yourSoup = BeautifulSoup(response.text, "lxml")
print(len(yourSoup.select('div #vehicle-search'))) // length = 1 -> visible
print()
print(len(yourSoup.select('#icm-years-select'))) // length = 0 -> not visible
So if you want to get the years by using Python by all means, I guess you might try to click on the respective tag and then parse again using some combination of requests/beautiful soup/ or the selenium module which will require a bit more digging :-)
Otherwise if you just quickly need the years parsed, use JavaScript:
countYears = document.getElementById('icm-years-select').length;
yearArray = [];
for (i = 0; i < countYears; i++) {yearArray.push(document.getElementById('icm-years-select')[i].value)};

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make web scraping in multiple pages with Selenium? - python

What you could do is find the pagination button and set it to a next_page variable like so: next_page = response.xpath('XPATH HERE').css('a::attr(href)').extract_first() and then call it like so: yield scrapy.Request(next_page, callback=self.parse)

Related

Python - Need Help Web Scraping Dynamic Website

Why does selenium display the same page even after going to next page?

How Scraping Table Web-Site with Button "Option value"?

scraping sports table from oddsportal

How to extract data from a dropdown menu using python beautifulsoup

Categories

Resources