Can't Find Element XPath, Skip & Write Placeholder Text - python

This is a loop for scraping several elements. Sometimes price isn't always found. Instead of passing through except - I need to print/write a value for those times when no price is found. The reason being is when it just passes through, it mismatches the variable values when printing (title, link, image, price). Hopefully, you can see my logic below in what I'm trying to accomplish. I'm also attaching a screenshot so you can see what I mean.
#finds titles
deal_title = browser.find_elements_by_xpath("//a[#id='dealTitle']/span")
titles = []
for title in deal_title:
titles.append(title.text)
#finds links
deal_link = browser.find_elements_by_xpath("//div[#class='a-row dealDetailContainer']/div/a[#id='dealTitle']")
links = []
for link in deal_link:
links.append(link.get_attribute('href'))
#finds images
deal_image = browser.find_elements_by_xpath("//a[#id='dealImage']/div/div/div/img")
images = []
for image in deal_image:
images.append(image.get_attribute('src'))
try:
deal_price = browser.find_elements_by_xpath("//div[#class='a-row priceBlock unitLineHeight']/span")
prices = []
for price in deal_price:
prices.append(price.text)
except NoSuchElementException:
price = ("PRINT/WRITE THIS TEXT INSTEAD OF PASSING")
#writes to html
for title, link, image, price in zip(titles, links, images, prices):
f.write("<tr class='border'><td class='image'>" + "<img src=" + image + "></td>" + "<td class='title'>" + title + "</td><td class='price'>" + price + "</td></tr>")

import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
capabilities = {
'browserName': 'chrome',
'chromeOptions': {
'useAutomationExtension': False,
'forceDevToolsScreenshot': True,
'args': ['--start-maximized', '--disable-infobars']
}
}
driver = webdriver.Chrome(executable_path='./chromedriver_2.38.exe', desired_capabilities=capabilities)
driver.get("""https://www.amazon.com/gp/goldbox/ref=gbps_ftr_s-4_bedf_page_10?
gb_f_deals1=enforcedCategories:2972638011,
dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,
includedAccessTypes:,page:10,sortOrder:BY_SCORE,
dealsPerPage:32&pf_rd_p=afc45143-5c9c-4b30-8d5c-d838e760bedf&
pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&
pf_rd_m=ATVPDKIKX0DER&pf_rd_r=ZDV4YBQJFDVR3PAY4ZBS&ie=UTF8""")
time.sleep(15)
golds = driver.find_elements_by_css_selector(".widgetContainer #widgetContent > div.singleCell")
print("found %d golds" % len(golds))
template = """\
<tr class="border">
<td class="image"><img src="{0}"></td>\
<td class="title">{2}</td>\
<td class="price">{3}</td>
</tr>"""
lines = []
for gold in golds:
goldInfo = {}
goldInfo['title'] = gold.find_element_by_css_selector('#dealTitle > span').text
goldInfo['link'] = gold.find_element_by_css_selector('#dealTitle').get_attribute('href')
goldInfo['image'] = gold.find_element_by_css_selector('#dealImage img').get_attribute('src')
try:
goldInfo['price'] = gold.find_element_by_css_selector('.priceBlock > span').text
except NoSuchElementException:
goldInfo['price'] = 'No price display'
print goldInfo['title']
line = template.format(goldInfo['image'], goldInfo['link'], goldInfo['title'], goldInfo['price'])
lines.append(line)
html = """\
<html>
<body>
<table>
{0}
</table>
</body>
</html>\
"""
f = open('./result.html', 'w')
f.write(html.format('\n'.join(lines)))
f.close()

If I understood it correctly, you have got problem with loading some page elements "on time".
The elements you want to scrap may not load yet when you are reading them.
To prevent this, you can use explicit waits(the script will wait specified time until specified element loads).
When using this, there will be a smaller chance you miss some values.

Yes, you can skip the price as well. I am providing you an alternative approach where you can create a List of the available prices and the respective images as follows :
Code Block:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
browser=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
browser.get("https://www.amazon.com/gp/goldbox/ref=gbps_ftr_s-4_bedf_page_10?gb_f_deals1=enforcedCategories:2972638011,dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:,page:10,sortOrder:BY_SCORE,dealsPerPage:32&pf_rd_p=afc45143-5c9c-4b30-8d5c-d838e760bedf&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=ZDV4YBQJFDVR3PAY4ZBS&ie=UTF8")
#finds images
deal_image = browser.find_elements_by_xpath("//div[#class='a-row priceBlock unitLineHeight']/span//preceding::img[1]")
images = []
for image in deal_image:
images.append(image.get_attribute('src'))
#finds prices
deal_price = browser.find_elements_by_xpath("//div[#class='a-row priceBlock unitLineHeight']/span")
prices = []
for price in deal_price:
prices.append(price.text)
#print the information
for image, price in zip(images, prices):
print(image, price)
Console Output:
https://images-na.ssl-images-amazon.com/images/I/31zt-ovKJqL._AA210_.jpg $9.25
https://images-na.ssl-images-amazon.com/images/I/610%2BKAfr72L._AA210_.jpg $15.89
https://images-na.ssl-images-amazon.com/images/I/41whkQ1m0uL._AA210_.jpg $31.49
https://images-na.ssl-images-amazon.com/images/I/41cAbUWEdoL._AA210_.jpg $259.58 - $782.99
https://images-na.ssl-images-amazon.com/images/I/51raHLFC8wL._AA210_.jpg $139.56
https://images-na.ssl-images-amazon.com/images/I/41fuZZwdruL._AA210_.jpg $41.24
https://images-na.ssl-images-amazon.com/images/I/51N2rdMSh0L._AA210_.jpg $19.50 - $20.99
https://images-na.ssl-images-amazon.com/images/I/515DbJhCtOL._AA210_.jpg $22.97
https://images-na.ssl-images-amazon.com/images/I/51OzOZrj1rL._AA210_.jpg $109.95
https://images-na.ssl-images-amazon.com/images/I/31-QDRkNbhL._AA210_.jpg $15.80
https://images-na.ssl-images-amazon.com/images/I/41vXJ9fvcIL._AA210_.jpg $88.99
https://images-na.ssl-images-amazon.com/images/I/51fKqo2YfcL._AA210_.jpg $21.85
https://images-na.ssl-images-amazon.com/images/I/31GcGUXz9TL._AA210_.jpg $220.99 - $241.99
https://images-na.ssl-images-amazon.com/images/I/41sROkWjnpL._AA210_.jpg $40.48
https://images-na.ssl-images-amazon.com/images/I/51vXMFtZajL._AA210_.jpg $22.72
https://images-na.ssl-images-amazon.com/images/I/512s5ZrjoFL._AA210_.jpg $51.99
https://images-na.ssl-images-amazon.com/images/I/51A8Nfvf8eL._AA210_.jpg $8.30
https://images-na.ssl-images-amazon.com/images/I/51aDac6YN5L._AA210_.jpg $18.53
https://images-na.ssl-images-amazon.com/images/I/31SQON%2BiOBL._AA210_.jpg $10.07
Link:
https://www.amazon.com/gp/goldbox/ref=gbps_ftr_s-4_bedf_page_10?gb_f_deals1=enforcedCategories:2972638011,dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:,page:10,sortOrder:BY_SCORE,dealsPerPage:32&pf_rd_p=afc45143-5c9c-4b30-8d5c-d838e760bedf&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=ZDV4YBQJFDVR3PAY4ZBS&ie=UTF8
Browser Snapshot:

Related

Python Selenium WebScrapping

I am trying to extract following information from the website https://www.brecorder.com/pakistan/2022-11-17
I want to do the following things
Extract Category name
Extract News Articles links as well as headline given on the page
Go to Individual article link and fetch whole news from there
Paginate to previous day and repeat the above mentioned steps
Store everything in a csv file
Now what I have done uptill now is I can get the category name, extract article links and paginate to previous page but my code isn't working well. First of all I am getting random articles links that aren't part of that particular webpage. I can paginate and extract articles link for the previous day but same happens there too (I am attaching a screenshot of it). Moreover, I am unable to click on individual link and get detailed news from there
I am also attaching the snippets of page's html
[[Page's html](https://i.stack.imgur.com/juvg0.png)](https://i.stack.imgur.com/rK1El.png)
My code up till now is
`PATH = r"C:\Users\HP\PycharmProjects\WebScraping\chromedriver.exe"
driver = webdriver.Chrome(PATH)
category_urls = ['https://www.brecorder.com/pakistan/2022-11-17']
Category = []
Headline = []
Date = []
Url = []
News = []
def Url_Extraction():
category_name = driver.find_element_by_css_selector(('div[class="p-4 text-md text-gr bg-orange-200 inline-block my-2 font-sans font-medium text-white"]'))
cat = category_name.text # Save category name in variable
print(f"{cat}")
news_articles = driver.find_elements_by_css_selector(('a[class="story__link "]'))
for element in news_articles:
URL = element.get_attribute('href')
print(URL)
Url.append(URL)
Category.append(cat)
current_time = time.time() - start_time
print(f'{len(Url)} urls extracted')
print(f'{len(Category)} categories extracted')
print(f'Current Time: {current_time / 3600:.2f} hr, {current_time / 60:.2f} min, {current_time:.2f} sec',
flush=True)
try:
next_page = driver.find_element(By.CSS_SELECTOR, 'a[class="infinite-more-link w-40 mx-auto text-center p-2 my-10 border bg-beige-400"')
driver.execute_script("arguments[0].click();", next_page)
except Exception as e:
print(e)
start_time = time.time()
for url in category_urls:
driver.get(url) # Go to Webpage
driver.implicitly_wait(30) # we don't need to wait 30 secs if element is already there (very useful)
for num in range(2):
print(f'page no. {num+1}')
Url_Extraction()
''' Saving URLs to a csv file'''
with open('URL_List', 'w', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow(Url)
f.close()
''' Adding Data to a Dataframe'''
cols = ['Url', 'Category']
data = pd.DataFrame(columns=cols, index=range(len(Url)))
for index in range(len(Url)):
data.loc[index].Url = Url[index]
data.loc[index].Category = Category[index]
data.to_csv('URLlist_with_Cat.csv')
time.sleep(3)
driver.quit()

Selenium webscrape not scraping all item information on Amazon

I've been building a webscrape that:
1.) Asks what item you'd like to look for on Amazon
2.) Opens a Chrome browser with Selenium and searches for item
3.) Runs through a pre-set amount of pages (I have it at 1 for time efficiency when debugging)
4.) Scrapes all items information on each page and creates a list of "Product" objects.
The issue I'm having is even with the Try & Except I still don't get all the information for each item. When debugging I've double and triple checked my xpaths with "Xpath Helper" and don't see where I went wrong.
Below is my code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from reference_functions import Product
import time
from lxml import html
from selenium.webdriver.chrome.options import Options
import pandas as pd
import datetime as datetime
## SETTING UP QUESTIONS NEEDED FOR SCRAPE
question_product = "What would you like to search for?\n:"
search_term = "invicta mens watch" #str(input(question_product))
search_terms = search_term.split(" ")
question_export = "Do you want to export all item data to excel?\n:"
export_data = "no"#str(input(question_export))
## SETTING UP WEBDRIVER
s = Service('/Users/nicholaskenney/PycharmProjects/Amazon_Scrape/chromedriver')
chrome_options = Options()
#chrome_options.add_argument("--headless")
driver = webdriver.Chrome(service=s, options=chrome_options)
## OPENING URL
url = "https://www.amazon.com/"
driver.get(url)
time.sleep(3)
## SENDING SEARCH TERMS TO SEARCH BOX FOLLED BY ENDER BUTTON
search_box = driver.find_element(By.ID, "twotabsearchtextbox")
search_box.send_keys(search_term)
search_box.send_keys(Keys.RETURN)
time.sleep(3)
products_list = []
page = 1
while True:
if page != 0:
try:
driver.get(driver.current_url + "&page=" + str(page))
time.sleep(3)
except:
break
else:
break
tree = html.fromstring(driver.page_source)
time.sleep(3)
for product_tree in tree.xpath('//div[contains(#data-cel-widget, "search_result_")]'):
should_add = True
title = ""
price = ""
url = ""
number_of_reviews = ""
review_score = ""
previous_price = ""
try:
## Finding Title of item
try:
title = product_tree.xpath('.//span[#class="a-size-medium a-color-base a-text-normal"]/text()')
except Exception as e:
print("This is from first title try: " + e)
title = product_tree.xpath('.//span[#class="a-size-base-plus a-color-base a-text-normal"]/text()')
## FINDING CURRENT PRICE OF ITEM
price = product_tree.xpath('.//span[#class="a-price-whole"]/text()')
## FINDING NUMBER OF REVIEWS OF EACH ITEM
try:
number_of_reviews = product_tree.xpath('.//span[#class="a-size-base"]/text()')
except:
number_of_reviews = product_tree.xpath('.//span[#class="a-size-base a-color-base s-underline-text"]/text()')
## REVIEW SCORE FOR EACH ITEM
try:
review_score = product_tree.xpath('.//span[#class="a-icon-alt"]/text()')
except:
review_score = product_tree.xpath('.//span[#class="a-size-base a-color-base s-underline-text"]/text()')
## FINDING LINK FOR EACH ITEM
try:
links = product_tree.xpath('.//a[#class="a-link-normal s-link-style a-text-normal"]')
for link in links:
if 'href' in link.attrib:
url = (str(link.attrib['href']))
except:
links = product_tree.xpath('.//a[#class="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"]')
for link in links:
if 'href' in link.attrib:
url = (str(link.attrib['href']))
## PREVIOUS PRICE SCRAPE
try:
previous_price = product_tree.xpath('.//span[#class="a-price a-text-price"]//span['
'#class="a-offscreen"]/text()')
except:
previous_price = price
except:
print("exception")
should_add = False
## IF ALL INFORMATION IS SCRAPED (SHOULD_ADD IS TRUE) CREATE PRODUCT OBJECTS FOR EACH ITEM AND APPEND TO PRODUCT LIST
product = Product(price, title, url, number_of_reviews, review_score, previous_price)
if should_add == True:
products_list.append(product)
page = page - 1
print("Number of items scraped: " + str(len(products_list)))
## End of Webscrape
driver.quit()
## PRINTING RESULT FOR DEBUGGING
count = 0
for x in products_list:
print(x)
print(x.url)
print("Price is: " + str(x.price))
print("Previous Price is: " + str(x.previous_price))
print("Item title: " + str(x.title))
print("Number of review: "+ str(x.number_of_reviews))
print("Review Scores: " + str(x.review_score))
print("__________")
And this is the result I get:
Number of items scraped: 83
<reference_functions.Product object at 0x7ffd78e8bf10>
https://www.amazon.com/
Price is: []
Previous Price is: []
Item title: []
Number of review: []
Review Scores: ['4.6 out of 5 stars.', '4.6 out of 5 stars.', '4.6 out of 5 stars.']
__________
<reference_functions.Product object at 0x7ffd78eb10d0>
/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A05122932N0ETGH50WEB2&url=%2FWatches-Chronograph-Stainless-Waterproof-Business%2Fdp%2FB07Z62B354%2Fref%3Dsr_1_1_sspa%3Fcrid%3DIFKI3E407I9T%26keywords%3Dinvicta%2Bmens%2Bwatch%26qid%3D1640751697%26sprefix%3Di%252Caps%252C70%26sr%3D8-1-spons%26psc%3D1&qualifier=1640751697&id=2139685257788988&widgetName=sp_atf
Price is: ['42']
Previous Price is: ['$49.99']
Item title: []
Number of review: ['6,012']
Review Scores: ['4.4 out of 5 stars']
__________
<reference_functions.Product object at 0x7ffd78eb12e0>
/Invicta-Diver-Blue-Watch-26972/dp/B07GMSXZBM/ref=sr_1_2?crid=IFKI3E407I9T&keywords=invicta+mens+watch&qid=1640751697&sprefix=i%2Caps%2C70&sr=8-2
Price is: ['49']
Previous Price is: []
Item title: []
Number of review: ['6,122']
Review Scores: ['4.6 out of 5 stars']
__________
<reference_functions.Product object at 0x7ffd78eb1130>
/Invicta-Diver-Quartz-Green-30623/dp/B08447S81T/ref=sr_1_omk_3?crid=IFKI3E407I9T&keywords=invicta+mens+watch&qid=1640751697&sprefix=i%2Caps%2C70&sr=8-3
Price is: ['59']
Previous Price is: ['$69.90']
Item title: []
Number of review: ['6']
Review Scores: ['4.8 out of 5 stars']
__________
<reference_functions.Product object at 0x7ffd78eb1070>
/Invicta-12847-Specialty-Stainless-Steel/dp/B00962GV2E/ref=sr_1_4?crid=IFKI3E407I9T&keywords=invicta+mens+watch&qid=1640751697&sprefix=i%2Caps%2C70&sr=8-4
Price is: ['37']
Previous Price is: []
Item title: []
Number of review: ['5,376']
Review Scores: ['4.7 out of 5 stars']
Etc. Etc. Etc.
On this trial run it exported the url and the total reviews. I find that every other run doesn't export these variables. Is that because Amazons html changes each time I run it or is it something wrong with the code?
Any help on this would be gratefully appreciated!
Personally, I would use css selector to find the link since I don't find xpath reliable. The code that I would use to find the link would be:
product_tree.find_element(By.CSS_SELECTOR, 'a.a-link-normal.s-no-outline').get_attribute('href')
Running this path for me would return the correct link every time, without any problems.
As for the reviews, I would also use css. In this case it would be:
product_tree.find_element(By.CSS_SELECTOR, 'span.a-icon-alt').text
If it still doesn't work, I would suggest writing the whole page source to a text file by using driver.page_source and then use a tool to view what the driver is seeing.

Selecting multiple options in unchanging url

I need to scrape content from the website by selecting state, district and blocks from the drop down menus.
I tried using python requests and posts, but I'm not able to scrape the content properly as the url of the website never changes for the options i choose.
This is the code I've tried so far :
# importing all necessary packages
import urllib3
import requests
from bs4 import BeautifulSoup
import csv
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
url = "http://swachhbharatmission.gov.in/tsc/Report_NBA/Panchayat/Rpt_SarpanchDetail.aspx"
session = requests.Session()
html = session.get(url, verify=False).content
soup = BeautifulSoup(html, "lxml")
option = soup.find("select",{"name":"ctl00$ContentPlaceHolder1$ddlState"}).findAll("option")
# create dictionary 'states' mapping each state with it's code
states = {}
for elem in option[1:]:
key = elem['value']
value = elem.text
states[key] = value
for state in states.keys():
payload_ano = {'ctl00$ContentPlaceHolder1$ddlState': str(state)}
r = requests.post(url, data=payload_ano,verify=False)
break
soup = BeautifulSoup(r.text,"html.parser")
option = soup.find("select",{"name":"ctl00$ContentPlaceHolder1$ddlDistrict"}).findAll("option")
option # only gives [<option selected="selected" value="%">All District</option>] from the home page and not the districts inside the state chosen
I have used a break statement so the code can terminate earlier. Now the problem is that the variable option in the final line should contain the content of the drop down list when the state was chosen. But it only shows the content of the home page.
Any help or suggestions would be really appreciated.
You can use selenium to select an option from the drop downs.
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
driver.get('http://swachhbharatmission.gov.in/tsc/Report_NBA/Panchayat/Rpt_SarpanchDetail.aspx')
# get state options
state_element = driver.find_element_by_xpath('//*[#id="ctl00_ContentPlaceHolder1_ddlState"]')
state_select = Select(state_element)
state_options = [state_option.text for state_option in state_select.options]
# choose state option number
print('\nselect state:')
for i, state in enumerate(state_options):
print(f'{i+1} - {state.strip()}')
state = input(':- ')
# select state option
state_selected = driver.find_element_by_xpath(f'//*[#id="ctl00_ContentPlaceHolder1_ddlState"]/option[{state}]')
state_selected.click()
# get district options
district_element = driver.find_element_by_xpath('//*[#id="ctl00_ContentPlaceHolder1_ddlDistrict"]')
district_select = Select(district_element)
district_options = [district_option.text for district_option in district_select.options]
# choose district option number
print('\nselect district:')
for i, district in enumerate(district_options):
print(f'{i+1} - {district.strip()}')
district = input(':- ')
# select district option
district_selected = driver.find_element_by_xpath(f'//*[#id="ctl00_ContentPlaceHolder1_ddlDistrict"]/option[{district}]')
district_selected.click()
# get block options
block_element = driver.find_element_by_xpath('//*[#id="ctl00_ContentPlaceHolder1_ddlBlock"]')
block_select = Select(block_element)
block_options = [block_option.text for block_option in block_select.options]
# choose block option number
print('\nselect block:')
for i, block in enumerate(block_options):
print(f'{i+1} - {block.strip()}')
block = input(':- ')
# select block option
block_selected = driver.find_element_by_xpath(f'//*[#id="ctl00_ContentPlaceHolder1_ddlBlock"]/option[{block}]')
block_selected.click()
# get data of each record
try:
table_element = driver.find_element_by_css_selector('table.Table')
except NoSuchElementException:
print('\nRecord not found')
else:
table_rows = table_element.find_elements_by_css_selector('table.Table tr')
print('\nGrampanchayat Sarpanch Details')
for table_row in table_rows[2:]:
table_cols = table_row.find_elements_by_css_selector('table.Table tr td')
for table_col in table_cols:
print(table_col.text, end=',\t')
print()
Note:
You need to download Chrome Driver into your project folder.

Python BeautifulSoup page drill down

I have a python script which scrapes information from an Amazon page using a list of keywords stored in a .txt file. I have almost all the information I need in the page below:
'https://www.amazon.co.uk/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords={a}'.format(a=keyword)
The bit missing is the seller info (for example: by ZETA) for which I need to drill down in all product pages as the one below:
https://www.amazon.co.uk/Stroller-Pushchair-Colours-Available-Raincover/dp/B073B2D7CL/ref=sr_1_9?keywords=Pushchair&qid=1555063828&s=gateway&sr=8-9
I guess I need a while loop inside get_data function but I'm not sure how to implement this. See below for the code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import re
import datetime
from collections import deque
import logging
import csv
class AmazonScaper(object):
def __init__(self,keywords, output_file='example.csv',sleep=2):
self.browser = webdriver.Chrome(executable_path='chromedriver.exe') #Add path to your Chromedriver
self.keyword_queue = deque(keywords) #Add the start URL to our list of URLs to crawl
self.output_file = output_file
self.sleep = sleep
self.results = []
def get_page(self, keyword):
try:
self.browser.get('https://www.amazon.co.uk/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords={a}'.format(a=keyword))
return self.browser.page_source
except Exception as e:
logging.exception(e)
return
def get_soup(self, html):
if html is not None:
soup = BeautifulSoup(html, 'lxml')
return soup
else:
return
def get_data(self,soup,keyword):
try:
results = soup.select('.s-result-list [data-asin]')
for a, b in enumerate(results):
soup = b
header = soup.find('h5')
result = a + 1
title = header.text.strip()
try:
link = soup.find('a', attrs={'class': 'a-link-normal a-text-normal'})
url = link['href']
url = re.sub(r'/ref=.*', '', str(url))
except:
url = "None"
# Extract the ASIN from the URL - ASIN is the breaking point to filter out if the position is sponsored
ASIN = re.sub(r'.*/dp/', '', str(url))
# Extract Score Data using ASIN number to find the span class
#<span class="a-icon-alt">4.3 out of 5 stars</span>
try:
score = soup.select_one('.a-icon-alt')
score = score.text
score = score.strip('\n')
score = re.sub(r' .*', '', str(score))
except:
score = "None"
# Extract Number of Reviews in the same way
try:
reviews = soup.select_one("href*='#customerReviews']")
reviews = reviews.text.strip()
except:
reviews = "None"
# And again for Prime
try:
PRIME = soup.select_one('[field-lbr_brands_browse-bin=*"]')
PRIME = PRIME['field-lbr_brands_browse-bin']
#<i class="a-icon a-icon-prime" role="img" aria-label="Amazon Prime"></i>
except:
PRIME = "None"
try:
seller = ""
seller = ""
except:
seller = "None"
data = {keyword:[keyword,str(result),seller,title,ASIN,score,reviews,PRIME,datetime.datetime.today().strftime("%B %d, %Y")]}
self.results.append(data)
except Exception as e:
print(e)
return 1
def csv_output(self):
keys = ['Keyword','Rank','seller','Title','ASIN','Score','Reviews','Prime','Dates']
print(self.results)
with open(self.output_file, 'a', encoding='utf-8') as outputfile:
dict_writer = csv.DictWriter(outputfile, keys)
dict_writer.writeheader()
for item in self.results:
for key,value in item.items():
print(".".join(value))
outputfile.write(",".join('"' + item + '"' for item in value)+"\n") # Add "" quote character so the CSV accepts commas
def run_crawler(self):
while len(self.keyword_queue): #If we have keywords to check
keyword = self.keyword_queue.popleft() #We grab a keyword from the left of the list
html = self.get_page(keyword)
soup = self.get_soup(html)
time.sleep(self.sleep) # Wait for the specified time
if soup is not None: #If we have soup - parse and save data
self.get_data(soup,keyword)
#self.browser.quit()
self.csv_output() # Save the object data to csv
if __name__ == "__main__":
keywords = [str.replace(line.rstrip('\n'),' ','+') for line in
open('keywords.txt')] # Use our file of keywords & replaces spaces with +
ranker = AmazonScaper(keywords) # Create the object
ranker.run_crawler() # Run the rank checker
On the search page, each search item is contained in tags like:
<div data-asin="B0089TV3CS" data-index="1" class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 s-result-item sg-col-4-of-28 sg-col-4-of-16 AdHolder sg-col sg-col-4-of-20 sg-col-4-of-32" data-cel-widget="search_result_1">
Look right at the end of the above line. You can see the pattern that all search results follow. So you can use a regex search on the div tags with class attributes like so:
search_results = soup.findall("div", {"data-cel-widget": re.compile(r"search_result_\d")})
Now you can loop through each search result, and extract the links to the individual product pages, noting that the links are contained in tags like:
<a class="a-link-normal a-text-normal" href="/Sterling-Necklace-Infinity-Pendant-Jewellery/dp/B07BPSPD14/ref=sr_1_8?keywords=cross&qid=1555066092&s=gateway&sr=8-8">
I'm not familiar with selenium, but if I were using the requests module, I'd use it to load each product page in the loop, make a BeautifulSoup from it, and then look for the following tag, which is where the seller info is contained:
<a id="bylineInfo" class="a-link-normal" href="/ZETA/b/ref=bl_dp_s_web_1658218031?ie=UTF8&node=1658218031&field-lbr_brands_browse-bin=ZETA">ZETA</a>

Simulate clicking a link inside a link - Selenium Python

Python Knowledge: beginner
I managed to create a script to scrape contact information. The flow I followed since I am a beginner is to extract all the first links and copied it to text file and this is being used in link = browser.find_element_by_link_text(str(link_text)) Scraping of contact details have been confirmed working (based on my separate run). The problem is that after clicking the first links, it won't go on clicking the links inside it, hence it cannot scrape the contact info.
What is wrong with my script? Please bear in mind I am a beginner so my script is a little bit manual and lengthy.
Thanks very much!!!
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import requests
from bs4 import BeautifulSoup
import urllib
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import csv, time, lxml
######################### open file list ####################################
testfile = open("category.txt") # this is where I saved the category
readfile = testfile.read()
readfilesplit = readfile.split("\n")
############################### end ###################################
################### open browser ###############################
browser = webdriver.Firefox()
browser.get('http://aucklandtradesmen.co.nz/')
####################### end ###################################
link_texts = readfilesplit
for link_text in link_texts:
link = browser.find_element_by_link_text(str(link_text))
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
link.click() #click link
time.sleep(5)
print "-------------------------------------------------------------------------------------------------"
print("Getting listings for '%s'" % link_text)
################# get list name #######################
urlNoList = 'http://aucklandtradesmen.co.nz/home-mainmenu-1.html'
r = requests.get(browser.current_url)
if (urlNoList != browser.current_url):
soup = BeautifulSoup(r.content, 'html.parser')
g_data = soup.find_all("div", {"class":"listing-summary"})
pageRange = soup.find_all("span", {"class":"xlistings"})
pageR = [pageRange[0].text]
pageMax = str(pageR)[-4:-2] # get max item for lists
X = str(pageMax).replace('nd', '0')
# print "Number of listings: ", X
Y = int(X) #convert string to int
print "Number of listings: ", Y
for item in g_data:
try:
listingNames = item.contents[1].text
lstList = []
lstList[len(lstList):] = [listingNames]
replStr = re.sub(r"u'", "'",str(lstList)) #strip u' char
replStr1 = re.sub(r"\s+'", "'",str(replStr)) #strip space and '
replStr2 = re.sub(r"\sFeatured", "",str(replStr1)) #strip Featured string
print "Cleaned string: ", replStr2
################ SCRAPE INFO ################
################### This is where the code is not executing #######################
count = 0
while (count < Y):
for info in replStr2:
link2 = browser.find_element_by_link_text(str(info))
time.sleep(10)
link2.click()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#rating-msg")))
print "count", count
count+= 1
print("Contact info for: '%s'" % link_text)
r2 = requests.get(browser.current_url)
soup2 = BeautifulSoup(r2.content, 'html.parser')
g_data2 = soup.find_all("div", {"class":"fields"})
for item2 in g_data2:
# print item.contents[0]
print item2.contents[0].text
print item2.contents[1].text
print item2.contents[2].text
print item2.contents[3].text
print item2.contents[4].text
print item2.contents[5].text
print item2.contents[6].text
print item2.contents[7].text
print item2.contents[8].text
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
################### END ---- This is where the code is not executing END ---#######################
############ END SCRAPE INFO ####################
except NoSuchElementException:
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
else:
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
print "Number of listings: 0"
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
By the way this is some of the result:
-------------------------------------------------------------------------------------------------
Getting listings for 'Plumbers'
Number of listings: 5
Cleaned string: ['Hydroflame Plumbing & Gas Ltd']
Cleaned string: ['Osborne Plumbing Ltd']
Cleaned string: ['Plumbers Auckland Central']
Cleaned string: ['Griffiths Plumbing']
Cleaned string: ['Plumber Auckland']
-------------------------------------------------------------------------------------------------
Getting listings for 'Professional Services'
Number of listings: 2
Cleaned string: ['North Shore Chiropractor']
Cleaned string: ['Psychotherapy Werks - Rob Hunter']
-------------------------------------------------------------------------------------------------
Getting listings for 'Property Maintenance'
Number of listings: 7
Cleaned string: ['Auckland Tree Services']
Cleaned string: ['Bob the Tree Man']
Cleaned string: ['Flawless House Washing & Drain Unblocking']
Cleaned string: ['Yardiez']
Cleaned string: ['Build Corp Apartments Albany']
Cleaned string: ['Auckland Trellis']
Cleaned string: ['Landscape Design']
What I would do is change the logic some. Here's the logic flow I would suggest you use. This will eliminate the writing off of the links and speed up the script.
1. Navigate to http://aucklandtradesmen.co.nz/
2. Grab all elements using CSS selector "#index a" and store the attribute "href" of each
in an array of string (links to each category page)
3. Loop through the href array
3.1. Navigate to href
3.1.1. Grab all elements using CSS selector "div.listing-summary a" and store the
.text of each (company names)
3.1.2. If an element .by_link_text("Next") exists, click it and return to 3.1.1.
If you want business contact info off of the company pages, you would want to store the href in 3.1.1. and then loop through that list and grab what you want off the page.
Sorry about the weirdness of the formatting of the list. It won't let me indent more than one level.
okay I found a solution after thinking #jeffC's suggestion:
extract the href values and append it to the base url which is http://aucklandtradesmen.co.nz, so for example the if the extracted href is /home-mainmenu-1/alarms-a-security/armed-alarms-ltd-.html,and tell browser to navigate to that URL..and then I can do whatever I want in the current page..

Categories