I've been building a webscrape that:
1.) Asks what item you'd like to look for on Amazon
2.) Opens a Chrome browser with Selenium and searches for item
3.) Runs through a pre-set amount of pages (I have it at 1 for time efficiency when debugging)
4.) Scrapes all items information on each page and creates a list of "Product" objects.
The issue I'm having is even with the Try & Except I still don't get all the information for each item. When debugging I've double and triple checked my xpaths with "Xpath Helper" and don't see where I went wrong.
Below is my code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from reference_functions import Product
import time
from lxml import html
from selenium.webdriver.chrome.options import Options
import pandas as pd
import datetime as datetime
## SETTING UP QUESTIONS NEEDED FOR SCRAPE
question_product = "What would you like to search for?\n:"
search_term = "invicta mens watch" #str(input(question_product))
search_terms = search_term.split(" ")
question_export = "Do you want to export all item data to excel?\n:"
export_data = "no"#str(input(question_export))
## SETTING UP WEBDRIVER
s = Service('/Users/nicholaskenney/PycharmProjects/Amazon_Scrape/chromedriver')
chrome_options = Options()
#chrome_options.add_argument("--headless")
driver = webdriver.Chrome(service=s, options=chrome_options)
## OPENING URL
url = "https://www.amazon.com/"
driver.get(url)
time.sleep(3)
## SENDING SEARCH TERMS TO SEARCH BOX FOLLED BY ENDER BUTTON
search_box = driver.find_element(By.ID, "twotabsearchtextbox")
search_box.send_keys(search_term)
search_box.send_keys(Keys.RETURN)
time.sleep(3)
products_list = []
page = 1
while True:
if page != 0:
try:
driver.get(driver.current_url + "&page=" + str(page))
time.sleep(3)
except:
break
else:
break
tree = html.fromstring(driver.page_source)
time.sleep(3)
for product_tree in tree.xpath('//div[contains(#data-cel-widget, "search_result_")]'):
should_add = True
title = ""
price = ""
url = ""
number_of_reviews = ""
review_score = ""
previous_price = ""
try:
## Finding Title of item
try:
title = product_tree.xpath('.//span[#class="a-size-medium a-color-base a-text-normal"]/text()')
except Exception as e:
print("This is from first title try: " + e)
title = product_tree.xpath('.//span[#class="a-size-base-plus a-color-base a-text-normal"]/text()')
## FINDING CURRENT PRICE OF ITEM
price = product_tree.xpath('.//span[#class="a-price-whole"]/text()')
## FINDING NUMBER OF REVIEWS OF EACH ITEM
try:
number_of_reviews = product_tree.xpath('.//span[#class="a-size-base"]/text()')
except:
number_of_reviews = product_tree.xpath('.//span[#class="a-size-base a-color-base s-underline-text"]/text()')
## REVIEW SCORE FOR EACH ITEM
try:
review_score = product_tree.xpath('.//span[#class="a-icon-alt"]/text()')
except:
review_score = product_tree.xpath('.//span[#class="a-size-base a-color-base s-underline-text"]/text()')
## FINDING LINK FOR EACH ITEM
try:
links = product_tree.xpath('.//a[#class="a-link-normal s-link-style a-text-normal"]')
for link in links:
if 'href' in link.attrib:
url = (str(link.attrib['href']))
except:
links = product_tree.xpath('.//a[#class="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"]')
for link in links:
if 'href' in link.attrib:
url = (str(link.attrib['href']))
## PREVIOUS PRICE SCRAPE
try:
previous_price = product_tree.xpath('.//span[#class="a-price a-text-price"]//span['
'#class="a-offscreen"]/text()')
except:
previous_price = price
except:
print("exception")
should_add = False
## IF ALL INFORMATION IS SCRAPED (SHOULD_ADD IS TRUE) CREATE PRODUCT OBJECTS FOR EACH ITEM AND APPEND TO PRODUCT LIST
product = Product(price, title, url, number_of_reviews, review_score, previous_price)
if should_add == True:
products_list.append(product)
page = page - 1
print("Number of items scraped: " + str(len(products_list)))
## End of Webscrape
driver.quit()
## PRINTING RESULT FOR DEBUGGING
count = 0
for x in products_list:
print(x)
print(x.url)
print("Price is: " + str(x.price))
print("Previous Price is: " + str(x.previous_price))
print("Item title: " + str(x.title))
print("Number of review: "+ str(x.number_of_reviews))
print("Review Scores: " + str(x.review_score))
print("__________")
And this is the result I get:
Number of items scraped: 83
<reference_functions.Product object at 0x7ffd78e8bf10>
https://www.amazon.com/
Price is: []
Previous Price is: []
Item title: []
Number of review: []
Review Scores: ['4.6 out of 5 stars.', '4.6 out of 5 stars.', '4.6 out of 5 stars.']
__________
<reference_functions.Product object at 0x7ffd78eb10d0>
/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A05122932N0ETGH50WEB2&url=%2FWatches-Chronograph-Stainless-Waterproof-Business%2Fdp%2FB07Z62B354%2Fref%3Dsr_1_1_sspa%3Fcrid%3DIFKI3E407I9T%26keywords%3Dinvicta%2Bmens%2Bwatch%26qid%3D1640751697%26sprefix%3Di%252Caps%252C70%26sr%3D8-1-spons%26psc%3D1&qualifier=1640751697&id=2139685257788988&widgetName=sp_atf
Price is: ['42']
Previous Price is: ['$49.99']
Item title: []
Number of review: ['6,012']
Review Scores: ['4.4 out of 5 stars']
__________
<reference_functions.Product object at 0x7ffd78eb12e0>
/Invicta-Diver-Blue-Watch-26972/dp/B07GMSXZBM/ref=sr_1_2?crid=IFKI3E407I9T&keywords=invicta+mens+watch&qid=1640751697&sprefix=i%2Caps%2C70&sr=8-2
Price is: ['49']
Previous Price is: []
Item title: []
Number of review: ['6,122']
Review Scores: ['4.6 out of 5 stars']
__________
<reference_functions.Product object at 0x7ffd78eb1130>
/Invicta-Diver-Quartz-Green-30623/dp/B08447S81T/ref=sr_1_omk_3?crid=IFKI3E407I9T&keywords=invicta+mens+watch&qid=1640751697&sprefix=i%2Caps%2C70&sr=8-3
Price is: ['59']
Previous Price is: ['$69.90']
Item title: []
Number of review: ['6']
Review Scores: ['4.8 out of 5 stars']
__________
<reference_functions.Product object at 0x7ffd78eb1070>
/Invicta-12847-Specialty-Stainless-Steel/dp/B00962GV2E/ref=sr_1_4?crid=IFKI3E407I9T&keywords=invicta+mens+watch&qid=1640751697&sprefix=i%2Caps%2C70&sr=8-4
Price is: ['37']
Previous Price is: []
Item title: []
Number of review: ['5,376']
Review Scores: ['4.7 out of 5 stars']
Etc. Etc. Etc.
On this trial run it exported the url and the total reviews. I find that every other run doesn't export these variables. Is that because Amazons html changes each time I run it or is it something wrong with the code?
Any help on this would be gratefully appreciated!
Personally, I would use css selector to find the link since I don't find xpath reliable. The code that I would use to find the link would be:
product_tree.find_element(By.CSS_SELECTOR, 'a.a-link-normal.s-no-outline').get_attribute('href')
Running this path for me would return the correct link every time, without any problems.
As for the reviews, I would also use css. In this case it would be:
product_tree.find_element(By.CSS_SELECTOR, 'span.a-icon-alt').text
If it still doesn't work, I would suggest writing the whole page source to a text file by using driver.page_source and then use a tool to view what the driver is seeing.
Related
So, for some reason when I try and get the results for this script, it just crashes and shows no error at all before I get anything, someone please help me to get this to work. I don't know why this is, I think it may have to do with getting the Items Variable in some regard, but I just can't figure it out! Any help would be appreciated.
Here Is The Script:
from bs4 import BeautifulSoup
import requests
import re
import time
print("Computer Deal Finder")
print("\nBy: ViridianTelamon.")
print("\nThis Program Will Help You Find The Best Computers, Adapters, Electronics, And Computer Components Using The Website New Egg.")
item_thing = input("\nEnter The Item You Want To Find The Best Deals On: ")
time.sleep(2)
#url = f"https://www.amazon.com/s?k={item}&page=1&crid=1BE844NMMQSV7&sprefix={item}%2Caps%2C1923&ref=nb_sb_noss_1"
url = f"https://www.newegg.ca/p/pl?d={item_thing}&N=4131"
page = requests.get(url).text
doc = BeautifulSoup(page, "html.parser")
#page_text = doc.find(class_="s-pagination-item s-pagination-selected")
page_text = doc.find(class_="list-tool-pagination-text").strong
pages = int(str(page_text).split("/")[-2].split(">")[-1][:-1])
items_found = []
for page in range(1, pages + 1):
url = f"https://www.newegg.ca/p/pl?d={item_thing}&N=4131page={page}"
page = requests.get(url).text
doc = BeautifulSoup(page, "html.parser")
items = doc.find_all(text=re.compile(item_thing))
#items = div.find_all(text=re.compile(item_thing))
for item in items:
parent = item.parent
link = None
if parent.name != "a":
continue
link = parent['href']
next_parent = item.find_parent(class_="item-container")
try:
price = next_parent.find(class_="price-current").find("strong").string
items_found[item] = {"Price: ": int(price.replace(",", "")), "URL: ": link}
except:
pass
#sorted_items = sorted(items_found.items(), key=lambda x: x[1]['price'])
sorted_items = sorted(items_found, key=lambda x: x[1]['price'])
print("\n--------------------")
for item in sorted_items:
print("\n"f"Name: {item[0]}")
print("\n"f"Price: ${items[1]['price']}")
print("\n"f"URL: items[1]['link']")
print("\n--------------------")
time.sleep(0.2)
I suggest you test the result of your .find() calls as not all items contain the information you need. For example:
from bs4 import BeautifulSoup
import requests
import re
import time
item_thing = "adapter"
url = f"https://www.newegg.ca/p/pl?d={item_thing}&N=4131"
page = requests.get(url).text
doc = BeautifulSoup(page, "html.parser")
page_text = doc.find(class_="list-tool-pagination-text").strong
pages = int(str(page_text).split("/")[-2].split(">")[-1][:-1])
items_found = []
for page in range(1, pages + 1):
print(f"Getting page {page}")
url = f"https://www.newegg.ca/p/pl?d={item_thing}&N=4131&page={page}"
req = requests.get(url)
doc = BeautifulSoup(req.content, "html.parser")
for div in doc.find_all('div', class_="item-container"):
li_price = div.find(class_='price-current')
price = 0 # assume unknown price
if li_price:
strong = li_price.find('strong')
if strong:
price = float(strong.text.replace(',', ''))
a_tag = div.find('a', class_='item-title', href=True)
items_found.append([price, a_tag['href'], a_tag.text])
for price, link, name in sorted(items_found):
print(f"Name: {name}")
print(f"Price: ${price}")
print(f"URL: {link}")
print("--------------------")
This would give you results starting:
Name: axGear Universal Brass 3.5mm Male to 6.5mm Female Stereo Audio Adapter Jack Connector
Price: $3.0
URL: https://www.newegg.ca/p/231-0099-00023?Description=adapter&cm_re=adapter-_-9SIAD1NC9E3870-_-Product
--------------------
Name: axGear USB-C Female to USB 3.0 Male Adapter Converter Type C to USB 3 F/M
Price: $7.0
URL: https://www.newegg.ca/p/231-0099-00018?Description=adapter&cm_re=adapter-_-9SIAD1NB4E4533-_-Product
--------------------
Name: ORICO USB to Bluetooth 4.0 Portable Adapter Wireless Receiver Adapter Dongle -White
Price: $8.0
URL: https://www.newegg.ca/orico-bta-403/p/0XM-000H-00009?Description=adapter&cm_re=adapter-_-0XM-000H-00009-_-Product
--------------------
CODE IS HERE
Hi guys
I have some problem with scraping this dynamic site (https://kvartiry-bolgarii.ru/)
I need to get all the links to the home sale ads
I used selenium to load the page and get links to ads after that I move the page down to load new ads. After the new ads are loaded, I start to parse all the links on the page and write them to the list again.
But the data in the list is not updated and the script continues to work with the links that were on the page before scrolling down.
By the way, I set a check so that the script is executed until the last announcement on the site appears in the list, the link to which I found out in advance
How can this problem be corrected?
def get_link_info():
try:
url = "https://kvartiry-bolgarii.ru/"
driver = webdriver.Chrome(
executable_path=r'C:\Users\kk\Desktop\scrape_house\drivers\chromedriver.exe',
options=options
)
driver.get(url)
req = requests.get(url)
req.encoding = 'utf8'
soup = BeautifulSoup(req.text, "lxml")
articles = soup.find_all("div", class_="content")
links_urls = []
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
#print(links_urls)
first_link_number = links_urls[-2].split("-")[-1]
first_link_number = first_link_number[1:]
#print(first_link_number)
last_link_number = links_urls[-1].split("-")[-1]
last_link_number = last_link_number[1:]
#print(last_link_number)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
check = "https://kvartiry-bolgarii.ru/kvartira-v-elitnom-komplekse-s-unikalynym-sadom-o21751"
for a in links_urls:
if a != check:
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
print(links_urls[-1])
else:
print(links_urls[0], links_urls[-1])
print("all links are ready")
Some pointers. You don't need to mix selenium,requests and BeautifulSoup. Just selenium is enough. When you are scrolling infinitely, you need to remove duplicate elements before adding them to your list.
You can try this. This should work.
from selenium import webdriver
import time
def get_link_info():
all_links = []
try:
driver = webdriver.Chrome(executable_path='C:/chromedriver.exe')
driver.get('https://kvartiry-bolgarii.ru/')
time.sleep(3)
old_links = set() # Empty Set
while True:
# Scroll to get more ads
driver.execute_script("window.scrollBy(0,3825)", "")
# Wait for new ads to load
time.sleep(8)
links_divs = driver.find_elements_by_xpath('//div[#class="content"]//a') # Find Elements
ans = set(links_divs) - set(old_links) # Remove old elements
for link in ans:
# Scroll to the link.
driver.execute_script("arguments[0].scrollIntoView();", link)
fir = link.get_attribute('href')
all_links.append(fir)
# Remove Duplicates
old_links = links_divs
except Exception as e:
raise e
get_link_info()
I wrote a python code for web scraping so that I can import the data from flipkart.
I need to load multiple pages so that I can import many products but right now only 1 product page is coming.
from urllib.request import urlopen as uReq
from requests import get
from bs4 import BeautifulSoup as soup
import tablib
my_url = 'https://www.xxxxxx.com/food-processors/pr?sid=j9e%2Cm38%2Crj3&page=1'
uClient2 = uReq(my_url)
page_html = uClient2.read()
uClient2.close()
page_soup = soup(page_html, "html.parser")
containers11 = page_soup.findAll("div",{"class":"_3O0U0u"})
filename = "FoodProcessor.csv"
f = open(filename, "w", encoding='utf-8-sig')
headers = "Product, Price, Description \n"
f.write(headers)
for container in containers11:
title_container = container.findAll("div",{"class":"_3wU53n"})
product_name = title_container[0].text
price_con = container.findAll("div",{"class":"_1vC4OE _2rQ-NK"})
price = price_con[0].text
description_container = container.findAll("ul",{"class":"vFw0gD"})
product_description = description_container[0].text
print("Product: " + product_name)
print("Price: " + price)
print("Description" + product_description)
f.write(product_name + "," + price.replace(",","") +"," + product_description +"\n")
f.close()
You have to check if the next page button exist or not. If yes then return True, go to that next page and start scraping if no then return False and move to the next container. Check for the class name of that button first.
# to check if a pagination exists on the page:
def go_next_page():
try:
button = driver.find_element_by_xpath('//a[#class="<class name>"]')
return True, button
except NoSuchElementException:
return False, None
You can Firstly get the number of pages available and iterate over for each of the pages and parse the data respectively.
Like if you change the URL with respect to page
'https://www.flipkart.com/food-processors/pr?sid=j9e%2Cm38%2Crj3&page=1' which points to page 1
'https://www.flipkart.com/food-processors/pr?sid=j9e%2Cm38%2Crj3&page=2' which points to page 2
try:
next_btn = driver.find_element_by_xpath("//a//span[text()='Next']")
next_btn.click()
except ElementClickInterceptedException as ec:
classes = "_3ighFh"
overlay = driver.find_element_by_xpath("(//div[#class='{}'])[last()]".format(classes))
driver.execute_script("arguments[0].style.visibility = 'hidden'",overlay)
next_btn = driver.find_element_by_xpath("//a//span[text()='Next']")
next_btn.click()
except Exception as e:
print(str(e.msg()))
break
except TimeoutException:
print("Page Timed Out")
driver.quit()
For me, the easiest way is to add an extra loop with the "page" variable:
# just check the number of the last page on the website
page = 1
while page != 10:
print(f'Scraping page: {page}')
my_url = 'https://www.xxxxxx.com/food-processors/pr?sid=j9e%2Cm38%2Crj3&page={page}'
# here add the for loop you already have
page += 1
This method should work.
This is a loop for scraping several elements. Sometimes price isn't always found. Instead of passing through except - I need to print/write a value for those times when no price is found. The reason being is when it just passes through, it mismatches the variable values when printing (title, link, image, price). Hopefully, you can see my logic below in what I'm trying to accomplish. I'm also attaching a screenshot so you can see what I mean.
#finds titles
deal_title = browser.find_elements_by_xpath("//a[#id='dealTitle']/span")
titles = []
for title in deal_title:
titles.append(title.text)
#finds links
deal_link = browser.find_elements_by_xpath("//div[#class='a-row dealDetailContainer']/div/a[#id='dealTitle']")
links = []
for link in deal_link:
links.append(link.get_attribute('href'))
#finds images
deal_image = browser.find_elements_by_xpath("//a[#id='dealImage']/div/div/div/img")
images = []
for image in deal_image:
images.append(image.get_attribute('src'))
try:
deal_price = browser.find_elements_by_xpath("//div[#class='a-row priceBlock unitLineHeight']/span")
prices = []
for price in deal_price:
prices.append(price.text)
except NoSuchElementException:
price = ("PRINT/WRITE THIS TEXT INSTEAD OF PASSING")
#writes to html
for title, link, image, price in zip(titles, links, images, prices):
f.write("<tr class='border'><td class='image'>" + "<img src=" + image + "></td>" + "<td class='title'>" + title + "</td><td class='price'>" + price + "</td></tr>")
import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
capabilities = {
'browserName': 'chrome',
'chromeOptions': {
'useAutomationExtension': False,
'forceDevToolsScreenshot': True,
'args': ['--start-maximized', '--disable-infobars']
}
}
driver = webdriver.Chrome(executable_path='./chromedriver_2.38.exe', desired_capabilities=capabilities)
driver.get("""https://www.amazon.com/gp/goldbox/ref=gbps_ftr_s-4_bedf_page_10?
gb_f_deals1=enforcedCategories:2972638011,
dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,
includedAccessTypes:,page:10,sortOrder:BY_SCORE,
dealsPerPage:32&pf_rd_p=afc45143-5c9c-4b30-8d5c-d838e760bedf&
pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&
pf_rd_m=ATVPDKIKX0DER&pf_rd_r=ZDV4YBQJFDVR3PAY4ZBS&ie=UTF8""")
time.sleep(15)
golds = driver.find_elements_by_css_selector(".widgetContainer #widgetContent > div.singleCell")
print("found %d golds" % len(golds))
template = """\
<tr class="border">
<td class="image"><img src="{0}"></td>\
<td class="title">{2}</td>\
<td class="price">{3}</td>
</tr>"""
lines = []
for gold in golds:
goldInfo = {}
goldInfo['title'] = gold.find_element_by_css_selector('#dealTitle > span').text
goldInfo['link'] = gold.find_element_by_css_selector('#dealTitle').get_attribute('href')
goldInfo['image'] = gold.find_element_by_css_selector('#dealImage img').get_attribute('src')
try:
goldInfo['price'] = gold.find_element_by_css_selector('.priceBlock > span').text
except NoSuchElementException:
goldInfo['price'] = 'No price display'
print goldInfo['title']
line = template.format(goldInfo['image'], goldInfo['link'], goldInfo['title'], goldInfo['price'])
lines.append(line)
html = """\
<html>
<body>
<table>
{0}
</table>
</body>
</html>\
"""
f = open('./result.html', 'w')
f.write(html.format('\n'.join(lines)))
f.close()
If I understood it correctly, you have got problem with loading some page elements "on time".
The elements you want to scrap may not load yet when you are reading them.
To prevent this, you can use explicit waits(the script will wait specified time until specified element loads).
When using this, there will be a smaller chance you miss some values.
Yes, you can skip the price as well. I am providing you an alternative approach where you can create a List of the available prices and the respective images as follows :
Code Block:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
browser=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
browser.get("https://www.amazon.com/gp/goldbox/ref=gbps_ftr_s-4_bedf_page_10?gb_f_deals1=enforcedCategories:2972638011,dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:,page:10,sortOrder:BY_SCORE,dealsPerPage:32&pf_rd_p=afc45143-5c9c-4b30-8d5c-d838e760bedf&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=ZDV4YBQJFDVR3PAY4ZBS&ie=UTF8")
#finds images
deal_image = browser.find_elements_by_xpath("//div[#class='a-row priceBlock unitLineHeight']/span//preceding::img[1]")
images = []
for image in deal_image:
images.append(image.get_attribute('src'))
#finds prices
deal_price = browser.find_elements_by_xpath("//div[#class='a-row priceBlock unitLineHeight']/span")
prices = []
for price in deal_price:
prices.append(price.text)
#print the information
for image, price in zip(images, prices):
print(image, price)
Console Output:
https://images-na.ssl-images-amazon.com/images/I/31zt-ovKJqL._AA210_.jpg $9.25
https://images-na.ssl-images-amazon.com/images/I/610%2BKAfr72L._AA210_.jpg $15.89
https://images-na.ssl-images-amazon.com/images/I/41whkQ1m0uL._AA210_.jpg $31.49
https://images-na.ssl-images-amazon.com/images/I/41cAbUWEdoL._AA210_.jpg $259.58 - $782.99
https://images-na.ssl-images-amazon.com/images/I/51raHLFC8wL._AA210_.jpg $139.56
https://images-na.ssl-images-amazon.com/images/I/41fuZZwdruL._AA210_.jpg $41.24
https://images-na.ssl-images-amazon.com/images/I/51N2rdMSh0L._AA210_.jpg $19.50 - $20.99
https://images-na.ssl-images-amazon.com/images/I/515DbJhCtOL._AA210_.jpg $22.97
https://images-na.ssl-images-amazon.com/images/I/51OzOZrj1rL._AA210_.jpg $109.95
https://images-na.ssl-images-amazon.com/images/I/31-QDRkNbhL._AA210_.jpg $15.80
https://images-na.ssl-images-amazon.com/images/I/41vXJ9fvcIL._AA210_.jpg $88.99
https://images-na.ssl-images-amazon.com/images/I/51fKqo2YfcL._AA210_.jpg $21.85
https://images-na.ssl-images-amazon.com/images/I/31GcGUXz9TL._AA210_.jpg $220.99 - $241.99
https://images-na.ssl-images-amazon.com/images/I/41sROkWjnpL._AA210_.jpg $40.48
https://images-na.ssl-images-amazon.com/images/I/51vXMFtZajL._AA210_.jpg $22.72
https://images-na.ssl-images-amazon.com/images/I/512s5ZrjoFL._AA210_.jpg $51.99
https://images-na.ssl-images-amazon.com/images/I/51A8Nfvf8eL._AA210_.jpg $8.30
https://images-na.ssl-images-amazon.com/images/I/51aDac6YN5L._AA210_.jpg $18.53
https://images-na.ssl-images-amazon.com/images/I/31SQON%2BiOBL._AA210_.jpg $10.07
Link:
https://www.amazon.com/gp/goldbox/ref=gbps_ftr_s-4_bedf_page_10?gb_f_deals1=enforcedCategories:2972638011,dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:,page:10,sortOrder:BY_SCORE,dealsPerPage:32&pf_rd_p=afc45143-5c9c-4b30-8d5c-d838e760bedf&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=ZDV4YBQJFDVR3PAY4ZBS&ie=UTF8
Browser Snapshot:
Python Knowledge: beginner
I managed to create a script to scrape contact information. The flow I followed since I am a beginner is to extract all the first links and copied it to text file and this is being used in link = browser.find_element_by_link_text(str(link_text)) Scraping of contact details have been confirmed working (based on my separate run). The problem is that after clicking the first links, it won't go on clicking the links inside it, hence it cannot scrape the contact info.
What is wrong with my script? Please bear in mind I am a beginner so my script is a little bit manual and lengthy.
Thanks very much!!!
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import requests
from bs4 import BeautifulSoup
import urllib
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import csv, time, lxml
######################### open file list ####################################
testfile = open("category.txt") # this is where I saved the category
readfile = testfile.read()
readfilesplit = readfile.split("\n")
############################### end ###################################
################### open browser ###############################
browser = webdriver.Firefox()
browser.get('http://aucklandtradesmen.co.nz/')
####################### end ###################################
link_texts = readfilesplit
for link_text in link_texts:
link = browser.find_element_by_link_text(str(link_text))
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
link.click() #click link
time.sleep(5)
print "-------------------------------------------------------------------------------------------------"
print("Getting listings for '%s'" % link_text)
################# get list name #######################
urlNoList = 'http://aucklandtradesmen.co.nz/home-mainmenu-1.html'
r = requests.get(browser.current_url)
if (urlNoList != browser.current_url):
soup = BeautifulSoup(r.content, 'html.parser')
g_data = soup.find_all("div", {"class":"listing-summary"})
pageRange = soup.find_all("span", {"class":"xlistings"})
pageR = [pageRange[0].text]
pageMax = str(pageR)[-4:-2] # get max item for lists
X = str(pageMax).replace('nd', '0')
# print "Number of listings: ", X
Y = int(X) #convert string to int
print "Number of listings: ", Y
for item in g_data:
try:
listingNames = item.contents[1].text
lstList = []
lstList[len(lstList):] = [listingNames]
replStr = re.sub(r"u'", "'",str(lstList)) #strip u' char
replStr1 = re.sub(r"\s+'", "'",str(replStr)) #strip space and '
replStr2 = re.sub(r"\sFeatured", "",str(replStr1)) #strip Featured string
print "Cleaned string: ", replStr2
################ SCRAPE INFO ################
################### This is where the code is not executing #######################
count = 0
while (count < Y):
for info in replStr2:
link2 = browser.find_element_by_link_text(str(info))
time.sleep(10)
link2.click()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#rating-msg")))
print "count", count
count+= 1
print("Contact info for: '%s'" % link_text)
r2 = requests.get(browser.current_url)
soup2 = BeautifulSoup(r2.content, 'html.parser')
g_data2 = soup.find_all("div", {"class":"fields"})
for item2 in g_data2:
# print item.contents[0]
print item2.contents[0].text
print item2.contents[1].text
print item2.contents[2].text
print item2.contents[3].text
print item2.contents[4].text
print item2.contents[5].text
print item2.contents[6].text
print item2.contents[7].text
print item2.contents[8].text
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
################### END ---- This is where the code is not executing END ---#######################
############ END SCRAPE INFO ####################
except NoSuchElementException:
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
else:
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
print "Number of listings: 0"
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
By the way this is some of the result:
-------------------------------------------------------------------------------------------------
Getting listings for 'Plumbers'
Number of listings: 5
Cleaned string: ['Hydroflame Plumbing & Gas Ltd']
Cleaned string: ['Osborne Plumbing Ltd']
Cleaned string: ['Plumbers Auckland Central']
Cleaned string: ['Griffiths Plumbing']
Cleaned string: ['Plumber Auckland']
-------------------------------------------------------------------------------------------------
Getting listings for 'Professional Services'
Number of listings: 2
Cleaned string: ['North Shore Chiropractor']
Cleaned string: ['Psychotherapy Werks - Rob Hunter']
-------------------------------------------------------------------------------------------------
Getting listings for 'Property Maintenance'
Number of listings: 7
Cleaned string: ['Auckland Tree Services']
Cleaned string: ['Bob the Tree Man']
Cleaned string: ['Flawless House Washing & Drain Unblocking']
Cleaned string: ['Yardiez']
Cleaned string: ['Build Corp Apartments Albany']
Cleaned string: ['Auckland Trellis']
Cleaned string: ['Landscape Design']
What I would do is change the logic some. Here's the logic flow I would suggest you use. This will eliminate the writing off of the links and speed up the script.
1. Navigate to http://aucklandtradesmen.co.nz/
2. Grab all elements using CSS selector "#index a" and store the attribute "href" of each
in an array of string (links to each category page)
3. Loop through the href array
3.1. Navigate to href
3.1.1. Grab all elements using CSS selector "div.listing-summary a" and store the
.text of each (company names)
3.1.2. If an element .by_link_text("Next") exists, click it and return to 3.1.1.
If you want business contact info off of the company pages, you would want to store the href in 3.1.1. and then loop through that list and grab what you want off the page.
Sorry about the weirdness of the formatting of the list. It won't let me indent more than one level.
okay I found a solution after thinking #jeffC's suggestion:
extract the href values and append it to the base url which is http://aucklandtradesmen.co.nz, so for example the if the extracted href is /home-mainmenu-1/alarms-a-security/armed-alarms-ltd-.html,and tell browser to navigate to that URL..and then I can do whatever I want in the current page..