How to find element by class in BS4 (python 3) - python

Website: https://ca.cartier.com/en-ca/collections/jewelry/categories.viewall.html
Looking to scrape all information for each product and copy into excel file for further charting/analysis.
Been following the documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
Work so far returns nothing:
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
url = "https://ca.cartier.com/en-ca/collections/jewelry/categories.viewall.html"
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
lst =[]
for my_items in soup.find_all("div", attrs={"class": "grid-item"}):
print(my_items)

The page is loaded dynamically, therefore requests won't support it. However, the data is available via sending a GET request to:
https://ca.cartier.com/en-ca/collections/jewelry/categories.productlistingservletv2.json
The response data is a Python dictionary (dict), where you can access the key/value:
>>> import requests
>>>
>>>
>>> URL = "https://ca.cartier.com/en-ca/collections/jewelry/categories.productlistingservletv2.json"
>>> response = requests.get(URL).json()
>>> print(type(response))
<class 'dict'>
An alternative would be to use Selenium to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
In your example:
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
URL = "https://ca.cartier.com/en-ca/collections/jewelry/categories.viewall.html"
driver = webdriver.Chrome(r"c:\path\to\chromedriver.exe")
driver.get(URL)
# Wait for the page to fully render
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()
for tag in soup.find_all("div", attrs={"class": "grid-item"}):
print(tag)

There's no need to use selenium as data is present in the HTML.
In order to collect all the information and go through all the pages you need to use pagination.
While loop would be a good solution to go through all possible pages no matter how many there're, it will go through all of them because it will be dynamic.
Pagination is done like this:
we find the parameter in the URL of the page responsible for the page number, in our case, it is "page = 1" it becomes visible in the URL when you click on the "LOAD MORE" button.
for ease of working with query and page parameters, they can be moved to params:
params = {
"page": 1, # number of page
"cgid": "Jewelry_AllCollections" # query
}
to go to the next page, you can use the check for the presence of the "LOAD MORE" button, the selector "#product-search-results .button--fluid" is responsible for it.
if the button is not present, the loop is terminated. Otherwise, we increment a 1 to a page URL parameter which will paginate us to the next page:
if soup.select_one("#product-search-results .button--fluid"):
params['page'] += 1
else:
break
Save data to Excel using pandas to_excel() and make sure you pip install openpyxl:
pd.DataFrame(data=data).to_excel(
'jewelry-items.xlsx',
sheet_name='Jewelry_Sheet',
index=False
)
See how it works in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
import pandas as pd
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
params = {
"page": 1, # number of page
"cgid": "Jewelry_AllCollections" # query
}
data = []
while True:
page = requests.get('https://www.cartier.com/en-ca/jewelry/all-collections/?', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['page']}")
print("-" * 10)
for result in soup.select(".product-tile__body"):
title = result.select_one("#product-search-results .heading-type").text.strip(),
material = result.select_one(".product-tile__material").text.strip(),
try:
price = result.select_one(".value").text.strip()
except:
None
data.append({
"title": "".join(title),
"material": "".join(material),
"price": price
})
if soup.select_one("#product-search-results .button--fluid"):
params['page'] += 1
else:
break
# save to excel
pd.DataFrame(data=data).to_excel('jewelry-items.xlsx', sheet_name='Jewelry_Sheet', index=False)
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Clash de Cartier necklace Diamonds",
"material": "Rose gold, diamonds",
"price": "8,250.00 C$"
},
{
"title": "Juste un Clou bracelet",
"material": "Rose gold, diamonds",
"price": "25,000.00 C$"
},
{
"title": "Solitaire 1895",
"material": "Platinum, rubies, diamond",
"price": "25,000.00 C$"
},
{
"title": "Panthère de Cartier ring",
"material": "White gold, tsavorite garnets, onyx",
"price": "8,800.00 C$"
},
{
"title": "Symbol pendant",
"material": "Yellow gold",
"price": "760.00 C$"
},
# ...
]
There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.

Related

I'm trying to web scrape ebay using python and BeautifulSoup, but I'm getting a list index out of rangeerror

As in the title, I'm trying to write a Ebay web-scrape program, yet when I try to find the price, it creates a list error, yet it works for getting the product name.
The url is: https://www.ebay.com.au/sch/i.html?_from=R40&_nkw=switch&_sacat=0&_pgn=1
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
**Open Collection**
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close
grabs each products
containers = page_soup.findAll("div", {"class" : "s-item__wrapper clearfix"})
filename = "EbayWebscraping.csv"
f = open(filename, "w")
headers = "product_name, quality"
for container in containers:
title_container = container.findAll('h3', {'class' : 's-item__title'} )
product_name = title_container[0].text
#Where the problem is#
price_container = container.findAll('span', {'class' : 's-item__price'})
price = price_container[0].text
print('Product: ' + product_name)
print('Price: ' + price)
if you see containers in which at index 0 there is no product or price info so you can start from index 1 and also you can use try-except instead of that
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.ebay.com.au/sch/i.html?_from=R40&_nkw=switch&_sacat=0&_pgn=1")
soup=BeautifulSoup(page.text,"lxml")
containers = soup.findAll("div", {"class" : "s-item__wrapper clearfix"})[1:]
for container in containers:
print(container.find('h3', {'class' : 's-item__title'} ).text)
print(container.find("span", class_="s-item__price").text)
Output:
30 in 1 Game Collection Nintendo Switch Brand New Sealed
AU $47.00
Street Fighter 30th Anniversary Collection Nintendo Switch Brand New Sealed
AU $47.00
For Nintendo Switch Case ZUSLAB Clear Slim Soft Heavy Duty Shockproof Cover
AU $9.99 to AU $16.95
.....
You can also check if the selector is present before doing further processing:
if container.findAll('span', {'class' : 's-item__price'}):
# do something
You also don't need to access [0] index. text would work perfectly. Additionally, there's no need to use findAll since you already extracting data from containers and its selector that contains data about title, price inside. Think of the container as matryoshka doll if it makes more sense.
You just have to call text and price selectors e.g:
containers = page_soup.findAll("div", {"class" : "s-item__wrapper clearfix"})
for container in containers:
product_name = container.find('h3', {'class' : 's-item__title'}).text
price = container.find('span', {'class' : 's-item__price'}).text
Code that paginates through all pages and example in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
params = {
'_nkw': 'switch', # search query
'_pgn': 1 # page number
}
data = []
while True:
page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['_pgn']}")
print("-" * 10)
for products in soup.select(".s-item__info"):
title = products.select_one(".s-item__title span").text
price = products.select_one(".s-item__price").text
link = products.select_one(".s-item__link")["href"]
data.append({
"title" : title,
"price" : price,
"link" : link
})
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output
Extracting page: 1
----------
[
{
"title": "Shop on eBay",
"price": "$20.00",
"link": "https://ebay.com/itm/123456?hash=item28caef0a3a:g:E3kAAOSwlGJiMikD&amdata=enc%3AAQAHAAAAsJoWXGf0hxNZspTmhb8%2FTJCCurAWCHuXJ2Xi3S9cwXL6BX04zSEiVaDMCvsUbApftgXEAHGJU1ZGugZO%2FnW1U7Gb6vgoL%2BmXlqCbLkwoZfF3AUAK8YvJ5B4%2BnhFA7ID4dxpYs4jjExEnN5SR2g1mQe7QtLkmGt%2FZ%2FbH2W62cXPuKbf550ExbnBPO2QJyZTXYCuw5KVkMdFMDuoB4p3FwJKcSPzez5kyQyVjyiIq6PB2q%7Ctkp%3ABlBMULq7kqyXYA"
},
{
"title": "Jabra Elite 7 Pro - Black Certified Refurbished",
"price": "$82.99",
"link": "https://www.ebay.com/itm/165621993671?epid=12050643207&hash=item268fd710c7:g:gMwAAOSwx8Bi9Fwg&amdata=enc%3AAQAHAAAA4NGq89JefbLJPItXeQ93BWhuE9Wt3pRHvU92HE2wiaGKAUlhQ8hDKu9iP2m5gdNQc8t8ujFSUwXJSyCxrnjh9qaxVXN0s0V7clbWiZTPr7Co3AwECpNLit29NfC%2BXbQxEv7kePJokjM9wnHv%2BAamoTlPl0K8BHa0S3FVrb7IUn9s%2FmvdzTiGUd4DHYNdIEQeFNK7zqB8%2BlWrukvfUz62JemzooE1UYtLbCtQwfIDP1F2GbOL4DoRwHXynUtpduYPA8TX6qZOv8eL44j4hNnP6%2BjGBaDGCReJ6ld13xxhYEUf%7Ctkp%3ABFBM3qnT0f5g"
},
{
"title": "New Listingnintendo switch bundle ",
"price": "$225.00",
"link": "https://www.ebay.com/itm/354344900745?hash=item52809a1889:g:egsAAOSw-qZjUQl-&amdata=enc%3AAQAHAAAA4MkbjLSYGoCVhjI%2BBE%2F1cIoqAfUyH73WJdSL7XugI%2BMtaCzRdexKqk3SnxM3PT5yMHSrChuJdcLC6ESDVvNs2j01yTzx8Cl9i9CQbV89Gp9tzPQNIaBGkVwSh989DJ4lmSmCKywnPQ9yLQqY3fz96kBJbbZwGd63yks4tTuZOiNcAl7PTriDOrVNHF%2FUXm3s18tajQeqtrZxW4pb8nWa5%2FtdmrwDphxTKmA9sONVXfKX5oFujpDxrwswe%2FgoJi2XGjGqe06ruHbzH295EHuRLUv4Tn0R2Kf7CKaman2IEpPo%7Ctkp%3ABFBM3qnT0f5g"
},
# ...
]
As an alternative, you can use Ebay Organic Results API from SerpApi. It's a paid API with a free plan that handles blocks and parsing on their backend.
Example code that paginates through all pages:
from serpapi import EbaySearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # serpapi api key
"engine": "ebay", # search engine
"ebay_domain": "ebay.com", # ebay domain
"_nkw": "switch", # search query
"_pgn": 1 # page number
"LH_Sold": "1" # shows sold items
}
search = EbaySearch(params) # where data extraction happens
page_num = 0
data = []
while True:
results = search.get_dict() # JSON -> Python dict
if "error" in results:
print(results["error"])
break
for organic_result in results.get("organic_results", []):
link = organic_result.get("link")
price = organic_result.get("price")
data.append({
"price" : price,
"link" : link
})
page_num += 1
print(page_num)
if "next" in results.get("pagination", {}):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2))
Output:
[
{
"price": {
"raw": "$70.00",
"extracted": 70.0
},
"link": "https://www.ebay.com/itm/334599074264?hash=item4de7a8b1d8:g:Vy4AAOSwLLNjUK2i&amdata=enc%3AAQAHAAAAkKM1u%2BmETRpbgLxiKL9uymVFiae4NU2iJa00z6qQK4lyzoe477sEDhhVVjF39BDTAOJQ4PLP%2BoXj1xf5wH8Ja5v1oAmO%2FNRlSFlTK80FlnQkHpIYswiG%2BNH44f98M5LWkwgeOb%2FRVc9uU6Ep9HYaV9JV39LZFRiOJLOGgFvoRxLD4731y0VuzM%2BcPXThX7aXtA%3D%3D%7Ctkp%3ABk9SR4KOv9H-YA"
},
{
"price": {
"raw": "$169.95",
"extracted": 169.95
},
"link": "https://www.ebay.com/itm/185625421454?epid=4050657390&hash=item2b3823268e:g:WrIAAOSwPKdjPfvK&amdata=enc%3AAQAHAAAAoBkI9bwtrhJH9mDVPkHzYgem23XBXWHO%2FghvdNjkqq2RX%2BCoy33RIc%2FxXg%2BHWp0Y5jUL9%2BOfnpKyRshkZTRttODPLt%2Fu0VIfjunwr%2F6r9lKHiZ9w%2FnaITM0BTU0FeU1gKw2dERJwDKrzgCPNc%2FStsq0BdCUYNxQeLG4I1ezDBYZSseUv96U33wRLz%2BJ94pP6UgnCp2nj4oX3qFujBLsvG%2F8%3D%7Ctkp%3ABk9SR4KOv9H-YA"
},
# ...
]
Disclaimer, I work for SerpApi.

Scraping Ebay, working until I use it in sold items

I will use this code to explain my doubt:
Using the url without sold filter
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.ebay.es/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=iphone+x&_sacat=0&LH_TitleDesc=0&_udlo=400&LH_Auction=1&_osacat=0&_odkw=Pok%C3%A9mon+card+Charizard+4%2F102&rt=nc"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", {"class": "s-item__info clearfix"})
print(len(results))
Output: 12
Then I use the url where there are only sold items, I check the html and the class is the same.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.ebay.es/sch/i.html?_from=R40&_nkw=iphone+x&_sacat=0&LH_TitleDesc=0&_udlo=400&LH_Auction=1&rt=nc&LH_Sold=1&LH_Complete=1"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", {"class": "s-item__info clearfix"})
print(len(results))
Output: 0
I tried different classes but I can´t never obtain something.
Thanks.
It was a captcha problem. tHanks!
There are several reasons why the output will be empty.
This is often because the site may think it is being accessed by a bot if requests is the default user-agent in the requests library is python-requests, this can be prevented by passing your actual User-Agent to the "headers". This seems to be a reason why you get a CAPTCHA.
The next step would be if User-Agent passing didn't work would be to use rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.
Also if passing request headers is not enough. That's when you can try using proxies (ideally residential) in combination with request headers.
An additional step is to use CAPTCHA solver, for example, 2captcha. It allows bypassing all possible CAPTCHAs depending on the target website.
Check the code using BeautifulSoup in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
}
params = {
'_nkw': 'iphone_x', # search query
'LH_Sold': '1', # shows sold items
'_pgn': 1 # page number
}
data = []
limit = 10 # page limit (if needed)
while True:
page = requests.get('https://www.ebay.es/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['_pgn']}")
print("-" * 10)
for products in soup.select(".s-item__info"):
title = products.select_one(".s-item__title span").text
price = products.select_one(".s-item__price").text
data.append({
"title" : title,
"price" : price
})
if params['_pgn'] == limit:
break
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Apple iPhone X 64 GB y 256 GB Grado A++ Desbloqueado - Excelente Estado Todos los Colores",
"price": "234,52 EUR"
},
{
"title": "Funda de silicona a prueba de golpes para iPhone 11 Pro Max 14Pro 8 7 SE 2022 colores",
"price": "4,56 EUR"
},
{
"title": "Apple iPhone X 64 GB 256 GB gris plateado sin contrato COMO NUEVO SIN MANCHA Wow ",
"price": "377,00 EUR a 409,00 EUR"
},
{
"title": "Funda transparente de silicona completa a prueba de golpes para iPhone 11 12 13 14 PRO MAX Mini X XR 8",
"price": "1,13 EUR a 4,06 EUR"
},
{
"title": "Apple iPhone X - 256 GB - Plateado (Desbloqueado) (Leer descripción) FA1065",
"price": "163,88 EUR"
},
other results ...
]
Also you can using official eBay Finding API, has a limit of 5000 requests per day, or third-party API like Ebay Organic Results API from SerpApi. It's a paid API with a free plan that handles blocks and parsing on their backend.
Example code with pagination:
from serpapi import EbaySearch
import json
params = {
"api_key": "...", # serpapi key, https://serpapi.com/manage-api-key
"engine": "ebay", # search engine
"ebay_domain": "ebay.es", # ebay domain
"_nkw": "iphone_x", # search query
"LH_Sold": "1", # shows sold items
"_pgn": 1 # page number
}
search = EbaySearch(params) # where data extraction happens
page_num = 0
data = []
while True:
results = search.get_dict() # JSON -> Python dict
if "error" in results:
print(results["error"])
break
for organic_result in results.get("organic_results", []):
title = organic_result.get("title")
price = organic_result.get("price")
data.append({
"title" : title,
"price" : price
})
page_num += 1
print(page_num)
if "next" in results.get("pagination", {}):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2))
Output:
[
{
"title": "Apple iPhone X (10) Desbloqueado 64 GB/256 GB Gris espacial/Plateado - Excelente Estado",
"price": {
"raw": "297,34 EUR",
"extracted": 297.34
}
},
{
"title": "Nuevo anuncioApple iPhone X - 64GB - Bianco (Sbloccato).",
"price": {
"raw": "340,00 EUR",
"extracted": 340.0
}
},
{
"title": "Apple iPhone X - 256GB - Factory Unlocked - Good Condition",
"price": {
"raw": "230,80 EUR",
"extracted": 230.8
}
},
other results ...
]

Using Python to search google and store the websites into variables

I looking for a way to search google with python and store each website into a slot in an data list. Im looking for something like the example code below.
search=input('->')
results=google.search((search),(10))
print results
In this case i want it to search google for whatever is in the variable "search", 10 is the amount of results I want to store in the variable and finally putting them on the screen with the "print results".
I would appreciate any help or anything similar to what I want. Thanks.
As mentioned above google does provide an api for completing searches (https://developers.google.com/custom-search/json-api/v1/overview), and as mentioned depending on what you are trying to accomplish can get quite expensive. Another option is to scrap the google page. Below is an example I created using Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#) to scrap the google result.
import urllib2
import xml.etree.ElementTree
from bs4 import BeautifulSoup #install using 'pip install beautifulsoup4'
'''
Since spaces will not work in url parameters, the spaces have to be converted int '+'
ex) "example text" -> "example+text"
'''
def spacesToPluses(string):
words = string.split(" ")
convertedString = ""
for i in range(0, len(words)):
convertedString += words[i] + "+"
return convertedString[0:len(convertedString)-1]
'''
Opens the url with the parameter included and reads it as a string
'''
def getRawGoogleResponse(url):
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent,} #Required for google to allow url request
request=urllib2.Request(url,None,headers)
response = urllib2.urlopen(request)
rawResponse = response.read()
return rawResponse
'''
Takes in the raw string representation and converts it into an easier to navigate object (Beautiful Soup)
'''
def getParsedGoogleResponse(url):
rawResponse = getRawGoogleResponse(url)
fullPage = BeautifulSoup(rawResponse, 'html.parser')
return fullPage
'''
Finds all of the urls on a single page
'''
def getGoogleResultsOnPage(fullPage):
searchResultContainers = fullPage.find_all("h3", {"class": "r"}) #the results are contained in an h3 element with the class 'r'
pageUrls = []
for container in searchResultContainers: #get each link in the container
fullUrl = container.find('a')['href']
beginningOfUrl = fullUrl.index('http')
pageUrls.append(fullUrl[beginningOfUrl:])#Chops off the extra bits google adds to the url
return pageUrls
'''
Returns number of pages (max of 10)
'''
def getNumPages(basePage):
navTable = basePage.find("table", {"id" : "nav"}) #The nav table contains the number of pages (up to 10)
pageNumbers = navTable.find_all("a", {"class" : "fl"})
lastPageNumber = int(pageNumbers[len(pageNumbers)-2].text)
return lastPageNumber
'''
Loops through pages gathering url from each page
'''
def getAllGoogleSearchResults(search, numResults):
baseUrl = "https://www.google.com/search?q=" + spacesToPluses(search)
basePage = getParsedGoogleResponse(baseUrl)
numPages = getNumPages(basePage)
allUrls = []
for i in range(0, numPages):
completeUrl = baseUrl + "&start=" + str(i * 10) #google uses the parameter 'start' to represent the url to start at (10 urls pre page)
page = getParsedGoogleResponse(completeUrl)
for url in getGoogleResultsOnPage(page):
allUrls.append(url)
return allUrls[0:numResults]#return just the number of results
def main():
print(getAllGoogleSearchResults("even another test", 1))
main()
The solution works for the first 10 pages (or next highest) of google results. The urls are returned in an array of string objects. The information is scrapped by getting the response using urllib2. Hope this helps.
Google search page has maximum 10 number of results to return(by default), parameter num in parameters dict is responsible for this:
params = {
"q": query, # query
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
To get more data, you can paginate through all pages using an infinite while loop. Pagination is possible as long as the next button exists (determined by the presence of a button selector on the page, in our case the CSS selector .d6cvqb a[id=pnnext]), you need to increase the value of ["start"] by 10 to access the next page, if present, otherwise, we need to exit the while loop:
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
You also need to pay attention to the fact that most sites, including Google, do not like being scraped and the request might be blocked if you using requests as default user-agent in requests library is a python-requests. Additional step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
query = input("Input your query: ") # for example: "auto"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query, # query
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
page_num = 0
website_data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = f'Title: {result.select_one("h3").text}'
link = f'Link: {result.select_one("a")["href"]}'
website_data.append({
"title" : title,
"link" : link,
})
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
print(json.dumps(website_data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Title: Show Your Auto",
"link": "Link: http://www.showyourauto.com/vehicles/388/2002-ford-saleen-mustang-s281-extreme-speedster"
},
{
"title": "Title: Global Competition in the Auto Parts Industry: Hearings ...",
"link": "Link: https://books.google.com/books?id=dm7bjDjkrRQC&pg=PA2&lpg=PA2&dq=auto&source=bl&ots=sIf4ELozPN&sig=ACfU3U3xea1-cJl9hiQe8cpac2KLrIF20g&hl=en&sa=X&ved=2ahUKEwjWn7ukv6P7AhU3nGoFHSRxABY4jgIQ6AF6BAgEEAM"
},
{
"title": "Title: Issues relating to the domestic auto industry: hearings ...",
"link": "Link: https://books.google.com/books?id=fHX_MJobx3EC&pg=PA79&lpg=PA79&dq=auto&source=bl&ots=jcrwR-jwck&sig=ACfU3U0p0Wn6f-RU11U8Z0GtqMjTKd44ww&hl=en&sa=X&ved=2ahUKEwjWn7ukv6P7AhU3nGoFHSRxABY4jgIQ6AF6BAgaEAM"
},
# ...
]
You can also use Google Search Engine Results API from SerpApi. It's a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
query = input("Input your query: ") # for example: "auto"
params = {
"api_key": os.getenv("API_KEY"), # serpapi key
"engine": "google", # serpapi parser engine
"q": query, # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"page_num": page_num,
"title": result.get("title"),
"link": result.get("link"),
"displayed_link": result.get("displayed_link"),
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"page_num": 4,
"title": "San Francisco's JFK Drive to Remain Closed to Cars",
"link": "https://www.nbcbayarea.com/decision-2022/jfk-drive-san-francisco-election/3073470/",
"displayed_link": "https://www.nbcbayarea.com › decision-2022 › jfk-driv..."
},
{
"page_num": 4,
"title": "Self-Driving Cruise Cars Are Expanding to Most of SF, Says ...",
"link": "https://sfstandard.com/business/self-driving-cruise-cars-are-expanding-to-most-of-sf-says-ceo/",
"displayed_link": "https://sfstandard.com › business › self-driving-cruise-c..."
},
# ...
]

Why does soup only shows half of the chart I'm scraping?

I'm scraping from a google search but I can only get the first row of a two row chart on the right-hand side.
The search query is:
https://www.google.com/search?q=kegerators
I've noticed that doing an inspect element doesn't really work as beautifulsoup seems to extract a different code.
The code I have is:
htmltext=br.open(query).read()
soup=BeautifulSoup(htmltext)
search = soup.findAll("div", attrs={ "class" : "_cf" })
print search
Upon looking at the code (basically looking for "b>$" - as I know I should see 8 of those) I only get 4, which happen to be the top row of the chart.
These is the result of the search:
[<div class="_cf" style="overflow:hidden"><span class="_vf" style="height:86px;width:86px"><span class="_uf"></span><img class="_wf" src="http://t3.gstatic.com/shopping?q=tbn:ANd9GcRY5NBoY-anFlJUYExmil81vJG5i1nw6LqVu64lSjw8tSPBUEdh3JaiFix-gfSKMGtE2ZwX8w&usqp=CAc"/></span><div style="height:2.4em;overflow:hidden">EdgeStar Ultra Low Temp F...</div><div><b>$599.00</b></div><div><cite style="white-space:nowrap">Kegerator</cite></div></div>, <div class="_cf" style="overflow:hidden"><span class="_vf" style="height:86px;width:86px"><span class="_uf"></span><img class="_wf" src="http://t3.gstatic.com/shopping?q=tbn:ANd9GcRS4iCsD4EDV37Rg1kZf0nxFK3bYgYaWC-bxMv-ISg4dI8m-COU3ZHCZGs3FdJBK3npkpoE&usqp=CAc"/></span><div style="height:2.4em;overflow:hidden">Kegco K199SS‑2 D...</div><div><b>$539.99</b></div><div><cite style="white-space:nowrap">BeverageFa...</cite></div></div>, <div class="_cf" style="overflow:hidden"><span class="_vf" style="height:86px;width:86px"><span class="_uf"></span><img class="_wf" src="http://t2.gstatic.com/shopping?q=tbn:ANd9GcSkf6-jVZt34pd_6QyqZGre06VxszvFZX70-wUOEDRhEFhorX_Yek0oyr-5jvk8FNpj2KWusQ&usqp=CAc"/></span><div style="height:2.4em;overflow:hidden">EdgeStar Ultra Low Temp F...</div><div><b>$499.00</b></div><div><cite style="white-space:nowrap">Compact Ap...</cite></div></div>, <div class="_cf" style="overflow:hidden"><span class="_vf" style="height:86px;width:86px"><span class="_uf"></span><img class="_wf" src="http://t1.gstatic.com/shopping?q=tbn:ANd9GcTf56EQ6DVbOk02D7cLgVmlurU-2gNrhD6a74MnzQBWg1W290DTYQuj0sSUxQEbxo1XO6pB&usqp=CAc"/></span><div style="height:2.4em;overflow:hidden">FunTime Black Kegge...</div><div><b>$399.99</b></div><div><cite style="white-space:nowrap">Max Tool</cite></div></div>]
Is Google doing something strange here?
The reason why results might differ is that Google displays different results on each request, e.g. sometimes it could get 10 shopping results, sometimes 7 or 4.
Specifying gl (country, e.g: us), hl (language, e.g: en) query params could get exact or close to the exact result that you see in your browser.
Also, don't forget to specify a user-agent, otherwise, Google will block your requests eventually.
Code and example in the online IDE:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
"(KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "buy coffe", # intentional grammatical error to display right side shopping results
"hl": "en",
"gl": "us"
}
response = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(response.text, 'html.parser')
# scrapes both top and right side shopping resutls
for result in soup.select('.pla-hovercard-content-ellip'):
title = result.select_one('.pymv4e').text
link = result.select_one('.pla-hovercard-content-ellip a.tkXAec')['href']
ad_link = f"https://www.googleadservices.com/pagead{result.select_one('.pla-hovercard-content-ellip a')['href']}"
price = result.select_one('.qptdjc').text
try:
rating = result.select_one('.Fam1ne.tPhRLe')["aria-label"].replace("Rated ", "").replace(" out of ", "").replace(",", "")
except:
rating = None
try:
reviews = result.select_one('.GhQXkc').text.replace("(", "").replace(")", "")
except:
reviews = None
source = result.select_one('.zPEcBd.LnPkof').text.strip()
print(f'{title}\n{link}\n{ad_link}\n{price}\n{rating}\n{reviews}\n{source}\n')
-------------
'''
MUD\WTR | Mushroom Coffee Replacement, 90 servings
https://mudwtr.com/collections/shop/products/90-serving-bag
https://www.googleadservices.com/pagead/aclk?sa=l&ai=DChcSEwj5p8u-2rzyAhV2yJQJHfzhBoUYABAHGgJ5bQ&sig=AOD64_3NGBzLzkTv61K7kSrD2f9AREHH_g&ctype=5&q=&ved=2ahUKEwji7MK-2rzyAhWaaM0KHcnaDDcQ9aACegQIAhBo&adurl=
$125.00
4.85
1k+
mudwtr.com
...
'''
Alternatively, you can do the same thing using Google Inline Shopping API from SerpApi. It's a paid API with a free plan.
The difference is that everything is already extracted, and all that needs to be done is just to iterate over structured JSON.
Code to integrate:
import json, os
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "buy coffe",
"hl": "en",
"gl": "us",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['shopping_results']:
print(json.dumps(result, indent=2, ensure_ascii=False))
--------
'''
{
"position": 1,
"block_position": "right",
"title": "Maxwell House Original Roast | 48oz",
"price": "$10.49",
"extracted_price": 10.49,
"link": "https://www.google.com/aclk?sa=l&ai=DChcSEwiGn8aT2rzyAhXgyZQJHZHdBJMYABAEGgJ5bQ&ae=2&sig=AOD64_0jBjdUIMeqJvrXYxn4NGcpwCYrJQ&ctype=5&q=&ved=2ahUKEwiOxLmT2rzyAhWiFVkFHWMNAaEQ5bgDegQIAhBa&adurl=",
"source": "Boxed",
"rating": 4.6,
"reviews": 2000,
"thumbnail": "https://serpapi.com/searches/611e1b2cfdca3e6a1c9335e6/images/e4ae7f31164ec52021f1c04d8be4e4bda2138b1acd12c868052125eb86ead292.png"
}
...
'''
P.S - I wrote a blog post about this topic that you can find here.
Disclaimer, I work for SerpApi.

Using mechanize bing search returns blank page

I am using mechanize to perform a bing search and then I will process the results with beautiful soup. I have successfully performed google and yahoo searches with this same method but when I do a bing search all I get is a blank page.
I am thoroughly confused why this is the case and if anyone can shed any light on the matter that would be greatly appreciated. Here is a sample of the code I'm using:
from BeautifulSoup import BeautifulSoup
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.bing.com/search?count=100&q=cheese")
content = br.response()
content = content.read()
soup = BeautifulSoup(content, convertEntities=BeautifulSoup.ALL_ENTITIES)
print soup
The result is a blank line printed.
You probably got response that answer is already in your browser cache. Try changing a little you query string, for example decrease count to 50.
You can also add some debugging code and see headers returned by server:
br.open("http://www.bing.com/search?count=50&q=cheese")
response = br.response()
headers = response.info()
print headers
content = response.read()
EDIT:
I have tried this query with count=100 with Firefox and Opera browsers and it seems that bing do not like such a "big" count. When I decrease count then it works. So this is not mechanize or other Python library fault, but your query is problematic to bing. It also seems that browser can query bing with count=100 but it must first query bing with some smaller count. Strange!
Another way to achieve this is by using requests with beautifulsoup
Code and example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml, json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
def get_organic_results():
html = requests.get('https://www.bing.com/search?q=nfs', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
bing_data = []
for result in soup.find_all('li', class_='b_algo'):
title = result.h2.text
try:
link = result.h2.a['href']
except:
link = None
displayed_link = result.find('div', class_='b_attribution').text
try:
snippet = result.find('div', class_='b_caption').p.text
except:
snippet = None
for inline in soup.find_all('div', class_='b_factrow'):
try:
inline_title = inline.a.text
except:
inline_title = None
try:
inline_link = inline.a['href']
except:
inline_link = None
bing_data.append({
'title': title,
'link': link,
'displayed_link': displayed_link,
'snippet': snippet,
'inline': [{'title': inline_title, 'link': inline_link}]
})
print(json.dumps(bing_data, indent = 2))
# part of the created json output:
'''
[
{
"title": "Need for Speed Video Games - Official EA Site",
"link": "https://www.ea.com/games/need-for-speed",
"displayed_link": "https://www.ea.com/games/need-for-speed",
"snippet": "Need for Speed Forums Buy Now All Games Forums Buy Now Learn More Buy Now Hit the gas and tear up the roads in this legendary action-driving series. Push your supercar to its limits and leave the competition in your rearview or shake off a full-scale police pursuit \u2013 it\u2019s all just a key-turn away.",
"inline": [
{
"title": null,
"link": null
}
]
}
]
'''
Alternatively, you can do the same thing using Bing Organic Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Code to integrate:
from serpapi import GoogleSearch
import os
def get_organic_results():
params = {
"api_key": os.getenv('API_KEY'),
"engine": "bing",
"q": "nfs most wanted"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
try:
snippet = result['snippet']
except:
snippet = None
try:
inline = result['sitelinks']['inline']
except:
inline = None
print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n{inline}\n')
# part of the output:
'''
Need for Speed: Most Wanted - Car Racing Game - Official ...
https://www.ea.com/games/need-for-speed/need-for-speed-most-wanted
https://www.ea.com/games/need-for-speed/need-for-speed-most-wanted
Jun 01, 2017 · To be Most Wanted, you’ll need to outrun the cops, outdrive your friends, and outsmart your rivals. With a relentless police force gunning to take you down, you’ll need to make split-second decisions. Use the open world to …
[{'title': 'Need for Speed No Limits', 'link': 'https://www.ea.com/games/need-for-speed/need-for-speed-no-limits'}, {'title': 'Buy Now', 'link': 'https://www.ea.com/games/need-for-speed/need-for-speed-heat/buy'}, {'title': 'Need for Speed Undercover', 'link': 'https://www.ea.com/games/need-for-speed/need-for-speed-undercover'}, {'title': 'Need for Speed The Run', 'link': 'https://www.ea.com/games/need-for-speed/need-for-speed-the-run'}, {'title': 'News', 'link': 'https://www.ea.com/games/need-for-speed/need-for-speed-payback/news'}]
'''
Disclaimer, I work for SerpApi.

Categories