I'm trying to scrape product titles on the first product page of Amazon using HTMLSession and xpath.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
def getTitle(url):
session = HTMLSession()
r = session.get(url)
r.html.render(sleep=1)
product = {
'title': r.html.xpath('//*[#class="a-size-medium a-color-base a-text-normal"]').text
}
print(product)
return product
getTitle('https://www.amazon.com/s?k=amazon+echo+dot&qid=1605730376&ref=sr_pg_1')
>{'title': 'Echo Dot (3rd Gen) - Smart speaker with Alexa - Charcoal'}
The product titles have the attribute of class="a-size-medium a-color-base a-text-normal", so I want to scrape all the titles of the product displayed on the same page, but the code only outputs one of them.
For ex, I would want something like:
{'title': 'Echo dot 1st gen...'}
{'title': 'Echo dot for kids...'}
{'title': 'Amazon Echo dot 3rd gen...'}
Any tip or workaround?
Thank you
Have modified your function a bit to actually collect the titles into product list of dictionaries (which btw you don't really need). You also do not need bs4 for this.
def getTitle(url):
session = HTMLSession()
r = session.get(url)
r.html.render(sleep=1)
product=[{'title':item.text} for item in r.html.xpath('//*[#class="a-size-medium a-color-base a-text-normal"]')]
return product
results=getTitle('https://www.amazon.com/s?k=amazon+echo+dot&qid=1605730376&ref=sr_pg_1')
Replace the product line with the below to get the list of titles (strings) instead of dictionaries containing the the title key and value.
product=[item.text for item in r.html.xpath('//*[#class="a-size-medium a-color-base a-text-normal"]')]
Why XPath for simple things like this?
[x.text for x in soup.find_all(class_="a-size-medium a-color-base a-text-normal")]
One thing is this dicts don't allow dupe keys, so you can't have multiple title in dict. But you can like title1,title2:
{'title'+str(x):y.text for x,y in enumerate(soup.find_all(class_="a-size-medium a-color-base a-text-normal"))}
Related
Im new to programming and cannot figure out why this wont loop. It prints and converts the first item exactly how I want. But stops after the first iteration.
from bs4 import BeautifulSoup
import requests
import re
import json
url = 'http://books.toscrape.com/'
page = requests.get(url)
html = BeautifulSoup(page.content, 'html.parser')
section = html.find_all('ol', class_='row')
for books in section:
#Title Element
header_element = books.find("article", class_='product_pod')
title_element = header_element.img
title = title_element['alt']
#Price Element
price_element = books.find(class_='price_color')
price_str = str(price_element.text)
price = price_str[1:]
#Create JSON
final_results_json = {"Title":title, "Price":price}
final_result = json.dumps(final_results_json, sort_keys=True, indent=1)
print(title)
print(price)
print()
print(final_result)
First, clarify what you are looking for? Probably, you wish to print the title, price and final_result for every book that has been scraped from the URL books.toscrape.com. The code is working as it is written though the expectation is different. If you notice you are finding all the "ol" tags with class name = "row" and there's just one such element on the page thus, section has only one element eventually the for loop iterates just once.
How to debug it?
Check the type of section, type(section)
Print the section to know what it contains
write some print statements in for loop to understand what happens when
It isn't hard to debug this one.
You need to change:
section = html.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
there is only 1 <ol> in that doc
I think you want
for book in section[0].find_all('li'):
ol means ordered list, of which there is one in this case, there are many li or list items in that ol
I've been trying to code a program in python that can return a list of all the product names on the first page. I have a function that gets the URL based on what you want to search:
def get_url(search_term):
template = 'https://www.amazon.com/s?k={}&ref=nb_sb_noss_1'
search_term = search_term.replace(' ', '+')
url = template.format(search_term)
print(url)
return URL
Then I pass the URL into another function and here is where I need help. Right now my function to retrieve the title and number of reviews is this:
def getInfo(url):
r = HTMLSession().get(url)
r.html.render()
product = {
'title': r.html.find('.a-size-medium' '.a-color-base' '.a-text-normal', first=True).text,
'reviews': r.html.find('.a-size-base', first=True).text
}
print(product)
However, the r.html.find part isn't getting the info I need, it either returns [] or None if I add first=True. I've tried different ways like using the XPath and selector. None of those seemed to work. Can anyone help find a way to use html.find method to find all the product names and save them in title in the dictionary product?
I am tring to extract different information from websites with BeautifulSoup, such as title of the product and the price.
I do that with different urls, looping through the urls with for...in.... Here, I'll just provide a snippet without the loop.
from bs4 import BeautifulSoup
import requests
import csv
url= 'https://www.mediamarkt.ch/fr/product/_lg-oled65gx6la-1991479.html'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
price = soup.find('meta', property="product:price:amount")
title = soup.find("div", {"class": "flix-model-name"})
title2 = soup.find('div', class_="flix-model-name")
title3 = soup.find("div", attrs={"class": "flix-model-name"})
print(price['content'])
print(title)
print(title2)
print(title3)
So from this URL https://www.mediamarkt.ch/fr/product/_lg-oled65gx6la-1991479.html I wasnt to extract the product number. the only place I find it is in the div class="flix-model-name". However, I am totally unable to reach it. I tried different ways to access it in the title, title2, title3, but I always have the output none.
I am a bit of a beginner, so I guess I am probably missing something basic... If so, please pardon me for that.
Any help is welcome! Many thanks in advance!
just for info, with each url I thought of appending the data and write them on a CSV file like that:
for url in urls:
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
row=[]
try:
# title = YOUR VERY WELCOMED ANSWER
prices = soup.find('meta', property="product:price:amount")
row = (title.text+','+prices['content']+'\n')
data.append(row)
except:
pass
file = open('database.csv','w')
i = 0
while i < (len(data)):
file.write(data[i])
i +=1
file.close()
Many thanks in advance for your help!
David
Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL, create the URL based on 2 dynamic parameters(product and category) and then do GET request to get the data.
After getting the data script will parse the JSON data using json.loads library.
Finally, it will iterate all over the list of products one by one and print the details which are divided in 2 categotries 'box1_ProductToProduct' and 'box2_KategorieTopseller' like Brand, Name, Product number and Unit price. Same way you can add more details by looking in to the API call.
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def scrap_product_details():
PRODUCT = 'MMCH1991479' #Product number
CATEGORY = '680942' #Category number
URL = 'https://www.mediamarkt.ch/rde_server/res/MMCH/recomm/product_detail/sid/WACXyEbIf3khlu6FcHlh1B1?product=' + PRODUCT + '&category=' + CATEGORY # dynamic URL
response = requests.get(URL,verify = False) #GET request to fetch the data
result = json.loads(response.text) # Parse JSON data using json.loads
box1_ProductToProduct = result[0]['box1_ProductToProduct'] # Extracted data from API
box2_KategorieTopseller = result[1]['box2_KategorieTopseller']
for item in box1_ProductToProduct: # loop over extracted data
print('-' * 100)
print('Brand : ',item['brand'])
print('Name : ',item['name'])
print('Net Unit Price : ',item['netUnitPrice'])
print('Product Number : ',item['product_nr'])
print('-' * 100)
for item in box2_KategorieTopseller: # loop over extracted data
print('-' * 100)
print('Brand : ',item['brand'])
print('Name : ',item['name'])
print('Net Unit Price : ',item['netUnitPrice'])
print('Product Number : ',item['product_nr'])
print('-' * 100)
scrap_product_details()
Im using BS4 for the first time and need to scrape the items from an online catalogue to csv.
I have setup my code however when i run the code the results are only repeating the first item in the catalogue n times (where n is the number of items).
Can someone review my code and let me know where i am going wrong.
Thanks
import requests
from bs4 import BeautifulSoup
from csv import writer
#response = requests.get('https://my.supplychain.nhs.uk/Catalogue/browse/27/anaesthetic-oxygen-and-resuscitation?CoreListRequest=BrowseCoreList')
response = requests.get('https://my.supplychain.nhs.uk/Catalogue/browse/32/nhs-cat?LastCartId=&LastFavouriteId=&CoreListRequest=BrowseAll')
soup = BeautifulSoup(response.text , 'html.parser')
items = soup.find_all(class_='productPrevDetails')
#print(items)
for item in items:
ItemCode = soup.find(class_='product_npc ').get_text().replace('\n','')
ItemNameS = soup.select('p')[58].get_text()
ProductInfo = soup.find(class_='product_key_info').get_text()
print(ItemCode,ItemNameS,ProductInfo)
You always see the first result because you are searching soup, not the item. Try
for item in items:
ItemCode = item.find(class_='product_npc ').get_text().replace('\n','')
ItemNameS = item.select('p')[58].get_text()
ProductInfo = item.find(class_='product_key_info').get_text()
print(ItemCode,ItemNameS,ProductInfo)
I've created a script to harvest the links of different products from a webpage. My intention is to scrape the links only when the products have Ajouter au panier sign, meaning Add to Basket. The html structures are very straightforward and easy to play with but the logic to get the desired links appears to be tricky. I've used three different links to show the variation.
Few urls lead to the desired products but still there are catalogues (if i make use of their links) which produces some more products. Check out the image links to see for yourself. I've drawn circles around the catalogues in the first image which can still produces the desired products whereas in that page few desired products are already there.
check out the variation
another one: only catalogues
This is the script I've written:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
urls = (
"https://www.directmedical.fr/categorie/aspirateurs-de-mucosite.html",
"https://www.directmedical.fr/categorie/literie.html",
"https://www.directmedical.fr/categorie/vetement.html"
)
def get_links(link):
r = requests.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select(".browseCategoryName a"):
ilink = urljoin(link,item.get("href"))
print(ilink)
if __name__ == '__main__':
for url in urls:
get_links(url)
How can I get all the products links having Ajouter au panier signs using those urls?
If you need to select Product links from both initial page and (if there are no products on initial page) from Category page, try
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
domain = "https://www.directmedical.fr/"
urls = (
"https://www.directmedical.fr/categorie/aspirateurs-de-mucosite.html",
"https://www.directmedical.fr/categorie/literie.html",
"https://www.directmedical.fr/categorie/vetement.html"
)
def get_links(link):
r = requests.get(link)
soup = BeautifulSoup(r.text, "lxml")
products = soup.select(".browseElements td > a")
if products:
for item in products:
ilink = urljoin(link, item.get("href"))
print(ilink)
else:
categories = [urljoin(domain, item.get("href")) for item in soup.select(".browseChildsCategorys td > a")]
for category in categories:
c = requests.get(category)
c_soup = BeautifulSoup(c.text, "lxml")
for item in c_soup.select(".browseElements td > a"):
c_link = urljoin(domain, item.get("href"))
print(c_link)
if __name__ == '__main__':
for url in urls:
get_links(url)