I've been trying to code a program in python that can return a list of all the product names on the first page. I have a function that gets the URL based on what you want to search:
def get_url(search_term):
template = 'https://www.amazon.com/s?k={}&ref=nb_sb_noss_1'
search_term = search_term.replace(' ', '+')
url = template.format(search_term)
print(url)
return URL
Then I pass the URL into another function and here is where I need help. Right now my function to retrieve the title and number of reviews is this:
def getInfo(url):
r = HTMLSession().get(url)
r.html.render()
product = {
'title': r.html.find('.a-size-medium' '.a-color-base' '.a-text-normal', first=True).text,
'reviews': r.html.find('.a-size-base', first=True).text
}
print(product)
However, the r.html.find part isn't getting the info I need, it either returns [] or None if I add first=True. I've tried different ways like using the XPath and selector. None of those seemed to work. Can anyone help find a way to use html.find method to find all the product names and save them in title in the dictionary product?
Related
while (x < go):
url = "https://www.shoppingwesbite.com/search?=product" + input_a
headers = {'User-Agent': 'my user agent here'}
ok = get(url, headers=headers)
data = BeautifulSoup(ok.content, 'html.parser')
price = data.find_all('div', {"class" : "css-rey619"})[x].get_text()
title = data.find_all('div', {"class" : "css-398hol"})[x].get_text()
reviews = data.find_all('span', {'class':'css-402phy'})[x].get_text()
I have included this piece of code from my web scraper and it essentially just pulls the first 10 results on a shopping website for a product inputted by the user. Now, most of the time it works but sometimes it returns the error that says the index is not callable for the "reviews" variable because I think it's trying to pull a review for a product that doesn't have a review yet. I don't know how to get around this and would appreciate any suggestions/ideas on what I could try. I was thinking on making some logic statement that checks if the listing has a review or not and outputting it if it does, but I don't know how to achieve this. Thanks!
You can check the length of the reviews, if it is zero you got empty review.
for i in range(len(reviews)):
if len(reviews[i]) == 0:
print("you got empty review now you can easily remove it")
I'm trying to scrape product titles on the first product page of Amazon using HTMLSession and xpath.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
def getTitle(url):
session = HTMLSession()
r = session.get(url)
r.html.render(sleep=1)
product = {
'title': r.html.xpath('//*[#class="a-size-medium a-color-base a-text-normal"]').text
}
print(product)
return product
getTitle('https://www.amazon.com/s?k=amazon+echo+dot&qid=1605730376&ref=sr_pg_1')
>{'title': 'Echo Dot (3rd Gen) - Smart speaker with Alexa - Charcoal'}
The product titles have the attribute of class="a-size-medium a-color-base a-text-normal", so I want to scrape all the titles of the product displayed on the same page, but the code only outputs one of them.
For ex, I would want something like:
{'title': 'Echo dot 1st gen...'}
{'title': 'Echo dot for kids...'}
{'title': 'Amazon Echo dot 3rd gen...'}
Any tip or workaround?
Thank you
Have modified your function a bit to actually collect the titles into product list of dictionaries (which btw you don't really need). You also do not need bs4 for this.
def getTitle(url):
session = HTMLSession()
r = session.get(url)
r.html.render(sleep=1)
product=[{'title':item.text} for item in r.html.xpath('//*[#class="a-size-medium a-color-base a-text-normal"]')]
return product
results=getTitle('https://www.amazon.com/s?k=amazon+echo+dot&qid=1605730376&ref=sr_pg_1')
Replace the product line with the below to get the list of titles (strings) instead of dictionaries containing the the title key and value.
product=[item.text for item in r.html.xpath('//*[#class="a-size-medium a-color-base a-text-normal"]')]
Why XPath for simple things like this?
[x.text for x in soup.find_all(class_="a-size-medium a-color-base a-text-normal")]
One thing is this dicts don't allow dupe keys, so you can't have multiple title in dict. But you can like title1,title2:
{'title'+str(x):y.text for x,y in enumerate(soup.find_all(class_="a-size-medium a-color-base a-text-normal"))}
I have the following code which succesfully pulls links, titles, etc. for podcast episodes. How would I go about just pulling the first one it comes to (i.e. the latest episode) and then immediately stop and produce just that result? Any advice would be greatly appreciated.
def get_playable_podcast(soup):
"""
#param: parsed html page
"""
subjects = []
for content in soup.find_all('item'):
try:
link = content.find('enclosure')
link = link.get('url')
print "\n\nLink: ", link
title = content.find('title')
title = title.get_text()
desc = content.find('itunes:subtitle')
desc = desc.get_text()
thumbnail = content.find('itunes:image')
thumbnail = thumbnail.get('href')
except AttributeError:
continue
item = {
'url': link,
'title': title,
'desc': desc,
'thumbnail': thumbnail
}
subjects.append(item)
return subjects
def compile_playable_podcast(playable_podcast):
"""
#para: list containing dict of key/values pairs for playable podcasts
"""
items = []
for podcast in playable_podcast:
items.append({
'label': podcast['title'],
'thumbnail': podcast['thumbnail'],
'path': podcast['url'],
'info': podcast['desc'],
'is_playable': True,
})
return items
The answer of #John Gordon is completely correct.
#John Gordon pointed out that:
soup.find()
will always display the first found item (for you thats perfectly fine, when you want to scrape the "latest episode").
However, imagine you just wanted to select the second, third, fourth, etc. item of your BeautifulSoup. Then you could do that with the following line of code:
soup.find()[0] # This will works the same way as soup.find() and displays the first item
When you replace the 0 by any other number (e.g. 4) you solely get the choosen (in this example fourth) item ;).
I'm trying to make a function that scrapes book names from goodreads using python and Beautifulsoup.
I've realized some goodread pages have a common url that have the form:
"https://www.goodreads.com/shelf/show/" + category_name + "?page=" + page_number so I've made a function that receives a category name and a max page range in order to iterate from page 1 to max_pages.
The problem is that every time the program iterates it doesn't update the page but instead goes to the first (default) page for the category. I've tried to provide the full url like for example: https://www.goodreads.com/shelf/show/art?page=2 but it still doesn't work so I'm guessing it might be that BeautifulSoup converts the url I'm passing into another format that's not working, but I don't know.
def scrape_category(category_name, search_range):
book_names = []
for i in range(search_range):
quote_page = "https://www.goodreads.com/shelf/show/" + category_name + "?page=" + str(i + 1)
page = urlopen(quote_page)
soup = BeautifulSoup(page,'lxml')
names = soup.find_all('a', attrs={"class":'bookTitle'})
for name in names:
book_name = name.text
book_name = re.sub(r'\"','',book_name)
book_names.append(book_name)
return book_names
The result from this code is always the book names from the first page of the category I'm passing as parameter, never the second, third ... or n page from range 1 to max_pages that I'm requesting.
I see the same books when I enter https://www.goodreads.com/shelf/show/art?page=2 and https://www.goodreads.com/shelf/show/art?page=15 in my browser. This is not a problem in BeautifulSoup, this is just how this site was built.
the website is:
https://pokemongo.gamepress.gg/best-attackers-type
my code is as follows, for now:
from bs4 import BeautifulSoup
import requests
import re
site = 'https://pokemongo.gamepress.gg/best-attackers-type'
page_data = requests.get(site, headers=headers)
soup = BeautifulSoup(page_data.text, 'html.parser')
check_gamepress = soup.body.findAll(text=re.compile("Strength"))
print(check_gamepress)
However, I really want to scrape certain data, and I'm really having trouble.
For example, how would I scrape the portion that show's the following for best Bug type:
"Good typing and lightning-fast attacks. Though cool-looking, Scizor is somewhat fragile."
This information could obviously be updated as it has been in the past, when a better Pokemon comes out for that type. So, how would I scrape this data where it'll probably be updated in the future, without me having to make code changes when that occurs.
In advance, thank you for reading!
This particular site is a bit tough because of how the HTML is organized. The relevant tags containing the information don't really have many distinguishing features, so we have to get a little clever. To make matters complicated, the divs the contain the information across the whole page are siblings. We'll also have to make up for this web-design travesty with some ingenuity.
I did notice a pattern that is (almost entirely) consistent throughout the page. Each 'type' and underlying section are broken into 3 divs:
A div containing the type and pokemon, for example Dark Type: Tyranitar.
A div containing the 'specialty' and moves.
A div containing the 'ratings' and commentary.
The basic idea that follows here is that we can begin to organize this markup chaos through a procedure that loosely goes like this:
Identify each of the type title divs
For each of those divs, get the other two divs by accessing its siblings
Parse the information out of each of those divs
With this in mind, I produced a working solution. The meat of the code consists of 5 functions. One to find each section, one to extract the siblings, and three functions to parse each of those divs.
import re
import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup
def type_section(tag):
"""Find the tags that has the move type and pokemon name"""
pattern = r"[A-z]{3,} Type: [A-z]{3,}"
# if all these things are true, it should be the right tag
return all((tag.name == 'div',
len(tag.get('class', '')) == 1,
'field__item' in tag.get('class', []),
re.findall(pattern, tag.text),
))
def parse_type_pokemon(tag):
"""Parse out the move type and pokemon from the tag text"""
s = tag.text.strip()
poke_type, pokemon = s.split(' Type: ')
return {'type': poke_type, 'pokemon': pokemon}
def parse_speciality(tag):
"""Parse the tag containing the speciality and moves"""
table = tag.find('table')
rows = table.find_all('tr')
speciality_row, fast_row, charge_row = rows
speciality_types = []
for anchor in speciality_row.find_all('a'):
# Each type 'badge' has a href with the type name at the end
href = anchor.get('href')
speciality_types.append(href.split('#')[-1])
fast_move = fast_row.find('td').text
charge_move = charge_row.find('td').text
return {'speciality': speciality_types,
'fast_move': fast_move,
'charge_move': charge_move}
def parse_rating(tag):
"""Parse the tag containing categorical ratings and commentary"""
table = tag.find('table')
category_tags = table.find_all('th')
strength_tag, meta_tag, future_tag = category_tags
str_rating = strength_tag.parent.find('td').text.strip()
meta_rating = meta_tag.parent.find('td').text.strip()
future_rating = meta_tag.parent.find('td').text.strip()
blurb_tags = table.find_all('td', {'colspan': '2'})
if blurb_tags:
# `if` to accomodate fire section bug
str_blurb_tag, meta_blurb_tag, future_blurb_tag = blurb_tags
str_blurb = str_blurb_tag.text.strip()
meta_blurb = meta_blurb_tag.text.strip()
future_blurb = future_blurb_tag.text.strip()
else:
str_blurb = None;meta_blurb=None;future_blurb=None
return {'strength': {
'rating': str_rating,
'commentary': str_blurb},
'meta': {
'rating': meta_rating,
'commentary': meta_blurb},
'future': {
'rating': future_rating,
'commentary': future_blurb}
}
def extract_divs(tag):
"""
Get the divs containing the moves/ratings
determined based on sibling position from the type tag
"""
_, speciality_div, _, rating_div, *_ = tag.next_siblings
return speciality_div, rating_div
def main():
"""All together now"""
url = 'https://pokemongo.gamepress.gg/best-attackers-type'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
types = {}
for type_tag in soup.find_all(type_section):
type_info = {}
type_info.update(parse_type_pokemon(type_tag))
speciality_div, rating_div = extract_divs(type_tag)
type_info.update(parse_speciality(speciality_div))
type_info.update(parse_rating(rating_div))
type_ = type_info.get('type')
types[type_] = type_info
pprint(types) # We did it
with open('pokemon.json', 'w') as outfile:
json.dump(types, outfile)
There is, for now, one small wrench in the whole thing. Remember when I said this pattern was almost entirely consistent? Well, the Fire type is an odd-ball here, because they included two pokemon for that type, so the Fire type results are not correct. I or some brave person may come up with a way to deal with that. Or maybe they'll decide on one fire pokemon in the future.
This code, the resulting json (prettified), and an archive of HTML response used can be found in this gist.