Python Trouble with find_all() - python

I am attempting to scrape a site for for some data on an long list of items. Below is my code:
def find_info():
page = requests.get(URL)
soup = bs(page.content, 'html.parser')
soup2 = bs(soup.prettify(), "html.parser")
lists = soup2.find_all('div', class_="featured_item")
#with open('data.csv', 'w', encoding='utf8', newline='') as f:
# thewriter = writer(f)
# header = ['Rank', 'Number', 'Price'] #csv: comma separated values
# thewriter.writerow(header)
for list in lists:
stats = lists.find_all('div', class_="item_stats")
sale = lists.find('div', class_="sale")
rank = float(stats.select('.item_stats span')[0].text)
number = stats.select('.item_stats span')[1].text.strip().replace('#', '')
price = sale.select('span')[0].text.strip().replace('◎', '')
print(rank, number, price)
find_info()
I get the following error:
Traceback (most recent call last):
File "C:\Users\Cameron Lemon\source\repos\ScrapperGUI\module1.py", line 36, in <module>
find_info()
File "C:\Users\Cameron Lemon\source\repos\ScrapperGUI\module1.py", line 27, in
find_info
stats = lists.find_all('div', class_="item_stats")
File "C:\Users\Cameron Lemon\AppData\Local\Programs\Python\Python310\lib\site-
packages\bs4\element.py", line 2253, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a
list of elements like a single element. Did you call find_all() when you meant to call
find()?
Press any key to continue . . .
This code runs well, until I add the line for list in lists: and use find_all() instead of find(). I am very confused, because in the past I have coded to find data for the first item in the list successfully and then set a for loop and changed my find() to find_all(). Any advice is much appreciated. Thank you
Below is my lists variable, for two of the many items within it.
[<div class="featured_item">
<div class="featured_item_img">
<a href="/degds/3251/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
1
</span>
</div>
<div class="item_stat">
item no.
<span>
#3251
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
</div>, <div class="featured_item">
<div class="featured_item_img">
<a href="/dgds/8628/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
2
</span>
</div>
<div class="item_stat">
item no.
<span>
#8628
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>

You are using find_all method two times. When you use for loop then you will get each item meaning list didn't exist any more. So we can't use find_all without list that's why you are getting that error.
html='''
<div class="featured_item">
<div class="featured_item_img">
<a href="/degds/3251/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
1
</span>
</div>
<div class="item_stat">
item no.
<span>
#3251
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
</div>, <div class="featured_item">
<div class="featured_item_img">
<a href="/dgds/8628/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
2
</span>
</div>
<div class="item_stat">
item no.
<span>
#8628
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"html.parser")
lists = soup.find_all('div', class_="featured_item")
for lis in lists:
stats = lis.find('div', class_="item_stats")
sale = lis.find('div', class_="sale")
rank = float(stats.select('.item_stats span')[0].text)
number = stats.select('.item_stats span')[1].text.strip().replace('#', '')
price = sale.select('span')[0].text.strip().replace('◎', '')
print(rank, number, price)
Output:
1.0 3251 not for sale
2.0 8628 not for sale

Related

How to get value of html tags with beautifulsoup in python?

I scraped multiple pages, some of which have a red class, some of which I do not want to store red class values in an array, but I want those pages that do not have this class to be in an empty array. Because of that I wrote this code and now I want to get value of them. can you help me?
for i in soup:
search = i.find_all('div', {'class':"red"})
if len(search)>0:
whoFollowThisDr.append(i.find_all('div', {'class':"info"},'span'))
i = i.text
else:
whoFollowThisDr.append(' ')
whoFollowThisDr
output:
[[<div class="info"> <strong> a</strong> <span>b</span> </div>,
<div class="info"> <strong> c</strong> <span>d</span> </div>,
<div class="info"> <strong style="font-size: 15px !important;"> e</strong> <span style="font-size: 12px !important;">f</span> </div>,
<div class="info"> <strong style="font-size: 15px !important;"> g</strong> <span style="font-size: 12px !important;">h</span> </div>],
[<div class="info"> <strong> i</strong> <span>j</span> </div>]]
What I want:
[[a,c,e],[i]]
i = i.text has no effect, since you never use i after the assignment. You should use .text when you're appending to the list. Use a list comprehension to call it on each element.
whoFollowThisDr.append([div.text for div in i.find_all('div', {'class':"info"},'span')])

Beautifulsoup get element with the same class

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help
Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]
You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences

Unable to scrape h1 class with python/beautiful soup

I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>

How to extract image/href url from div class using scrapy

I having hard time to extract href url from given website code
<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-66hch1" style="max-width: 534px"> <div class="media-preview-content"> <a href="https://i.redd.it/nctvpvsnbpsy.jpg" class="may-blank"> <img class="preview" src="https://i.redditmedia.com/UELqh-mbh5mwnXr67PoBbi23nwZuNl2v3flNbkmewQE.jpg?w=534&amp;s=1426be7f811e5d5043760f8882674070" width="534" height="768"> </a> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>
Probably, you can use regular expressions for this. Here is example:
s = """<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-66hch1" style="max-width: 534px"> <div class="media-preview-content"> <a href="https://i.redd.it/nctvpvsnbpsy.jpg" class="may-blank"> <img class="preview" src="https://i.redditmedia.com/UELqh-mbh5mwnXr67PoBbi23nwZuNl2v3flNbkmewQE.jpg?w=534&amp;s=1426be7f811e5d5043760f8882674070" width="534" height="768"> </a> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>"""
re.search('href="(.*jpg)&quot', s).groups()[0]
# 'https://i.redd.it/nctvpvsnbpsy.jpg'

BeautifulSoup: scraping for a span gives me a result, for another span it gives "None"

I am coding a scraper for Etsy and when I scrape the span for reviews I get the right output. However when I scrape for the span with the price it gives me only None values and I don't understand why. If someone could help, it would be great!
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs each listing card
divs = page_soup.find_all("div", {"class": "v2-listing-card__shop"})
for i in divs:
shop = i.p.text
reviews = i.find("span", {"class" : "text-body-smaller text-gray-lighter display-inline-block vertical-align-middle icon-b-1"})
prices = i.find("span", {"class" : "currency-value"})
print shop
print reviews.text
print prices
Here are the two span elements as on the website:
<div class="v2-listing-card__info">
<p class="text-gray text-truncate mb-xs-0 text-body">
Blush Watercolor Flowers & Leaves with Different Shades Clipart Separate Elements Hand Painted Commercial Use | S15 Fairy Tale
</p>
<div class="v2-listing-card__shop">
<p class="text-gray-lighter text-body-smaller display-inline-block mr-xs-1">PatishopArt</p>
<div class="v2-listing-card__rating icon-t-2">
<div class="stars-svg stars-smaller ">
<input name="initial-rating" type="hidden" value="5"/>
<input name="rating" type="hidden" value="5"/>
<span class="screen-reader-only">5 out of 5 stars</span>
<div aria-hidden="true" class="rating lit rating-first icon-b-2" data-rating="1">
<span class="etsy-icon stars-svg-star" title="Disappointed"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="2">
<span class="etsy-icon stars-svg-star" title="Not a fan"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="3">
<span class="etsy-icon stars-svg-star" title="It's okay"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="4">
<span class="etsy-icon stars-svg-star" title="Like it"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="5">
<span class="etsy-icon stars-svg-star" title="Love it"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
</div>
</div>
</div>
</div>
</div>
</div>
<span class="text-body-smaller text-gray-lighter display-inline-block vertical-align-middle icon-b-1">(110)</span>
</div>
</div>
<p class="n-listing-card__price text-gray strong mt-xs-0">
<span class="currency-symbol">$</span><span class="currency-value">6.60</span>
</p>
<!-- This shows Free shipping on its own line , we only show it if it wasn't shown above -->
</div>
You are only checking in divs of type listing-card__shop but it looks to me as if the span in question, is outside of those divs

Categories