Python Trouble with find_all() - python
I am attempting to scrape a site for for some data on an long list of items. Below is my code:
def find_info():
page = requests.get(URL)
soup = bs(page.content, 'html.parser')
soup2 = bs(soup.prettify(), "html.parser")
lists = soup2.find_all('div', class_="featured_item")
#with open('data.csv', 'w', encoding='utf8', newline='') as f:
# thewriter = writer(f)
# header = ['Rank', 'Number', 'Price'] #csv: comma separated values
# thewriter.writerow(header)
for list in lists:
stats = lists.find_all('div', class_="item_stats")
sale = lists.find('div', class_="sale")
rank = float(stats.select('.item_stats span')[0].text)
number = stats.select('.item_stats span')[1].text.strip().replace('#', '')
price = sale.select('span')[0].text.strip().replace('◎', '')
print(rank, number, price)
find_info()
I get the following error:
Traceback (most recent call last):
File "C:\Users\Cameron Lemon\source\repos\ScrapperGUI\module1.py", line 36, in <module>
find_info()
File "C:\Users\Cameron Lemon\source\repos\ScrapperGUI\module1.py", line 27, in
find_info
stats = lists.find_all('div', class_="item_stats")
File "C:\Users\Cameron Lemon\AppData\Local\Programs\Python\Python310\lib\site-
packages\bs4\element.py", line 2253, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a
list of elements like a single element. Did you call find_all() when you meant to call
find()?
Press any key to continue . . .
This code runs well, until I add the line for list in lists: and use find_all() instead of find(). I am very confused, because in the past I have coded to find data for the first item in the list successfully and then set a for loop and changed my find() to find_all(). Any advice is much appreciated. Thank you
Below is my lists variable, for two of the many items within it.
[<div class="featured_item">
<div class="featured_item_img">
<a href="/degds/3251/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
1
</span>
</div>
<div class="item_stat">
item no.
<span>
#3251
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
</div>, <div class="featured_item">
<div class="featured_item_img">
<a href="/dgds/8628/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
2
</span>
</div>
<div class="item_stat">
item no.
<span>
#8628
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
You are using find_all method two times. When you use for loop then you will get each item meaning list didn't exist any more. So we can't use find_all without list that's why you are getting that error.
html='''
<div class="featured_item">
<div class="featured_item_img">
<a href="/degds/3251/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
1
</span>
</div>
<div class="item_stat">
item no.
<span>
#3251
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
</div>, <div class="featured_item">
<div class="featured_item_img">
<a href="/dgds/8628/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
2
</span>
</div>
<div class="item_stat">
item no.
<span>
#8628
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"html.parser")
lists = soup.find_all('div', class_="featured_item")
for lis in lists:
stats = lis.find('div', class_="item_stats")
sale = lis.find('div', class_="sale")
rank = float(stats.select('.item_stats span')[0].text)
number = stats.select('.item_stats span')[1].text.strip().replace('#', '')
price = sale.select('span')[0].text.strip().replace('◎', '')
print(rank, number, price)
Output:
1.0 3251 not for sale
2.0 8628 not for sale
Related
How to get value of html tags with beautifulsoup in python?
I scraped multiple pages, some of which have a red class, some of which I do not want to store red class values in an array, but I want those pages that do not have this class to be in an empty array. Because of that I wrote this code and now I want to get value of them. can you help me? for i in soup: search = i.find_all('div', {'class':"red"}) if len(search)>0: whoFollowThisDr.append(i.find_all('div', {'class':"info"},'span')) i = i.text else: whoFollowThisDr.append(' ') whoFollowThisDr output: [[<div class="info"> <strong> a</strong> <span>b</span> </div>, <div class="info"> <strong> c</strong> <span>d</span> </div>, <div class="info"> <strong style="font-size: 15px !important;"> e</strong> <span style="font-size: 12px !important;">f</span> </div>, <div class="info"> <strong style="font-size: 15px !important;"> g</strong> <span style="font-size: 12px !important;">h</span> </div>], [<div class="info"> <strong> i</strong> <span>j</span> </div>]] What I want: [[a,c,e],[i]]
i = i.text has no effect, since you never use i after the assignment. You should use .text when you're appending to the list. Use a list comprehension to call it on each element. whoFollowThisDr.append([div.text for div in i.find_all('div', {'class':"info"},'span')])
Beautifulsoup get element with the same class
I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup. The html code is like this : <div class="info-item"> <div class="item-name">Model</div> <div class="item-content">XPANDER 1.5L GLX</div> </div> <div class="info-item"> <div class="item-name">Transmission</div> <div class="item-content"> MT </div> </div> <div class="info-item"> <div class="item-name">Engine Capacity (cc)</div> <div class="item-content">1499 cc</div> </div> <div class="info-item"> <div class="item-name">Fuel</div> <div class="item-content">Bensin </div> </div> I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline) I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX please help
Use .find_all() or .select(): from bs4 import BeautifulSoup html_doc = """ <div class="info-item"> <div class="item-name">Model</div> <div class="item-content">XPANDER 1.5L GLX</div> </div> <div class="info-item"> <div class="item-name">Transmission</div> <div class="item-content"> MT </div> </div> <div class="info-item"> <div class="item-name">Engine Capacity (cc)</div> <div class="item-content">1499 cc</div> </div> <div class="info-item"> <div class="item-name">Fuel</div> <div class="item-content">Bensin </div> </div> """ soup = BeautifulSoup(html_doc, "html.parser") items = [ item.get_text(strip=True) for item in soup.find_all(class_="item-content") ] print(*items) Prints: XPANDER 1.5L GLX MT 1499 cc Bensin Or: items = [item.get_text(strip=True) for item in soup.select(".item-content")]
You can try this soup = BeautifulSoup(html, "html.parser") items = [item.text for item in soup.find_all("div", {"class": "item-content"})] find_all retreives all occurences
Unable to scrape h1 class with python/beautiful soup
I am trying to scrape a title from an h1 class, but I keep getting "None" page = requests.get(URL, headers=headers) soup = BeautifulSoup(page.content, 'html.parser') title = soup.find('h1', {'class': 'prod-name'}) print(title) I've also tried using this way: name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0] name = name_div.find('h1').text print(name) in which case I get: "IndexError: list index out of range" Can anybody help me out? This is the source code: <div class="row attachDetails __web-inspector-hidebefore-shortcut__"> <div class="row"> <div class="col-md-12 col-sm-12 col-xs-12"> <div class="brand-desc">POLO RALPH LAUREN</div> <h1 class="prod-name">ARAN CREWNECK SWEATER</h1> <div class="panel-group" id="accordion"> <div class="borders-overview"> <div class="panel-heading"> <h4 class="panel-title"> <label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1"> <a class="fa fa-angle-up pull-right"></a> <a class="over-view">OVERVIEW</a> <span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span> <span class="style-num over-view">MATERIAL# : 710766783002 </span></label> </h4> </div> <div id="collapse1" class="panel-collapse collapse"> <div class="short-desc-section"></div> </div> </div> <div class="border-details"> <div class="panel-heading"> <h4 class="panel-title"> <label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2"> <a class="detail-link">Details</a> <a class="fa fa-angle-up pull-right"></a> </label> </h4> </div> <div id="collapse2" class="long-desc panel-collapse collapse"> <div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div> <ol> <div><li><b>Board:</b> S196SC23</li></div> <!--***********************************************************************************************************--> </ol> </div> </div> </div> </div> </div> </div>
How to extract image/href url from div class using scrapy
I having hard time to extract href url from given website code <div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-66hch1" style="max-width: 534px"> <div class="media-preview-content"> <a href="https://i.redd.it/nctvpvsnbpsy.jpg" class="may-blank"> <img class="preview" src="https://i.redditmedia.com/UELqh-mbh5mwnXr67PoBbi23nwZuNl2v3flNbkmewQE.jpg?w=534&s=1426be7f811e5d5043760f8882674070" width="534" height="768"> </a> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>
Probably, you can use regular expressions for this. Here is example: s = """<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-66hch1" style="max-width: 534px"> <div class="media-preview-content"> <a href="https://i.redd.it/nctvpvsnbpsy.jpg" class="may-blank"> <img class="preview" src="https://i.redditmedia.com/UELqh-mbh5mwnXr67PoBbi23nwZuNl2v3flNbkmewQE.jpg?w=534&s=1426be7f811e5d5043760f8882674070" width="534" height="768"> </a> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>""" re.search('href="(.*jpg)"', s).groups()[0] # 'https://i.redd.it/nctvpvsnbpsy.jpg'
BeautifulSoup: scraping for a span gives me a result, for another span it gives "None"
I am coding a scraper for Etsy and when I scrape the span for reviews I get the right output. However when I scrape for the span with the price it gives me only None values and I don't understand why. If someone could help, it would be great! #html parsing page_soup = soup(page_html, "html.parser") #grabs each listing card divs = page_soup.find_all("div", {"class": "v2-listing-card__shop"}) for i in divs: shop = i.p.text reviews = i.find("span", {"class" : "text-body-smaller text-gray-lighter display-inline-block vertical-align-middle icon-b-1"}) prices = i.find("span", {"class" : "currency-value"}) print shop print reviews.text print prices Here are the two span elements as on the website: <div class="v2-listing-card__info"> <p class="text-gray text-truncate mb-xs-0 text-body"> Blush Watercolor Flowers & Leaves with Different Shades Clipart Separate Elements Hand Painted Commercial Use | S15 Fairy Tale </p> <div class="v2-listing-card__shop"> <p class="text-gray-lighter text-body-smaller display-inline-block mr-xs-1">PatishopArt</p> <div class="v2-listing-card__rating icon-t-2"> <div class="stars-svg stars-smaller "> <input name="initial-rating" type="hidden" value="5"/> <input name="rating" type="hidden" value="5"/> <span class="screen-reader-only">5 out of 5 stars</span> <div aria-hidden="true" class="rating lit rating-first icon-b-2" data-rating="1"> <span class="etsy-icon stars-svg-star" title="Disappointed"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span> <div class="rating lit" data-rating="2"> <span class="etsy-icon stars-svg-star" title="Not a fan"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span> <div class="rating lit" data-rating="3"> <span class="etsy-icon stars-svg-star" title="It's okay"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span> <div class="rating lit" data-rating="4"> <span class="etsy-icon stars-svg-star" title="Like it"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span> <div class="rating lit" data-rating="5"> <span class="etsy-icon stars-svg-star" title="Love it"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span> </div> </div> </div> </div> </div> </div> <span class="text-body-smaller text-gray-lighter display-inline-block vertical-align-middle icon-b-1">(110)</span> </div> </div> <p class="n-listing-card__price text-gray strong mt-xs-0"> <span class="currency-symbol">$</span><span class="currency-value">6.60</span> </p> <!-- This shows Free shipping on its own line , we only show it if it wasn't shown above --> </div>
You are only checking in divs of type listing-card__shop but it looks to me as if the span in question, is outside of those divs