I am trying to access other divs nested within a div for my webscrapper using Python and BeautifulSoup, but there seems to be an error.
my_url='https://www.boohooman.com/mens/shirts/white-shirts'
data= Soup(page, "html.parser")
P_Info= Soup.findAll("div",{"class":"product-tile js-product-tile"})
content=P_Info[0]
Now, content.div prints:
<div itemprop="brand" itemscope="" itemtype="https://schema.org/Brand">
And, content.a prints:
<a class="thumb-link js-canonical-link"
This a is inside the sibling div to the former div.
However, content.div.div prints nothing. And, content.div.div.a throws an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'a')
Can someone point out what I am doing wrong?
This is the structure of each product tile (a.k.a content):
( content ) -> ( div ) -> ( div ) + ( div -> a )
You are trying to find the anchor element a inside your content. However, this a is inside the second div. The problem with your logic is that, you are trying to find it inside the first div inside content, by calling content.div.div - which defaults to selecting the first div inside the content. But the first div inside content does not have any a element. So, content.div.div is assigned NoneType.
This is perfectly allowed. However, when you type: content.div.div.a you are trying to search inside NoneType object, and this throws an error that you see on the screen.
Solution: You need to find the second div (I am calling it target), by typing this:
target = content.find("div", {"class":"product-image js-product-image load-bg"})
Now, you can safely call target.a to get the a you were looking for!
Edit: Since your comment said that you wanted to implement logic to get all necessary information from each product on the page, here is the complete code.
import requests
from bs4 import BeautifulSoup
URL = "https://www.boohooman.com/mens/shirts/white-shirts"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
productTiles = soup.findAll("div", {"class": "product-tile js-product-tile"})
for productTile in productTiles:
nameElement = productTile.find(
"a", {"class": "name-link js-canonical-link"})
name = nameElement.text.strip()
print(f"Name: {name}")
linkElement = productTile.find(
"a", {"class": "thumb-link js-canonical-link"})
link = linkElement["href"]
link = URL + link
print(f"Link: {link}")
imgElement = productTile.find("img")
imgSrc = imgElement["src"]
print(f"Image: {imgSrc}")
stdPriceElement = productTile.find(
"span", {"class": "product-standard-price"})
if stdPriceElement is not None:
stdPrice = stdPriceElement.text.strip()
print(f"Standard price: {stdPrice}")
salesPriceElement = productTile.find(
"span", {"class": "product-sales-price"})
if salesPriceElement is not None:
salesPrice = salesPriceElement.text.strip()
print(f"Sales price: {salesPrice}")
print("----------------------------------------------")
Related
In this code I am scraping a website for all "a" tags inside all "div class='image'" found on the page and print out the contents of each "a" tag inside all the "image" classes on the page.
from bs4 import BeautifulSoup
import os
#edible mushroom scraping
url = 'http://www.mushroom.world/mushrooms/edible?page=0'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
#find all image classes
images = soup.find('div', attrs={'class': 'image'})
#images = soup.select('div.class image')
#find all images within class
for link in images.findAll("a"):
#get url ending from image a href
image_url= (link.get('href'))
#creates usable url
image_url = image_url.replace('/../', 'https://www.mushroom.world/')
print(image_url)
I believe the issue with the code is around the:
#find all image classes
images = soup.find('div', attrs={'class': 'image'})
when using soup.find, images is set to the first div of class images, and the rest of the code successfully retrieves the internal "a" tag found inside the first "image" class, however, when I set the code to:
#find all image classes
images = soup.find_all('div', attrs={'class': 'image'})
in order to go through all "image" class divs, then the code gives the error:
Exception has occurred: AttributeError
ResultSet object has no attribute 'findAll'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
To select all <a> tags under class="image", you can use CSS selector:
import requests
from bs4 import BeautifulSoup
url = "http://www.mushroom.world/mushrooms/edible?page=0"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
for a in soup.select(".image a"):
u = a["href"].replace("/../", "https://www.mushroom.world/")
print(u)
Prints:
https://www.mushroom.world/data/fungi/Agaricusarvensis1.JPG
https://www.mushroom.world/data/fungi/Agaricusarvensis2.JPG
https://www.mushroom.world/data/fungi/Agaricusarvensis3.JPG
https://www.mushroom.world/data/fungi/Agaricusarvensis4.JPG
https://www.mushroom.world/data/fungi/Agaricusarvensis5.JPG
https://www.mushroom.world/data/fungi/Agaricusaugustus1.jpg
https://www.mushroom.world/data/fungi/Agaricusaugustus2.jpg
https://www.mushroom.world/data/fungi/Agaricusaugustus3.jpg
https://www.mushroom.world/data/fungi/Agaricusaugustus4.jpg
https://www.mushroom.world/data/fungi/Albatrellusovinus1.JPG
https://www.mushroom.world/data/fungi/Albatrellusovinus2.JPG
https://www.mushroom.world/data/fungi/Albatrellusovinus3.JPG
https://www.mushroom.world/data/fungi/Albatrellusovinus4.JPG
https://www.mushroom.world/data/fungi/Albatrellusovinus5.JPG
https://www.mushroom.world/data/fungi/Armillariamellea1.JPG
https://www.mushroom.world/data/fungi/Armillariamellea2.JPG
https://www.mushroom.world/data/fungi/Armillariamellea3.JPG
https://www.mushroom.world/data/fungi/Armillariamellea4.JPG
https://www.mushroom.world/data/fungi/Armillariamellea5.JPG
https://www.mushroom.world/data/fungi/Armillariamellea6.JPG
https://www.mushroom.world/data/fungi/Boletusbadius1.JPG
https://www.mushroom.world/data/fungi/Boletusbadius2.JPG
https://www.mushroom.world/data/fungi/Boletusbadius3.JPG
https://www.mushroom.world/data/fungi/Boletusbadius4.JPG
https://www.mushroom.world/data/fungi/Boletusbadius5.JPG
https://www.mushroom.world/data/fungi/Boletusbadius7.JPG
https://www.mushroom.world/data/fungi/Boletusedulis1.JPG
https://www.mushroom.world/data/fungi/Boletusedulis2.JPG
https://www.mushroom.world/data/fungi/Boletusedulis3.JPG
https://www.mushroom.world/data/fungi/Boletusedulis4.JPG
https://www.mushroom.world/data/fungi/Boletusedulis5.JPG
https://www.mushroom.world/data/fungi/Boletusedulis6.JPG
https://www.mushroom.world/data/fungi/Boletuspinophilus1.jpg
https://www.mushroom.world/data/fungi/Boletuspinophilus2.JPG
https://www.mushroom.world/data/fungi/Boletuspinophilus3.JPG
https://www.mushroom.world/data/fungi/Boletuspinophilus4.JPG
https://www.mushroom.world/data/fungi/Boletuspinophilus5.JPG
https://www.mushroom.world/data/fungi/Boletuspinophilus6.JPG
https://www.mushroom.world/data/fungi/Boletussubtomentosus1.JPG
https://www.mushroom.world/data/fungi/Boletussubtomentosus2.JPG
https://www.mushroom.world/data/fungi/Boletussubtomentosus3.JPG
https://www.mushroom.world/data/fungi/Boletussubtomentosus4.JPG
https://www.mushroom.world/data/fungi/Boletussubtomentosus5.JPG
https://www.mushroom.world/data/fungi/Boletussubtomentosus6.JPG
https://www.mushroom.world/data/fungi/Boletussubtomentosus7.JPG
If you don't prefer CSS selectors:
for img in soup.find_all(class_="image"):
for a in img.find_all("a"):
u = a["href"].replace("/../", "https://www.mushroom.world/")
print(u)
I have tried the script below and it works just fine:
from bs4 import BeautifulSoup
import requests
pr= input("search: ")
source= requests.get('https://www.flipkart.com/search?q={}&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'.format(pr)).content
soup = BeautifulSoup(source, 'html.parser')
url= soup.find_all('div', class_=('_3O0U0u'))
whole_product_list= []
whole_url_list= []
main_product_list= []
main_url_list= []
for i in url:
tag_a_data= i.find_all('a')
for l in tag_a_data:
product_list= l.find('div', class_= '_3wU53n')
if product_list:
main_product_list.append(product_list.text)
else:
product_ok= l.get('title')
main_product_list.append(product_ok)
print(main_product_list)
so for example, if I pass "samsung" as input it returns a list for available attribute "div" with the given class Id, which is passed as arguments and if I pass something else as input like "shoes" which has "title" attribute it returns a list of all the titles available in it's html.
But if I reverse the order, like below:
from bs4 import BeautifulSoup
import requests
pr= input("search: ")
source= requests.get('https://www.flipkart.com/search?q={}&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'.format(pr)).content
soup = BeautifulSoup(source, 'html.parser')
url= soup.find_all('div', class_=('_3O0U0u'))
whole_product_list= []
whole_url_list= []
main_product_list= []
main_url_list= []
for i in url:
tag_a_data= i.find_all('a')
for l in tag_a_data:
product_list = l.get('title')
if product_list:
main_product_list.append(product_list)
else:
product_ok= l.find('div', class_= '_3wU53n').text
main_product_list.append(product_ok)
print(main_product_list)
it starts giving an attribute error:
Traceback (most recent call last):
File "tess.py", line 28, in <module>
product_ok= l.find('div', class_= '_3wU53n').text
AttributeError: 'NoneType' object has no attribute 'text'
I'm not getting why the first script is working fine based on if-else operation but second is not.
In this line:
product_ok= l.find('div', class_= '_3wU53n').text
l.find('div', class_= '_3wU53n') returns None, meaning it doesn't find the div. None values haven't got a text property, so it raises an AttributeError exception.
A fix would be to use the new walrus operator:
if product_ok := l.find('div', class_= '_3wU53n'):
product_ok = product_ok.text
Suppose you have the following data collected for your "l" values
item1 <title>title1</title><div class_= '_3wU53n'>xyz</div>
item2 <title>title1</title><div>xyz</div>
item3 <title>title1</title><div class_= '_3wU53n'>xyz</div>
Using the first code, your product_list variable will contain item1 and item3. Then you can get the title of the given items as they are available. So the code works without any problem.
Using the second code, your product_list variable will contain item1, item2, and item3. But in this case, you won't get the required div tag, as it doesn't exist for the second item. This causes the attribute error.
The simple thing is items in the database will always have a title, but most likely won't have the required div tag always.
The following change should get it working:
from bs4 import BeautifulSoup
import requests
pr= input("search: ")
source= requests.get('https://www.flipkart.com/search?q={}&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'.format(pr)).content
soup = BeautifulSoup(source, 'html.parser')
url= soup.find_all('div', class_=('_3O0U0u'))
whole_product_list= []
whole_url_list= []
main_product_list= []
main_url_list= []
for i in url:
tag_a_data= i.find_all('a')
for l in tag_a_data:
product_list = l.get('title')
if product_list:
main_product_list.append(product_list)
else:
if l.find("div", class_='_3wU53n'):
product_ok= l.find('div', class_= '_3wU53n').text
main_product_list.append(product_ok)
print(main_product_list)
My python scrapping program is running into TypeError.
Here's my code:
from bs4 import BeautifulSoup
import requests, feedparser
cqrss = feedparser.parse('https://www.reddit.com/r/pics/new.rss')
for submission in cqrss.entries:
folder_name = submission.title #use for create folder
reddit_url = submission.link
source = requests.get(reddit_url)
plain_text = source.content
soup = BeautifulSoup(plain_text, 'lxml')
title = soup.find('a', 'title may-blank outbound', href=True)
if 'imgur.com' in title['href']:
imgur_link = title['href']
print(imgur_link)
Error:
if 'imgur.com' in title['href']:
TypeError: 'NoneType' object is not subscriptable
What did I do wrong?
find "fails" (i.e. does not find anything) for some data and returns None.
if title and 'imgur.com' in title['href']:
imgur_link = title['href']
print(imgur_link)
should work.
Note that print was moved under the if clause, as it obviously does not make sense to call it, if data isn't there.
Usually I would just call the div by a class name but it's not unique. The only unique thing the div tag has is the word "data-sc-replace" right after div. This is a shorten example of the source code
<div data-sc-replace data-sc-slot="1234" class = "inlineblock" data-sc-params="{'magnet': 'magnet:?......'extension': 'epub', 'stream': '' }"></div>
How would I go about calling the word "data-sc-replace" if it's not attached to a class or an id?
This is the code I have
import requests
from bs4 import BeautifulSoup
url_to_scrape = "http://example.com"
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html5lib")
list = soup.findAll('div', {'class':'inlineblock'})
print(list)
# list = soup.findAll("div", "data-sc-params")
# list = soup.find('data-sc-replace')
# list = soup.find('data-sc-params')
# list = soup.find('div', {'class':'inlineblock'}, 'data-sc-params')
Use CSS query selectors. Finds all divs with data-sc-replace attributes.
result = soup.select('div[data-sc-replace]')
That distinctive mark seems to be an HTML attribute without value. So try this:
soup.find('div', attrs = {'data-sc-replace': ''})
# or use find_all() to get all such div containers
I just started a python web course and I was trying to parse HTML Data using BeautifulSoup and I came across this error . I researched but couldnt find any precise and certain solution . So here is the piece of code :
import requests
from bs4 import BeautifulSoup
request = requests.get("http://www.johnlewis.com/toms-berkley-slipper-grey/p3061099")
content = request.content
soup = BeautifulSoup(content, 'html.parser')
element = soup.find(" span", {"itemprop ": "price ", "class": "now-price"})
string_price = (element.text.strip())
print(int(string_price))
# <span itemprop="price" class="now-price"> £40.00 </span>
And this is the error I face :
C:\Users\IngeniousAmbivert\venv\Scripts\python.exe
C:/Users/IngeniousAmbivert/PycharmProjects/FullStack/price-eg/src/app.py
Traceback (most recent call last):
File "C:/Users/IngeniousAmbivert/PycharmProjects/FullStack/price-eg/src/app.py", line 8, in <module>
string_price = (element.text.strip())
AttributeError: 'NoneType' object has no attribute 'text'
Process finished with exit code 1
Any help will be appreciated
The problem is the extra space characters you have inside the tag name, attribute name and attribute values, replace:
element = soup.find(" span", {"itemprop ": "price ", "class": "now-price"})
with:
element = soup.find("span", {"itemprop": "price", "class": "now-price"})
After that, two more things to fix when converting the string:
strip the £ character from the left
use float() instead of int()
Fixed version:
element = soup.find("span", {"itemprop": "price", "class": "now-price"})
string_price = (element.get_text(strip=True).lstrip("£"))
print(float(string_price))
You would see 40.00 printed.
You can try like this also using css selector:
import requests
from bs4 import BeautifulSoup
request = requests.get("http://www.johnlewis.com/toms-berkley-slipper-grey/p3061099")
content = request.content
soup = BeautifulSoup(content, 'html.parser')
# print soup
element = soup.select("div p.price span.now-price")[0]
print element
string_price = (element.text.strip())
print(int(float(string_price[1:])))
Output:
<span class="now-price" itemprop="price">
£40.00
</span>
40