Getting an AttributeError, when adding .text attribute - python

I have tried the script below and it works just fine:
from bs4 import BeautifulSoup
import requests
pr= input("search: ")
source= requests.get('https://www.flipkart.com/search?q={}&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'.format(pr)).content
soup = BeautifulSoup(source, 'html.parser')
url= soup.find_all('div', class_=('_3O0U0u'))
whole_product_list= []
whole_url_list= []
main_product_list= []
main_url_list= []
for i in url:
tag_a_data= i.find_all('a')
for l in tag_a_data:
product_list= l.find('div', class_= '_3wU53n')
if product_list:
main_product_list.append(product_list.text)
else:
product_ok= l.get('title')
main_product_list.append(product_ok)
print(main_product_list)
so for example, if I pass "samsung" as input it returns a list for available attribute "div" with the given class Id, which is passed as arguments and if I pass something else as input like "shoes" which has "title" attribute it returns a list of all the titles available in it's html.
But if I reverse the order, like below:
from bs4 import BeautifulSoup
import requests
pr= input("search: ")
source= requests.get('https://www.flipkart.com/search?q={}&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'.format(pr)).content
soup = BeautifulSoup(source, 'html.parser')
url= soup.find_all('div', class_=('_3O0U0u'))
whole_product_list= []
whole_url_list= []
main_product_list= []
main_url_list= []
for i in url:
tag_a_data= i.find_all('a')
for l in tag_a_data:
product_list = l.get('title')
if product_list:
main_product_list.append(product_list)
else:
product_ok= l.find('div', class_= '_3wU53n').text
main_product_list.append(product_ok)
print(main_product_list)
it starts giving an attribute error:
Traceback (most recent call last):
File "tess.py", line 28, in <module>
product_ok= l.find('div', class_= '_3wU53n').text
AttributeError: 'NoneType' object has no attribute 'text'
I'm not getting why the first script is working fine based on if-else operation but second is not.

In this line:
product_ok= l.find('div', class_= '_3wU53n').text
l.find('div', class_= '_3wU53n') returns None, meaning it doesn't find the div. None values haven't got a text property, so it raises an AttributeError exception.
A fix would be to use the new walrus operator:
if product_ok := l.find('div', class_= '_3wU53n'):
product_ok = product_ok.text

Suppose you have the following data collected for your "l" values
item1 <title>title1</title><div class_= '_3wU53n'>xyz</div>
item2 <title>title1</title><div>xyz</div>
item3 <title>title1</title><div class_= '_3wU53n'>xyz</div>
Using the first code, your product_list variable will contain item1 and item3. Then you can get the title of the given items as they are available. So the code works without any problem.
Using the second code, your product_list variable will contain item1, item2, and item3. But in this case, you won't get the required div tag, as it doesn't exist for the second item. This causes the attribute error.
The simple thing is items in the database will always have a title, but most likely won't have the required div tag always.
The following change should get it working:
from bs4 import BeautifulSoup
import requests
pr= input("search: ")
source= requests.get('https://www.flipkart.com/search?q={}&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'.format(pr)).content
soup = BeautifulSoup(source, 'html.parser')
url= soup.find_all('div', class_=('_3O0U0u'))
whole_product_list= []
whole_url_list= []
main_product_list= []
main_url_list= []
for i in url:
tag_a_data= i.find_all('a')
for l in tag_a_data:
product_list = l.get('title')
if product_list:
main_product_list.append(product_list)
else:
if l.find("div", class_='_3wU53n'):
product_ok= l.find('div', class_= '_3wU53n').text
main_product_list.append(product_ok)
print(main_product_list)

Related

BeautifulSoup and Lists

I am attempting to use beautifulsoup to look through and request each url in a txt file. So far I am able to scrape the first link for what I seek, progressing to the next url I hit an error.
This is the error I keep getting:
AttributeError: ResultSet object has no attribute 'find'. You're
probably treating a list of elements like a single element. Did you
call find_all() when you meant to call find()?
from bs4 import BeautifulSoup as bs
import requests
import constants as c
file = open(c.fvtxt)
read = file.readlines()
res = []
DOMAIN = c.vatican_domain
pdf = []
def get_soup(url):
return bs(requests.get(url).text, 'html.parser')
for link in read:
bs = get_soup(link)
res.append(bs)
soup = bs.find('div', {'class': 'headerpdf'})
pdff = soup.find('a')
li = pdff.get('href')
surl = f"{DOMAIN}{li}"
pdf.append(f"{surl}\n")
print(pdf)
It's your variable name confuses the Python interpreter, you cannot have the same name as a function and a variable at the same time, in your case 'bs'.
It should work fine if you rename the variable bs to parsed_text or something else but bs.
for link in read:
parsed_text = get_soup(link)
res.append(parsed_text)
soup = parsed_text.find('div', {'class': 'headerpdf'})
pdff = soup.find('a')
li = pdff.get('href')
print(li)
surl = f"{DOMAIN}{li}"
pdf.append(f"{surl}\n")
print(pdf)
The result:

How to access other divs with Python and BeautifulSoup

I am trying to access other divs nested within a div for my webscrapper using Python and BeautifulSoup, but there seems to be an error.
my_url='https://www.boohooman.com/mens/shirts/white-shirts'
data= Soup(page, "html.parser")
P_Info= Soup.findAll("div",{"class":"product-tile js-product-tile"})
content=P_Info[0]
Now, content.div prints:
<div itemprop="brand" itemscope="" itemtype="https://schema.org/Brand">
And, content.a prints:
<a class="thumb-link js-canonical-link"
This a is inside the sibling div to the former div.
However, content.div.div prints nothing. And, content.div.div.a throws an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'a')
Can someone point out what I am doing wrong?
This is the structure of each product tile (a.k.a content):
( content ) -> ( div ) -> ( div ) + ( div -> a )
You are trying to find the anchor element a inside your content. However, this a is inside the second div. The problem with your logic is that, you are trying to find it inside the first div inside content, by calling content.div.div - which defaults to selecting the first div inside the content. But the first div inside content does not have any a element. So, content.div.div is assigned NoneType.
This is perfectly allowed. However, when you type: content.div.div.a you are trying to search inside NoneType object, and this throws an error that you see on the screen.
Solution: You need to find the second div (I am calling it target), by typing this:
target = content.find("div", {"class":"product-image js-product-image load-bg"})
Now, you can safely call target.a to get the a you were looking for!
Edit: Since your comment said that you wanted to implement logic to get all necessary information from each product on the page, here is the complete code.
import requests
from bs4 import BeautifulSoup
URL = "https://www.boohooman.com/mens/shirts/white-shirts"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
productTiles = soup.findAll("div", {"class": "product-tile js-product-tile"})
for productTile in productTiles:
nameElement = productTile.find(
"a", {"class": "name-link js-canonical-link"})
name = nameElement.text.strip()
print(f"Name: {name}")
linkElement = productTile.find(
"a", {"class": "thumb-link js-canonical-link"})
link = linkElement["href"]
link = URL + link
print(f"Link: {link}")
imgElement = productTile.find("img")
imgSrc = imgElement["src"]
print(f"Image: {imgSrc}")
stdPriceElement = productTile.find(
"span", {"class": "product-standard-price"})
if stdPriceElement is not None:
stdPrice = stdPriceElement.text.strip()
print(f"Standard price: {stdPrice}")
salesPriceElement = productTile.find(
"span", {"class": "product-sales-price"})
if salesPriceElement is not None:
salesPrice = salesPriceElement.text.strip()
print(f"Sales price: {salesPrice}")
print("----------------------------------------------")

Scraping an url using BeautifulSoup

Hello I am beginner in data scraping.
At this case I want to get an url like "https:// . . ." but the result is a list in link variable that contain of all links in web. Here the code below;
import requests
from bs4 import BeautifulSoup
url = 'https://www.detik.com/search/searchall?query=KPK'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
artikel = soup.findAll('div', {'class' : 'list media_rows list-berita'})
p = 1
link = []
for p in artikel:
s = p.findAll('a', href=True)['href']
link.append(s)
the result of the code above is getting error such as
TypeError Traceback (most recent call last)
<ipython-input-141-469cb6eabf70> in <module>
3 link = []
4 for p in artikel:
5 s = p.findAll('a', href=True)['href']
6 link.append(s)
TypeError: list indices must be integers or slices, not str
The result is I want to get all links of https:// . . . in <div class = 'list media_rows list-berita' as a list
Thank you in advance.
Code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.detik.com/search/searchall?query=KPK'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
articles = soup.findAll('div', {'class' : 'list media_rows list-berita'})
links = []
for article in articles:
hrefs = article.find_all('a', href=True)
for href in hrefs:
links.append(href['href'])
print(links)
Output:
['https://news.detik.com/kolom/d-5609578/bahaya-laten-narasi-kpk-sudah-mati', 'https://news.detik.com/berita/d-5609585/penyuap-nurdin-abdullah-tawarkan-proyek-sulsel-ke-pengusaha-minta-rp-1-m', 'https://news.detik.com/berita/d-5609537/7-gebrakan-ahok-yang-bikin-geger', 'https://news.detik.com/berita/d-5609423/ppp-minta-bkn-jangan-asal-sebut-twk-kpk-dokumen-rahasia',
'https://news.detik.com/berita/d-5609382/mantan-sekjen-nasdem-gugat-pasal-suap-ke-mk-karena-dinilai-multitafsir', 'https://news.detik.com/berita/d-5609381/kpk-gali-informasi-soal-nurdin-abdullah-beli-tanah-pakai-uang-suap', 'https://news.detik.com/berita/d-5609378/hrs-bandingkan-kasus-dengan-pinangki-ary-askhara-tuntutan-ke-saya-gila', 'https://news.detik.com/detiktv/d-5609348/pimpinan-kpk-akhirnya-penuhi-panggilan-komnas-ham', 'https://news.detik.com/berita/d-5609286/wakil-ketua-kpk-nurul-ghufron-penuhi-panggilan-komnas-ham-soal-polemik-twk']
There is only one div with the class list media_rows list-berita. So you can use find instead of findAll
Select the div with class name list media_rows list-berita
Select all the <a> with findAll from the div. This will give you a list of all <a> tags present inside the div
Iterate over all the <a> from the above list and extract the href.
Here is a working code.
import requests
from bs4 import BeautifulSoup
url = 'https://www.detik.com/search/searchall?query=KPK'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
artikel = soup.find('div', {'class' : 'list media_rows list-berita'})
a_hrefs = artikel.findAll('a')
link = []
for k in a_hrefs:
link.append(k['href'])
print(link)

Web scraping with beautifulsoup: find and replace missing nodes with None

I'm using the following code to scrape web items with Beaufulsoup:
item_id = []
items = soup.find_all('div', class_ = 'item-id')
for one_item in items:
list_item = one_item.text
item_id.append(list_item)
However, some items are missing and when I run the code, I one get the list of the items available. How can I proceed to get the entire list including the missings listed as "None" ?
import requests
from bs4 import BeautifulSoup as bsoup
site_source = requests.get("https://search.bvsalud.org/global-literature-on-novel-coronavirus-2019-ncov/?output=site&lang=en&from=0&sort=&format=summary&count=100&fb=&page=1&skfp=&index=tw&q=%28%22rapid+test%22+OR+%22rapid+diagnostic+test%22%29+AND+sensitivity+AND+specificity").content
soup = bsoup(site_source, "html.parser")
item_list = soup.find_all('div', class_ = 'textArt')
result_list = []
for item in item_list:
result = item.find('div', class_='reference')
if result is None:
result_list.append('None')
else:
result_list.append(result.text)
for result in result_list:
print(result)

Why does TypeError appear occasionally?

My python scrapping program is running into TypeError.
Here's my code:
from bs4 import BeautifulSoup
import requests, feedparser
cqrss = feedparser.parse('https://www.reddit.com/r/pics/new.rss')
for submission in cqrss.entries:
folder_name = submission.title #use for create folder
reddit_url = submission.link
source = requests.get(reddit_url)
plain_text = source.content
soup = BeautifulSoup(plain_text, 'lxml')
title = soup.find('a', 'title may-blank outbound', href=True)
if 'imgur.com' in title['href']:
imgur_link = title['href']
print(imgur_link)
Error:
if 'imgur.com' in title['href']:
TypeError: 'NoneType' object is not subscriptable
What did I do wrong?
find "fails" (i.e. does not find anything) for some data and returns None.
if title and 'imgur.com' in title['href']:
imgur_link = title['href']
print(imgur_link)
should work.
Note that print was moved under the if clause, as it obviously does not make sense to call it, if data isn't there.

Categories