Why getAttribute() not giving result in selenium? - python

I was trying to scrap yellow pages Australia page. I searched for all the Piazza Restaurants in Australia. Now I want to fetch the email of every restaurant which is the value of data-email(an attribute of an anchor tag). Below is my code and I used getAttribute() on the anchor tag, but it is always giving me this error.
TypeError: 'NoneType' object is not callable
This is my code
import csv
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
url = "https://www.yellowpages.com.au/search/listings?clue=Pizza+Restaurants&locationClue=Sydney+CBD%2C+NSW&lat=&lon="
driver=webdriver.Chrome(executable_path="/usr/local/share/chromedriver")
driver.get(url)
pageSource=driver.page_source
bsObj=BeautifulSoup(pageSource,'lxml')
items=bsObj.find('div',{'class':'flow-layout outside-gap-large inside-gap inside-gap-large vertical'}).findAll('div',class_='cell in-area-cell find-show-more-trial middle-cell')
for item in items:
print(item.find('a',class_='contact contact-main contact-email ').getAttribute("data-email"))

Tag.getAttribute does not exist - you want either Tag[<attrname>] (if you are sure the item has this attribute) or Tag.get(<attrname>[,default=None]) if you're not.
Note that with most Python objects you would have got an AttributeError but beautifulsoup uses the __getattr__ hook a lot and returns None instead of raising an AttributeError when it cannot dynamically resolve an attribute, which is rather confusing.
This being said, item.find() can return None so you should indeed also test the result of item.find() before calling .get() on it, ie:
tag = item.find('a', ...)
if tag:
email = tag.get("data-email")
if email:
print(email)

You can also try something like this
https://github.com/n0str/beautifulsoup-none-catcher
So, it becomes
from maybe import Maybe
bsObj=BeautifulSoup(pageSource,'lxml')
items=Maybe(bsObj).find('div',{'class':'flow-layout outside-gap-large inside-gap inside-gap-large vertical'}).find_all('div', {'class': 'cell in-area-cell find-show-more-trial middle-cell'})
print('\n'.join(filter(lambda x: x, [Maybe(item).find('a', {'class': 'contact-email'}).get("data-email").resolve() for item in items.resolve()])))
Output
[..]#crust.com.au
[..]#madinitalia.com
<...>
[..]#ventuno.com.au
Just wrap Maybe(soup) and call .resolve() afterwards

Related

Beautifulsoup html parser attribute crawling question

from bs4 import BeautifulSoup as bs4
import json
import requests
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
html = '''
<div class="side_sm">선물</div>
'''
json_data=soup.find('div',class_='side_sm').find('a').attrs('data-log-body')
print(json_data)
TypeError
'dict' object is not callable
I want to get the value of the dictionary, which is the value of the data-log-body attribute of the tag a under the 'class' side_sm of the 'div' tag.
I keep getting error handling, so please give me some advice on how to deal with it.
To access a dict with key use square brackets:
.attrs['data-log-body']
Use tag.get('attr') if you’re not sure attr is defined, just as you
would with a Python dictionary.
docs
So I would recommend to use:
.get('data-log-body')

Python web scraping error: 'NoneType' object is not callable after using split function

I'm a beginner writing my first scraping script trying to extract company name, phone number, and email from the following page.
So far my script successfully pulls out the name and phone number, but I am getting stuck on pulling out the email, which is nested within a script object. My latest two attempts involved using regex, and when that failed, a split function, which is returning the error mentioned in the title.
Script:
import re
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
url1 = "http://pcoc.officialbuyersguide.net/Listing?MDSID=CPC-1210"
html = urlopen(url1)
soup = BeautifulSoup(html,'html.parser')
for company_name in soup.find_all(class_='ListingPageNameAddress NONE'):
print(company_name.find('h1').text)
for phone in soup.find_all(class_='ListingPageNameAddress NONE'):
print(phone.find(class_='Disappear').text)
for email in soup.findAll(class_='ListingPageNameAddress NONE'):
print(email.find('script').text)
print(email.split('LinkValue: "')[1].split('"')[0])
print(re.findall(r"([\w\._]+\#([\w_]+\\.)+[a-zA-Z]+)", soup))
Error:
TypeError Traceback (most recent call last)
<ipython-input-20-ace5e5106ea7> in <module>
1 for email in soup.findAll(class_='ListingPageNameAddress NONE'):
2 print(email.find('script').text)
----> 3 print(email.split('LinkValue: "')[1].split('"')[0])
4 print(re.findall(r"([\w\._]+\#([\w_]+\\.)+[a-zA-Z]+)", soup))
TypeError: 'NoneType' object is not callable
HTML within "script" that I'm trying to pull from:
EMLink('com','aol','mikemhnam','<div class="emailgraphic"><img style="position: relative; top: 3px;" src="https://www.naylornetwork.com/EMailProtector/text-gif.aspx?sx=com&nx=mikemhnam&dx=aol&size=9&color=034af3&underline=yes" border=0></div>','pcoc.officialbuyersguide.net Inquiry','onClick=\'$.get("TrackLinkClick", { LinkType: "Email", LinkValue: "mikemhnam#aol.com", MDSID: "CPC-1210", AdListingID: "" });\'')
As far as I'm aware, BeautifulSoup doesn't expose a split function on elements.
BeautifulSoup elements allow you to specify any attribute tough, and if it isn't a property or function of the element, it will look for a tag with that name. For instance, element.div would find the first descendant of element that is a div. So you can even do things like element.nonsense, and since nonsense is not a function or property of the element object, it then searches the document tree for a descendant with the name nonsense, and since one doesn't exist, it will simply return None.
So when you call email.split(...), it doesn't find a function or property called split on the email object, so it searches the HTML tree for a tag named split. Since it can't find an element named split, it returns None, and you try to call it as a function, which results in the error you are getting.
Is it possible you meant to get the text from email email.text.split()?
Try this, This might solve your problem.
import re
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
url1 = "http://pcoc.officialbuyersguide.net/Listing?MDSID=CPC-1210"
html = urlopen(url1)
soup = BeautifulSoup(html,'html.parser')
for company_name in soup.find_all(class_='ListingPageNameAddress NONE'):
print(company_name.find('h1').text)
for phone in soup.find_all(class_='ListingPageNameAddress NONE'):
print(phone.find(class_='Disappear').text)
for email in soup.findAll(class_='ListingPageNameAddress NONE'):
print(email.find('script').text)
a=email.find('script').text
# print(email.split('LinkValue: "')[1].split('"')[0])
print(str(re.findall(r"\S+#\S+", a)).split('"')[1])
Did you try str(email) before you split it? It worked for me!

how to get html text in <strong> tag using python

I have tried multiple methods to no avail.
I have this simple html that I want to extract the number 373 and then do some division.
<span id="ctl00_cph1_lblRecCount">Records Found: <strong> 373</strong></span>
I attempted to get the number with this python script below
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import re
NSNpreviousAwardRef = "https://www.dibbs.bsm.dla.mil/Awards/AwdRecs.aspx?Category=nsn&TypeSrch=cq&Value="+NSN+"&Scope=all&Sort=nsn&EndDate=&StartDate=&lowCnt=&hiCnt="
NSNdriver.get(NSNpreviousAwardRef)
previousAwardSoup = BeautifulSoup(NSNdriver.page_source,"html5lib");
# parsing of table
try:
totalPrevAward = previousAwardSoup.find("span", {"id": "ctl00_cph1_lblRecCount"}).strong.text
awardpagetotala = float(totalPrevAward) / (50)
awardpagetotal = math.ceil(awardpagetotala)+1
print(date)
print("total previous awards: "+ str(totalPrevAward))
print("page total : "+ str(awardpagetotal))
except Exception as e:
print(e)
continue
all I get is this error
'NoneType' object has no attribute 'strong'
I tried parse the html as lxml and still the same error. What am I doing wrongly and how can I fix it
The code to access the strong tag, soup.find("span").strong, is perfectly right.
You can explicitly try it by putting that html line in a variable, and creating your BeautifulSoup object from that variable.
Now, the error clearly tells you that the span tag you're looking for does not exist.
So here are some potential sources of the problem, off the top of my head:
Are you sure of the html input you feed into BeautifulSoup to create previousAwardSoup?
Are you sure that the id attribute is correct? More specifically, is it always the same and not randomized?
Print your previousAwardSoup and check if it has the span tag that you're searching for.

HTML class visible when using inspect element, but can't be found with BS4 in Python

I'm trying to search for a particular class on a webpage; when I use inspect element, I can clearly see that the class exists. But when I use BeautifulSoup to find the class, eg
import bs4
import requests
url = r"https://twitter.com/TheSun/status/998755828931932160"
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, "html.parser")
body = soup.find(class_ = "permalink-inner permalink-tweet-container"
Yet body is a NoneType object, which indicates that BS4 was unable to find a class by the name permalink-inner permalink-tweet-container. Anyone know why this is? If you go to the URL I provided, you can see that it is of a Tweet, and the class I am trying to access represents the "body" of the Tweet. This code works for some tweets, but seemingly randomly gives me None for body.

TypeError: 'NoneType' object is not iterable using BeautifulSoup

I am pretty new to Python and this could be a very simple type of error, but can´t work out what´s wrong. I am trying to get the links from a website containing a specific substring, but get the "TypeError: 'NoneType' object is not iterable" when I do it. I believe the problem is related to the links I get from the website. Anybody knows what is the problem here?
from bs4 import BeautifulSoup
from urllib.request import urlopen
html_page = urlopen("http://www.scoresway.com/?sport=soccer&page=competition&id=87&view=matches")
soup = BeautifulSoup(html_page, 'html.parser')
lista=[]
for link in soup.find_all('a'):
lista.append(link.get('href'))
for text in lista:
if "competition" in text:
print (text)
You're getting a TypeError exception because some 'a' tags dont have a 'href' attribute , and so get('href') returns None , wich is not iterable .
You can fix this if you replace this :
soup.find_all('a')
with this :
soup.find_all('a', href=True)
to ensure that all your links have a 'href' attribute
In the line lista.append(link.get('href')) expression link.get('href') can return None. After that you try to use "competition" in text, where text can equal to None - it is not iterable object. To avoid this, use link.get('href', '') and set default value of get() - empty string '' is iterable.
I found mistakes in two places.
First of all the urllib module doesn't have request method.
from urllib.request import urlopen
# should be
from urllib import urlopen
Second one is when you are fetching the link from the page, beautifulSoup is returning None for few link.
print(lista)
# prints [None, u'http://facebook.com/scoresway', u'http://twitter.com/scoresway', ...., None]
As you can see your list contains two None and that's why when you are iterating over it, you get "TypeError: 'NoneType'.
How to fix it?
You should remove the None from the list.
for link in soup.find_all('a'):
if link is not None: # Add this line
lista.append(link.get('href'))

Categories