Beautifulsoup html parser attribute crawling question - python

from bs4 import BeautifulSoup as bs4
import json
import requests
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
html = '''
<div class="side_sm">선물</div>
'''
json_data=soup.find('div',class_='side_sm').find('a').attrs('data-log-body')
print(json_data)
TypeError
'dict' object is not callable
I want to get the value of the dictionary, which is the value of the data-log-body attribute of the tag a under the 'class' side_sm of the 'div' tag.
I keep getting error handling, so please give me some advice on how to deal with it.

To access a dict with key use square brackets:
.attrs['data-log-body']
Use tag.get('attr') if you’re not sure attr is defined, just as you
would with a Python dictionary.
docs
So I would recommend to use:
.get('data-log-body')

Related

BeautifulSoup object returns as NoneType

The following code is supposed to download the logo image at pythonscraping.com but returns the error:"
AttributeError: 'NoneType' object has no attribute 'find'
". It seems the error lies in the fact that the BeautifulSoup bs object returns as Nonetype.
All BeautifulSoup objects that have been called with the same exact code work so far. Where is the error in this case please? Thanks.
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com')
bs = BeautifulSoup(html, 'html.parser')
imageLocation = bs.find('a', {'id': 'logo'}).find('img')['src']
urlretrieve(imageLocation, 'logo.jpg')
Error in line
imageLocation = bs.find('a', {'id': 'logo'}).find('img')['src']
It happened because of the second find
Because the a tag does not find the desired id
so returns None and you cant use "find()" for 'NoneType' object
If this code has been used in the past
There may be changes to the html page that will require a new code

how to get html text in <strong> tag using python

I have tried multiple methods to no avail.
I have this simple html that I want to extract the number 373 and then do some division.
<span id="ctl00_cph1_lblRecCount">Records Found: <strong> 373</strong></span>
I attempted to get the number with this python script below
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import re
NSNpreviousAwardRef = "https://www.dibbs.bsm.dla.mil/Awards/AwdRecs.aspx?Category=nsn&TypeSrch=cq&Value="+NSN+"&Scope=all&Sort=nsn&EndDate=&StartDate=&lowCnt=&hiCnt="
NSNdriver.get(NSNpreviousAwardRef)
previousAwardSoup = BeautifulSoup(NSNdriver.page_source,"html5lib");
# parsing of table
try:
totalPrevAward = previousAwardSoup.find("span", {"id": "ctl00_cph1_lblRecCount"}).strong.text
awardpagetotala = float(totalPrevAward) / (50)
awardpagetotal = math.ceil(awardpagetotala)+1
print(date)
print("total previous awards: "+ str(totalPrevAward))
print("page total : "+ str(awardpagetotal))
except Exception as e:
print(e)
continue
all I get is this error
'NoneType' object has no attribute 'strong'
I tried parse the html as lxml and still the same error. What am I doing wrongly and how can I fix it
The code to access the strong tag, soup.find("span").strong, is perfectly right.
You can explicitly try it by putting that html line in a variable, and creating your BeautifulSoup object from that variable.
Now, the error clearly tells you that the span tag you're looking for does not exist.
So here are some potential sources of the problem, off the top of my head:
Are you sure of the html input you feed into BeautifulSoup to create previousAwardSoup?
Are you sure that the id attribute is correct? More specifically, is it always the same and not randomized?
Print your previousAwardSoup and check if it has the span tag that you're searching for.

Why getAttribute() not giving result in selenium?

I was trying to scrap yellow pages Australia page. I searched for all the Piazza Restaurants in Australia. Now I want to fetch the email of every restaurant which is the value of data-email(an attribute of an anchor tag). Below is my code and I used getAttribute() on the anchor tag, but it is always giving me this error.
TypeError: 'NoneType' object is not callable
This is my code
import csv
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
url = "https://www.yellowpages.com.au/search/listings?clue=Pizza+Restaurants&locationClue=Sydney+CBD%2C+NSW&lat=&lon="
driver=webdriver.Chrome(executable_path="/usr/local/share/chromedriver")
driver.get(url)
pageSource=driver.page_source
bsObj=BeautifulSoup(pageSource,'lxml')
items=bsObj.find('div',{'class':'flow-layout outside-gap-large inside-gap inside-gap-large vertical'}).findAll('div',class_='cell in-area-cell find-show-more-trial middle-cell')
for item in items:
print(item.find('a',class_='contact contact-main contact-email ').getAttribute("data-email"))
Tag.getAttribute does not exist - you want either Tag[<attrname>] (if you are sure the item has this attribute) or Tag.get(<attrname>[,default=None]) if you're not.
Note that with most Python objects you would have got an AttributeError but beautifulsoup uses the __getattr__ hook a lot and returns None instead of raising an AttributeError when it cannot dynamically resolve an attribute, which is rather confusing.
This being said, item.find() can return None so you should indeed also test the result of item.find() before calling .get() on it, ie:
tag = item.find('a', ...)
if tag:
email = tag.get("data-email")
if email:
print(email)
You can also try something like this
https://github.com/n0str/beautifulsoup-none-catcher
So, it becomes
from maybe import Maybe
bsObj=BeautifulSoup(pageSource,'lxml')
items=Maybe(bsObj).find('div',{'class':'flow-layout outside-gap-large inside-gap inside-gap-large vertical'}).find_all('div', {'class': 'cell in-area-cell find-show-more-trial middle-cell'})
print('\n'.join(filter(lambda x: x, [Maybe(item).find('a', {'class': 'contact-email'}).get("data-email").resolve() for item in items.resolve()])))
Output
[..]#crust.com.au
[..]#madinitalia.com
<...>
[..]#ventuno.com.au
Just wrap Maybe(soup) and call .resolve() afterwards

Python 3.4: href with XPATH

Using lxml and requests I am passing a XPATH to retrieve href attributes of a tags. Every time I use the simple code below I get an AttributeError as exemplified below.
import requests
from lxml import html
import csv
url = 'https://biz.yahoo.com/p/sum_conameu.html'
resp = requests.get(url)
tree = html.fromstring(resp.text)
update_tick = [td.text_content()
for td in tree.xpath('''//tr[starts-with(normalize-space(.), "Industry")]
/following-sibling::tr[position()>0]
/td/a/#href''')]
print(update_tick)
AttributeError: 'str' object has no attribute 'text_content'
Passing XPath attribute selector (.../#href) to xpath() method make it return string values of the matched attributes. No need to call text_content() in this case :
update_tick = [td
for td in tree.xpath('''//tr[starts-with(normalize-space(.), "Industry")]
/following-sibling::tr[position()>0]
/td/a/#href''')]

BeautifulSoup: AttributeError: 'NavigableString' object has no attribute 'name'

Do you know why the first example in BeautifulSoup tutorial http://www.crummy.com/software/BeautifulSoup/documentation.html#QuickStart gives AttributeError: 'NavigableString' object has no attribute 'name'? According to this answer the space characters in the HTML causes the problem. I tried with sources of a few pages and 1 worked the others gave the same error (I removed spaces). Can you explain what does "name" refer to and why this error happens? Thanks.
Just ignore NavigableString objects while iterating through the tree:
from bs4 import BeautifulSoup, NavigableString, Tag
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for body_child in soup.body.children:
if isinstance(body_child, NavigableString):
continue
if isinstance(body_child, Tag):
print(body_child.name)
name will refer to the name of the tag if the object is a Tag object (ie: <html> name = "html")
if you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString's. So if you use the index of the contents to grab nodes, you might grab a NavigableString instead of the next Tag.
To avoid this, query for the node you are looking for: Searching the Parse Tree
or if you know the name of the next tag you would like, you can use that name as the property and it will return the first Tag with that name or None if no children with that name exist: Using Tag Names as Members
If you wanna use the contents you have to check the objects you are working with. The error you are getting just means you are trying to access the name property because the code assumes it's a Tag
You can use try catch to eliminate the cases when Navigable String is being parsed in the loop, like this:
for j in soup.find_all(...)
try:
print j.find(...)
except NavigableString:
pass
This is the latest working code to obtain the name of the tags in soup.
from bs4 import BeautifulSoup, Tag
res = requests.get(url).content
soup = BeautifulSoup(res, 'lxml')
for child in soup.body.children:
if isinstance(body_child, Tag):
print(child.name)

Categories