Do you know why the first example in BeautifulSoup tutorial http://www.crummy.com/software/BeautifulSoup/documentation.html#QuickStart gives AttributeError: 'NavigableString' object has no attribute 'name'? According to this answer the space characters in the HTML causes the problem. I tried with sources of a few pages and 1 worked the others gave the same error (I removed spaces). Can you explain what does "name" refer to and why this error happens? Thanks.
Just ignore NavigableString objects while iterating through the tree:
from bs4 import BeautifulSoup, NavigableString, Tag
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for body_child in soup.body.children:
if isinstance(body_child, NavigableString):
continue
if isinstance(body_child, Tag):
print(body_child.name)
name will refer to the name of the tag if the object is a Tag object (ie: <html> name = "html")
if you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString's. So if you use the index of the contents to grab nodes, you might grab a NavigableString instead of the next Tag.
To avoid this, query for the node you are looking for: Searching the Parse Tree
or if you know the name of the next tag you would like, you can use that name as the property and it will return the first Tag with that name or None if no children with that name exist: Using Tag Names as Members
If you wanna use the contents you have to check the objects you are working with. The error you are getting just means you are trying to access the name property because the code assumes it's a Tag
You can use try catch to eliminate the cases when Navigable String is being parsed in the loop, like this:
for j in soup.find_all(...)
try:
print j.find(...)
except NavigableString:
pass
This is the latest working code to obtain the name of the tags in soup.
from bs4 import BeautifulSoup, Tag
res = requests.get(url).content
soup = BeautifulSoup(res, 'lxml')
for child in soup.body.children:
if isinstance(body_child, Tag):
print(child.name)
Related
I have a class in my html code. I need to locate td class "Currentlocation" using python.
CODE :
<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>
Below are the codes I tried.
First attempt:
My_result = page_soup.find_element_by_class_name('CURRENTLOCATION')
Getting "TypeError: 'NoneType' object is not callable" error. Second attempt:
My_result = page_soup.find(‘td’, attrs={‘class’: ‘CURRENTLOCATION’})
Getting "invalid character in identifier" error.
Can anyone please help me locate a class in html code using python?
from bs4 import BeautifulSoup
sdata = '<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>'
soup = BeautifulSoup(sdata, 'lxml')
mytds = soup.findAll("td", {"class": "CURRENTLOCATION"})
for td in mytds:
print(td)
I tried your code, the second example, and the problem are the quotation marks you use. To me they are apostrophes (‘, unicode code point \u2019), while the python interpreter requires single (') or double (") quotation marks.
Changing them I can find the tag:
>>> bs.find('td', attrs={'class': 'CURRENTLOCATION'})
<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>
About your first example. I do not know where you find a reference to the method find_element_by_class_name but it seems to not be implemented by the BeautifulSoup class. The class instead implements the __getattr__ method that is a special one that is invoked anytime you try to access a non existing attribute. Here an excerpt of the method:
def __getattr__(self, tag):
#print "Getattr %s.%s" % (self.__class__, tag)
if len(tag) > 3 and tag.endswith('Tag'):
#
# We special case contents to avoid recursion.
elif not tag.startswith("__") and not tag == "contents":
return self.find(tag)
When you try to access the attribute find_element_by_class_name, you are actually looking for a tag with the same name.
There is a function in BeautifulSoup for this.
You can get all the desired tags and specify the attributes which you are lookin for in find_all function. It returns the list of all the elements which fulfill the criteria
import re
from bs4 import BeautifulSoup
text = '<td class="CURRENTLOCATION"><img align="MIDDLE" src="..\Images\FolderOpen.bmp"/> Metrics</td>'
soup = BeautifulSoup(text, 'lxml')
output_list = soup.find_all('td',{"class": "CURRENTLOCATION"}) # I am looking for all the td tags whose class atrribute is set to CURRENTLOCATION
For some reason I keep getting the following error when I run the fnmatch function.
Error: 'NoneType' object has no attribute 'replace'
It works when I try it with a single link, but doesn't work when I loop through an array and try to match every link in the array.
from bs4 import BeautifulSoup
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html").text
soup = BeautifulSoup(xml, 'html.parser')
all_links = [link.get('href') for link in soup.find_all('a')]
matched_links = [fnmatch(link, pattern) for link in all_links]
Not all <a> tags have href attributes. Your all_links probably has some None values in, which fnmatch() can't do regex matching on. This happens because some of those tags were probably being used as named anchors <a name="whatever"> instead of links.
You could add a guard condition to your list comprehension to make sure these get filtered out.
all_links = [link.get('href') for link in soup.find_all('a') if link.get('href')]
Subsequently, you could also do the conditional check on your second comprehension.
matched_links = [fnmatch(link, pattern) for link in all_links if link]
I was hoping to ask a pretty simple question. I have come across the below code and have not been able to find a decent explanation as to:
What exactly does the .attrs function do in this case?
What is the function of the ['href'] part at the end i.e. what exactly does that part of the code execute?
Here is the code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("url")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print (link.attrs['href'])
Let's go line by line after imports
Load a url into variable called html
Create BeautifulSoup object from html
For every link in the object's "a" tags (it loops over html tags in the html, finds all <a> and loops over them)
If the attribute of the tag has 'href' (<a href=""> - href is an attribute, thatt's stored in the link's .attrs property)
print to stdout the attribute that has key 'href' (it's a dictionary with 'href':'http://something')
The indentation is a bit wrong there, print should have be more indented than if
Let's try to fetch this question it self and see:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://stackoverflow.com/q/39308028/1005215")
bsObj = BeautifulSoup(html)
i) what exactly does the .attrs function do in this code
In [6]: bsObj.findAll("a")[30]
Out[6]: <a class="question-hyperlink" href="/questions/39308028/beautifuelsoup-python">Beautifuelsoup - Python</a>
In [7]: bsObj.findAll("a")[30].attrs
Out[7]:
{'class': ['question-hyperlink'],
'href': '/questions/39308028/beautifuelsoup-python'}
In [8]: type(bsObj.findAll("a")[30])
Out[8]: bs4.element.Tag
If you read the documentation, you will notice that a tag may have any number of attributes. In the element number 30, the tag has attributes 'class' and 'href'
ii) what is the function of the ['href'] part at the end
In [9]: bsObj.findAll("a")[30]['href']
Out[9]: '/questions/39308028/beautifuelsoup-python'
If you look at the above output, you will see that the tag had an attribute 'href' and the above code fetched us the value for that attribute.
from bs4 import BeautifulSoup
import urllib.request
import win_unicode_console
win_unicode_console.enable()
link = ('https://pietroalbini.io/')
req = urllib.request.Request(link, headers={'User-Agent': 'Mozilla/5.0'})
url = urllib.request.urlopen(req).read()
soup = BeautifulSoup(url, "html.parser")
body = soup.find_all('div', {"class":"wrapper"})
print(body.text)
Hi, I have a problem with Beautiful Soup, if I run this code without ".text" at the end it show me a list of div but if I add ".text" at the end come the error
Traceback (most recent call last):
File "script.py", line 15, in
print(body.text)
AttributeError: 'ResultSet' object has no attribute 'text'
find_all returns a ResultSet object which you can iterate over using a for loop. What you can do is:
for wrapper in body.find_all('div', {"class":"wrapper"}):
print wrapper.text
If you'll type:
print(type(body))
you'll see body is <class 'bs4.element.ResultSet'> It means all the elements that match the class. You can either iterate over them:
for div in body:
print(div.text)
Or if you know you only have div, you can use find instead:
div = soup.find('div', {"class":"wrapper"})
div.text
Probably should have posted as answer.. so as stated in the comments almost verbatim
Your code should be the following:
for div in body:
print div.text
#python3
#print(div.text)
Or some naming schema to your preference thereof.
The find_all method returns a generated list ( loosely using the term list here ) of items that beautifulsoup has found matching your criteria after parsing the source webpages html either recursively or non-recursively depending upon how you search.
As the error says the resulting set of objects has no attribute text, since it isn't an element but rather a collection of them.
However, the items inside the resulting set ( should any be found ) do.
You can view the documentation here
I'm working on a project to parse HTML page. It is for an internal website within a company but I changed the example so you can try.
I get the source code of a HTML page and I search for a certain markup. Then I want to extract a substring of this markup but it doesn't work. Python returns a none... Hier below my code with in comment the return of Python:
#!/usr/bin/python
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen("http://www.resto.be/restaurant/liege/4000-liege/8219-le-bar-a-gouts/")
page_source = response.read()
soup = BeautifulSoup(page_source)
name = soup.find_all("meta", attrs={"itemprop":"name"})
print(name[0])
# <meta content="LE BAR A GOUTS" itemprop="name"/>
print(name[0].find("<meta"))
# none
You don't have a string, you have a tag object. Printing the tag has a nice HTML represention, but it is not a string object.
As such, you are using the BeautifulSoup Tag.find() function, and it returns None if there are no child tags with the tag name <meta. Which indeed there are not here.
If you wanted to find the content attribute, use item access:
print name[0]['content']