Beautiful Soup - Python - python

I was hoping to ask a pretty simple question. I have come across the below code and have not been able to find a decent explanation as to:
What exactly does the .attrs function do in this case?
What is the function of the ['href'] part at the end i.e. what exactly does that part of the code execute?
Here is the code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("url")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print (link.attrs['href'])

Let's go line by line after imports
Load a url into variable called html
Create BeautifulSoup object from html
For every link in the object's "a" tags (it loops over html tags in the html, finds all <a> and loops over them)
If the attribute of the tag has 'href' (<a href=""> - href is an attribute, thatt's stored in the link's .attrs property)
print to stdout the attribute that has key 'href' (it's a dictionary with 'href':'http://something')
The indentation is a bit wrong there, print should have be more indented than if

Let's try to fetch this question it self and see:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://stackoverflow.com/q/39308028/1005215")
bsObj = BeautifulSoup(html)
i) what exactly does the .attrs function do in this code
In [6]: bsObj.findAll("a")[30]
Out[6]: <a class="question-hyperlink" href="/questions/39308028/beautifuelsoup-python">Beautifuelsoup - Python</a>
In [7]: bsObj.findAll("a")[30].attrs
Out[7]:
{'class': ['question-hyperlink'],
'href': '/questions/39308028/beautifuelsoup-python'}
In [8]: type(bsObj.findAll("a")[30])
Out[8]: bs4.element.Tag
If you read the documentation, you will notice that a tag may have any number of attributes. In the element number 30, the tag has attributes 'class' and 'href'
ii) what is the function of the ['href'] part at the end
In [9]: bsObj.findAll("a")[30]['href']
Out[9]: '/questions/39308028/beautifuelsoup-python'
If you look at the above output, you will see that the tag had an attribute 'href' and the above code fetched us the value for that attribute.

Related

How to get the desired value in BeautifulSoup?

Suppose we have the html code as follows:
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'lxml')
I want to get the name xyz. Then, I write
soup.find('div',{'class':'name'})
However, it returns abc.
How to solve this problem?
The thing is that Beautiful Soup returns the first element that has the class name and div so the thing is that the first div has class name and class dt so it selects that div.
So, div helps but it still narrows down to 2 divs. Next, it returns a array so check the second div to use print(soup('div')[1].text). If you want to print all the divs use this code:
for i in range(len(soup('div')))
print(soup('div')[i].text)
And as pointed out in Ankur Sinha's answer, if you want to select all the divs that have only class name, then you have to use select, like this:
soup.select('div[class=name]')[0].get_text()
But if there are multiple divs that satisfy this property, use this:
for i in range(len(soup.select('div[class=name]'))):
print(soup.select('div[class=name]')[i].get_text())
Just to continue Ankur Sinha, when you use select or even just soup() it forms a array, because there can be multiple items so that's why I used len(), to figure out the length of the array. Then I ran a for loop on it and then printed the select function at i starting from 0.
When you do that, it rather would give a specific div instead of a array, and if it gave out a array, calling get_text() would produce errors because the array is NOT text.
This blog was helpful in doing what you would like, and that is to explicitly find a tag with specific class attribute:
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'html.parser')
soup.find(lambda tag: tag.name == 'div' and tag['class'] == ['name'])
Output:
<div class="name">xyz</div>
You can do it without lambda also using select to find exact class name like this:
soup.select("div[class = name]")
Will give:
[<div class="name">xyz</div>]
And if you want the value between tags:
soup.select("div[class=name]")[0].get_text()
Will give:
xyz
In case you have multiple div with class = 'name', then you can do:
for i in range(len(soup.select("div[class=name]"))):
print(soup.select("div[class=name]")[i].get_text())
Reference:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
This might work for you, note that it is contingent on the div being the second div item in the html.
import requests
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, features='lxml')
print(soup('div')[1].text)

Get URL from BeautifulSoup object

Somebody is handing my function a BeautifulSoup object (BS4) that he has gotten using the typical call:
soup = BeautifulSoup(url)
my code:
def doSomethingUseful(soup):
url = soup.???
How do I get the original URL from the soup object? I tried reading the docs AND the BeautifulSoup source code... I'm still not sure.
If the url variable is a string of an actual URL, then you should just forget the BeautifulSoup here and use the same variable url. You should be using BeautifulSoup to parse HTML code, not a simple URL. In fact, if you try to use it like this, you get a warning:
>>> from bs4 import BeautifulSoup
>>> url = "https://foo"
>>> soup = BeautifulSoup(url)
C:\Python27\lib\site-packages\bs4\__init__.py:336: UserWarning: "https://foo" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
Since the URL is just a string, BeautifulSoup doesn't really know what to do with it when you "soupify" it, except for wrapping it up in basic HTML:
>>> soup
<html><body><p>https://foo</p></body></html>
If you still wanted to extract the URL from this, you could just use .text on the object, since it's the only thing in there:
>>> print(soup.text)
https://foo
If on the other hand url is not really a URL at all but rather a bunch of HTML code (in which case the variable name would be very misleading), then how you'd extract a specific link inside would beg the question of how it's in your code. Doing a find to get the first a tag, then extracting the href value would be one way.
>>> actual_html = '<html><body>My link text</body></html>'
>>> newsoup = BeautifulSoup(actual_html)
>>> newsoup.find('a')['href']
'http://moo'

Removing span tags in python

I'm a newbie having trouble removing span tags after using BeautifulSoup to to grab the html from a page. Tried using "del links['span'] but it returned the same results. A few attemps at using getText() failed, as well. Clearly I'm doing something wrong that should be very easy. Help?
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
for links in soup.find_all("span", text=re.compile(".com")):
del links['class']
print(links.)
Use the .unwrap() method to remove tags, preserving their contents:
for links in soup.find_all("span", text=re.compile(".com")):
links.unwrap()
print soup
Depending what you are trying to do, you could either use unwrap to remove the tag (in fact, replacing the element by its content) or decompose to remove the element and its content.

Beautiful Soup - Cannot find the tags

The page is: http://item.taobao.com/item.htm?id=13015989524
you can see its source code.
In its source code the following code exists
<a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank">
But when I use BeautifulSoup to read the source code and execute the following
soup.findAll('a', href="http://item.taobao.com/item.htm?id=13015989524")
It returns [] empty. What does it return '[]'?
As far as I can see, the <a> tag you are trying to find is inside a <textarea> tag. BS does not parse the contents of <textarea> as HTML, and rightly so since <textarea> should not contain HTML. In short, that page is doing something sketchy.
If you really need to get that, you might "cheat" and parse the contents of <textarea> again and search within them:
import urllib
from BeautifulSoup import BeautifulSoup as BS
soup = BS(urllib.urlopen("http://item.taobao.com/item.htm?id=13015989524"))
a = []
for textarea in soup.findAll("textarea"):
textsoup = BS(textarea.text) # parse the contents as html
a.extend(textsoup.findAll("a", attrs={"href":"http://item.taobao.com/item.htm?id=13015989524"}))
for tag in a:
print tag
# outputs
# <a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank"><img ...
# <a href="http://item.taobao.com/item.htm?id=13015989524" title="901 ...
Use a dictionary to store the attribute:
soup.findAll('a', {
'href': "http://item.taobao.com/item.htm?id=13015989524"
})

BeautifulSoup: AttributeError: 'NavigableString' object has no attribute 'name'

Do you know why the first example in BeautifulSoup tutorial http://www.crummy.com/software/BeautifulSoup/documentation.html#QuickStart gives AttributeError: 'NavigableString' object has no attribute 'name'? According to this answer the space characters in the HTML causes the problem. I tried with sources of a few pages and 1 worked the others gave the same error (I removed spaces). Can you explain what does "name" refer to and why this error happens? Thanks.
Just ignore NavigableString objects while iterating through the tree:
from bs4 import BeautifulSoup, NavigableString, Tag
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for body_child in soup.body.children:
if isinstance(body_child, NavigableString):
continue
if isinstance(body_child, Tag):
print(body_child.name)
name will refer to the name of the tag if the object is a Tag object (ie: <html> name = "html")
if you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString's. So if you use the index of the contents to grab nodes, you might grab a NavigableString instead of the next Tag.
To avoid this, query for the node you are looking for: Searching the Parse Tree
or if you know the name of the next tag you would like, you can use that name as the property and it will return the first Tag with that name or None if no children with that name exist: Using Tag Names as Members
If you wanna use the contents you have to check the objects you are working with. The error you are getting just means you are trying to access the name property because the code assumes it's a Tag
You can use try catch to eliminate the cases when Navigable String is being parsed in the loop, like this:
for j in soup.find_all(...)
try:
print j.find(...)
except NavigableString:
pass
This is the latest working code to obtain the name of the tags in soup.
from bs4 import BeautifulSoup, Tag
res = requests.get(url).content
soup = BeautifulSoup(res, 'lxml')
for child in soup.body.children:
if isinstance(body_child, Tag):
print(child.name)

Categories