media:thumbnail w/ BeautifulSoup - python

What is the correct way to parse the url attribute of media:thumbnail tags using BeautifulSoup? I have tried the following:
doc = BeautifulSoup(urlopen('http://rss.cnn.com/rss/edition.rss'), 'xml')
items = doc.findAll('item')
for item in items:
title = item.title.text
link = item.link.text
image = item.find('media:thumbnail')[0]['url']
However, I get the 'NoneType' object is not subscriptable error.

Don't include the namespace prefix:
>>> doc.find('thumbnail')
<media:thumbnail height="51" url="http://i2.cdn.turner.com/cnn/dam/assets/150116173806-amateur-video-amedy-coulibaly-top-tease.jpg" width="90"/>
The element.find() method returns one element, so there is no need for subscription here; you can access the url attribute on the element directly:
>>> doc.find('thumbnail')['url']
u'http://i2.cdn.turner.com/cnn/dam/assets/150116173806-amateur-video-amedy-coulibaly-top-tease.jpg'
There currently isn't any support for searching by a specific namespace; the namespace URL is stored (in the .namespace attribute) but not used by .find() or .find_all().

Related

String argument behaves differently within my script

I've tried to parse a text out of some html elements using string argument the way it is described here but failed miserably. I've tried two different ways but every time I encountered the same AttributeError.
How can I use string argument in this very case to fetch the text?
I've tried with:
import re
from bs4 import BeautifulSoup
htmlelement = """
<caption>
<span class="toggle open"></span>
ASIC registration
</caption>
"""
soup = BeautifulSoup(htmlelement,"lxml")
try:
item = soup.find("caption",string="ASIC registration").text
#item = soup.find("caption",string=re.compile("ASIC registration",re.I)).text
except AttributeError:
item = ""
print(item)
Expected output (only using string argument):
ASIC registration
How can I use string argument in this very case to fetch the text?
You can't
Note:
I am assuming that you mean by some change string parameter in
item = soup.find("caption",string="ASIC registration").text
As given in the documentation
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
import re
from bs4 import BeautifulSoup
htmlelement = """
<caption>
<span class="toggle open"></span>
ASIC registration
</caption>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.find("caption")
print(item.string)
Output
None
Here the .string is None as caption has more than one child.
If you are trying to get the parent (caption tag in this case) with the text, you could do
item = soup.find(string=re.compile('ASIC registration')).parent
which will give
<caption><a></a>ASIC registration</caption>
Of course, calling a .text on this parent tag will give the full text within that tag, if it is not the full text within it.
item = soup.find(string=re.compile('ASIC')).parent.text
will give an output
ASIC registration
The issue you're running into is that the string argument searches for strings instead of for tags as it states in the documentation you linked.
The syntax you are using:
soup.find("caption",string="ASIC registration")
is for finding tags.
For finding strings:
soup.find(string=re.compile('ASIC'))
With the first one you are saying find a caption tag with the "string" attribute of your string. The caption tag has no string attribute so nothing is returned.
The second one is saying find the string that contains 'ASIC', so it returns the string.
Turns out the string parameter doesn't work if a tag has a child tag. The following code is stupid, but it works:
real_item = ""
try:
items = soup.find_all("caption")
r = re.compile(u"ASIC registration", re.I)
for item in items:
for s in item.strings:
if r.search(unicode(s)):
real_item = item
break
except AttributeError:
real_item = ""
print(real_item)

Beautiful Soup 'ResultSet' object has no attribute 'text'

from bs4 import BeautifulSoup
import urllib.request
import win_unicode_console
win_unicode_console.enable()
link = ('https://pietroalbini.io/')
req = urllib.request.Request(link, headers={'User-Agent': 'Mozilla/5.0'})
url = urllib.request.urlopen(req).read()
soup = BeautifulSoup(url, "html.parser")
body = soup.find_all('div', {"class":"wrapper"})
print(body.text)
Hi, I have a problem with Beautiful Soup, if I run this code without ".text" at the end it show me a list of div but if I add ".text" at the end come the error
Traceback (most recent call last):
File "script.py", line 15, in
print(body.text)
AttributeError: 'ResultSet' object has no attribute 'text'
find_all returns a ResultSet object which you can iterate over using a for loop. What you can do is:
for wrapper in body.find_all('div', {"class":"wrapper"}):
print wrapper.text
If you'll type:
print(type(body))
you'll see body is <class 'bs4.element.ResultSet'> It means all the elements that match the class. You can either iterate over them:
for div in body:
print(div.text)
Or if you know you only have div, you can use find instead:
div = soup.find('div', {"class":"wrapper"})
div.text
Probably should have posted as answer.. so as stated in the comments almost verbatim
Your code should be the following:
for div in body:
print div.text
#python3
#print(div.text)
Or some naming schema to your preference thereof.
The find_all method returns a generated list ( loosely using the term list here ) of items that beautifulsoup has found matching your criteria after parsing the source webpages html either recursively or non-recursively depending upon how you search.
As the error says the resulting set of objects has no attribute text, since it isn't an element but rather a collection of them.
However, the items inside the resulting set ( should any be found ) do.
You can view the documentation here

Python 3.4: href with XPATH

Using lxml and requests I am passing a XPATH to retrieve href attributes of a tags. Every time I use the simple code below I get an AttributeError as exemplified below.
import requests
from lxml import html
import csv
url = 'https://biz.yahoo.com/p/sum_conameu.html'
resp = requests.get(url)
tree = html.fromstring(resp.text)
update_tick = [td.text_content()
for td in tree.xpath('''//tr[starts-with(normalize-space(.), "Industry")]
/following-sibling::tr[position()>0]
/td/a/#href''')]
print(update_tick)
AttributeError: 'str' object has no attribute 'text_content'
Passing XPath attribute selector (.../#href) to xpath() method make it return string values of the matched attributes. No need to call text_content() in this case :
update_tick = [td
for td in tree.xpath('''//tr[starts-with(normalize-space(.), "Industry")]
/following-sibling::tr[position()>0]
/td/a/#href''')]

Extract URL using beautifulsoup 4

I'm trying to extract a URL using BS4, I can get to the correct location but I'm not sure how to remove the '' tags from URL. I tried adding .text however this just returned nothing.
vid_screenshot = (soup('a', {'class':'mp4Thumb'}))[0].contents[0]
>> <img src="www.fgfg.com/dsfasdf.jpg"/>
desired result
>> www.fgfg.com/dsfasdf.jpg
did not work, returned nothing.
(soup('a', {'class':'mp4Thumb'}))[0].contents[0].text
Would anyone know how to strip these tags..?
You have the HTML tag, you need to take the src attribute:
vid_screenshot = soup('a', {'class':'mp4Thumb'})[0].contents[0]
vid_screenshot_src = wid_screenshot['src']
This assumes there is always going to be a src attribute on the tag. You can also use the .get() method to return None if the attribute is not present:
vid_screenshot = soup('a', {'class':'mp4Thumb'})[0].contents[0]
vid_screenshot_src = wid_screenshot.get('src')
or you can give .get() a second argument to return if the attribute is missing:
vid_screenshot = soup('a', {'class':'mp4Thumb'})[0].contents[0]
vid_screenshot_src = wid_screenshot.get('src', 'default value')
See the Attributes documentation section.

BeautifulSoup: AttributeError: 'NavigableString' object has no attribute 'name'

Do you know why the first example in BeautifulSoup tutorial http://www.crummy.com/software/BeautifulSoup/documentation.html#QuickStart gives AttributeError: 'NavigableString' object has no attribute 'name'? According to this answer the space characters in the HTML causes the problem. I tried with sources of a few pages and 1 worked the others gave the same error (I removed spaces). Can you explain what does "name" refer to and why this error happens? Thanks.
Just ignore NavigableString objects while iterating through the tree:
from bs4 import BeautifulSoup, NavigableString, Tag
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for body_child in soup.body.children:
if isinstance(body_child, NavigableString):
continue
if isinstance(body_child, Tag):
print(body_child.name)
name will refer to the name of the tag if the object is a Tag object (ie: <html> name = "html")
if you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString's. So if you use the index of the contents to grab nodes, you might grab a NavigableString instead of the next Tag.
To avoid this, query for the node you are looking for: Searching the Parse Tree
or if you know the name of the next tag you would like, you can use that name as the property and it will return the first Tag with that name or None if no children with that name exist: Using Tag Names as Members
If you wanna use the contents you have to check the objects you are working with. The error you are getting just means you are trying to access the name property because the code assumes it's a Tag
You can use try catch to eliminate the cases when Navigable String is being parsed in the loop, like this:
for j in soup.find_all(...)
try:
print j.find(...)
except NavigableString:
pass
This is the latest working code to obtain the name of the tags in soup.
from bs4 import BeautifulSoup, Tag
res = requests.get(url).content
soup = BeautifulSoup(res, 'lxml')
for child in soup.body.children:
if isinstance(body_child, Tag):
print(child.name)

Categories