pubDate RSS parsing weirdness with Beautifulsoup/Python - python

I'm trying to parse an RSS/Podcast feed using Beautifulsoup and everything is working nicely except I can't seem to parse the 'pubDate' field.
data = urllib2.urlopen("http://www.democracynow.org/podcast.xml")
dom = BeautifulStoneSoup(data, fromEncoding='utf-8')
items = dom.findAll('item');
for item in items:
title = item.find('title').string.strip()
pubDate = item.find('pubDate').string.strip()
The title gets parsed fine but when it gets to pubDate, it says:
Traceback (most recent call last):
File "", line 2, in
AttributeError: 'NoneType' object has no attribute 'string'
However, when I download a copy of the XML file and rename 'pubDate' to something else, then parse it again, it seems to work. Is pubDate a reserved variable or something in Python?
Thanks,
g

It works with item.find('pubdate').string.strip().
Why don't you use feedparser ?

Related

Why doesn't bs4 find the href attribute?

So I'm learning using atbwp and I'm now doing a program where I open top 5 search results on a website.
It all works up until I have to get the href for each of the top results and open it. I get this error:
Traceback (most recent call last):
File "C:\Users\Asus\Desktop\pyhton\projects\emagSEARCH.py", line 33, in <module>
webbrowser.open(url)
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38-32\lib\webbrowser.py", line 86, in open
if browser.open(url, new, autoraise):
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38-32\lib\webbrowser.py", line 603, in open
os.startfile(url)
TypeError: startfile: filepath should be string, bytes or os.PathLike, not NoneType
This is how the html looks:
The No-Brainer Set The Ordinary, Deciem
And this is the part of my code which won't work for some reason..:
Soup=bs4.BeautifulSoup(res.text,'html.parser')
results= Soup.select('.item-title')
numberTabs=min(5,len(results))
print('Opening top '+str(numberTabs)+' top results...')
for i in range(numberTabs):
url=results[i].get('href')
webbrowser.open(url)
It does what it should until the for loop. It looks pretty much exactly like the example program in the book, so I don't understand why it doesn't work. What am I doing wrong?
If u wanna extract the href under the a tag, then use this:
html = ' The No-Brainer Set The Ordinary, Deciem'
Soup=bs4.BeautifulSoup(html,'html.parser')
url = Soup.find('a')['href']
print(url)
webbrowser.open(url)
Output:
https://comenzi.farmaciatei.ro/ingrijire-personala/ingrijire-corp-si-fata/tratamente-/the-no-brainer-set-the-ordinary-deciem-p344003
U can do the same for all a tags in order to get all hrefs.

Python - Web Scraping exercise - Attribute Error

I am learning how to scrape web information. Below is a snippet of the actual code solution + output from datacamp.
On datacamp, this works perfectly fine, but when I try to run it on Spyder (my own macbook), it doesn't work...
This is because on datacamp, the URL has already been pre-loaded into a variable named 'response'.. however on Spyder, the URL needs to be defined again.
So, I first defined the response variable as response = requests.get('https://www.datacamp.com/courses/all') so that the code will point to datacamp's website..
My code looks like:
from scrapy.selector import Selector
import requests
response = requests.get('https://www.datacamp.com/courses/all')
this_url = response.url
this_title = response.xpath('/html/head/title/text()').extract_first()
print_url_title( this_url, this_title )
When I run this on Spyder, I got an error message
Traceback (most recent call last):
File "<ipython-input-30-6a8340fd3a71>", line 11, in <module>
this_title = response.xpath('/html/head/title/text()').extract_first()
AttributeError: 'Response' object has no attribute 'xpath'
Could someone please guide me? I would really like to know how to get this code working on Spyder.. thank you very much.
The value returned by requests.get('https://www.datacamp.com/courses/all') is a Response object, and this object has no attribute xpath, hence the error: AttributeError: 'Response' object has no attribute 'xpath'
I assume response from your tutorial source, probably has been assigned to another object (most likely the object returned by etree.HTML) and not the value returned by requests.get(url).
You can however do this:
from lxml import etree #import etree
response = requests.get('https://www.datacamp.com/courses/all') #get the Response object
tree = etree.HTML(response.text) #pass the page's source using the Response object
result = tree.xpath('/html/head/title/text()') #extract the value
print(response.url) #url
print(result) #findings

Beautifulsoup: ValueError: Tag.index: element not in tag

I am converting ePub to single HTML files, so I need to concatenate the individual chapters into one HTML file. The are names "..._split_000.html" etc and I set up various structures to iterate over the ToC, generate directory names and so on.
I want to concat the HTML content of the individual parts with Beautifulsoup by appending the content of the body element of the following parts to the body of the first part. Only my code doesn't seem to work. "book" is an instance of the epub class of ebooklib. "docsfiles" is a dictionary with the names of the HTML files as a key and a list of files as one value among others:
def concat_articles(book, docsfiles, toc):
articles = {}
for doc, val in docsfiles.iteritems():
firstsoup = False
for f in val['files']:
content = book.get_item_with_href(f).content
soup = BeautifulSoup(content, "html.parser")
if not firstsoup:
firstsoup = soup
continue
body = copy.copy(soup.body)
firstsoup.body.append(body)
articles[val['id']] = firstsoup.prettify("utf-8")
return articles
When I run this on my ePub, an error occurs:
Traceback (most recent call last):
File "extract-new.py", line 170, in <module>
articles_html = concat_articles(book, docsfiles, toc)
File "extract-new.py", line 97, in concat_articles
firstsoup.body.append(body)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 338, in append
self.insert(len(self.contents), tag)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 291, in insert
new_child.extract()
File "/Library/Python/2.7/site-packages/bs4/element.py", line 235, in extract
del self.parent.contents[self.parent.index(self)]
File "/Library/Python/2.7/site-packages/bs4/element.py", line 888, in index
raise ValueError("Tag.index: element not in tag")
ValueError: Tag.index: element not in tag
Actually I should unwrap() the so soup.body in the above code but the leads to another error, so I thought I would solve this first.
Strange enough it works when I am using Martijn Peters' "clone()" method from this StackOverflow post:
body = clone(soup.body)
firstsoup.body.append(body)
Why this works and "copy.copy()" doesn't, I have yet to figure out.
The complete working solution without duplication of the body tags looks like this:
body = clone(soup.body)
for child in body.contents:
firstsoup.body.append(clone(child))
This also works when I am using "copy.copy()" in the first line but not when I replace "clone()" by "copy.copy()" in the last line.
It might be too late but I ran into a similar problem and found a solution that is more simple. Please turn the all the objects you extract with BeautifulSoup to a string, using the str() function.

renderContents in beautifulsoup (python)

The code I'm trying to get working is:
h = str(heading)
# '<h1>Heading</h1>'
heading.renderContents()
I get this error:
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
print h.renderContents()
AttributeError: 'str' object has no attribute 'renderContents'
Any ideas?
I have a string with html tags and i need to clean it if there is a different way of doing that please suggest it.
Your error message and your code sample don't line up. You say you're calling:
heading.renderContents()
But your error message says you're calling:
print h.renderContents()
Which suggests that perhaps you have a bug in your code, trying to call renderContents() on a string object that doesn't define that method.
In any case, it would help if you checked what type of object heading is to make sure it's really a BeautifulSoup instance. This works for me with BeautifulSoup 3.2.0:
from BeautifulSoup import BeautifulSoup
heading = BeautifulSoup('<h1>heading</h1>')
repr(heading)
# '<h1>heading</h1>'
print heading.renderContents()
# <h1>heading</h1>
print str(heading)
# '<h1>heading</h1>'
h = str(heading)
print h
# <h1>heading</h1>

AttributeError: xmlNode instance has no attribute 'isCountNode'

I'm using libxml2 in a Python app I'm writing, and am trying to run some test code to parse an XML file. The program downloads an XML file from the internet and parses it. However, I have run into a problem.
With the following code:
xmldoc = libxml2.parseDoc(gfile_content)
droot = xmldoc.children # Get document root
dchild = droot.children # Get child nodes
while dchild is not None:
if dchild.type == "element":
print "\tAn element with ", dchild.isCountNode(), "child(ren)"
print "\tAnd content", repr(dchild.content)
dchild = dchild.next
xmldoc.freeDoc();
...which is based on the code example found on this article on XML.com, I receive the following error when I attempt to run this code on Python 2.4.3 (CentOS 5.2 package).
Traceback (most recent call last):
File "./xml.py", line 25, in ?
print "\tAn element with ", dchild.isCountNode(), "child(ren)"
AttributeError: xmlNode instance has no attribute 'isCountNode'
I'm rather stuck here.
Edit: I should note here I also tried IsCountNode() and it still threw an error.
isCountNode should read "lsCountNode" (a lower-case "L")

Categories