Why doesn't bs4 find the href attribute? - python

So I'm learning using atbwp and I'm now doing a program where I open top 5 search results on a website.
It all works up until I have to get the href for each of the top results and open it. I get this error:
Traceback (most recent call last):
File "C:\Users\Asus\Desktop\pyhton\projects\emagSEARCH.py", line 33, in <module>
webbrowser.open(url)
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38-32\lib\webbrowser.py", line 86, in open
if browser.open(url, new, autoraise):
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38-32\lib\webbrowser.py", line 603, in open
os.startfile(url)
TypeError: startfile: filepath should be string, bytes or os.PathLike, not NoneType
This is how the html looks:
The No-Brainer Set The Ordinary, Deciem
And this is the part of my code which won't work for some reason..:
Soup=bs4.BeautifulSoup(res.text,'html.parser')
results= Soup.select('.item-title')
numberTabs=min(5,len(results))
print('Opening top '+str(numberTabs)+' top results...')
for i in range(numberTabs):
url=results[i].get('href')
webbrowser.open(url)
It does what it should until the for loop. It looks pretty much exactly like the example program in the book, so I don't understand why it doesn't work. What am I doing wrong?

If u wanna extract the href under the a tag, then use this:
html = ' The No-Brainer Set The Ordinary, Deciem'
Soup=bs4.BeautifulSoup(html,'html.parser')
url = Soup.find('a')['href']
print(url)
webbrowser.open(url)
Output:
https://comenzi.farmaciatei.ro/ingrijire-personala/ingrijire-corp-si-fata/tratamente-/the-no-brainer-set-the-ordinary-deciem-p344003
U can do the same for all a tags in order to get all hrefs.

Related

Beautifulsoup: ValueError: Tag.index: element not in tag

I am converting ePub to single HTML files, so I need to concatenate the individual chapters into one HTML file. The are names "..._split_000.html" etc and I set up various structures to iterate over the ToC, generate directory names and so on.
I want to concat the HTML content of the individual parts with Beautifulsoup by appending the content of the body element of the following parts to the body of the first part. Only my code doesn't seem to work. "book" is an instance of the epub class of ebooklib. "docsfiles" is a dictionary with the names of the HTML files as a key and a list of files as one value among others:
def concat_articles(book, docsfiles, toc):
articles = {}
for doc, val in docsfiles.iteritems():
firstsoup = False
for f in val['files']:
content = book.get_item_with_href(f).content
soup = BeautifulSoup(content, "html.parser")
if not firstsoup:
firstsoup = soup
continue
body = copy.copy(soup.body)
firstsoup.body.append(body)
articles[val['id']] = firstsoup.prettify("utf-8")
return articles
When I run this on my ePub, an error occurs:
Traceback (most recent call last):
File "extract-new.py", line 170, in <module>
articles_html = concat_articles(book, docsfiles, toc)
File "extract-new.py", line 97, in concat_articles
firstsoup.body.append(body)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 338, in append
self.insert(len(self.contents), tag)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 291, in insert
new_child.extract()
File "/Library/Python/2.7/site-packages/bs4/element.py", line 235, in extract
del self.parent.contents[self.parent.index(self)]
File "/Library/Python/2.7/site-packages/bs4/element.py", line 888, in index
raise ValueError("Tag.index: element not in tag")
ValueError: Tag.index: element not in tag
Actually I should unwrap() the so soup.body in the above code but the leads to another error, so I thought I would solve this first.
Strange enough it works when I am using Martijn Peters' "clone()" method from this StackOverflow post:
body = clone(soup.body)
firstsoup.body.append(body)
Why this works and "copy.copy()" doesn't, I have yet to figure out.
The complete working solution without duplication of the body tags looks like this:
body = clone(soup.body)
for child in body.contents:
firstsoup.body.append(clone(child))
This also works when I am using "copy.copy()" in the first line but not when I replace "clone()" by "copy.copy()" in the last line.
It might be too late but I ran into a similar problem and found a solution that is more simple. Please turn the all the objects you extract with BeautifulSoup to a string, using the str() function.

lxml.etree.XPathEvalError: Invalid expression

I am getting an error with Python that I am not able to understand. I have simplified my code to the very bare minimum:
response = requests.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
r = tree.xpath('//divass="campaign"]/a/#href')
print(r)
and still get the error
Traceback (most recent call last):
File "ultimate-1.py", line 17, in <module>
r = tree.xpath('//divass="campaign"]/a/#href')
File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954)
File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962)
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817)
lxml.etree.XPathEvalError: Invalid expression
Would anyone have an idea of where the issue is coming from? Might it be a dependencies problem? Thanks.
The expression '//divass="campaign"]/a/#href' is not syntactically correct and does not make much sense. Instead you meant to check the class attribute:
//div[#class="campaign"]/a/#href
Now, that would help to avoid the Invalid Expression error, but you would get nothing found by the expression. This is because the data is not there in the response that requests receives. You would need to mimic what the browser does to get the desired data and make an additional request to get the javascript file containing the campaigns.
Here is what works for me:
import ast
import re
import requests
from lxml import html
with requests.Session() as session:
# extract script url
response = session.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
script_url = tree.xpath("//script[contains(#src, 'generate-js')]/#src")[0]
# get the script
response = session.get(script_url)
data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1))
# extract the desired data
tree = html.fromstring(data)
campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[#class="campaign"]/a')]
print(campaigns)
Prints:
['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140',
...
'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481'
]
U was wrong in making xpath.
If u want to take all hrefs your xpath should be like
hrefs = tree.xpath('//div[#class="campaign"]/a')
for href in hrefs:
print(href.get('href'))
or in one line:
hrefs = [item.get('href') for item in tree.xpath('//div[#class="campaign"]/a')]

Not clear on why my function is returning none

I have very limited coding background except for some Ruby, so if there's a better way of doing this, please let me know!
Essentially I have a .txt file full of words. I want to import the .txt file and turn it into a list. Then, I want to take the first item in the list, assign it to a variable, and use that variable in an external request that sends off to get the definition of the word. The definition is returned, and tucked into a different .txt file. Once that's done, I want the code to grab the next item in the list and do it all again until the list is exhausted.
Below is my code in progress to give an idea of where I'm at. I'm still trying to figure out how to iterate through the list correctly, and I'm having a hard time interpreting the documentation.
Sorry in advance if this was already asked! I searched, but couldn't find anything that specifically answered my issue.
from __future__ import print_function
import requests
import urllib
from bs4 import BeautifulSoup
def get_definition(x):
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query={0}'.format(x)
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
return soup.find('pre', text=True)[0]
lines = []
with open('vocab.txt') as f:
lines = f.readlines()
lines = [line.strip() for line in lines]
definitions = []
for line in lines:
definitions.append(get_definition(line))
out_str = '\n'.join(definitions)
with open('definitions.txt', 'w') as f:
f.write(out_str)
the problem I'm having is
Traceback (most recent call last):
File "WIP.py", line 20, in <module>
definitions.append(get_definition(line))
File "WIP.py", line 11, in get_definition
return soup.find('pre', text=True)[0]
File "/Library/Python/2.7/site-packages/bs4/element.py", line 958, in __getitem__
return self.attrs[key]
KeyError: 0
I understand that soup.find('pre', text=True) is returning None, but not why or how to fix it.
your problem is that find() returns a single result not a list. The result is a dict-like object so it tries to find the key 0 which it cannot.
just remove the [0] and you should be fine
Also soup.find(...) is not returning None. It is returning an answer! If it were returning None you would get the error
NoneType has no attribute __getitem__
Beautiful soup documentation for find()

Tags with : in name in lxml

I'm trying to use lxml.etree to parse a Wordpress export document (it's XML, somewhat RSS like). I'm only interested in published posts, so I'm using the following to loop through published posts:
for item in data.findall("item"):
if item.find("wp:post_type").text != "post":
continue
if item.find("wp:status").text != "publish":
continue
write_post(item)
where data is the tag that all item tags are found in. item tags contain posts, pages, and drafts. My problem is that lxml can't find tags that have a : in their name (e.g. wp:post_type). When I try item.find("wp:post_type") I get this error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "lxml.etree.pyx", line 1279, in lxml.etree._Element.find (src/lxml/lxml.e
tree.c:38124)
File "/usr/lib64/python2.7/site-packages/lxml/_elementpath.py", line 210, in f
ind
it = iterfind(elem, path)
File "/usr/lib64/python2.7/site-packages/lxml/_elementpath.py", line 200, in i
terfind
selector = _build_path_iterator(path)
File "/usr/lib64/python2.7/site-packages/lxml/_elementpath.py", line 184, in _
build_path_iterator
selector.append(ops[token[0]](_next, token))
KeyError: ':'
I assume the KeyError : ':' refers to the colon in the name of the tag being invalid. Is there some way I can escape the colon so that lxml finds the right tag? Does : have some special meaning in this context? Or am I doing something wrong? Any help would be appreciated.
The : is an XML namespace separator. To escape the colon in lxml, you need to replace it with the namespace URL within curly braces, as in item.find("{http://example.org/}status").text.

pubDate RSS parsing weirdness with Beautifulsoup/Python

I'm trying to parse an RSS/Podcast feed using Beautifulsoup and everything is working nicely except I can't seem to parse the 'pubDate' field.
data = urllib2.urlopen("http://www.democracynow.org/podcast.xml")
dom = BeautifulStoneSoup(data, fromEncoding='utf-8')
items = dom.findAll('item');
for item in items:
title = item.find('title').string.strip()
pubDate = item.find('pubDate').string.strip()
The title gets parsed fine but when it gets to pubDate, it says:
Traceback (most recent call last):
File "", line 2, in
AttributeError: 'NoneType' object has no attribute 'string'
However, when I download a copy of the XML file and rename 'pubDate' to something else, then parse it again, it seems to work. Is pubDate a reserved variable or something in Python?
Thanks,
g
It works with item.find('pubdate').string.strip().
Why don't you use feedparser ?

Categories