lxml.etree.XPathEvalError: Invalid expression - python

I am getting an error with Python that I am not able to understand. I have simplified my code to the very bare minimum:
response = requests.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
r = tree.xpath('//divass="campaign"]/a/#href')
print(r)
and still get the error
Traceback (most recent call last):
File "ultimate-1.py", line 17, in <module>
r = tree.xpath('//divass="campaign"]/a/#href')
File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954)
File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962)
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817)
lxml.etree.XPathEvalError: Invalid expression
Would anyone have an idea of where the issue is coming from? Might it be a dependencies problem? Thanks.

The expression '//divass="campaign"]/a/#href' is not syntactically correct and does not make much sense. Instead you meant to check the class attribute:
//div[#class="campaign"]/a/#href
Now, that would help to avoid the Invalid Expression error, but you would get nothing found by the expression. This is because the data is not there in the response that requests receives. You would need to mimic what the browser does to get the desired data and make an additional request to get the javascript file containing the campaigns.
Here is what works for me:
import ast
import re
import requests
from lxml import html
with requests.Session() as session:
# extract script url
response = session.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
script_url = tree.xpath("//script[contains(#src, 'generate-js')]/#src")[0]
# get the script
response = session.get(script_url)
data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1))
# extract the desired data
tree = html.fromstring(data)
campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[#class="campaign"]/a')]
print(campaigns)
Prints:
['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140',
...
'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481'
]

U was wrong in making xpath.
If u want to take all hrefs your xpath should be like
hrefs = tree.xpath('//div[#class="campaign"]/a')
for href in hrefs:
print(href.get('href'))
or in one line:
hrefs = [item.get('href') for item in tree.xpath('//div[#class="campaign"]/a')]

Related

Why doesn't bs4 find the href attribute?

So I'm learning using atbwp and I'm now doing a program where I open top 5 search results on a website.
It all works up until I have to get the href for each of the top results and open it. I get this error:
Traceback (most recent call last):
File "C:\Users\Asus\Desktop\pyhton\projects\emagSEARCH.py", line 33, in <module>
webbrowser.open(url)
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38-32\lib\webbrowser.py", line 86, in open
if browser.open(url, new, autoraise):
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38-32\lib\webbrowser.py", line 603, in open
os.startfile(url)
TypeError: startfile: filepath should be string, bytes or os.PathLike, not NoneType
This is how the html looks:
The No-Brainer Set The Ordinary, Deciem
And this is the part of my code which won't work for some reason..:
Soup=bs4.BeautifulSoup(res.text,'html.parser')
results= Soup.select('.item-title')
numberTabs=min(5,len(results))
print('Opening top '+str(numberTabs)+' top results...')
for i in range(numberTabs):
url=results[i].get('href')
webbrowser.open(url)
It does what it should until the for loop. It looks pretty much exactly like the example program in the book, so I don't understand why it doesn't work. What am I doing wrong?
If u wanna extract the href under the a tag, then use this:
html = ' The No-Brainer Set The Ordinary, Deciem'
Soup=bs4.BeautifulSoup(html,'html.parser')
url = Soup.find('a')['href']
print(url)
webbrowser.open(url)
Output:
https://comenzi.farmaciatei.ro/ingrijire-personala/ingrijire-corp-si-fata/tratamente-/the-no-brainer-set-the-ordinary-deciem-p344003
U can do the same for all a tags in order to get all hrefs.

How to locate an XML error in python given the line number and column number?

I am getting an error when I parse my xml. It gives a line and column number, but I am not sure how to go about locating it.
My code
urlBase = 'https://www.goodreads.com/review/list_rss/'
urlMiddle = '?shelf=read&order=d&sort=rating&per_page=200&page='
finalUrl = urlBase + str(32994) + urlMiddle +str(1)
resp = requests.get(finalUrl)
from xml.etree import ElementTree as ET
x = ET.fromstring(resp.content)
Error
File "<string>", line unknown
ParseError: not well-formed (invalid token): line 952, column 1023
I try to print the contents, but it's just one line
resp.content
The output is too big to print here.
So I'm not sure how to check a specific line since it's just one line.
You are trying to parse a HTML content with an XML parser. You may run into problem if the content is not XML-valid: if it is not XHTML.
Instead of that, you can use a HTML parser like the one available with lxml.
For instance
parser = etree.HTMLParser()
tree = etree.parse(BytesIO(resp.content), parser)
This will solve your issue.
Most likely you are on Windows and the print isn’t respecting e.g \n.
Try adding:
open(‘resp.xml’).write(resp.content)
After where you get resp
Then, you can open resp.xml in an editor and see what line 952 looks like.

Beautifulsoup: ValueError: Tag.index: element not in tag

I am converting ePub to single HTML files, so I need to concatenate the individual chapters into one HTML file. The are names "..._split_000.html" etc and I set up various structures to iterate over the ToC, generate directory names and so on.
I want to concat the HTML content of the individual parts with Beautifulsoup by appending the content of the body element of the following parts to the body of the first part. Only my code doesn't seem to work. "book" is an instance of the epub class of ebooklib. "docsfiles" is a dictionary with the names of the HTML files as a key and a list of files as one value among others:
def concat_articles(book, docsfiles, toc):
articles = {}
for doc, val in docsfiles.iteritems():
firstsoup = False
for f in val['files']:
content = book.get_item_with_href(f).content
soup = BeautifulSoup(content, "html.parser")
if not firstsoup:
firstsoup = soup
continue
body = copy.copy(soup.body)
firstsoup.body.append(body)
articles[val['id']] = firstsoup.prettify("utf-8")
return articles
When I run this on my ePub, an error occurs:
Traceback (most recent call last):
File "extract-new.py", line 170, in <module>
articles_html = concat_articles(book, docsfiles, toc)
File "extract-new.py", line 97, in concat_articles
firstsoup.body.append(body)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 338, in append
self.insert(len(self.contents), tag)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 291, in insert
new_child.extract()
File "/Library/Python/2.7/site-packages/bs4/element.py", line 235, in extract
del self.parent.contents[self.parent.index(self)]
File "/Library/Python/2.7/site-packages/bs4/element.py", line 888, in index
raise ValueError("Tag.index: element not in tag")
ValueError: Tag.index: element not in tag
Actually I should unwrap() the so soup.body in the above code but the leads to another error, so I thought I would solve this first.
Strange enough it works when I am using Martijn Peters' "clone()" method from this StackOverflow post:
body = clone(soup.body)
firstsoup.body.append(body)
Why this works and "copy.copy()" doesn't, I have yet to figure out.
The complete working solution without duplication of the body tags looks like this:
body = clone(soup.body)
for child in body.contents:
firstsoup.body.append(clone(child))
This also works when I am using "copy.copy()" in the first line but not when I replace "clone()" by "copy.copy()" in the last line.
It might be too late but I ran into a similar problem and found a solution that is more simple. Please turn the all the objects you extract with BeautifulSoup to a string, using the str() function.

readlines() error with for-loop in python

This error is hard to describe because I can't figure out how the loop is even affecting the readline() and readlines() Methods. When I try using the former, I get these unexpected Traceback errors. When I try the latter, my code runs and nothing happens. I have determined that the bug is located in the first eight lines. The first few lines of the Topics.txt file is posted.
Code
import requests
from html.parser import HTMLParser
from bs4 import BeautifulSoup
Url = "https://ritetag.com/best-hashtags-for/"
Topicfilename = "Topics.txt"
Topicfile = open(Topicfilename, 'r')
Line = Topicfile.readlines()
Linenumber = 0
for Line in Topicfile:
Linenumber += 1
print("Reading line", Linenumber)
Topic = Line
Newtopic = Topic.strip("\n").replace(' ', '').replace(',', '')
print(Newtopic)
Link = Url.join(Newtopic)
print(Link)
Sourcecode = requests.get(Link)
When I run this bit here, it prints the the URL preceded by the first character of the line.For example, it prints as 2https://ritetag.com/best-hashtags-for/4https://ritetag.com/best-hashtags-for/Hhttps://ritetag.com/best-hashtags-for/ etc. for 24 Hour Fitness.
Topics.txt
21st Century Fox
24 Hour Fitness
2K Games
3M
Full Error
Reading line 1 24HourFitness
2https://ritetag.com/best-hashtags-for/4https://ritetag.com/best-hashtags-for/Hhttps://ritetag.com/best-hashtags-for/ohttps://ritetag.com/best-hashtags-for/uhttps://ritetag.com/best-hashtags-for/rhttps://ritetag.com/best-hashtags-for/Fhttps://ritetag.com/best-hashtags-for/ihttps://ritetag.com/best-hashtags-for/thttps://ritetag.com/best-hashtags-for/nhttps://ritetag.com/best-hashtags-for/ehttps://ritetag.com/best-hashtags-for/shttps://ritetag.com/best-hashtags-for/s
Traceback (most recent call last): File
"C:\Users\Caden\Desktop\Programs\LususStudios\AutoDealBot\HashtagScanner.py",
line 17, in
Sourcecode = requests.get(Link) File "C:\Python34\lib\site-packages\requests-2.10.0-py3.4.egg\requests\api.py",
line 71, in get
return request('get', url, params=params, **kwargs) File "C:\Python34\lib\site-packages\requests-2.10.0-py3.4.egg\requests\api.py",
line 57, in request
return session.request(method=method, url=url, **kwargs) File "C:\Python34\lib\site-packages\requests-2.10.0-py3.4.egg\requests\sessions.py",
line 475, in request
resp = self.send(prep, **send_kwargs) File "C:\Python34\lib\site-packages\requests-2.10.0-py3.4.egg\requests\sessions.py",
line 579, in send
adapter = self.get_adapter(url=request.url) File "C:\Python34\lib\site-packages\requests-2.10.0-py3.4.egg\requests\sessions.py",
line 653, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url) requests.exceptions.InvalidSchema: No connection adapters were
found for
'2https://ritetag.com/best-hashtags-for/4https://ritetag.com/best-hashtags-for/Hhttps://ritetag.com/best-hashtags-for/ohttps://ritetag.com/best-hashtags-for/uhttps://ritetag.com/best-hashtags-for/rhttps://ritetag.com/best-hashtags-for/Fhttps://ritetag.com/best-hashtags-for/ihttps://ritetag.com/best-hashtags-for/thttps://ritetag.com/best-hashtags-for/nhttps://ritetag.com/best-hashtags-for/ehttps://ritetag.com/best-hashtags-for/shttps://ritetag.com/best-hashtags-for/s'
I think there are two issues:
You seem to be iterating over Topicfile instead of Topicfile.readLines().
Url.join(Newtopic) isn't returning what you think it is. .join takes a list (in this case, a string is a list of characters) and will insert Url in between each one.
Here is code with these problems addressed:
import requests
Url = "https://ritetag.com/best-hashtags-for/"
Topicfilename = "topics.txt"
Topicfile = open(Topicfilename, 'r')
Lines = Topicfile.readlines()
Linenumber = 0
for Line in Lines:
Linenumber += 1
print("Reading line", Linenumber)
Topic = Line
Newtopic = Topic.strip("\n").replace(' ', '').replace(',', '')
print(Newtopic)
Link = '{}{}'.format(Url, Newtopic)
print(Link)
Sourcecode = requests.get(Link)
As an aside, I also recommend using lowercased variable names since camel case is generally reserved for class names in Python :)
Firstly, python conventions are to lowercase all variable names.
Secondly, you are exhausting the file pointer when you read all the lines at first, then continue to loop over the file.
Try to simply open the file, then loop over it
linenumber = 0
with open("Topics.txt") as topicfile:
for line in topicfile:
# do work
linenumber += 1
Then, the issue in the traceback, if you look closely, you are building up this really long url string and that's definitely not a url, so requests throws an error
InvalidSchema: No connection adapters were found for '2https://ritetag.com/best-hashtags-for/4https://ritetag.com/...
And you can debug to see that Url.join(Newtopic) is "interleaving" the Url String between each character of the Newtopic list, which is what str.join will do

ElementTree XML parsing and urllib2.urlopen

I'm opening a URL using:
response = urllib2.urlopen(url, data, timeout=_TIMEOUT)
and using response.read(), it gives the following output:
<XMLlookup licenseid="X4X6X42" reason="OK" status="1" />
but when I want to parse it using ElementTree, like so:
print response.read()
t = ET.parse(response)
r = t.getroot()
print r.attrib.get('status')
gives me the following error message:
File "<string>", line 62, in parse
File "<string>", line 38, in parse
cElementTree.ParseError: no element found: line 1, column 0
but when I remove the line response.read(), the code works fine. What am I doing wrong?
You can only read once a response, because it is a file object like (in fact a addinfourl). Subsequent calls to read will return an empty string since you have always read the whole text.
So either you do not call read before using ET.parse(response), or you store the result in a string and use it for ET :
txt = response.read()
# do what you want with txt (without changing it)
t = ET.fromstring(txt)
You need to use:
t = ET.fromstring(response.read())
instead of
response.read()
t = ET.parse(response)
r = t.getroot()
try
resp = response.read()
t = ET.fromstring(resp)
r = t.getroot()
or
t = ET.fromstring(response.read())
r = t.getroot()
Also, you should note that not all HTML is parsable as XML. If your request returns XHTML then you will be fine, but otherwise you will get a very similar error to what you are seeing.

Categories