What is wrong with my xpath expression? - python

I want to extract all the links in td whose class is u-ctitle.
import os
import urllib
import lxml.html
down='http://v.163.com/special/opencourse/bianchengdaolun.html'
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)
namelist=root.xpath('//td[#class="u-ctitle"]/a')
len(namelist)
The output is [],there are so many td whose classis "u-ctitle" ,with firebug you ca get, why can't extract it?
My python version is 2.7.9.
It is no use to change file into other name.

Your XPath is correct. The problem is unrelated.
If you examine HTML, you will see following meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=GBK" />
And in this code:
file=urllib.urlopen(down).read()
root=lxml.html.document_fromstring(file)
file is actually a bytes sequence, so decoding from GBK-encoded bytes to Unicode string is happening inside document_fromstring method.
The problem is, HTML encoding is not actually GBK and lxml decodes it incorrectly, leading to loss of data.
>>> file.decode('gbk')
Traceback (most recent call last):
File "down.py", line 9, in <module>
file.decode('gbk')
UnicodeDecodeError: 'gbk' codec can't decode bytes in position 7247-7248: illegal multibyte sequence
After some trial and error, we can find that actual encoding is GB_18030. To make script work, you need to decode bytes manually:
root=lxml.html.document_fromstring(file.decode('GB18030'))

Related

SimpleHTTPServer in Python3.6.4 can not handle non-ASCII string(Chinese in my case)

I run SimpleHTTPServer in Python3.6.4 64bit by this command:
python -m http.server --cgi
then I make a form in test.py, submit it to test_form_action.py to print the input text.
cgi-bin/test.py
# coding=utf-8
from __future__ import unicode_literals, absolute_import
print("Content-Type: text/html") # HTML is following
print()
reshtml = '''<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html" charset="utf-8"/>
</head>
<body>
<div style="text-align: center;">
<form action="/cgi-bin/test_form_action.py" method="POST"
target="_blank">
输入:<input type="text" id= "id" name="name"/></td>
<button type="submit">Submit</button>
</form>
</div>
</body>
</html>'''
print(reshtml)
cgi-bin/test_form_action.py
# coding=utf-8
from __future__ import unicode_literals, absolute_import
# Import modules for CGI handling
import cgi, cgitb
cgitb.enable()
if __name__ == '__main__':
print("Content-Type: text/html") # HTML is following
print()
form = cgi.FieldStorage()
print(form)
id = form.getvalue("id")
name = form.getvalue("name")
print(id)
When I visit http://127.0.0.1:8000/cgi-bin/test.py,
The Chinese Character "输入" doesn't show right, it look like "����",
I have to manually change the Text Encoding of this page from
"Unicode" to "Chinese Simplified" in Firefox to make Chinese Character look normal.
It's weird, since I put charset="utf-8" in cgi-bin/test.py.
Further more, when I put some Chinese in input form, and submit. But cgi-bin/test_form_action.py is blank.
meanwhile some error show in windows terminal where I run SimpleHTTPServer:
127.0.0.1 - - [23/Mar/2018 23:43:32] b'Error in sys.excepthook:\r\nTraceback (most recent call last):\r\n File
"E:\Python\Python36\Lib\cgitb.py", line 26 8, in call\r\n
self.handle((etype, evalue, etb))\r\n File
"E:\Python\Python36\Lib\cgitb.py", line 288, in handle\r\n
self.file.write(doc + \'\ n\')\r\nUnicodeEncodeError: \'gbk\' codec
can\'t encode character \'\ufffd\' in position 1894: illegal
multibyte sequence\r\n\r\nOriginal exception was:\r\nT raceback (most
recent call last):\r\n File
"G:\Python\Project\VideoHelper\cgi-bin\test_form_action.py", line
13, in \r\n print(form)\r\nUnico deEncodeError: \'gbk\'
codec can\'t encode character \'\ufffd\' in position 52: illegal
multibyte sequence\r\n'
127.0.0.1 - - [23/Mar/2018 23:43:32] CGI script exit status 0x1
When you use the print() expression, Python converts the strings to bytes, ie. it encodes them using a default codec.
The choice of this default value depends on the environment – in your case it seems to be GBK (judging from the error message).
In the HTML page your CGI script returns, you specify the codec ("charset") as UTF-8.
You can of course change this to GBK, but it will only solve your first problem (display of test.py), not the second one (encoding error in test_form_action.py).
Instead, it's probably better to get Python to send UTF-8-encoded data on STDOUT.
One approach is to replace all occurrences of
print(x)
with
sys.stdout.buffer.write(x.encode('utf8'))
Alternatively, you can replace sys.stdout with a re-encoded wrapper, without changing the print() occurrences:
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8'))
Note: These two solutions don't work in Python 2.x (you'd have to omit the .buffer part there).
I'm writing this because your code has from __future__ import statements, which have no use in code that is run with Python 3 exclusively.

Error with Python and XML

I'm getting an error when trying to grab a value from my XML. I get "Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
Here is my code:
import requests
import lxml.etree
from requests.auth import HTTPBasicAuth
r= requests.get("https://somelinkhere/folder/?parameter=abc", auth=HTTPBasicAuth('username', 'password'))
print r.text
root = lxml.etree.fromstring(r.text)
textelem = root.find("opensearch:totalResults")
print textelem.text
I'm getting this error:
Traceback (most recent call last):
File "tickets2.py", line 8, in <module>
root = lxml.etree.fromstring(r.text)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934)
File "src/lxml/parser.pxi", line 1814, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124471)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Here is what the XML looks like, where I'm trying to grab the file in the last line.
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:apple-wallpapers="http://www.apple.com/ilife/wallpapers" xmlns:g-custom="http://base.google.com/cns/1.0" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:cc="http://web.resource.org/cc/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:g-core="http://base.google.com/ns/1.0">
<title>Feed from some link here</title>
<link rel="self" href="https://somelinkhere/folder/?parameter=abc" />
<link rel="first" href="https://somelinkhere/folder/?parameter=abc" />
<id>https://somelinkhere/folder/?parameter=abc</id>
<updated>2018-03-06T17:48:09Z</updated>
<dc:creator>company.com</dc:creator>
<dc:date>2018-03-06T17:48:09Z</dc:date>
<opensearch:totalResults>4</opensearch:totalResults>
I have tried various changes from links like https://twigstechtips.blogspot.com/2013/06/python-lxml-strings-with-encoding.html and http://makble.com/how-to-parse-xml-with-python-and-lxml but I keep running into the same error.
Instead of r.text, which guesses at the text encoding and decodes it, try using r.content which accesses the response body as bytes. (See http://docs.python-requests.org/en/latest/user/quickstart/#response-content.)
You could also use r.raw. See parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml) for more info.
Once that issue is fixed, you'll have the issue of the namespace. The element you're trying to find (opensearch:totalResults) has the prefix opensearch which is bound to the uri http://a9.com/-/spec/opensearch/1.1/.
You can find the element by combining the namespace uri and the local name (Clark notation):
{http://a9.com/-/spec/opensearch/1.1/}totalResults
See http://lxml.de/tutorial.html#namespaces for more info.
Here's an example with both changes implemented:
os = "{http://a9.com/-/spec/opensearch/1.1/}"
root = lxml.etree.fromstring(r.content)
textelem = root.find(os + "totalResults")
print textelem.text

pykml and utf8: either the input is incorrect, or I don't get it

I have a kmz file from the www and wish to read it into csv or such using pykml.
The file is in UTF8, or at least it claims to - see header below. Reading it works, but triggers an error when coming on the first accented character.
<?xml version='1.0' encoding='UTF-8'?>
<kml xmlns='http://www.opengis.net/kml/2.2'>
<Document>
<name>
from pykml import parser
with open(KMZFIL) as f:
folder=parser.parse(f).getroot().Document.Folder
for pm in folder.Placemark:
print(pm.name)
Ablitas (militar) (Emerg)
Ademuz (forestal)
Ager (PL%)
Alcala del Rio (ILIPA MAGNA)(Esc.)
Traceback (most recent call last):
File "bin4/b21_xxxxxxx", line 15, in <module>
print(pm.name)
grep "name" $INFIL | head -7
( ... )
<name>Ablitas (militar) (Emerg)</name>
<name>Ademuz (forestal)</name>
<name>Ager (PL%)</name>
<name>Alcala del Rio (ILIPA MAGNA)(Esc.)</name>
<name>Ainzón</name>
You need to open the file in a way that instructs Python to interpret the bytes as UTF-8 characters. In Python 2.7 you do it with the codecs module.
import codecs
with codecs.open(KMZFIL, encoding='utf-8') as f:
In Python 3 the encoding option has been added to the standard open so there's no need to use codecs.
Didn't see the answer here but these are lmxl StringElements -- I used .text to fix this error.
change print(pm.name) to print(pm.name.text)
https://lxml.de/api/lxml.objectify.StringElement-class.html

Extracting the value of A tag by xpath in python

I have a simple python script like:
#!/usr/bin/python
import requests
from lxml import html
response = requests.get('http://site.ir/')
out=response.content
tree = html.fromstring(open(out).read())
print [e.text_content() for e in tree.xpath('//div[class="group"]/div[class="groupinfo"]/a/text()')]
I used xpath in order to get value of tag a as you can see from image below...
But the output sample is not what I expected.
UPDATE
I have also the following error:
Traceback (most recent call last):
File "p.py", line 7, in <module>
tree = html.fromstring(open(out).read())
IOError: [Errno 36] File name too long: '\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" ....
You need to put # at the beginning of attribute name to address an attribute in XPath :
//div[#class="group"]/div[#class="groupinfo"]/a/text()

Iteratively parse a large XML file without using the DOM approach

I have an xml file
<temp>
<email id="1" Body="abc"/>
<email id="2" Body="fre"/>
.
.
<email id="998349883487454359203" Body="hi"/>
</temp>
I want to read the xml file for each email tag. That is, at a time I want to read email id=1..extract body from it, the read email id=2...and extract body from it...and so on
I tried to do this using DOM model for XML parsing, since my file size is 100 GB..the approach does not work. I then tried using:
from xml.etree import ElementTree as ET
tree=ET.parse('myfile.xml')
root=ET.parse('myfile.xml').getroot()
for i in root.findall('email/'):
print i.get('Body')
Now once I get the root..I am not getting why is my code not been able to parse.
The code upon using iterparse is throwing the following error:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 437: ordinal not in range(128)"
Can somebody help
An example for iterparse:
import cStringIO
from xml.etree.ElementTree import iterparse
fakefile = cStringIO.StringIO("""<temp>
<email id="1" Body="abc"/>
<email id="2" Body="fre"/>
<email id="998349883487454359203" Body="hi"/>
</temp>
""")
for _, elem in iterparse(fakefile):
if elem.tag == 'email':
print elem.attrib['id'], elem.attrib['Body']
elem.clear()
Just replace fakefile with your real file.
Also read this for further details.

Categories