Python ─ UTF-8 filename from HTML form via CherryPy - python

Python Header: # ! /usr/bin/env python
# -*- coding: utf-8 -*-
# image_upload.py
Cherrypy Config: cherrypy.config.update(
{'tools.encode.on': True,
'tools.encode.encoding': 'utf-8',
'tools.decode.on': True,
},)
HTML Header: <head><meta http-equiv="Content-Type"
content="text/html;charset=ISO-8859-1"></head>
""" Python 2.7.3
Cherrypy 3.2.2
Ubuntu 12.04
"""
With an HTML form, I'm uploading an image file to a database. That works so far without problems. However, if the filename ist not 100% in ASCII, there seems to be no way to retrieve it in UTF-8. This is weird, because with the HTML text input fields it works without problems, from saving until showing. Therefore I assume that it's an encoding or decoding problem with the web application framework CherryPy, because the upload is handeld by it, like here.
How it works:
The HTML form POSTs the uploaded file to another Python function, which receives the file in the standard dictionary **kwargs. From here you get the filename with extention, like this: filename = kwargs['file'].filename. But that's already with the wrong encoding. Until now the image hasn't been processed, stored or used in any way.
I'm asking for a solution, which would prevent it, to just parse the filename and change it back "manually". I guess the result already is in UTF-8, which makes it cumbersome to get it right. That's why getting CherryPy to do it, might be the best way. But maybe it's even an HTML issue, because the file comes from a form.
Here are the wrong decoded umlauts.
What I need is the input as result.
input → result input → result
ä → ä Ä → Ä
ö → ö Ö → Ö
ü → ü Ü → Ãœ
Following are the failed attempts to get the right result, which would be: "Würfel"
NOTE: img_file = kwargs['file']
original attempt:
result = img_file.filename.rsplit('.',1)[0]
result: "Würfel"
change system encoding:
reload(sys)
sys.setdefaultencoding('utf-8')
result: "Würfel"
encoding attempt 1:
result = img_file.filename.rsplit('.',1)[0].encode('utf-8')
result: "Würfel"
encoding attempt 2:
result = unicode(img_file.filename.rsplit('.',1)[0], 'urf-8')
Error Message:
TypeError: decoding Unicode is not supported
decoding attempt:
result = img_file.filename.rsplit('.',1)[0].decode('utf-8')
Error Message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)
cast attempt:
result = str(img_file.filename.rsplit('.',1)[0])
Error Message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)

Trying with your string it seems I can get the filename using latin1 encoding.
>>> s = u'W\xc3\xbcrfel.jpg'
>>> print s.encode('latin1')
Würfel.jpg
>>>
You simply need to use that .encode('latin1') before splitting.
But the problem here is broader. You really need to figure out why your web encoding is latin1 instead of utf8. I don't know cherrypy but try to ensure to use utf8 or you could get in other glitches when serving your application through a webserver like apache or nginx.

The problem is that you serve your HTML with charset ISO-8859-1; this makes the browsers confused and they use the charset also when sending to server. Serve all your HTML always with UTF-8, code in UTF-8, and set your terminal to UTF-8, and you shouldn't have problems.

Related

Python opening files with utf-8 file names

In my code I used something like file = open(path +'/'+filename, 'wb') to write the file
but in my attempt to support non-ascii filenames, I encode it as such
naming = path+'/'+filename
file = open(naming.encode('utf-8', 'surrogateescape'), 'wb')
write binary data...
so the file is named something like directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt
and it works, but the issue arises when I try to get that file again by crawling into the same directory using:
for file in path:
data = open(file.as_posix(), 'rb)
...
I keep getting this error 'ascii' codec can't encode characters in position..
I tried converting the string to bytes like data = open(bytes(file.as_posix(), encoding='utf-8'), 'rb') but I get 'utf-8' codec can't encode characters in position...'
I also tried file.as_posix().encode('utf-8', 'surrogateescape'), I found that both encode and print just fine but with open() I still get the error 'utf-8' codec can't encode characters in position...'
How can I open a file with a utf-8 filename?
I'm using Python 3.9 on ubuntu linux
Any help is greatly appreciated.
EDIT
I figured out why the issue happens when crawling to the directory after writing.
So, when I write the file and give it the raw string directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt and encode the string to utf, it writes fine.
But when finding the file again by crawling into the directory the str(filepath) or filepath.as_posix() returns the string as directory/path/????????.txt so it gives me an error when I try to encode it to any codec.
Currently I'm investigating if the issue's related to my linux locale, it was set to POSIX, I changed it to C.UTF-8 but still no luck atm.
More context: this is a file system where the file is uploaded through a site, so I receive the filename string in utf-8 format
I don't understand why you feel you need to recode filepaths.
Linux (unix) filenames are just sequences of bytes (with a couple of prohibited byte values). There's no need to break astral characters in surrogate pairs; the UTF-8 sequence for an astral character is perfectly acceptable in a filename. But creating surrogate pairs is likely to get you into trouble, because there's no UTF-8 encoding for a surrogate. So if you actually manage to create something that looks like the UTF-8 encoding for a surrogate codepoint, you're likely to encounter a decoding error when you attempt to turn it back into a Unicode codepoint.
Anyway, there's no need to go to all that trouble. Before running this session, I created a directory called ´ñ´ with two empty files, 𝔐 and mañana. The first one is an astral character, U+1D510. As you can see, everything works fine, with no need for manual decoding.
>>> [*Path('ñ').iterdir()]
[PosixPath('ñ/𝔐'), PosixPath('ñ/mañana')]
>>> Path.mkdir('ñ2')
>>> for path in Path('ñ').iterdir():
... open(Path('ñ2', path.name), 'w').close()
...
>>> [*Path('ñ2').iterdir()]
[PosixPath('ñ2/𝔐'), PosixPath('ñ2/mañana')]
>>> [open(path).read() for path in Path('ñ2').iterdir()]
['', '']
Note:
In a comment, OP says that they had previously tried:
file = open('/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
and received the error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-11: ordinal not in range(128)
Without more details, it's hard to know how to respond to that. It's possible that open will raise that error for a filesystem which doesn't allow non-ascii characters, but that wouldn't be normal on Linux.
However, it's worth noting that the string literal
'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png'
is not the string you think it is. \x escapes in a Python string are Unicode codepoints (with a maximum value of 255), not individual UTF-8 byte values. The Python string literal, "\xd8\xb9" contains two characters, "O with stroke" (Ø) and "superscript 1" (¹); in other words, it is exactly the same as the string literal "\u00d8\u00b9".
To get the Arabic letter ain (ع), either just type it (if you have an Arabic keyboard setting and your source file encoding is UTF-8, which is the default), or use a Unicode escape for its codepoint U+0639: "\u0639".
If for some reason you insist on using explicit UTF-8 byte encoding, you can use a byte literal as the argument to open:
file = open(b'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
But that's not recommended.
So after being in a rabbit hole for the past few days, I figured the issue isn't with python itself but with the locale that my web framework was using. Debugging this, I saw that
import sys
print(sys.getfilesystemencoding())
returned 'ASCII', which was weird considering I had set the linux locale to C.UTF-8 but discovered that since I was running WSGI on Apache2, I had to add locale to my WSGI as such WSGIDaemonProcess my_app locale='C.UTF-8' in the Apache configuration file thanks to this post.

UnicodeDammit: detwingle crashes on a website

I’m scraping websites and use BeautifulSoup4 to parse them. As the websits can have really random char sets, I use UnicodeDammit.detwingle to ensure that I feed proper data to BeautifulSoup. It worked fine... until it crashed. One website causes the code to break. The code to build "soup" looks like this:
u = bs.UnicodeDammit.detwingle( html_blob ) <--- here it crashes
u = bs.UnicodeDammit( u.decode('utf-8'),
smart_quotes_to='html',
is_html = True )
u = u.unicode_markup
soup = bs.BeautifulSoup( u )
And the error (standard Python-Unicode hell duo)
File ".../something.py", line 92, in load_bs_from_html_blob
u = bs.UnicodeDammit.detwingle( html_blob )
File ".../beautifulsoup4-4.1.3-py2.7.egg/bs4/dammit.py", line 802, in detwingle
return b''.join(byte_chunks)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0:
ordinal not in range(128)
The offending website is this one
Question: How to make a proper and bulletproof website source decoding?
This website is not a special case in terms of character encoding at all, it's perfectly valid utf-8 with even the http header set correctly. It then follows that your code would have crashed on any website encoded in utf-8 with code points beyond ASCII.
It is also evident from the documentation, that UnicodeDammit.detwingle takes an unicode string. You are passing it html_blob, and the variable naming suggests that it's not a decoded unicode string. (Misunderstanding)
To handle any website encoding is not trivial in the case the http header or markup lies about the encoding or is not included at all. You need to perform various heuristics and even then you won't get it right. But this website is sending the charset header correctly and has been encoded correctly in that charset.
Interesting trivia. The only beyond ASCII text in the website are these javascript comments (after being decoded as utf-8):
image = new Array(4); //¶¨ÒåimageΪͼƬÊýÁ¿µÄÊý×é
image[0] = 'sample_BG_image01.png' //±³¾°Í¼ÏóµÄ·¾¶
If you then encode those to ISO-8859-1, and decode the result as GB2312, you get:
image = new Array(4); //定义image为图片数量的数组
image[0] = 'sample_BG_image01.png' //背景图象的路径
Which google chinese -> english, translates to:
image = new Array(4); //Defined image of the array of the number of images
image[0] = 'sample_BG_image01.png' //The path of the background image

How to deal with <FE><FF> in what should be valid utf-8? What am I doing wrong?

I am using pdftotext with options "-enc utf-8 -htmlmeta -raw" and passing that into a python script, which is parsing the output. (Please read on even if you're unfamiliar with pdftotext, since that may not be relevant.)
For some of the pdf's that we are processing, pdftotext is outputting metadata that looks like this:
<meta name="CreationDate" content="<FE><FF>">
In python, I am doing this (basically):
attrib[name] = content.decode('utf-8')
where content is that <FE><FF> string in the above piece of metadata. Python raises an exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 0: unexpected code byte
At this point, I am unsure if the problem is the PDF itself, or the output from pdftotext, or Python's way of interpreting utf-8.
I have googled and not found anything conclusive.
Essentially, I would expect pdftotext -enc utf-8 to only output valid utf-8. And I would expect Python to understand how to deal with that utf-8 when decoding. Is there some part of this that I am missing?
I would appreciate any help in understanding why this is occurring, and help with a solution.
Thanks!
Two things:
First, instead of using content.decode('utf-8'), use:
content.decode('utf-8-sig')
This will automatically remove the BOM (if one is present).
Second, it looks like pdftotext is outputting a UTF-16 BOM, not a UTF-8 one. The UTF-8 BOM is '\xEF\xBB\xBF'. You'll need to figure out why you're getting UTF-16, or change your script to decode from UTF-16.

Python Decoding Uncode files with 'ÆØÅ'

I read in some data from a danish text file. But i can't seem to find a way to decode it.
The original text is "dør" but in the raw text file its stored as "d√∏r"
So i tried the obvious
InputData = "d√∏r"
Print InputData.decode('iso-8859-1')
sadly resulting in the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range(128)
UTF-8 gives the same error.
(using Python 2.6.5)
How can i decode this text so the printed message would be "dør"?
C3 B8 is the UTF-8 encoding for "ø". You need to read the file in UTF-8 encoding:
import codecs
codecs.open(myfile, encoding='utf-8')
The reason that you're getting a UnicodeEncodeError is that you're trying to output the text and Python doesn't know what encoding your terminal is in, so it defaults to ascii. To fix this issue, use sys.stdout = codecs.getwriter('utf8')(sys.stdout) or use the environment variable PYTHONIOENCODING="utf-8".
Note that this will give you the text as unicode objects; if everything else in your program is str then you're going to run into compatibility issues. Either convert everything to unicode or (probably easier) re-encode the file into Latin-1 using ustr.encode('iso-8859-1'), but be aware that this will break if anything is outside the Latin-1 codepage. It might be easier to convert your program to use str in utf-8 encoding internally.

UnicodeEncodeError when fetching url

I have this issue trying to get all the text nodes in an HTML document using lxml but I get an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128). However, when I try to find out the type of encoding of this page (encoding = chardet.detect(response)['encoding']), it says it's utf-8. It seems weird that a single page has utf-8 and ascii. Actually, this:
fromstring(response).text_content().encode('ascii', 'replace')
solves the problem.
Here it's my code:
from lxml.html import fromstring
import urllib2
import chardet
request = urllib2.Request(my_url)
request.add_header('User-Agent',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')
request.add_header("Accept-Language", "en-us")
response = urllib2.urlopen(request).read()
print encoding
print fromstring(response).text_content()
Output:
utf-8
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)
What can I do to solve this issue?. Keep in mind that I want to do this with a few other pages, so I don't want to encode on an individual basis.
UPDATE:
Maybe there is something else going on here. When I run this script on the terminal, I get a correct output but when a run it inside SublimeText, I get UnicodeEncodeError... ¿?
UPDATE2:
It's also happening when I create a file with this output. .encode('ascii', 'replace') is working but I'd like to have a more general solution.
Regards
Can you try wrapping your string with repr()?
This article might help.
print repr(fromstring(response).text_content())
As far as writing out to a file as said in your edit, I would recommend opening the file with the codecs module:
import codecs
output_file = codecs.open('filename.txt','w','utf8')
I don't know SublimeText, but it seems to be trying to read your output as ASCII, hence the encoding error.
Based on your first update I would say that the terminal told Python to output utf-8 and SublimeText made clear it expects ascii. So I think the solution will be in finding the right settings in SublimeText.
However, if you cannot change what SublimeText expects it is better to use the encode function like you already did in a separate function.
def smartprint( text ) :
if sys.stdout.encoding == None :
print text
else :
print text.encode( sys.stdout.encoding , 'replace' )
You can use this function instead of print. Keep in mind that your program's output when run in SublimeText differs from Terminal. Because of the replace accented characters will loose their accents when this code is run in SublimeText, e.g. é will be shown as e.

Categories