UnicodeError when try to send a file with greek filename - python

I have created and populated Greek names in a set() and I then pass this set of values to a view function.
When I try to print this set Greek names appear as jibberish. I believe this has somethign to do that Apache mod_wsgi or Bottle doens't start with utf-8 support.
How can I tell Apache/Bottle to use LANG=el_GR.utf-8 so I can display unicode properly because I believe that's the case here?
I looked for AddDefaultCharset utf-8 in httpd.conf but it is already enabled, so I have to ask why the Greek chars appear as jibberish?
This is when i try to download a file with a greek filename.
Error: 500 Internal Server Error
Sorry, the requested URL 'http://superhost.gr/downloads/file' caused an error:
Internal Server Error
Exception:
UnicodeEncodeError('ascii', '/static/files/Î\x92ιογÏ\x81αÏ\x86ικÏ\x8c - Î\x9dίκοÏ\x82.docx', 14, 34, 'ordinal not in range(128)')
Traceback:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/bottle.py", line 862, in _handle
return route.call(**args)
File "/usr/lib/python3.6/site-packages/bottle.py", line 1740, in wrapper
rv = callback(*a, **ka)
File "/usr/lib/python3.6/site-packages/bottle.py", line 2690, in wrapper
return func(*a, **ka)
File "/home/nikos/public_html/downloads.py", line 148, in file
return static_file(filename, root='/static/files', download=True)
File "/usr/lib/python3.6/site-packages/bottle.py", line 2471, in static_file
if not os.path.exists(filename) or not os.path.isfile(filename):
File "/usr/lib64/python3.6/genericpath.py", line 19, in exists
os.stat(path)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-33: ordinal not in range(128)
The code use to download the file is:
return static_file(filename, root='/static/files', download=True)
my system is et to utf-8
[root#superhost public_html]# echo $LANG
en_US.UTF-8
Perhaps something with Apache or is it a probelm with Python3 ?

You can't use Bottle static_file() with unicode filename and download=True. See accepted answer for this question for two alternative solutions of this limitation.

Related

indian rupee symbol UnicodeEncodeError while uploading file to s3 using pandas

I have scraped some data from a website for my assignment. It consists of Indian rupee character - "₹". The data when I'm trying to save into CSV file in utf-8 characters on local machine using pandas, it is saving effortlessly. The same file, I have changed the delimiters and tried to save the file to s3 using pandas, but it gave "UnicodeEncodeError" error. I'm scraping the web page using scrapy framework.
Earlier I was trying to save the file in Latin-1 i.e. "ISO-8859-1" formatting and hence changed to "utf-8" but the same error is occurring. I'm using pythn 3.7 for the development.
Below code used for saving on the local machine which is working:
result_df.to_csv(filename+str2+'.csv',index=False)
Below code is used to save the file to S3:
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
Below is the error while saving the file to S3:
2019-10-29 19:24:27 [scrapy.utils.signal] ERROR: Error caught on signal handler: <function Spider.close at 0x0000019CD3B1AA60>
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "c:\programdata\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 94, in close
return closed(reason)
File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
File "c:\programdata\anaconda3\lib\site-packages\pandas\core\generic.py", line 3020, in to_csv
formatter.save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 172, in save
self._save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 288, in _save
self._save_chunk(start_i, end_i)
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 315, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
File "c:\programdata\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 2661: character maps to <undefined>
I am very new to this StackOverflow platform and please let me know if more information is to be presented.
The error gives an evidence that the code tries to encode the filename_str2.csv file in cp1252. From your stack trace:
...File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/ filename_str2.csv ',......
File "c:\programdata\anaconda3\lib\encodings\ cp1252.py ", line 19, in encode
The reason I do not know, because you explicitely ask for an utf-8 encoding. But as the codecs page in the Python Standard Library reference says that the canonical name for utf8 is utf_8 (notice the underline instead of minus sign) and does not list utf-8 in allowed aliases, I would first try to use utf_8. If it still uses cp1252, then you will have to give the exact versions of Python and pandas that you are using.

Rasa App breaks in Pycharm but works fine in terminal

Whenever I try to run my Rasa app using the run button in PyCharm, or try to use the debugger, I get the following error:
Traceback (most recent call last):
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/pykwalify/core.py", line 76, in __init__
self.source = yaml.load(stream)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/main.py", line 933, in load
loader = Loader(stream, version, preserve_quotes=preserve_quotes)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/loader.py", line 50, in __init__
Reader.__init__(self, stream, loader=self)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 85, in __init__
self.stream = stream # type: Any # as .read is called
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 130, in stream
self.determine_encoding()
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 190, in determine_encoding
self.update_raw()
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 297, in update_raw
data = self.stream.read(size)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 473: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/matthewspeck/project/trainer_app/app.py", line 25, in <module>
parser=False, core=True)
File "/Users/matthewspeck/project/trainer_app/rasa_model.py", line 165, in make_rasa_model
rasa_config=rasa_config
File "/Users/matthewspeck/project/trainer_app/rasa_model.py", line 66, in __init__
self._parser = create_agent(use_rasa_nlu=True, load_models=True)
File "/Users/matthewspeck/project/trainer_app/rasa.py", line 32, in create_agent
domain = create_domain()
File "/Users/matthewspeck/project/trainer_app/rasa.py", line 83, in create_domain
domain = ClarifyDomain.load(domain_path)
File "/Users/project/clarification/domain.py", line 39, in load
domain = TemplateDomain.load(filename)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/rasa_core/domain.py", line 404, in load
cls.validate_domain_yaml(filename)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/rasa_core/domain.py", line 438, in validate_domain_yaml
schema_files=[schema_file])
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/pykwalify/core.py", line 78, in __init__
raise CoreError(u"Unable to load any data from source yaml file")
pykwalify.errors.CoreError: <CoreError: error code 3: Unable to load any data from source yaml file: Path: '/'>
Process finished with exit code 1
However, when I run the app from my terminal, or from my text editor (I use VSCode), It runs with no problems whatsoever. I've looked online and every answer I see has something to do with Rasa, but nothing mentions problems with PyCharm.
I've also checked that the yaml for the domain is properly formatted, and it is. Anyone have any idea why I would be getting this error in PyCharm, but not in any other environment, and how I could fix it?
I believe your problem was fixed with Rasa version 0.12 ([changelog][1]): https://github.com/RasaHQ/rasa_core/blob/master/CHANGELOG.rst#0120---2018-11-11 .
I recommend upgrading to a newer version of Rasa Core which parses the training data correctly.

Flask raises UnicodeEncodeError (latin-1) when send attachment with UTF-8 characters

I'm creating a file server by flask. When I'm testing the download feature, I found it raises UnicodeEncodeError if I try to download files named with UTF-8 characters.
Create a file at upload/1512026299/%E6%97%A0%E6%A0%87%E9%A2%98.png , then run codes below:
#app.route('/getfile/<timestamp>/<filename>')
def download(timestamp, filename):
dirpath = os.path.join(os.path.join(os.path.abspath(os.path.dirname(__file__)), 'upload'), timestamp)
return send_from_directory(dirpath, filename, as_attachment=True)
You will get an exception, which should be like this:
127.0.0.1 - - [30/Nov/2017 21:39:05] "GET /getfile/1512026299/%E6%97%A0%E6%A0%87%E9%A2%98.png HTTP/1.1" 200 -
Error on request:
Traceback (most recent call last):
File "C:\Program Files\Python36\lib\site-packages\werkzeug\serving.py", line 209, in run_wsgi
execute(self.server.app)
File "C:\Program Files\Python36\lib\site-packages\werkzeug\serving.py", line 200, in execute
write(data)
File "C:\Program Files\Python36\lib\site-packages\werkzeug\serving.py", line 168, in write
self.send_header(key, value)
File "C:\Program Files\Python36\lib\http\server.py", line 508, in send_header
("%s: %s\r\n" % (keyword, value)).encode('latin-1', 'strict'))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 43-45: ordinal not in range(256)
The problem is that when using as_attachement=True the filename is sent in the headers. Unfortunately it seems that flask does not yet support rfc5987 which specifies how to encode attachment file names in a different encoding other than latin1.
The easiest solution in this case would be to drop as_attachement=True, then it won't be sent with a Content-Disposition header, which avoids this problem.
If you really have to send the Content-Disposition header you could try the code posted in the related issue:
response = make_response(send_file(out_file))
basename = os.path.basename(out_file)
response.headers["Content-Disposition"] = \
"attachment;" \
"filename*=UTF-8''{utf_filename}".format(
utf_filename=quote(basename.encode('utf-8'))
)
return response
This should be fixed in the next release (>0.12)

tf.gfile.Glob gives me UnicodeDecodeError error anyway to fix this?

I was trying to get the list of name of txt file that was written in Korean in the specified directory with the code below
dir_list = tf.gfile.Glob(engine.TXT_DIR+"/*.txt")
However, This one gives me the following error:
Traceback (most recent call last):
File "D:/Prj_mayDay/Prj_FrankenShtine/shakespear_reborn/main.py", line 108, in <module>
dir_list = tf.gfile.Glob(engine.TXT_DIR+"/*.txt")
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 326, in get_matching_files
compat.as_bytes(filename), status)
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 325, in <listcomp>
for matching_filename in pywrap_tensorflow.GetMatchingFiles(
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\util\compat.py", line 106, in as_str_any
return as_str(value)
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\util\compat.py", line 84, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 19: invalid start byte
Now, throughout some research, I found out the reason
The error is because there is some non-ascii character in the dictionary and it can't be encoded/decoded
However, I do not see any way to apply the solution into my code. or is there?
**if there is alternative code for this. It should be applicable for both cloud stroage bucket / my personal hard drive as the code above did.
I'm using python3, Tensorflow version of 1.2.0-rc2
so after few hours of fiddling around with my code I finally found the solution.
Afterall one of the file inside of the directory I specified had a name in Korean. After I took that out of the directory. problem was gone.

Umlauts in JSON files lead to errors in Python code created by ANTLR4

I've created python modules from the JSON grammar on github / antlr4 with
antlr4 -Dlanguage=Python3 JSON.g4
I've written a main program "JSON2.py" following this guide: https://github.com/antlr/antlr4/blob/master/doc/python-target.md
and downloaded the example1.json also from github.
python3 ./JSON2.py example1.json # works perfectly, but
python3 ./JSON2.py bookmarks-2017-05-24.json # the bookmarks contain German Umlauts like "ü"
...
File "/home/xyz/lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 227: ordinal not in range(128)
The offending line in JSON2.py is
input = FileStream(argv[1])
I've searched stackoverflow and tried this instead of using the above FileStream:
fp = codecs.open(argv[1], 'rb', 'utf-8')
try:
input = fp.read()
finally:
fp.close()
lexer = JSONLexer(input)
stream = CommonTokenStream(lexer)
parser = JSONParser(stream)
tree = parser.json() # This is line 39, mentioned in the error message
Execution of this program ends with an error message, even if the input file doesn't contain Umlauts:
python3 ./JSON2.py example1.json
Traceback (most recent call last):
File "./JSON2.py", line 46, in <module>
main(sys.argv)
File "./JSON2.py", line 39, in main
tree = parser.json()
File "/home/x/Entwicklung/antlr/links/JSONParser.py", line 108, in json
self.enterRule(localctx, 0, self.RULE_json)
File "/home/xyz/lib/python3.5/site-packages/antlr4/Parser.py", line 358, in enterRule
self._ctx.start = self._input.LT(1)
File "/home/xyz/lib/python3.5/site-packages/antlr4/CommonTokenStream.py", line 61, in LT
self.lazyInit()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit
self.setup()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 189, in setup
self.sync(0)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 111, in sync
fetched = self.fetch(n)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 123, in fetch
t = self.tokenSource.nextToken()
File "/home/xyz/lib/python3.5/site-packages/antlr4/Lexer.py", line 111, in nextToken
tokenStartMarker = self._input.mark()
AttributeError: 'str' object has no attribute 'mark'
This parses correctly:
javac *.java
grun JSON json -gui bookmarks-2017-05-24.json
So the grammar itself is not the problem.
So finally the question: How should I process the input file in python, so that lexer and parser can digest it?
Thanks in advance.
Make sure your input file is actually encoded as UTF-8. Many problems with character recognition by the lexer are caused by using other encodings. I just took a testbed application, added ëto the list of available characters for an IDENTIFIER and it works again. UTF-8 is the key -- and make sure your grammar also allows these characters where you want to accept them.
I solved it by passing the encoding info:
input = FileStream(sys.argv[1], encoding = 'utf8')
If without the encoding info, I will have the same issue as yours.
Traceback (most recent call last):
File "test.py", line 20, in <module>
main()
File "test.py", line 9, in main
input = FileStream(sys.argv[1])
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 20, in __init__
super().__init__(self.readDataFrom(fileName, encoding, errors))
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)
Where my input data is
[今明]天(台南|高雄)的?天氣如何

Categories