Handling French text Python - python

I am trying to read some French text and do some frequency analysis of words. I want the characters with the umlauts and other diacritics to stay. So, I did this for testing:
>>> import codecs
>>> f = codecs.open('file','r','utf-8')
>>> for line in f:
... print line
...
Faites savoir à votre famille que vous êtes en sécurité.
So far, so good. But, I have a list of French files which I iterate over in the following way:
import codecs,sys,os
path = sys.argv[1]
for f in os.listdir(path):
french = codecs.open(os.path.join(path,f),'r','utf-8')
for line in french:
print line
Here, it gives the following error:
rdholaki74: python TestingCodecs.py ../frenchResources | more
Traceback (most recent call last):
File "TestingCodecs.py", line 7, in <module>
print line
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 14: ordinal not in range(128)
Why is it that the same file throws up an error when passed as an argument and not when given explicitly in the code?
Thanks.

Because you're misinterpreting the cause. The fact that you're piping the output means that Python can't detect what encoding to use. If stdout is not a TTY then you'll need to encode as UTF-8 manually before outputting.

It is a print error due to redirection. You could use:
PYTHONIOENCODING=utf-8 python ... | ...
Specify another encoding if your terminal doesn't use utf-8

Related

How can I write to a file with the same formatting as print?

TL;DR
While trying to write a string to a file the following error occurred:
Code
logfile.write(cli_args.last_name)
Output
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)
But this works:
Code
print(cli_args.last_name)
Output
Pérez
Why?
FULL CONTEXT
I made a script which receives data from a Linux CLI, processes it and finally creates a Zendesk ticket with the provided data. It is kind of a CLI API, since before my script there is a bigger system which has a web interface with forms, where users fill the values of the fields and are then replaced into the CLI script. For example:
myscript.py --first_name '_first_name_' --last_name '_last_name_'
The script was working with no issues, until yesterday when the web was updated. I think they changed something related to charsets or encoding.
I do some simple logging with F-strings by opening a file and writing some informative messages in case anything fails, so I can go back to check where it happened. Also the CLI attributes are read using the argparse module. Example:
logfile.write(f"\tChecking for opened tickets for user '{cli_args.first_name} {cli_args.last_name}'\n")
After the website update I am getting an error like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
8-9: ordinal not in range(128)
Doing some troubleshooting I found it is because some users input names with accent marks like Carlos Pérez.
I need the script to work again and also prepare it for inputs like that, so I looked for answers by checking the HTTP headers in the input forms of the web console and found out it uses a Content-Type: text/html; charset=UTF-8; my first try was to encode the str passed in the CLI argument to utf-8 and decode it again using the same codec, but didn't succeed.
On my second try, I checked the Python docs str.encode() and bytes.decode(). So I tried this:
logfile.write(
"\tChecking for opened tickets for user "
f"'{cli_args.first_name.encode(encoding='utf-8', errors='ignore').decode('utf-8')} "
f"{cli_args.last_name.encode(encoding='utf-8', errors='ignore').decode('utf-8')}'"
)
It worked but removed the accent marked letter so Carlos Pérez became Carlos Prez which is of no use to me in this case, I need the full input.
As a desperate move I tried printing the same F-string I was trying to write to the logfile, which to my surprise it worked. It printed to the console Carlos Pérez without any kind of encoding/decoding process.
How does print work? and Why trying to write to the file didn't work? But most importantly How can I write to a file with the same formatting as print?
Edit 1 #MarkTolonen
Tried the following:
logfile = open("/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/755bug.txt", mode="a", encoding="utf8")
logfile.write(cli_args.body)
logfile.close()
Output:
Traceback (most recent call last):
File "/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/ticket_query_app.py", line 414, in
main()
File "/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/ticket_query_app.py", line 81, in main
logfile.write(cli_args.body)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed
Edit 2
I managed to get the text that is causing the issue:
if __name__ == "__main__":
string = (
"Buenos d\udcc3\udcadas,\r\n\r\n"
"Mediante monitoreo autom\udcc3\udca1tico se ha detectado un evento fuera de lo normal:\r\n\r\n"
"Descripci\udcc3\udcb3n del evento: _snmp_f13_\r\n"
"Causas sugeridas del evento: _snmp_f14_\r\n"
"Posible afectaci\udcc3\udcb3n del evento: _snmp_f15_\r\n"
"Validaciones de bajo impacto: _snmp_f16_\r\n"
"Fecha y hora del evento: 2021-07-14 17:47:51\r\n\r\n"
"Saludos."
)
# Output: Text with the unicodes translated
print(string)
# Output: "UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed"
with open(file="test.log", mode="w", encoding="utf8") as logfile:
logfile.write(string)
The answer is the encoding parameter to open. Observe:
Last login: Wed Jul 14 15:05:24 2021 from 50.126.68.34
[timrprobocom#jared-ingersoll ~]$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('x.txt','a')
>>> g = open('y.txt','a',encoding='utf-8')
>>> s = "spades \u2660 spades"
>>> f.write(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u2660' in position 7: ordinal not in range(128)
>>> g.write(s)
15
>>>
[timrprobocom#jared-ingersoll ~]$ hexdump -C y.txt
00000000 73 70 61 64 65 73 20 e2 99 a0 20 73 70 61 64 65 |spades ... spade|
*
00000011
It looks like something upstream is misconfigured. Your string appears to have been produced by a decode operation with the wrong encoding, with errors='surrogateescape' error handling. From the data shown, it looks like the decoding operation tried to decode UTF-8-encoded text as ASCII.
errors='surrogateescape' is a way for an encoding to handle invalid bytes during a decode operation. The error handler replaces the invalid bytes with partial surrogates in the range U+DC80..U+DCFF when converting to a Unicode string, and the process can be reversed to get the original byte string back by performing an encode with errors='surrogateescape' and the same encoding.
The partial surrogates in your string match the pattern of what a decode(encoding='ascii', errors='surrogateescape') call would produce when given data actually encoded in UTF-8 - the surrogates are all in the range surrogateescape uses, and the bytes they correspond to form valid UTF-8. In the code below, I recover the original bytes, then decode them correctly as UTF-8. Once the Unicode string is valid, it can be written to the log file with encoding='utf8'.
string = (
"Buenos d\udcc3\udcadas,\r\n\r\n"
"Mediante monitoreo autom\udcc3\udca1tico se ha detectado un evento fuera de lo normal:\r\n\r\n"
"Descripci\udcc3\udcb3n del evento: _snmp_f13_\r\n"
"Causas sugeridas del evento: _snmp_f14_\r\n"
"Posible afectaci\udcc3\udcb3n del evento: _snmp_f15_\r\n"
"Validaciones de bajo impacto: _snmp_f16_\r\n"
"Fecha y hora del evento: 2021-07-14 17:47:51\r\n\r\n"
"Saludos."
)
fixed = string.encode('ascii',errors='surrogateescape').decode('utf8')
print(fixed)
with open(file="test.log", mode="w", encoding="utf8") as logfile:
logfile.write(fixed)
You can read more about surrogate escapes in PEP 383.

Strange UnicodeEncodeError/AttributeError in my script

Currently I am writing a script in Python 2.7 that works fine except for after running it for a few seconds it runs into an error:
Enter Shopify website URL (without HTTP): store.highsnobiety.com
Scraping! Check log file # z:\shopify_output.txt to see output.
!!! Also make sure to clear file every hour or so !!!
Copper Bracelet - 3mm - Polished ['3723603267']
Traceback (most recent call last):
File "shopify_sitemap_scraper.py", line 38, in <module>
print(prod, variants).encode('utf-8')
AttributeError: 'NoneType' object has no attribute 'encode'
The script is to get data from a Shopify website and then print it to console. Code here:
# -*- coding: utf-8 -*-
from __future__ import print_function
from lxml.html import fromstring
import requests
import time
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Log file location, change "z://shopify_output.txt" to your location.
logFileLocation = "z:\shopify_output.txt"
log = open(logFileLocation, "w")
# URL of Shopify website from user input (for testing, just use store.highsnobiety.com during input)
url = 'http://' + raw_input("Enter Shopify website URL (without HTTP): ") + '/sitemap_products_1.xml'
print ('Scraping! Check log file # ' + logFileLocation + ' to see output.')
print ("!!! Also make sure to clear file every hour or so !!!")
while True :
page = requests.get(url)
tree = fromstring(page.content)
# skip first url tag with no image:title
url_tags = tree.xpath("//url[position() > 1]")
data = [(e.xpath("./image/title//text()")[0],e.xpath("./loc/text()")[0]) for e in url_tags]
for prod, url in data:
# add xml extension to url
page = requests.get(url + ".xml")
tree = fromstring(page.content)
variants = tree.xpath("//variants[#type='array']//id[#type='integer']//text()")
print(prod, variants).encode('utf-8')
The most crazy part about it is that when I take out the .encode('utf-8') it gives me a UnicodeEncodeError seen here:
Enter Shopify website URL (without HTTP): store.highsnobiety.com
Scraping! Check log file # z:\shopify_output.txt to see output.
!!! Also make sure to clear file every hour or so !!!
Copper Bracelet - 3mm - Polished ['3723603267']
Copper Bracelet - 5mm - Brushed ['3726247811']
Copper Bracelet - 7mm - Polished ['3726253635']
Highsnobiety x EARLY - Leather Pouch ['14541472963', '14541473027', '14541473091']
Traceback (most recent call last):
File "shopify_sitemap_scraper.py", line 38, in <module>
print(prod, variants)
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xae' in position 13: character maps to <undefined>'
Any ideas? Have no idea what else to try after hours of googling.
snakecharmerb almost got it, but missed the cause of your first error. Your code
print(prod, variants).encode('utf-8')
means you print the values of the prod and variants variables, then try to run the encode() function on the output of print. Unfortunately, print() (as a function in Python 2 and always in Python 3) returns None. To fix it, use the following instead:
print(prod.encode("utf-8"), variants)
Your console has a default encoding of cp437, and cp437 is unable to represent the character u'\xae'.
>>> print (u'\xae')
®
>>> print (u'\xae'.encode('utf-8'))
b'\xc2\xae'
>>> print (u'\xae'.encode('cp437'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/encodings/cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\xae' in position 0: character maps to <undefined>
You can see that it's trying to convert to cp437 in the traceback:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
(I reproduced the problem in Python3.5, but it's the same issue in both versions of Python)

nltk NERTagger UnicodeDecodeError in python

I am writing a program in python 2.7.6 that uses nltk with Stanford named entity tagger in Window 7 professional to tag a text and print the result as follows:
import re
from nltk.tag.stanford import NERTagger
WORD = re.compile(r'\w+')
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar")
text = "title Wienfilm 1896-1976 (1976)"
words = WORD.findall(text )
print words
answer = st.tag(words )
print answer
The last print statement in the program suppose to return a tuple consisting of five lists as:
[(u'title', u'O'), (u'Wienfilm', u'O'), (u'1896', u'O'), (u'1976', u'O'), (u'1976', u'O')]
However when I run the program, it gives me the following error message:
['title', 'Wienfilm', '1896', '1976', '1976']
Traceback (most recent call last):
File "E:\Google Drive\myPyPrgs\testNLP.py", line 27, in <module>
answer = st.tag(words )
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
return self.tag_sents([tokens])[0]
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 23: ordinal not in
range(128)
Note that if I remove the number, '-1976' from the text string the program tags and prints the correct answer. But if the number '-1976' is within the text, I always have the above error.
In this forum, somebody suggested to me to change the default encoding in the stanford.py of the nltk. When I changed the default encoding in the stanford.py from ascii to UTF-16 and replaced the the last print statement of the above code with the following looping:
for i, word_pos in enumerate(answer):
word, pos = word_pos
print i , word.encode('utf-16'), pos.encode('utf-16')
I got the following incorrect output:
0 ÿþ ÿþtitle/O Wienfilm/O 1896 1976 1976/O
Please any clues on how to deal with this issue? Thanks in advance.
This worked for me: specify the encoding argument as UTF-8 when you create NERTagger object
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar", encoding='utf-8')
Open terminal(cmd), and write;
chcp
It should return something like;
active code page: 857
Then, write;
chcp 1254
After then, in your .py script, to the top of your script write;
# -*- coding: cp1254 -*-
This should solve your problem.If it's not, copy these codes and paste to the top of your script.
# -*-coding:utf-8-*-
import locale
locale.setlocale(locale.LC_ALL, '')
I had many problems with decoding before, these methods solved.
ASCII can decode only 2^7 = 128 characters, that's why you getting that error.As you see in the error sentence ordinal not in range(128) .
And check this website please.Use arrow keys for switching pages :-) I believe it's going to solve your problem.
At the top of your app add:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
I was dealing with the same problem and I solved it by adding the encoding options on internals.py in nltk.
You must open internals.py saven on:
%YourPythonFolder%\Lib\site-packages\nltk\internals.py
Then go to the method java and adding this line after #construct the full command string (about line 147)
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
This section code must look like:
# Construct the full command string.
cmd = list(cmd)
cmd = ['-cp', classpath] + cmd
cmd = [_java_bin] + _java_options + cmd
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
Hope it helps.

Python - Encoding string - Swedish Letters

I'm having some trouble with Python's raw_input command (Python2.6),
For some reason, the raw_input does not get the converted string that swedify() produces and this giving me a encoding error which i'm aware of, that's why i made swedify() to begin with.
Here's what i'm trying to do:
elif cmd in ('help', 'hjälp', 'info'):
buffert += 'Just nu är programmet relativt begränsat,\nDe funktioner du har att använda är:\n'
buffert += ' * historik :: skriver ut all din historik\n'
buffert += ' * ändra <något> :: ändrar något i databasen, följande finns att ändra:\n'
print swedify(buffert)
This works just fine, it outputs the swedish characters just as i want them to the console.
But when i try to (in the same code, with same \x?? values, print this piece:
core['goalDistance'] = raw_input(swedify('Hur långt i kilometer är ditt mål: '))
core['goalTime'] = raw_input(swedify('Vad är ditt mål i minuter att springa ' + core['goalDistance'] + 'km på: '))
Then i get this:
C:\Users\Anon>python löp.py
Traceback (most recent call last):
File "l÷p.py", line 92, in <module>
core['goalDistance'] = raw_input(swedify('Hur långt i kilometer är ditt mål: '))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 5: ordinal not in range(128)
Now i've googled around, found some "solutions" but none of them work, some sad that i have to create a batch script that executes chcp ??? in the beginning, but that's not a clean solution IMO.
Here is swedify:
def swedify(inp):
try:
return inp.decode('utf-8')
except:
return '(!Dec:) ' + str(inp)
Any solutions on how to get raw_input to read my return value from swedify()?
i've tried from encodings import getencoder, getdecoder and others but nothing for the better.
You mention the fact that you received an encoding error which motivated you to write swedify in the first place, and you have found solutions around chcp which is a Windows command.
On *nix systems with UTF-8 terminals, swedify is not necessary:
>>> raw_input('Hur långt i kilometer är ditt mål: ')
Hur långt i kilometer är ditt mål: 100
'100'
>>> a = raw_input('Hur långt i kilometer är ditt mål: ')
Hur långt i kilometer är ditt mål: 200
>>> a
'200'
FWIW, when I do use swedify, I get the same error you do:
>>> def swedify(inp):
... try:
... return inp.decode('utf-8')
... except:
... return '(!Dec:) ' + str(inp)
...
>>> swedify('Hur långt i kilometer är ditt mål: ')
u'Hur l\xe5ngt i kilometer \xe4r ditt m\xe5l: '
>>> raw_input(swedify('Hur långt i kilometer är ditt mål: '))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 5: ordinal not in range(128)
Your swedify function returns a unicode object. The built-in raw_input is just not happy with unicode objects.
>>> raw_input("å")
åeee
'eee'
>>> raw_input(u"å")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128)
You might want to try this in Python 3. See this Python bug.
Also of interest: How to read Unicode input and compare Unicode strings in Python?.
UPDATE According to this blog post there is a way to set the system's default encoding. This might be worth a try.
For me it worked fine with:
#-*- coding: utf-8 -*-
import sys
import codecs
koden=sys.stdin.encoding
a=raw_input( u'Frågan är öppen? '.encode(koden))
print a
Per
On Windows, the console's native Unicode support is broken. Even the apparent UTF-8 codepage isn't a proper fix.
To read and write with Windows console you need use https://github.com/Drekin/win-unicode-console, which works directly with the underlying console API, so that multi-byte characters are read and written correctly.
Windows command prompt uses Codepage 850 when using Swedish regional settings (https://en.wikipedia.org/wiki/Code_page_850).
It's probably used because of backwards compatibility with old MS-Dos programs.
You can set Windows command prompt to use UTF-8 as encoding by entering:
chcp 65001 (Unicode characters in Windows command line - how?)
Try this magic comment at the very top of your script:
# -*- coding: utf-8 -*-
Here is some information about it:
http://www.python.org/dev/peps/pep-0263/
Solution to a lot of problems:
Edit: C:\Python??\Lib\Site.py
Replace "del sys.setdefaultencoding" with "pass"
Then,
Put this in the top of your code:
sys.setdefaultencoding('latin-1')
The holy grail of fixing the Swedish/non-UTF8 compatible characters.

Spanish text in .py files

This is the code
A = "Diga sí por cualquier número de otro cuidador.".encode("utf-8")
I get this error:
'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
I tried numerous encodings unsuccessfully.
Edit:
I already have this at the beginning
# -*- coding: utf-8 -*-
Changing to
A = u"Diga sí por cualquier número de otro cuidador.".encode("utf-8")
doesn't help
Are you using Python 2?
In Python 2, that string literal is a bytestring. You're trying to encode it, but you can encode only a Unicode string, so Python will first try to decode the bytestring to a Unicode string using the default "ascii" encoding.
Unfortunately, your string contains non-ASCII characters, so it can't be decoded to Unicode.
The best solution is to use a Unicode string literal, like this:
A = u"Diga sí por cualquier número de otro cuidador.".encode("utf-8")
Error message: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
says that the 7th byte is 0xed. This is either the first byte of the UTF-8 sequence for some (maybe CJK) high-ordinal Unicode character (that's absolutely not consistent with the reported facts), or it's your i-acute encoded in Latin1 or cp1252. I'm betting on the cp1252.
If your file was encoded in UTF-8, the offending byte would be not 0xed but 0xc3:
Preliminaries:
>>> import unicodedata
>>> unicodedata.name(u'\xed')
'LATIN SMALL LETTER I WITH ACUTE'
>>> uc = u'Diga s\xed por'
What happens if file is encoded in UTF-8:
>>> infile = uc.encode('utf8')
>>> infile
'Diga s\xc3\xad por'
>>> infile.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
#### NOT the message reported in the question ####
What happens if file is encoded in cp1252 or latin1 or similar:
>>> infile = uc.encode('cp1252')
>>> infile
'Diga s\xed por'
>>> infile.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
#### As reported in the question ####
Having # -*- coding: utf-8 -*- at the start of your code does not magically ensure that your file is encoded in UTF-8 -- that's up to you and your text editor.
Actions:
save your file as UTF-8.
As
suggested by others, you need u'blah
blah'
put on first line of your code this:
# -*- coding: utf-8 -*-
You should specify your source file's encoding by adding the following line to the very beginning of your code (assuming that your file is encoded in UTF-8):
# Encoding: UTF-8
Otherwise, Python will assume an ASCII encoding and fail during parsing.
You probably operate on normal string, not unicode string:
>> type(u"zażółć gęślą jaźń")
-> <type 'unicode'>
>> type("zażółć gęślą jaźń")
-> <type 'str'>
so
u"Diga sí por cualquier número de otro cuidador.".encode("utf-8")
should work.
If you want use unicode strings by default, put
# -*- coding: utf-8 -*-
in the first line of your script.
Look also in docs.
P.S. It's Polish in examples above :)
In the first or second line of your code, type the comment:
# -*- coding: latin-1 -*-
For a list of symbols supported see:
http://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29
And the languages covered: http://en.wikipedia.org/wiki/ISO_8859-1
Maybe this is what you want to do:
A = 'Diga sí por cualquier número de otro cuidador'.decode('latin-1')
And don't forget to add # -*- coding: latin-1 -*- at the beginning of your code.

Categories