I made a code that returns all the WiFi names and passwords in your device. Since my device's language is Spanish, the cmd output has accent marks and python can't decode it.
I want to know how to eliminate the accent marks or just use a different encoding.
import subprocess
import re
import unicodedata
command_output=subprocess.run(['netsh','wlan','show','profiles'],capture_output=True).stdout.decode()
profile_names=re.findall("Perfil de todos los usuarios : (.*)\r", command_output)
wifi_list=[]
if len(profile_names) !=0:
for red in profile_names:
wifi_profile={}
profile_info=subprocess.run(['netsh','wlan','show','profiles',red], capture_output=True).stdout.decode()
if re.search("Clave de seguridad : Ausente", profile_info):
continue
else:
wifi_profile["ssid"]=red
profile_info_pass=subprocess.run(["netsh","wlan","show","profile",red,"key=clear"])
password=re.search("Contenido de la clave : (.*)\r")
if password==None:
wifi_profile["password"]=None
else:
wifi_profile["password"]=password[1]
wifi_list.append(wifi_profile)
for x in range(len(wifi_list)):
print(wifi_list[x])
The error I get is:
File "getwifi.py", line 11, in <module>
profile_info=subprocess.run(['netsh','wlan','show','profiles',red], capture_output=True).stdout.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 179: invalid start byte
By default, Python assumes that the bytes value is UTF-8-encoded. You can either pass the correct encoding (I'll assume it's iso8859 here) to decode:
profile_info = subprocess.run(['netsh','wlan','show','profiles',red],
capture_output=True).stdout.decode("iso8859")
or you can tell subprocess.run to decode it for you with the encoding argument.
profile_info = subprocess.run(['netsh','wlan','show','profiles',red],
capture_output=True, encoding="iso8859").stdout
A third option (beyond the scope of this answer) would be to configure your environment to make sure that netsh outputs UTF-8-encoded output in the first place.
Related
TL;DR
While trying to write a string to a file the following error occurred:
Code
logfile.write(cli_args.last_name)
Output
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)
But this works:
Code
print(cli_args.last_name)
Output
Pérez
Why?
FULL CONTEXT
I made a script which receives data from a Linux CLI, processes it and finally creates a Zendesk ticket with the provided data. It is kind of a CLI API, since before my script there is a bigger system which has a web interface with forms, where users fill the values of the fields and are then replaced into the CLI script. For example:
myscript.py --first_name '_first_name_' --last_name '_last_name_'
The script was working with no issues, until yesterday when the web was updated. I think they changed something related to charsets or encoding.
I do some simple logging with F-strings by opening a file and writing some informative messages in case anything fails, so I can go back to check where it happened. Also the CLI attributes are read using the argparse module. Example:
logfile.write(f"\tChecking for opened tickets for user '{cli_args.first_name} {cli_args.last_name}'\n")
After the website update I am getting an error like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
8-9: ordinal not in range(128)
Doing some troubleshooting I found it is because some users input names with accent marks like Carlos Pérez.
I need the script to work again and also prepare it for inputs like that, so I looked for answers by checking the HTTP headers in the input forms of the web console and found out it uses a Content-Type: text/html; charset=UTF-8; my first try was to encode the str passed in the CLI argument to utf-8 and decode it again using the same codec, but didn't succeed.
On my second try, I checked the Python docs str.encode() and bytes.decode(). So I tried this:
logfile.write(
"\tChecking for opened tickets for user "
f"'{cli_args.first_name.encode(encoding='utf-8', errors='ignore').decode('utf-8')} "
f"{cli_args.last_name.encode(encoding='utf-8', errors='ignore').decode('utf-8')}'"
)
It worked but removed the accent marked letter so Carlos Pérez became Carlos Prez which is of no use to me in this case, I need the full input.
As a desperate move I tried printing the same F-string I was trying to write to the logfile, which to my surprise it worked. It printed to the console Carlos Pérez without any kind of encoding/decoding process.
How does print work? and Why trying to write to the file didn't work? But most importantly How can I write to a file with the same formatting as print?
Edit 1 #MarkTolonen
Tried the following:
logfile = open("/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/755bug.txt", mode="a", encoding="utf8")
logfile.write(cli_args.body)
logfile.close()
Output:
Traceback (most recent call last):
File "/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/ticket_query_app.py", line 414, in
main()
File "/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/ticket_query_app.py", line 81, in main
logfile.write(cli_args.body)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed
Edit 2
I managed to get the text that is causing the issue:
if __name__ == "__main__":
string = (
"Buenos d\udcc3\udcadas,\r\n\r\n"
"Mediante monitoreo autom\udcc3\udca1tico se ha detectado un evento fuera de lo normal:\r\n\r\n"
"Descripci\udcc3\udcb3n del evento: _snmp_f13_\r\n"
"Causas sugeridas del evento: _snmp_f14_\r\n"
"Posible afectaci\udcc3\udcb3n del evento: _snmp_f15_\r\n"
"Validaciones de bajo impacto: _snmp_f16_\r\n"
"Fecha y hora del evento: 2021-07-14 17:47:51\r\n\r\n"
"Saludos."
)
# Output: Text with the unicodes translated
print(string)
# Output: "UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed"
with open(file="test.log", mode="w", encoding="utf8") as logfile:
logfile.write(string)
The answer is the encoding parameter to open. Observe:
Last login: Wed Jul 14 15:05:24 2021 from 50.126.68.34
[timrprobocom#jared-ingersoll ~]$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('x.txt','a')
>>> g = open('y.txt','a',encoding='utf-8')
>>> s = "spades \u2660 spades"
>>> f.write(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u2660' in position 7: ordinal not in range(128)
>>> g.write(s)
15
>>>
[timrprobocom#jared-ingersoll ~]$ hexdump -C y.txt
00000000 73 70 61 64 65 73 20 e2 99 a0 20 73 70 61 64 65 |spades ... spade|
*
00000011
It looks like something upstream is misconfigured. Your string appears to have been produced by a decode operation with the wrong encoding, with errors='surrogateescape' error handling. From the data shown, it looks like the decoding operation tried to decode UTF-8-encoded text as ASCII.
errors='surrogateescape' is a way for an encoding to handle invalid bytes during a decode operation. The error handler replaces the invalid bytes with partial surrogates in the range U+DC80..U+DCFF when converting to a Unicode string, and the process can be reversed to get the original byte string back by performing an encode with errors='surrogateescape' and the same encoding.
The partial surrogates in your string match the pattern of what a decode(encoding='ascii', errors='surrogateescape') call would produce when given data actually encoded in UTF-8 - the surrogates are all in the range surrogateescape uses, and the bytes they correspond to form valid UTF-8. In the code below, I recover the original bytes, then decode them correctly as UTF-8. Once the Unicode string is valid, it can be written to the log file with encoding='utf8'.
string = (
"Buenos d\udcc3\udcadas,\r\n\r\n"
"Mediante monitoreo autom\udcc3\udca1tico se ha detectado un evento fuera de lo normal:\r\n\r\n"
"Descripci\udcc3\udcb3n del evento: _snmp_f13_\r\n"
"Causas sugeridas del evento: _snmp_f14_\r\n"
"Posible afectaci\udcc3\udcb3n del evento: _snmp_f15_\r\n"
"Validaciones de bajo impacto: _snmp_f16_\r\n"
"Fecha y hora del evento: 2021-07-14 17:47:51\r\n\r\n"
"Saludos."
)
fixed = string.encode('ascii',errors='surrogateescape').decode('utf8')
print(fixed)
with open(file="test.log", mode="w", encoding="utf8") as logfile:
logfile.write(fixed)
You can read more about surrogate escapes in PEP 383.
I am trying to read some French text and do some frequency analysis of words. I want the characters with the umlauts and other diacritics to stay. So, I did this for testing:
>>> import codecs
>>> f = codecs.open('file','r','utf-8')
>>> for line in f:
... print line
...
Faites savoir à votre famille que vous êtes en sécurité.
So far, so good. But, I have a list of French files which I iterate over in the following way:
import codecs,sys,os
path = sys.argv[1]
for f in os.listdir(path):
french = codecs.open(os.path.join(path,f),'r','utf-8')
for line in french:
print line
Here, it gives the following error:
rdholaki74: python TestingCodecs.py ../frenchResources | more
Traceback (most recent call last):
File "TestingCodecs.py", line 7, in <module>
print line
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 14: ordinal not in range(128)
Why is it that the same file throws up an error when passed as an argument and not when given explicitly in the code?
Thanks.
Because you're misinterpreting the cause. The fact that you're piping the output means that Python can't detect what encoding to use. If stdout is not a TTY then you'll need to encode as UTF-8 manually before outputting.
It is a print error due to redirection. You could use:
PYTHONIOENCODING=utf-8 python ... | ...
Specify another encoding if your terminal doesn't use utf-8
I'm having some trouble with Python's raw_input command (Python2.6),
For some reason, the raw_input does not get the converted string that swedify() produces and this giving me a encoding error which i'm aware of, that's why i made swedify() to begin with.
Here's what i'm trying to do:
elif cmd in ('help', 'hjälp', 'info'):
buffert += 'Just nu är programmet relativt begränsat,\nDe funktioner du har att använda är:\n'
buffert += ' * historik :: skriver ut all din historik\n'
buffert += ' * ändra <något> :: ändrar något i databasen, följande finns att ändra:\n'
print swedify(buffert)
This works just fine, it outputs the swedish characters just as i want them to the console.
But when i try to (in the same code, with same \x?? values, print this piece:
core['goalDistance'] = raw_input(swedify('Hur långt i kilometer är ditt mål: '))
core['goalTime'] = raw_input(swedify('Vad är ditt mål i minuter att springa ' + core['goalDistance'] + 'km på: '))
Then i get this:
C:\Users\Anon>python löp.py
Traceback (most recent call last):
File "l÷p.py", line 92, in <module>
core['goalDistance'] = raw_input(swedify('Hur långt i kilometer är ditt mål: '))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 5: ordinal not in range(128)
Now i've googled around, found some "solutions" but none of them work, some sad that i have to create a batch script that executes chcp ??? in the beginning, but that's not a clean solution IMO.
Here is swedify:
def swedify(inp):
try:
return inp.decode('utf-8')
except:
return '(!Dec:) ' + str(inp)
Any solutions on how to get raw_input to read my return value from swedify()?
i've tried from encodings import getencoder, getdecoder and others but nothing for the better.
You mention the fact that you received an encoding error which motivated you to write swedify in the first place, and you have found solutions around chcp which is a Windows command.
On *nix systems with UTF-8 terminals, swedify is not necessary:
>>> raw_input('Hur långt i kilometer är ditt mål: ')
Hur långt i kilometer är ditt mål: 100
'100'
>>> a = raw_input('Hur långt i kilometer är ditt mål: ')
Hur långt i kilometer är ditt mål: 200
>>> a
'200'
FWIW, when I do use swedify, I get the same error you do:
>>> def swedify(inp):
... try:
... return inp.decode('utf-8')
... except:
... return '(!Dec:) ' + str(inp)
...
>>> swedify('Hur långt i kilometer är ditt mål: ')
u'Hur l\xe5ngt i kilometer \xe4r ditt m\xe5l: '
>>> raw_input(swedify('Hur långt i kilometer är ditt mål: '))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 5: ordinal not in range(128)
Your swedify function returns a unicode object. The built-in raw_input is just not happy with unicode objects.
>>> raw_input("å")
åeee
'eee'
>>> raw_input(u"å")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128)
You might want to try this in Python 3. See this Python bug.
Also of interest: How to read Unicode input and compare Unicode strings in Python?.
UPDATE According to this blog post there is a way to set the system's default encoding. This might be worth a try.
For me it worked fine with:
#-*- coding: utf-8 -*-
import sys
import codecs
koden=sys.stdin.encoding
a=raw_input( u'Frågan är öppen? '.encode(koden))
print a
Per
On Windows, the console's native Unicode support is broken. Even the apparent UTF-8 codepage isn't a proper fix.
To read and write with Windows console you need use https://github.com/Drekin/win-unicode-console, which works directly with the underlying console API, so that multi-byte characters are read and written correctly.
Windows command prompt uses Codepage 850 when using Swedish regional settings (https://en.wikipedia.org/wiki/Code_page_850).
It's probably used because of backwards compatibility with old MS-Dos programs.
You can set Windows command prompt to use UTF-8 as encoding by entering:
chcp 65001 (Unicode characters in Windows command line - how?)
Try this magic comment at the very top of your script:
# -*- coding: utf-8 -*-
Here is some information about it:
http://www.python.org/dev/peps/pep-0263/
Solution to a lot of problems:
Edit: C:\Python??\Lib\Site.py
Replace "del sys.setdefaultencoding" with "pass"
Then,
Put this in the top of your code:
sys.setdefaultencoding('latin-1')
The holy grail of fixing the Swedish/non-UTF8 compatible characters.
This is the code
A = "Diga sí por cualquier número de otro cuidador.".encode("utf-8")
I get this error:
'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
I tried numerous encodings unsuccessfully.
Edit:
I already have this at the beginning
# -*- coding: utf-8 -*-
Changing to
A = u"Diga sí por cualquier número de otro cuidador.".encode("utf-8")
doesn't help
Are you using Python 2?
In Python 2, that string literal is a bytestring. You're trying to encode it, but you can encode only a Unicode string, so Python will first try to decode the bytestring to a Unicode string using the default "ascii" encoding.
Unfortunately, your string contains non-ASCII characters, so it can't be decoded to Unicode.
The best solution is to use a Unicode string literal, like this:
A = u"Diga sí por cualquier número de otro cuidador.".encode("utf-8")
Error message: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
says that the 7th byte is 0xed. This is either the first byte of the UTF-8 sequence for some (maybe CJK) high-ordinal Unicode character (that's absolutely not consistent with the reported facts), or it's your i-acute encoded in Latin1 or cp1252. I'm betting on the cp1252.
If your file was encoded in UTF-8, the offending byte would be not 0xed but 0xc3:
Preliminaries:
>>> import unicodedata
>>> unicodedata.name(u'\xed')
'LATIN SMALL LETTER I WITH ACUTE'
>>> uc = u'Diga s\xed por'
What happens if file is encoded in UTF-8:
>>> infile = uc.encode('utf8')
>>> infile
'Diga s\xc3\xad por'
>>> infile.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
#### NOT the message reported in the question ####
What happens if file is encoded in cp1252 or latin1 or similar:
>>> infile = uc.encode('cp1252')
>>> infile
'Diga s\xed por'
>>> infile.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
#### As reported in the question ####
Having # -*- coding: utf-8 -*- at the start of your code does not magically ensure that your file is encoded in UTF-8 -- that's up to you and your text editor.
Actions:
save your file as UTF-8.
As
suggested by others, you need u'blah
blah'
put on first line of your code this:
# -*- coding: utf-8 -*-
You should specify your source file's encoding by adding the following line to the very beginning of your code (assuming that your file is encoded in UTF-8):
# Encoding: UTF-8
Otherwise, Python will assume an ASCII encoding and fail during parsing.
You probably operate on normal string, not unicode string:
>> type(u"zażółć gęślą jaźń")
-> <type 'unicode'>
>> type("zażółć gęślą jaźń")
-> <type 'str'>
so
u"Diga sí por cualquier número de otro cuidador.".encode("utf-8")
should work.
If you want use unicode strings by default, put
# -*- coding: utf-8 -*-
in the first line of your script.
Look also in docs.
P.S. It's Polish in examples above :)
In the first or second line of your code, type the comment:
# -*- coding: latin-1 -*-
For a list of symbols supported see:
http://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29
And the languages covered: http://en.wikipedia.org/wiki/ISO_8859-1
Maybe this is what you want to do:
A = 'Diga sí por cualquier número de otro cuidador'.decode('latin-1')
And don't forget to add # -*- coding: latin-1 -*- at the beginning of your code.
I am working through the Django RSS reader project here.
The RSS feed will read something like "OKLAHOMA CITY (AP) — James Harden let". The RSS feed's encoding reads encoding="UTF-8" so I believe I am passing utf-8 to markdown in the code snippet below. The em dash is where it chokes.
I get the Django error of "'ascii' codec can't encode character u'\u2014' in position 109: ordinal not in range(128)" which is an UnicodeEncodeError. In the variables being passed I see "OKLAHOMA CITY (AP) \u2014 James Harden". The code line that is not working is:
content = content.encode(parsed_feed.encoding, "xmlcharrefreplace")
I am using markdown 2.0, django 1.1, and python 2.4.
What is the magic sequence of encoding and decoding that I need to do to make this work?
(In response to Prometheus' request. I agree the formatting helps)
So in views I add a smart_unicode line above the parsed_feed encoding line...
content = smart_unicode(content, encoding='utf-8', strings_only=False, errors='strict')
content = content = content.encode(parsed_feed.encoding, "xmlcharrefreplace")
This pushes the problem to my models.py for me where I have
def save(self, force_insert=False, force_update=False):
if self.excerpt:
self.excerpt_html = markdown(self.excerpt)
# super save after this
If I change the save method to have...
def save(self, force_insert=False, force_update=False):
if self.excerpt:
encoded_excerpt_html = (self.excerpt).encode('utf-8')
self.excerpt_html = markdown(encoded_excerpt_html)
I get the error "'ascii' codec can't decode byte 0xe2 in position 141: ordinal not in range(128)" because now it reads "\xe2\x80\x94" where the em dash was
If the data that you are receiving is, in fact, encoded in UTF-8, then it should be a sequence of bytes -- a Python 'str' object, in Python 2.X
You can verify this with an assertion:
assert isinstance(content, str)
Once you know that that's true, you can move to the actual encoding. Python doesn't do transcoding -- directly from UTF-8 to ASCII, for instance. You need to first turn your sequence of bytes into a Unicode string, by decoding it:
unicode_content = content.decode('utf-8')
(If you can trust parsed_feed.encoding, then use that instead of the literal 'utf-8'. Either way, be prepared for errors.)
You can then take that string, and encode it in ASCII, substituting high characters with their XML entity equivalents:
xml_content = unicode_content.encode('ascii', 'xmlcharrefreplace')
The full method, then, would look somthing like this:
try:
content = content.decode(parsed_feed.encoding).encode('ascii', 'xmlcharrefreplace')
except UnicodeDecodeError:
# Couldn't decode the incoming string -- possibly not encoded in utf-8
# Do something here to report the error
Django provides a couple of useful functions for converting back and forth between Unicode and bytestrings:
from django.utils.encoding import smart_unicode, smart_str
I encountered this error during a write of a file name with zip file. The following failed
ZipFile.write(root+'/%s'%file, newRoot + '/%s'%file)
and the following worked
ZipFile.write(str(root+'/%s'%file), str(newRoot + '/%s'%file))