Python convention name files encoding from iso-8859-5 to utf-8 - python

I have about 3500 files whose name is encoded in 'iso-8859-5' and the contents too.
here's how it looks on the Linux console and the 7 zip program:
I'm trying to write a script that converts to 'UTF-8'
# -*- coding: utf-8 -*-
import os
#Exemple
# how it should look like
#iso-8859-5 ==> utf-8
#НјБ_ФШРУ_Г99 ==> ЭМС_диаг_У99
path = r"C://Users//Kamel//Desktop//работа//macros"
obj = os.scandir(path)
for entry in obj:
if entry.is_dir() or entry.is_file():
command = entry.name
print(command, end="\t\t")
file_name = command.encode('iso-8859-5').decode('UTF-8')
print(command)
I get this error
C:\Python\Python310\python.exe D:/PycharmProjects/pythonProject3/ansi_to_utf.py
Traceback (most recent call last):
File "D:\PycharmProjects\pythonProject3\ansi_to_utf.py", line 15, in <module>
file_name = command.encode('iso-8859-5').decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 11: invalid start byte
BE_BEF BE_BEF
BE_BEF_IMP_0 BE_BEF_IMP_0
BE_BEF_IMP_1 BE_BEF_IMP_1
BE_BEF_IMP_6 BE_BEF_IMP_6
BE_BEF_IMP_7 BE_BEF_IMP_7
BE_BEF_IMP_8 BE_BEF_IMP_8
BE_BEF_IMP_K BE_BEF_IMP_K
BE_BEF_IMP_T BE_BEF_IMP_T
BE_BEF_IMP_В
Process finished with exit code 1

A mojibake case. Your example НјБ_ФШРУ_Г99 ==> ЭМС_диаг_У99 could be accomplished as:
'НјБ_ФШРУ_Г99'.encode('cp1251').decode('iso-8859-5')
# 'ЭМС_диаг_У99'
or (alternatively) as
'НјБ_ФШРУ_Г99'.encode('ptcp154').decode('iso-8859-5')
# 'ЭМС_диаг_У99'
Your failing example (… can't decode byte 0xb2 in position 11):
'BE_BEF_IMP_В'.encode('iso-8859-5')
# b'BE_BEF_IMP_\xb2'
is solved using the same mechanism:
'BE_BEF_IMP_В'.encode('cp1251').decode('iso-8859-5')
# 'BE_BEF_IMP_Т'

Related

How can I replace 'æ' 'ø' and 'å' in a text without error: 'ascii' codec can't decode byte 0xc3 in position 0

I am making a program which is supposed to open a textfile, then replace letters 'æ, ø, and å' (Danish text) with 'ae, oe, aa'.
I need to open the program and run it through the mac terminal.
I tried using the replace() function, and tried writing:
# -*- coding: utf-8 -*-
#!/usr/bin/env python
in the beginning of the file.
But I keep getting error:
File "replace.py", line 20, in replace_nonascii
word = word.replace('å', 'aa')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
any suggestions? have tried googling this for days, I have no clue how to fix it.
Here is my program:
filepath = input('insert path for text')
with codecs.open(filepath, 'r', encoding = 'utf8') as file_object:
filename_cont['text1'] = file_object.read()
def replace_nonascii(word):
word = word.lower()
word = word.replace('å', 'aa')
word = word.replace('æ', 'ae')
word = word.strip('/-.,?!')
print(word)
for text in filename_cont:
newtext = filename_cont[text]
for word in newtext.split():
replace_nonascii(word)

Unicode-encode issues while sending desktop notification using Python

I am fetching latest football scores from a website and sending a notification on the desktop (OS X). I am using BeautifulSoup to scrape the data. I had issues with the unicode data which was generating this error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128).
So I inserted this at the beginning which solved the problem while outputting on the terminal.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
But the problem exists when I am sending notifications on the desktop. I use terminal-notifier to send desktop-notifications.
def notify (title, subtitle, message):
t = '-title {!r}'.format(title)
s = '-subtitle {!r}'.format(subtitle)
m = '-message {!r}'.format(message)
os.system('terminal-notifier {}'.format(' '.join((m, t, s))))
The below images depict the output on the terminal Vs the desktop notification.
Output on terminal.
Desktop Notification
Also, if I try to replace the comma in the string, I get the error,
new_scorer = str(new_scorer[0].text).replace(",","")
File "live_football_bbc01.py", line 41, in get_score
new_scorer = str(new_scorer[0].text).replace(",","")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)
How do I get the output on the desktop notifications like the one on the terminal? Thanks!
Edit : Snapshot of the desktop notification. (Solved)
You are formatting using !r which gives you the repr output, forget the terrible reload logic and either use unicode everywhere:
def notify (title, subtitle, message):
t = u'-title {}'.format(title)
s = u'-subtitle {}'.format(subtitle)
m = u'-message {}'.format(message)
os.system(u'terminal-notifier {}'.format(u' '.join((m, t, s))))
or encode:
def notify (title, subtitle, message):
t = '-title {}'.format(title.encode("utf-8"))
s = '-subtitle {}'.format(subtitle.encode("utf-8"))
m = '-message {}'.format(message.encode("utf-8"))
os.system('terminal-notifier {}'.format(' '.join((m, t, s))))
When you call str(new_scorer[0].text).replace(",","") you are trying to encode to ascii, you need to specify the encoding to use:
In [13]: s1=s2=s3= u'\xfc'
In [14]: str(s1) # tries to encode to ascii
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-14-589849bdf059> in <module>()
----> 1 str(s1)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)
In [15]: "{}".format(s1) + "{}".format(s2) + "{}".format(s3) # tries to encode to ascii---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-15-7ca3746f9fba> in <module>()
----> 1 "{}".format(s1) + "{}".format(s2) + "{}".format(s3)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)
You can encode straight away:
In [16]: "{}".format(s1.encode("utf-8")) + "{}".format(s2.encode("utf-8")) + "{}".format(s3.encode("utf-8"))
Out[16]: '\xc3\xbc\xc3\xbc\xc3\xbc'
Or use use all unicode prepending a u to the format strings and encoding last:
In [17]: out = u"{}".format(s1) + u"{}".format(s2) + u"{}".format(s3)
In [18]: out
Out[18]: u'\xfc\xfc\xfc'
In [19]: out.encode("utf-8")
Out[19]: '\xc3\xbc\xc3\xbc\xc3\xbc'
If you use !r you are always going to the the bytes in the output:
In [30]: print "{}".format(s1.encode("utf-8"))
ü
In [31]: print "{!r}".format(s1).encode("utf-8")
u'\xfc'
You can also pass the args using subprocess:
from subprocess import check_call
def notify (title, subtitle, message):
cheek_call(['terminal-notifier','-title',title.encode("utf-8"),
'-subtitle',subtitle.encode("utf-8"),
'-message'.message.encode("utf-8")])
Use: ˋsys.getfilesystemencoding` to get your encoding
Encode your string with it, ignore or replace errors:
import sys
encoding = sys.getfilesystemencoding()
msg = new_scorer[0].text.replace(",", "")
print(msg.encode(encoding, errons="replace"))

Unicode error in `str.format()`

I am trying to run the following script, which scans for *.csproj files and checks for project dependencies in Visual Studio solutions, but I am getting the following error. I have already tried all sorts of codec and encode/decode and u'' combination, to no avail...
(the diacritics are intended and I plan to keep them).
Traceback (most recent call last):
File "E:\00 GIT\SolutionDependencies.py", line 44, in <module>
references = GetProjectReferences("MiotecGit")
File "E:\00 GIT\SolutionDependencies.py", line 40, in GetProjectReferences
outputline = u'"{}" -> "{}"'.format(projectName, referenceName)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 19: ordinal not in range(128)
import glob
import os
import fnmatch
import re
import subprocess
import codecs
gvtemplate = """
digraph g {
rankdir = "LR"
#####
}
""".strip()
def GetProjectFiles(rootFolder):
result = []
for root, dirnames, filenames in os.walk(rootFolder):
for filename in fnmatch.filter(filenames, "*.csproj"):
result.append(os.path.join(root, filename))
return result
def GetProjectName(path):
result = os.path.splitext(os.path.basename(path))[0]
return result
def GetProjectReferences(rootFolder):
result = []
projectFiles = GetProjectFiles(rootFolder)
for projectFile in projectFiles:
projectName = GetProjectName(projectFile)
with codecs.open(projectFile, 'r', "utf-8") as pfile:
content = pfile.read()
references = re.findall("<ProjectReference.*?</ProjectReference>", content, re.DOTALL)
for reference in references:
referenceProject = re.search('"([^"]*?)"', reference).group(1)
referenceName = GetProjectName(referenceProject)
outputline = u'"{}" -> "{}"'.format(projectName, referenceName)
result.append(outputline)
return result
references = GetProjectReferences("MiotecGit")
output = u"\n".join(*references)
with codecs.open("output.gv", "w", 'utf-8') as outputfile:
outputfile.write(gvtemplate.replace("#####", output))
graphvizpath = glob.glob(r"C:\Program Files*\Graphviz*\bin\dot.*")[0]
command = '{} -Gcharset=latin1 -T pdf -o "output.pdf" "output.gv"'.format(graphvizpath)
subprocess.call(command)
When Python 2.x tries to use a byte string in a Unicode context, it automatically tries to decode the byte string to a Unicode string using the ascii codec. While the ascii codec is a safe choice, it often doesn't work.
For Windows environments the mbcs codec will select the code page that Windows uses for 8-bit characters. You can decode the string yourself explicitly.
outputline = u'"{}" -> "{}"'.format(projectName.decode('mbcs'), referenceName.decode('mbcs'))

python: batch string en- / decoding

i have following problem:
i wrote a python script and it needs inputparameters to run... but if the parameters include one of our german "umlaute" like äüö or ß the script stops with following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position
8: ordinal not in range(128)
and if i start the script with a batchfile, the "umlaute" are replaced with random chars like ?, some other variation of the ö....
pls help me.. thx :)
part of the code:
...
if batch_exe:
try:
aIndex = sys.argv.index("-a")
buchungsart_regEx = sys.argv[aIndex+1]
except:
buchungsart_regEx = ""
else:
...
select_stmt = select_stmt + " AND REGEXP_LIKE (BUCHUNGSART, " + "'" + buchungsart_regEx + "')"
...
db_list = sde_conn.execute(select_stmt)
...
and the cmdinput is something like:
python C:\...\Script.py -i ..... -a äöüß
Check this answer: https://stackoverflow.com/a/846931/1686094
You can use his sys.argv = win32_unicode_argv()
And maybe you can then encode your sys.argv with utf-8 for future use.
You could try adding the type of encoding at the top of your script:
# -*- coding: utf-8 -*-

How to fix this UnicodeDecodeError that appears when I try to remove accents in Python strings?

I'm trying to use this function:
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
in the code below (which unzips and reads files with non-ASCII strings). But I'm getting this error, (from this library file C:\Python27\Lib\encodings\utf_8.py):
Message File Name Line Position
Traceback
<module> C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py 64
getNameList C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py 26
remove_accents C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py 17
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 3: ordinal not in range(128)
Why am I getting this error? How to avoid it and make remove_accents work?
Thanks for any help!
Here's the entire code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-
import os
import re
from zipfile import ZipFile
import csv
##def strip_accents(s):
## return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
def getNameList():
namesDict=extractNamesDict()
maleNames=list()
femaleNames=list()
for name in namesDict:
print name
# name = strip_accents(name)
name = remove_accents(name)
counts=namesDict[name]
tuple=(name,counts[0],counts[1])
if counts[0]>counts[1]:
maleNames.append(tuple)
elif counts[1]>counts[0]:
femaleNames.append(tuple)
names=(maleNames,femaleNames)
# print maleNames
return names
def extractNamesDict():
zf=ZipFile('names.zip', 'r')
filenames=zf.namelist()
names=dict()
genderMap={'M':0,'F':1}
for filename in filenames:
file=zf.open(filename,'r')
rows=csv.reader(file, delimiter=',')
for row in rows:
#name=row[0].upper().decode('latin1')
name=row[0].upper()
gender=genderMap[row[1]]
count=int(row[2])
if not names.has_key(name):
names[name]=[0,0]
names[name][gender]=names[name][gender]+count
file.close()
# print '\tImported %s'%filename
# print names
return names
if __name__ == "__main__":
getNameList()
Best practice is to decode to Unicode when the data comes into your program:
for row in rows:
name=row[0].upper().decode('utf8') # or whatever...you DO need to know the encoding.
Then remove_accents can just be:
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', input_str)
return u''.join(c for c in nkfd_form if not unicodedata.combining(c))
Encode data when leaving your program such as writing to a file, database, terminal, etc.
Why remove accents in the first place?
If you want to robustly convert unicode characters to ascii in a string, you should use the awesome unidecode module:
>>> import unidecode
>>> unidecode.unidecode(u'Björk')
'Bjork'
>>> unidecode.unidecode(u'András Sütő')
'Andras Suto'
>>> unidecode.unidecode(u'Ελλάς')
'Ellas'
You get it because you are decoding from a bytestring without specifying a codec:
unicode(input_str)
Add a codec there (here I assume your data is encoded in utf-8, 0xe1 would be the first of a 3-byte character):
unicode(input_str, 'utf8')

Categories