Convert multiple CSV files into UTF-8 encoding - python

I need to convert multiple CSV files (with different encodings) into UTF-8.
Here is my code:
#find encoding and if not in UTF-8 convert it
import os
import sys
import glob
import chardet
import codecs
myFiles = glob.glob('/mypath/*.csv')
csv_encoding = []
for file in myFiles:
with open(file, 'rb') as opened_file:
bytes_file=opened_file.read()
result=chardet.detect(bytes_file)
my_encoding=result['encoding']
csv_encoding.append(my_encoding)
print(csv_encoding)
for file in myFiles:
if csv_encoding in ['utf-8', 'ascii']:
print(file + ' in utf-8 encoding')
else:
with codecs.open(file, 'r') as file_for_conversion:
read_file_for_conversion = file_for_conversion.read()
with codecs.open(file, 'w', 'utf-8') as converted_file:
converted_file.write(read_file_for_conversion)
print(file +' converted to utf-8')
When I try to run this code I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 5057: invalid continuation byte
Can someone help me? Thanks!!!

You need to zip the lists myFiles and csv_encoding to get their values aligned:
for file, encoding in zip(myFiles, csv_encoding):
...
And you need to specify that value in the open() call:
...
with codecs.open(file, 'r', encoding=encoding) as file_for_conversion:
Note: in Python 3 there's no need to use the codecs module for opening files.
Just use the built-in open function and specify the encoding with the encoding parameter.

Related

How to write a iadx or text file with BOM in python

I want to use BOM with UTF-8. But it only saves files in UTF-8. What can I do ?I'm rather new, could you please write an answer as an addition to the sample code I shared directly?
import os
import codecs
a=1
filelist=os.listdir("name")
for file in filelist:
filelen=len(os.listdir("name/"+file))
if filelen==10:
with open(file + ".iadx", "w", encoding="UTF-8") as f:
f.write("<name>")
f.write("\n")
f.write('something')
From Python documentation on codecs (search for "-sig") :
On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file.
So just doing :
with open(file + ".iadx", "w", encoding="utf-8-sig") as f:
# ^^^^
will do the trick.

Encoding and decoding with utf-8 returns UnicodeError

I am both enconding and decoding with utf-8 but still I get a UnicodeError.
import pandas as pd
df.to_csv('myfile.csv', index=False, encoding='utf-8')
Then, in another .py, same project
import pandas as pd
with open(file, 'r') as f:
csv = pd.read_csv(f, encoding='utf-8')
The error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 51956: character maps to <undefined>
This is not the first time I get this issue.
Ok, found it. Makes a lot of sense now.
with open(file, 'r', encoding='utf-8') as f:
csv = pd.read_csv(f)

decode every file in a zip file with python

I have a zip file file link
it is encoded with utf-8 code, how can I decode every file within it? I tried but failed: TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'ZipFile')
from zipfile import ZipFile
import codecs
with ZipFile('articles.zip', 'r') as zip:
with zip.open('articles/document0001.txt') as file:
codecs.decode(file, encoding='utf-8', errors='strict')
also there are 100 files on that Zip, any smart way to do the decoding for all the files in one off?
You can use bytes.decode on the text:
from zipfile import ZipFile
with ZipFile('articles.zip', 'r') as z:
with z.open('articles/document0001.txt') as file:
file_text = file.read().decode('utf-8')
print(file_text) # or do whatever else you want to do with it.

Convert a bunch of files from guessed encoding to UTF-8

I have this Python script that attempts to detect the character encoding of a text file (in this case, C# .cs source files, but they could be any text file) and then convert them from that character encoding and into UTF-8 (without BOM).
While chardet detects the encoding well enough and the script runs without errors, characters like © are encoded into $. So I assume there's something wrong with the script and my understanding of encoding in Python 2. Since converting files from UTF-8-SIG to UTF-8 works, I have a feeling that the problem is the decoding (reading) part and not the encoding (writing) part.
Can anyone tell me what I'm doing wrong? If switching to Python 3 is a solution, I'm all for it, I then just need help figuring out how to convert the script from running on version 2.7 to 3.4. Here's the script:
import os
import glob
import fnmatch
import codecs
from chardet.universaldetector import UniversalDetector
# from http://farmdev.com/talks/unicode/
def to_unicode_or_bust(obj, encoding='utf-8'):
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
obj = unicode(obj, encoding)
return obj
def enforce_unicode():
detector = UniversalDetector()
for root, dirnames, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, '*.cs'):
detector.reset()
filepath = os.path.join(root, filename)
with open(filepath, 'r') as f:
for line in f:
detector.feed(line)
if detector.done: break
detector.close()
encoding = detector.result['encoding']
if encoding and not encoding == 'UTF-8':
print '%s -> UTF-8 %s' % (encoding.ljust(12), filepath)
with codecs.open(filepath, 'r', encoding=encoding) as f:
content = ''.join(f.readlines())
content = to_unicode_or_bust(content)
with codecs.open(filepath, 'w', encoding='utf-8') as f:
f.write(content)
enforce_unicode()
I have tried to do content = content.decode(encoding).encode('utf-8') before writing the file, but that fails with the following error:
/usr/local/.../lib/python2.7/encodings/utf_8_sig.py:19: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if input[:3] == codecs.BOM_UTF8:
Traceback (most recent call last):
File "./enforce-unicode.py", line 48, in <module>
enforce_unicode()
File "./enforce-unicode.py", line 43, in enforce_unicode
content = content.decode(encoding).encode('utf-8')
File "/usr/local/.../lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 87: ordinal not in range(128)
Ideas?
chardet simply got the detected codec it wrong, your code is otherwise correct. Character detection is based on statistics, heuristics and plain guesses, it is not a foolproof method.
For example, the Windows 1252 codepage is very close to the Latin-1 codec; files encoded with the one encoding can be decoded without error in the other encoding. Detecting the difference between a control code in the one or a Euro symbol in the other usually takes a human being looking at the result.
I'd record the chardet guesses for each file, if the file turns out to be wrongly re-coded, you need to look at what other codecs could be close. All of the 1250-series codepages look a lot alike.

Python script to convert from UTF-8 to ASCII [duplicate]

This question already has answers here:
Convert Unicode to ASCII without errors in Python
(12 answers)
Closed 8 years ago.
I'm trying to write a script in python to convert utf-8 files into ASCII files:
#!/usr/bin/env python
# *-* coding: iso-8859-1 *-*
import sys
import os
filePath = "test.lrc"
fichier = open(filePath, "rb")
contentOfFile = fichier.read()
fichier.close()
fichierTemp = open("tempASCII", "w")
fichierTemp.write(contentOfFile.encode("ASCII", 'ignore'))
fichierTemp.close()
When I run this script I have the following error :
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xef in position 13:
ordinal not in range(128)
I thought that can ignore error with the ignore parameter in the encode method. But it seems not.
I'm open to other ways to convert.
data="UTF-8 DATA"
udata=data.decode("utf-8")
asciidata=udata.encode("ascii","ignore")
import codecs
...
fichier = codecs.open(filePath, "r", encoding="utf-8")
...
fichierTemp = codecs.open("tempASCII", "w", encoding="ascii", errors="ignore")
fichierTemp.write(contentOfFile)
...
UTF-8 is a superset of ASCII. Either your UTF-8 file is ASCII, or it can't be converted without loss.

Categories