Output CSV file encoding is incorrect UTF-8

Output CSV file encoding is incorrect UTF-8 - python

I have UTF-8 (no BOM) encoded CSV file:
aaa;bbb;ccc
fff;äää;ööö
Following snippet reads the the file and then saves it again using different encoding:
import csv
rows = []
with open('test_in.csv', 'r', newline='') as file:
csvReader = csv.reader(file, delimiter=';')
for row in csvReader:
rows.append(row)
with open('test_out.csv', 'w', newline='', encoding='iso-8859-1') as file:
csvWriter = csv.writer(file, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in rows:
csvWriter.writerow(row)
Problem: Saved file is not iso-8859-1, but utf-8 encoded.
If I replace the file read with following list in my UTF-8 encoded source code file, it works correctly:
rows = [
['aaa','bbb','ccc'],
['fff','äää','ööö']
]
Is this a bug in Python? Or do I have to use additional encoding options?
Tested with Python 3.4.

I tried with python3.5.1 and it worked fine for me:
sharad#ss:~$ rm test_out.csv
sharad#ss:~$ ls test_in.csv
test_in.csv
sharad#ss:~$ cat my.py
import csv
rows = []
#with open('my.csv', 'r', newline='', encoding='utf-8') as file:
with open('test_in.csv', 'r', newline='') as file:
csvReader = csv.reader(file, delimiter=';')
for row in csvReader:
rows.append(row)
with open('test_out.csv', 'w', newline='', encoding='iso-8859-1') as file:
csvWriter = csv.writer(file, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in rows:
csvWriter.writerow(row)
sharad#ss:~$
sharad#ss:~$ python3.5 my.py
sharad#ss:~$ ls test_out.csv
test_out.csv
sharad#ss:~$ file test_*.csv
test_in.csv: UTF-8 Unicode text
test_out.csv: ISO-8859 text, with CRLF line terminators
sharad#ss:~$

It seems encoding option for open() doesn't work as I thought (I assumed it defaults to UTF-8). Docs for open say:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. ...
And on my system it seems to default to cp1252. Thus repr(rows) returns
[['aaa', 'bbb', 'ccc'], ['fff', 'Ã¤Ã¤Ã¤', 'Ã¶Ã¶Ã¶']]
Fix is to explicitly specify encoding also for the input file:
with open('test_in.csv', 'r', newline='', encoding='utf-8') as file:

Related

csv.writer encoding 'utf-8', but reading encoding 'cp1252'

When writing to a file I use the following code. Here it's upper case, but I've also seen the encoding in lower case utf-8.
path_to_file = os.path.join(r'C:\Users\jpm\Downloads', 'c19_Vaccine_Current.csv')
#write to file
with open(path_to_file, 'w', newline='', encoding='UTF-8') as csvfile:
f = csv.writer(csvfile)
#write the headers of the csv file
f.writerow(['County','AdminCount','AdminCountChange', 'RollAvg', 'AllocDoses', 'FullyVaccinated', 'FullyVaccinatedChange', 'ReportDate', 'Pop', 'PctVaccinated', 'LHDInventory', 'CommInventory',
'TotalInventory', 'InventoryDate'])
And to check if the *.csv is in fact utf-8 I open it and read it:
with open(path_to_file, 'r') as r:
print(r)
I'm expecting the encoding to be utf-8, but I get:
<_io.TextIOWrapper name='C:\\Users\\jpm\\Downloads\\c19_Vaccine_Current.csv' mode='r' encoding='cp1252'>
I pretty much borrowed the code from this answer. And I've also read the doc. It's crucial that I have the *.csv file as utf-8, but that doesn't appear to be the case.

The encoding has to be specified on an open as well. The encoding in which a file is opened is platform dependant, it would seem to be cp1252 on windows.
You can check the default platform encoding with this: (on Mac it gives utf-8)
>>>import locale
>>>locale.getpreferredencoding(False)
'UTF-8'
with open('file', 'r', encoding='utf-8'):
...
with open('file', 'w', encoding='utf-8'):
...

Avoiding UnicodeEncodeError in python

I tried to parse an html table into csv using python with a following script:
from bs4 import BeautifulSoup
import requests
import csv
csvFile = open('log.csv', 'w', newline='')
writer = csv.writer(csvFile)
def parse():
html = requests.get('https://en.wikipedia.org/wiki/Comparison_of_text_editors')
bs = BeautifulSoup(html.text, 'lxml')
table = bs.select_one('table.wikitable')
rows = table.select('tr')
for row in rows:
csvRow = []
for cell in row.findAll(['th', 'td']):
csvRow.append(cell.getText())
writer.writerow(csvRow)
print(csvRow)
parse()
csvFile.close()
This code outputed a clear formated CSV file with no encoding issues.
All was just fine before Enrico Tröger's Geany. My script was unable to write ö
into a csv file, so i tried this:
csvRow.append(cell.text.encode('ascii', 'replace')) instead of that: csvRow.append(cell.getText())
All was fine, despite the fact that each table cell was nested in b''. So, how can i get a clear formated csv file withous encoding issues(like in the first screenshot) and replaced or ignored all
non-unicode symbols(like in the second screenshot) using my scipt?

Change this one:
csvFile = open('log.csv', 'w', newline='')
To this one:
csvFile = open('log.csv', 'w', newline='', encoding='utf8')
csv module documentation:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
I suppose your system default encoding is not utf8.
You can check it like this:
import locale
locale.getpreferredencoding()
Hope it helps!

Looks like the csv module expects strings, not bytes. So you could de-encode your bytes before passing them:
cell.text.encode('ascii', 'replace').decode('ascii')

python - writing hex digits to csv

I am having a the following string:
>>> line = '\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
When I type the variable line in the python terminal it showing the following:
>>> line
'\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
When I am printing it, its showing the following:
>>> print line
7 Cardio Metabolic Care 12,788,528.04
In the variable line each word is separated using \t and I wanted to save it to a csv file. So I tried using the following code:
import csv
with open('test.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(line.split('\t'))
When I look into the test.csv file, I am getting only the following
,,,,,,
Is there any to get the words into the csv file. Kindly help.

Your input text is not corrupted, it's encoded - as UTF-16 (Big Endian in this case). And it's CSV itself, just with tab as the delimiter.
You must decode it into a string, after that you can use it normally.
Ideally you declare the proper byte encoding when you read it from a source. For example, when you open a file you can state the encoding the file uses so that the file reader will decode the contents for you.
If you have that byte string from a source where you can't declare an encoding while reading it, you can decode manually:
line = '\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
decoded = line.decode('utf_16_be')
print decoded
# 7 Cardio Metabolic Care 12,788,528.04
But since I suppose that you are actually reading it from a file:
import csv
import codecs
with codecs.open('input.txt', 'r', encoding='utf16') as in_file, codecs.open('output.csv', 'w', encoding='utf8') as out_file:
reader = csv.reader(in_file, delimiter='\t')
writer = csv.writer(out_file, delimiter=',', quotechar='"')
writer.writerows(reader)

Using r+ mode to read and write into the same file

I have a script that successfully removes a column from a csv file. Currently it does this by creating a new file. I want it to write to the original file rather than create a new one.
I’ve tried this by using the r+ mode for open but it’s not working how I want. See notes below. I think r+ mode is the one I need but I’m struggling to find working examples to learn from.
my code:
import csv
in_file = "Path to Source"
out_file = "Path to Result"
with open(in_file, 'r', newline='') as inf, \
open(out_file, 'w', newline='') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf)
for r in reader:
writer.writerow((r[0],r[1],r[2],r[3],r[4],r[5],r[6]))
attempt using r+ mode:
with open(in_file, 'r+', newline='') as inf:
reader = csv.reader(inf)
writer = csv.writer(inf)
for r in reader:
writer.writerow((r[0],r[1],r[2],r[3],r[4],r[5],r[6]))
This fails with the error list index out of range

From what I see, as the reader reads, the writer writes. On the same file.
Files have a 'cursor', i.e. a current position upon which they are read from/written to.
So the writer is overwriting the next row in the file after the one the reader has just read, with catastrophic consequences on the following readings.
I think the first approach is the best one: create a new file and then rename it (the original input file is deleted automatically)
import csv, os
in_file = "Path to Source"
out_file = "Path to Result"
with open(in_file, 'r', newline='') as inf, \
open(out_file, 'w', newline='') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf)
for r in reader:
writer.writerow(r[:7])
os.rename(out_file, in_file)

Python unicode csv cyryllic, dont convert

I have a problem with writing cyryllic to csv. I use unicodecsv and next snippet:
import unicodecsv
ff = open('test.csv', 'wb+')
writer = unicodecsv.writer(ff, encoding='utf-8', delimiter=',', quotechar='"')
writer.writerows([[u'тест', 'aaa', 'nnn']])
ff.close()
csv generates well, but than U open it in Microsoft Excel 2011 and I see this:
Try it in Libre office too, same problem...
OS: Mac os Yosemite
don't work too with utf-8-sig:
writer = unicodecsv.writer(ff, encoding='utf-8-sig', delimiter=',', quotechar='"')

Excel likes UTF-8-encoded files to have a BOM (byte order mark) signature. Use utf-8-sig as the encoding instead, else it thinks the file is ANSI-encoded. "ANSI" is locale-dependent, and is Windows-1252 on U.S. Windows.
Test source file saved with UTF-8 encoding:
#coding:utf8
import unicodecsv
with open('test.csv', 'wb+') as ff:
writer = unicodecsv.writer(ff, encoding='utf-8-sig', delimiter=',', quotechar='"')
writer.writerows([[u'тест', 'aaa', 'nnn']])
Output:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Output CSV file encoding is incorrect UTF-8 - python

Related

csv.writer encoding 'utf-8', but reading encoding 'cp1252'

Avoiding UnicodeEncodeError in python

python - writing hex digits to csv

Using r+ mode to read and write into the same file

Python unicode csv cyryllic, dont convert

Categories

Resources