Reading Korean through a CSV in Python - python

I am having an issue reading a CSV file in to Python containing English and Korean Characters, have tested my code without the Korean and it works fine.
Code (Python - 3.6.4)
import csv
with open('Kor3.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list)
Error
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position
2176: character maps to undefined
CSV File Output: This has been converted from Excel to unicode text then filename changed to CSV. Think this is the route of the problem.
Would it be better to read from an Excel or another format?
Sample Input (2 Columns)
생일 축하해요 Happy birthday

Just declare the encoding when opening the file:
with open('Kor3.csv', 'r', encoding='utf-8') as f:

Use python 3. Csv functions will read with unicode by default

In the end I just went with importing from the excel file think this was an issue with the csv rather than python. Thanks for your help.
from xlrd import open_workbook
wb = open_workbook('Korean.xlsx')
values = []
for s in wb.sheets():
#print 'Sheet:',s.name
for row in range(1, s.nrows):
col_names = s.row(0)
col_value = []
for name, col in zip(col_names, range(s.ncols)):
value = (s.cell(row,col).value)
try : value = str(int(value))
except : pass
col_value.append((value))
values.append(col_value)
print(values) #test
print(values[0][1],values[1][1]) #test2

Related

decoding breaks lines into characters in python 3

I am reading a CSV file through samba share. My CSV file format
hello;world
1;2;
Python code
import urllib
from smb.SMBHandler import SMBHandler
PATH = 'smb://myusername:mypassword#192.168.1.200/myDir/'
opener = urllib.request.build_opener(SMBHandler)
fh = opener.open(PATH + 'myFileName')
data = fh.read().decode('utf-8')
print(data) // This prints the data right
csvfile = csv.reader(data, delimiter=';')
for myrow in csvfile:
print(myrow) // This just prints ['h']. however it should print(hello;world)
break
fh.close()
The problem is that after decoding to utf-8, the rows are not the actual lines in the file
Desired output of a row after reading the file: hello;world
Current output of a row after reading the file: h
Any help is appreciated.
csv.reader takes an iterable that returns lines. Strings, when iterated, yield characters. The fix is simple:
csvfile = csv.reader(data.splitlines(), delimiter=';')

UnicodeEncodeError: 'ascii' codec can't encode character despite trying other SO solutions [duplicate]

This question already has answers here:
Python CSV DictReader with UTF-8 data
(7 answers)
Closed 7 years ago.
I am trying to convert a CSV file to a json file. During that process, when i try to write to the json file, i am getting an error halfway about a unicode error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u06ec' in position 933: ordinal not in range(128)
my code:
import csv
import json
import codecs
csvfile = codecs.open('my.csv', 'r', encoding='utf-8', errors='ignore')
jsonfile = codecs.open('my.json',"w", encoding='utf-8',errors='ignore')
fieldnames = ("Title","Date","Text","Country","Page","Week")
reader = csv.DictReader(csvfile, fieldnames)
for row in reader:
row['Text'] = row['Text'].encode('ascii',errors='ignore') #error occur on this line
json.dump(row, jsonfile)
jsonfile.write('\n')
example of a row:
{'Country': 'UK', 'Title': '12345', 'Text': " hi there hi john i currently ", 'Week': 'week2', 'Page': 'homepage', 'Date': '1/3/16'}
Don't convert to ASCII.
JSON handles unicode natively.
Simply remove .encode("ascii", ...) part.
Also, you don't need to have encoding set on file object you use for JSON, because JSON already serialises unicode correctly.
Edited my code to read the CSV file as binary. It then gave me another issue of invalid byte of which i solved by transforming the text string to unicode:
This is the working code:
csvfile = open('my.csv', 'rb')
jsonfile = codecs.open('my.json',"w")
fieldnames = ("Title","Date","Text","Country","Page","Week")
reader = csv.DictReader(csvfile, fieldnames)
for row in reader:
print row
row['Text'] = unicode(row['Text'],errors='replace')

Reading russian language data from csv

I have some data in CSV file that are in Russian:
2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы
2-комнатная квартира БГР', мкр Таугуль, Дулати (Навои) — Токтабаева;Алматы
2-комнатная квартира ЦФМ', мкр Тастак-2, Тлендиева — Райымбека;Алматы
Delimiter is ; symbol.
I want to read data and put it into array. I tried to read this data using this code:
def loadCsv(filename):
lines = csv.reader(open(filename, "rb"),delimiter=";" )
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [str(x) for x in dataset[i]]
return dataset
Then I read and print result:
mydata = loadCsv('krish(csv3).csv')
print mydata
Output:
[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]
I found that in this case codecs are required and tried to do the same with this code:
import codecs
with codecs.open('krish(csv3).csv','r',encoding='utf8') as f:
text = f.read()
print text
I got this error:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte
What is the problem? When using codecs how to specify delimiter in my data?
I just want to read data from file and put it in 2-dimensional array.
\ea is the windows-1251 / cp5347 encoding for к. Therefore, you need to use windows-1251 decoding, not UTF-8.
In Python 2.7, the CSV library does not support Unicode properly - See "Unicode" in https://docs.python.org/2/library/csv.html
They propose a simple work around using:
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
This would allow you to do:
def loadCsv(filename):
lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" )
# if you really need lists then uncomment the next line
# this will let you do call exact lines by doing `line_12 = lines[12]`
# return list(lines)
# this will return an "iterator", so that the file is read on each call
# use this if you'll do a `for x in x`
return lines
If you try to print dataset, then you'll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x or \u. To print the values, do:
for csv_line in loadCsv("myfile.csv"):
print u", ".join(csv_line)
If you need to write your results to another file (fairly typical), you could do:
with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput:
for csv_line in loadCsv("myfile.csv"):
my_output.write(u", ".join(csv_line))
This will automatically convert/encode your output to UTF-8.
You cant try:
import pandas as pd
pd.read_csv(path_file , "cp1251")
or
import csv
with open(path_file, encoding="cp1251", errors='ignore') as source_file:
reader = csv.reader(source_file, delimiter=",")
Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.
In py3:
import csv
path = 'C:/Users/me/Downloads/sv.csv'
with open(path, encoding="UTF8") as f:
reader = csv.reader(f)
for row in reader:
print(row)

export a list as a csv file in python and getting UnicodeEncodeError

I want to get a csv file from my list.
This is my list:
temp = ['سلام' , 'چطوری' ]
Members of list are in Persian language.
I tried to get csv file by this code:
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(temp)
but terminal gives me this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u06a9' in position 0: ordinal not in range(128)
How can I solve it and get my csv file?
P.S
Actually when I print temp , I see these strings:
[u'\u06a9\u0627\u062e \u0645\u0648\u0632\u0647 \u06af\u0644\u0633\u062a\u0627\u0646 | Golestan Palace', u'\u062a\u0647\u0631\u0627\u0646', u'\u062a\u0647\u0631\u0627\]
But when I call Temp[1] I get this:
کاخ موزه گلستان | Golestan Palace
How can I solve it and get my csv file?
Why sometimes python encodes my data and sometime it doesn't?
In another answer, you said you were using Python 2.7. Extract from Python Standard Library Reference Manual - csv module :
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
Same paragraph gives you an example of a class that could be used to deal with unicode data :
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
But you could also try simpler code :
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows([u.encode('utf-8') for u in temp])
if temp is a list of unicode strings
or :
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows([[u.encode('utf-8') for u in row] for row in temp])
if temp is a list of list of unicode strings
The csv library in Python 2 cannot handle Unicode data. This is fixed in Python 3, but will not be backported. However, there is a drop-in replacement 3rd party library that fixes the problem.
Try using UnicodeCSV instead.

Python process a csv file to remove unicode characters greater than 3 bytes

I'm using Python 2.7.5 and trying to take an existing CSV file and process it to remove unicode characters that are greater than 3 bytes. (Sending this to Mechanical Turk, and it's an Amazon restriction.)
I've tried to use the top (amazing) answer in this question (How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?). I assume that I can just iterate through the csv row-by-row, and wherever I spot unicode characters of >3 bytes, replace them with a replacement character.
# -*- coding: utf-8 -*-
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
ifile = open('sourcefile.csv', 'rU')
reader = csv.reader(ifile, dialect=csv.excel_tab)
ofile = open('outputfile.csv', 'wb')
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
#skip header row
next(reader, None)
for row in reader:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c).encode('utf8')) for c in row])
ifile.close()
ofile.close()
I'm currently getting this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 264: ordinal not in range(128)
So this does iterate properly through some rows, but stops when it gets to the strange unicode characters.
I'd really appreciate some pointers; I'm completely confused. I've replaced 'utf8' with 'latin1' and unicode(c).encode to unicode(c).decode and I keep getting this same error.
Your input is still encoded data, not Unicode values. You'd need to decode to unicode values first, but you didn't specify an encoding to use. You then need to encode again back to encoded values to write back to the output CSV:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c, 'utf8')).encode('utf8')
for c in row])
Your error stems from the unicode(c) call; without an explicit codec to use, Python falls back to the default ASCII codec.
If you use your file objects as context managers, there is no need to manually close them:
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def limit_to_BMP(value, patt=re_pattern):
return patt.sub(u'\uFFFD', unicode(value, 'utf8')).encode('utf8')
with open('sourcefile.csv', 'rU') as ifile, open('outputfile.csv', 'wb') as ofile:
reader = csv.reader(ifile, dialect=csv.excel_tab)
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
next(reader, None) # header is not added to output file
writer.writerows(map(limit_to_BMP, row) for row in reader)
I moved the replacement action to a separate function too, and used a generator expression to produce all rows on demand for the writer.writerows() function.

Categories