Reading russian language data from csv - python

I have some data in CSV file that are in Russian:
2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы
2-комнатная квартира БГР', мкр Таугуль, Дулати (Навои) — Токтабаева;Алматы
2-комнатная квартира ЦФМ', мкр Тастак-2, Тлендиева — Райымбека;Алматы
Delimiter is ; symbol.
I want to read data and put it into array. I tried to read this data using this code:
def loadCsv(filename):
lines = csv.reader(open(filename, "rb"),delimiter=";" )
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [str(x) for x in dataset[i]]
return dataset
Then I read and print result:
mydata = loadCsv('krish(csv3).csv')
print mydata
Output:
[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]
I found that in this case codecs are required and tried to do the same with this code:
import codecs
with codecs.open('krish(csv3).csv','r',encoding='utf8') as f:
text = f.read()
print text
I got this error:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte
What is the problem? When using codecs how to specify delimiter in my data?
I just want to read data from file and put it in 2-dimensional array.

\ea is the windows-1251 / cp5347 encoding for к. Therefore, you need to use windows-1251 decoding, not UTF-8.
In Python 2.7, the CSV library does not support Unicode properly - See "Unicode" in https://docs.python.org/2/library/csv.html
They propose a simple work around using:
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
This would allow you to do:
def loadCsv(filename):
lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" )
# if you really need lists then uncomment the next line
# this will let you do call exact lines by doing `line_12 = lines[12]`
# return list(lines)
# this will return an "iterator", so that the file is read on each call
# use this if you'll do a `for x in x`
return lines
If you try to print dataset, then you'll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x or \u. To print the values, do:
for csv_line in loadCsv("myfile.csv"):
print u", ".join(csv_line)
If you need to write your results to another file (fairly typical), you could do:
with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput:
for csv_line in loadCsv("myfile.csv"):
my_output.write(u", ".join(csv_line))
This will automatically convert/encode your output to UTF-8.

You cant try:
import pandas as pd
pd.read_csv(path_file , "cp1251")
or
import csv
with open(path_file, encoding="cp1251", errors='ignore') as source_file:
reader = csv.reader(source_file, delimiter=",")

Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.

In py3:
import csv
path = 'C:/Users/me/Downloads/sv.csv'
with open(path, encoding="UTF8") as f:
reader = csv.reader(f)
for row in reader:
print(row)

Related

decoding breaks lines into characters in python 3

I am reading a CSV file through samba share. My CSV file format
hello;world
1;2;
Python code
import urllib
from smb.SMBHandler import SMBHandler
PATH = 'smb://myusername:mypassword#192.168.1.200/myDir/'
opener = urllib.request.build_opener(SMBHandler)
fh = opener.open(PATH + 'myFileName')
data = fh.read().decode('utf-8')
print(data) // This prints the data right
csvfile = csv.reader(data, delimiter=';')
for myrow in csvfile:
print(myrow) // This just prints ['h']. however it should print(hello;world)
break
fh.close()
The problem is that after decoding to utf-8, the rows are not the actual lines in the file
Desired output of a row after reading the file: hello;world
Current output of a row after reading the file: h
Any help is appreciated.
csv.reader takes an iterable that returns lines. Strings, when iterated, yield characters. The fix is simple:
csvfile = csv.reader(data.splitlines(), delimiter=';')

Reading Korean through a CSV in Python

I am having an issue reading a CSV file in to Python containing English and Korean Characters, have tested my code without the Korean and it works fine.
Code (Python - 3.6.4)
import csv
with open('Kor3.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list)
Error
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position
2176: character maps to undefined
CSV File Output: This has been converted from Excel to unicode text then filename changed to CSV. Think this is the route of the problem.
Would it be better to read from an Excel or another format?
Sample Input (2 Columns)
생일 축하해요 Happy birthday
Just declare the encoding when opening the file:
with open('Kor3.csv', 'r', encoding='utf-8') as f:
Use python 3. Csv functions will read with unicode by default
In the end I just went with importing from the excel file think this was an issue with the csv rather than python. Thanks for your help.
from xlrd import open_workbook
wb = open_workbook('Korean.xlsx')
values = []
for s in wb.sheets():
#print 'Sheet:',s.name
for row in range(1, s.nrows):
col_names = s.row(0)
col_value = []
for name, col in zip(col_names, range(s.ncols)):
value = (s.cell(row,col).value)
try : value = str(int(value))
except : pass
col_value.append((value))
values.append(col_value)
print(values) #test
print(values[0][1],values[1][1]) #test2

trouble with unicodecsv reader in python

I'm having trouble using the unicodecsv reader. I keep looking for different examples of how to use the module, but everyone keeps referencing the exact sample from the unicodecsv website (or some similar variation).
import unicodecsv as csv
from io import BytesIO
f = BytesIO()
w = csv.writer(f, encoding='utf-8')
_ = w.writerow((u'é', u'ñ'))
_ = f.seek(0)
r = csv.reader(f, encoding='utf-8')
next(r) == [u'é', u'ñ']
>>> True
For me this example makes too many assumptions about our understanding. It doesn't look like a csv file is being passed. I've completely missed the plot.
What I want to do is:
Read the first line of the csv file which are headers
Read the remaining lines and put them in a dictionary
My broken code:
import unicodecsv
#
i = 0
myCSV = "$_input.csv"
dic = {}
#
f = open(myCSV, "rb")
reader = unicodecsv.reader(f, delimiter=',')
strHeader = reader.next()
#
# read the first line of csv
# use custom function to parse the header
myHeader = FNC.PARSE_HEADER(strHeader)
#
# read the remaining lines
# put data into dictionary of class objects
for row in reader:
i += 1
dic[i] = cDATA(myHeader, row)
And, as expected, I get the 'UnicodeDecodeError'. Maybe the example above has the answers, but they are just completely going over my head.
Can someone please fix my code? I'm running out of hair to pull out.
I switched the reader line to:
reader = unicodecsv.reader(f, encoding='utf-8')
Traceback:
for row in reader:
File "C:\Python27\unicodecsv\py2.py", line 128 in next
for value in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 48: invalide start byte
When I strictly print the data using:
f = open(myCSV, "rb")
reader = csv.reader(f, delimiter=',')
for row in reader:
print(str[row[9]] + '\n')
print(repr(row[9] + '\n')
>>> UTAS ? Offline
>>> 'UTAS ? Offline'
You need to declare the encoding of the input file when creating the reader, just like you did when creating the writer:
>>> import unicodecsv as csv
>>> with open('example.csv', 'wb') as f:
... writer = csv.writer(f, encoding='utf-8')
... writer.writerow(('heading0', 'heading1'))
... writer.writerow((u'é', u'ñ'))
... writer.writerow((u'ŋ', u'ŧ'))
...
>>> with open('example.csv', 'rb') as f:
... reader = csv.reader(f, encoding='utf-8')
... headers = next(reader)
... print headers
... data = {i: v for (i, v) in enumerate(reader)}
... print data
...
[u'heading0', u'heading1']
{0: [u'\xe9', u'\xf1'], 1: [u'\u014b', u'\u0167']}
Printing the dictionary shows the escaped representation of the data, but you can see the characters by printing them individually:
>>> for v in data.values():
... for s in v:
... print s
...
é
ñ
ŋ
ŧ
EDIT:
If the encoding of the file is unknown, then it's best to use some like chardet to determine the encoding before processing.
If your final goal is read csv file and convert data into dicts then I would recommend using csv.DictReader. DictRead will take care of reading header and converting rest of the rows into Dict (rowdicts). This uses CSV moduels, which contains lots of documentation/example available.
>>> import csv
>>> with open('names.csv') as csvfile:
... reader = csv.DictReader(csvfile)
... for row in reader:
... print(row['first_name'], row['last_name'])
To get more clarity you check examples here https://docs.python.org/2/library/csv.html#csv.DictReader

export a list as a csv file in python and getting UnicodeEncodeError

I want to get a csv file from my list.
This is my list:
temp = ['سلام' , 'چطوری' ]
Members of list are in Persian language.
I tried to get csv file by this code:
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(temp)
but terminal gives me this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u06a9' in position 0: ordinal not in range(128)
How can I solve it and get my csv file?
P.S
Actually when I print temp , I see these strings:
[u'\u06a9\u0627\u062e \u0645\u0648\u0632\u0647 \u06af\u0644\u0633\u062a\u0627\u0646 | Golestan Palace', u'\u062a\u0647\u0631\u0627\u0646', u'\u062a\u0647\u0631\u0627\]
But when I call Temp[1] I get this:
کاخ موزه گلستان | Golestan Palace
How can I solve it and get my csv file?
Why sometimes python encodes my data and sometime it doesn't?
In another answer, you said you were using Python 2.7. Extract from Python Standard Library Reference Manual - csv module :
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
Same paragraph gives you an example of a class that could be used to deal with unicode data :
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
But you could also try simpler code :
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows([u.encode('utf-8') for u in temp])
if temp is a list of unicode strings
or :
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows([[u.encode('utf-8') for u in row] for row in temp])
if temp is a list of list of unicode strings
The csv library in Python 2 cannot handle Unicode data. This is fixed in Python 3, but will not be backported. However, there is a drop-in replacement 3rd party library that fixes the problem.
Try using UnicodeCSV instead.

Python process a csv file to remove unicode characters greater than 3 bytes

I'm using Python 2.7.5 and trying to take an existing CSV file and process it to remove unicode characters that are greater than 3 bytes. (Sending this to Mechanical Turk, and it's an Amazon restriction.)
I've tried to use the top (amazing) answer in this question (How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?). I assume that I can just iterate through the csv row-by-row, and wherever I spot unicode characters of >3 bytes, replace them with a replacement character.
# -*- coding: utf-8 -*-
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
ifile = open('sourcefile.csv', 'rU')
reader = csv.reader(ifile, dialect=csv.excel_tab)
ofile = open('outputfile.csv', 'wb')
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
#skip header row
next(reader, None)
for row in reader:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c).encode('utf8')) for c in row])
ifile.close()
ofile.close()
I'm currently getting this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 264: ordinal not in range(128)
So this does iterate properly through some rows, but stops when it gets to the strange unicode characters.
I'd really appreciate some pointers; I'm completely confused. I've replaced 'utf8' with 'latin1' and unicode(c).encode to unicode(c).decode and I keep getting this same error.
Your input is still encoded data, not Unicode values. You'd need to decode to unicode values first, but you didn't specify an encoding to use. You then need to encode again back to encoded values to write back to the output CSV:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c, 'utf8')).encode('utf8')
for c in row])
Your error stems from the unicode(c) call; without an explicit codec to use, Python falls back to the default ASCII codec.
If you use your file objects as context managers, there is no need to manually close them:
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def limit_to_BMP(value, patt=re_pattern):
return patt.sub(u'\uFFFD', unicode(value, 'utf8')).encode('utf8')
with open('sourcefile.csv', 'rU') as ifile, open('outputfile.csv', 'wb') as ofile:
reader = csv.reader(ifile, dialect=csv.excel_tab)
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
next(reader, None) # header is not added to output file
writer.writerows(map(limit_to_BMP, row) for row in reader)
I moved the replacement action to a separate function too, and used a generator expression to produce all rows on demand for the writer.writerows() function.

Categories