Write numpy.ndarray with Russian characters to file

Write numpy.ndarray with Russian characters to file - python

I try to write numpy.ndarray to file.
I use
unique1 = np.unique(df['search_term'])
unique1 = unique1.tolist()
and next try
1)
edf = pd.DataFrame()
edf['term'] = unique1
writer = pd.ExcelWriter(r'term.xlsx', engine='xlsxwriter')
edf.to_excel(writer)
writer.close()
and 2)
thefile = codecs.open('domain.txt', 'w', encoding='utf-8')
for item in unique:
thefile.write("%s\n" % item)
But all return UnicodeDecodeError: 'utf8' codec can't decode byte 0xd7 in position 9: invalid continuation byte

The second example should work if you encode the strings as utf8.
The following works in Python2 with a utf8 encoded file:
# _*_ coding: utf-8
import pandas as pd
edf = pd.DataFrame()
edf['term'] = ['foo', 'bar', u'русском']
writer = pd.ExcelWriter(r'term.xlsx', engine='xlsxwriter')
edf.to_excel(writer)
writer.save()
Output:

Related

csv read raises "UnicodeDecodeError: 'charmap' codec can't decode..."

I've read every post I can find, but my situation seems unique. I'm totally new to Python so this could be basic. I'm getting the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 70: character maps to undefined
When I run the code:
import csv
input_file = 'input.csv'
output_file = 'output.csv'
cols_to_remove = [4, 6, 8, 9, 10, 11,13, 14, 19, 20, 21, 22, 23, 24]
cols_to_remove = sorted(cols_to_remove, reverse=True)
row_count = 0 # Current amount of rows processed
with open(input_file, "r") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='') as result:
writer = csv.writer(result)
for row in reader:
row_count += 1
print('\r{0}'.format(row_count), end='')
for col_index in cols_to_remove:
del row[col_index]
writer.writerow(row)
What am I doing wrong?

In Python 3, the csv module processes the file as unicode strings, and because of that has to first decode the input file. You can use the exact encoding if you know it, or just use Latin1 because it maps every byte to the unicode character with same code point, so that decoding+encoding keep the byte values unchanged. Your code could become:
...
with open(input_file, "r", encoding='Latin1') as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding='Latin1') as result:
...

Add encoding="utf8" while opening file. Try below instead:
with open(input_file, "r", encoding="utf8") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding="utf8") as result:

Try pandas
input_file = pandas.read_csv('input.csv')
output_file = pandas.read_csv('output.csv')
Try saving the file again as CSV UTF-8

ASCII Encode/Decode Error

I have csv files having encoding-'utf-8'. I need to convert the csv to excel workbook with same encoding but unable to do so. Tried many things but not able to fix. Here is the code snippet.
NOte: Using xlsxwriter package
def csv_to_excel(input_file_path, output_file_path):
file_path = input_file_path
excel_file_path = output_file_path
wb = Workbook(excel_file_path.encode('utf-8', 'ignore'), {'encoding': 'utf-8'})
sheet1 = wb.add_worksheet(("anyname1").encode('utf-8','ignore'))
sheet2 = wb.add_worksheet(("anyname2").encode('utf-8','ignore'))
for filename in glob.glob(file_path):
(f_path, f_name) = os.path.split(filename)
w_tab = str(f_name.split('_')[2]).split('.')[0]
if (w_tab=="anyname1"):
w_sheet = sheet1
elif (w_tab=="anyname2"):
w_sheet = sheet2
spamReader = csv.reader(open(filename, "rb"), delimiter=',',quotechar='"')
row_count = 0
for row in spamReader:
for col in range(len(row)):
w_sheet.write(row_count,col,row[col])
row_count +=1
try:
os.remove(excel_file_path)
except:
pass
wb.close()
print "Converted CSVs to Excel File"
Errors:
Case1: When I am trying to open the utf-8 encoded csv file as follows:
spamReader = csv.reader(io.open(filename, "r", encoding = 'utf-8'), delimiter=',',quotechar='"')
Then getting error while iterating over the spamReader object as
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 92: ordinal not in range(128)
Case2: When I am trying to open the same csv file as binary as mentioned in above code snippet, then I am not able to save it as utf-8 encoded excel, so while calling wb.close(), getting error as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)
I have just started learning python so maybe this is not that big issue but Please help me on this.

UnicodeDecodeError while trying to change points into commas in Excel

I'm currently trying to run following code:
import csv
with open("C:\\Users\\User\\Downloads\\stu21617.rw2.xlsx", 'r', encoding='utf-8') as infile:
with open("C:\\Users\\User\\Documents\\jens\\komma.xlsx", 'w', encoding='utf-8') as outfile:
tabel = []
writer = csv.writer(outfile, delimiter = ';')
for rij in csv.reader(infile, delimiter = ';'):
writer.writerow(rij.replace('.', ','))
But I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte

Python JSON to CSV - bad encoding, UnicodeDecodeError: 'charmap' codec can't decode byte

I have a problem converting nested JSON to CSV. For this i use https://github.com/vinay20045/json-to-csv (forked a bit to support python 3.4), here is full json-to-csv.py file.
Converting is working, if i set
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('utf8','ignore')
and
fp = open(json_file_path, 'r', encoding='utf-8')
but when i import csv to MS Excel i see bad cyrillic characters, for example \xe0\xf1 , english text is ok.
Experimented with setting encode('cp1251','ignore') but then i got an error
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to (as here UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>)
import sys
import json
import csv
##
# This function converts an item like
# {
# "item_1":"value_11",
# "item_2":"value_12",
# "item_3":"value_13",
# "item_4":["sub_value_14", "sub_value_15"],
# "item_5":{
# "sub_item_1":"sub_item_value_11",
# "sub_item_2":["sub_item_value_12", "sub_item_value_13"]
# }
# }
# To
# {
# "node_item_1":"value_11",
# "node_item_2":"value_12",
# "node_item_3":"value_13",
# "node_item_4_0":"sub_value_14",
# "node_item_4_1":"sub_value_15",
# "node_item_5_sub_item_1":"sub_item_value_11",
# "node_item_5_sub_item_2_0":"sub_item_value_12",
# "node_item_5_sub_item_2_0":"sub_item_value_13"
# }
##
def reduce_item(key, value):
global reduced_item
#Reduction Condition 1
if type(value) is list:
i=0
for sub_item in value:
reduce_item(key+'_'+str(i), sub_item)
i=i+1
#Reduction Condition 2
elif type(value) is dict:
sub_keys = value.keys()
for sub_key in sub_keys:
reduce_item(key+'_'+str(sub_key), value[sub_key])
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('cp1251','ignore')
if __name__ == "__main__":
if len(sys.argv) != 4:
print("\nUsage: python json_to_csv.py <node_name> <json_in_file_path> <csv_out_file_path>\n")
else:
#Reading arguments
node = sys.argv[1]
json_file_path = sys.argv[2]
csv_file_path = sys.argv[3]
fp = open(json_file_path, 'r', encoding='cp1251')
json_value = fp.read()
raw_data = json.loads(json_value)
processed_data = []
header = []
for item in raw_data[node]:
reduced_item = {}
reduce_item(node, item)
header += reduced_item.keys()
processed_data.append(reduced_item)
header = list(set(header))
header.sort()
with open(csv_file_path, 'wt+') as f:#wb+ for python 2.7
writer = csv.DictWriter(f, header, quoting=csv.QUOTE_ALL, delimiter=',')
writer.writeheader()
for row in processed_data:
writer.writerow(row)
print("Just completed writing csv file with %d columns" % len(header))
How to convert cyrillic correctly and also i want to skip bad characters?

You need to know cyrylic encoding of which file are you going to open.
For example that is enough in python3:
with open(args.input_file, 'r', encoding="cp866") as input_file:
data = input_file.read()
structure = json.loads(data)
In python3 data variable is automatically utf-8. In python2 there might be problem with feeding input to json.
Also try to print out in python interpreter line and see if symbols are right. Without input file is hard to tell if everything is right. Also are you sure that it is python, not excel related problem? Did you tried to open in notepad++ or similar encodings respecting editors?
Most important thing working with encodings is cheking that input and output is right. I would suggest to look here.

maybe you could use the chardet to detect the file's encoding.
import chardet
File='arq.GeoJson'
enc=chardet.detect(open(File,'rb').read())['encoding']
with open(File,'r', encoding = enc) as f:
data=json.load(f)
f.close()
This avoids 'to kick' the encoding.

how to write a unicode csv in Python 2.7

I want to write data to files where a row from a CSV should look like this list (directly from the Python console):
row = ['\xef\xbb\xbft_11651497', 'http://kozbeszerzes.ceu.hu/entity/t/11651497.xml', "Szabolcs Mag '98 Kft.", 'ny\xc3\xadregyh\xc3\xa1za', 'ny\xc3\xadregyh\xc3\xa1za', '4400', 't\xc3\xbcnde utca 20.', 47.935175, 21.744975, u'Ny\xedregyh\xe1za', u'Borb\xe1nya', u'Szabolcs-Szatm\xe1r-Bereg', u'Ny\xedregyh\xe1zai', u'20', u'T\xfcnde utca', u'Magyarorsz\xe1g', u'4405']
Py2k does not do Unicode, but I had a UnicodeWriter wrapper:
import cStringIO, codecs
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
However, these lines still produce the dreaded encoding error message below:
f.write(codecs.BOM_UTF8)
writer = UnicodeWriter(f)
writer.writerow(row)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 9: ordinal not in range(128)
What is there to do? Thanks!

You are passing bytestrings containing non-ASCII data in, and these are being decoded to Unicode using the default codec at this line:
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
unicode(bytestring) with data that cannot be decoded as ASCII fails:
>>> unicode('\xef\xbb\xbft_11651497')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
Decode the data to Unicode before passing it to the writer:
row = [v.decode('utf8') if isinstance(v, str) else v for v in row]
This assumes that your bytestring values contain UTF-8 data instead. If you have a mix of encodings, try to decode to Unicode at the point of origin; where your program first sourced the data. You really want to do so anyway, regardless of where the data came from or if it already was encoded to UTF-8 as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Write numpy.ndarray with Russian characters to file - python

Related

csv read raises "UnicodeDecodeError: 'charmap' codec can't decode..."

ASCII Encode/Decode Error

UnicodeDecodeError while trying to change points into commas in Excel

Python JSON to CSV - bad encoding, UnicodeDecodeError: 'charmap' codec can't decode byte

how to write a unicode csv in Python 2.7

Categories

Resources