Write zlib compressed utf8 data to a file - python

I have a file with data encoded in utf-8. I would like to read the data, remove whitespaces, separate words with a newline, compress the entire content and write them to a file. This is what I am trying to do :
with codecs.open('1020104_4.utf8', encoding='utf8', mode='r') as fr :
data = re.split(r'\s+',fr.read().encode('utf8'))
#with codecs.open('out2', encoding='utf8', mode='w') as fw2 :
data2 = ('\n'.join(data)).decode('utf8')
data3 = zlib.compress(data2)
#fw2.write(data3)
However I get an error :
Traceback (most recent call last):
File "tmp2.py", line 17, in <module>
data3 = zlib.compress(data2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 47-48: ordinal not in range(128)
How can I write this data to a file?

I think your encoding-foo is just the wrong way round, in Python 3 this would be a lot clearer ☺.
First, when splitting you want to do this on decoded data, i.e. on Unicode strings, which you already get from read since you are using codecs.open, so the first line should be
data = re.split(r'\s+', fr.read())
Consequently, before passing data to zlib you want to convert it to bytes by encoding it:
data2 = ('\n'.join(data)).encode('utf8')
data3 = zlib.compress(data2)
In the last step you want to write it to a binary file handle:
with open("output", "wb") as fw:
fw.write(data3)
You can shorten this a bit by using the gzip module instead:
with codecs.open('1020104_4.utf8', encoding='utf8', mode='r') as fr:
data = re.split(r'\s+', fr.read())
with gzip.open('out2', mode='wb') as fw2 :
data2 = ('\n'.join(data)).encode('utf8')
fw2.write(data2)

Related

Reading Korean through a CSV in Python

I am having an issue reading a CSV file in to Python containing English and Korean Characters, have tested my code without the Korean and it works fine.
Code (Python - 3.6.4)
import csv
with open('Kor3.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list)
Error
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position
2176: character maps to undefined
CSV File Output: This has been converted from Excel to unicode text then filename changed to CSV. Think this is the route of the problem.
Would it be better to read from an Excel or another format?
Sample Input (2 Columns)
생일 축하해요 Happy birthday
Just declare the encoding when opening the file:
with open('Kor3.csv', 'r', encoding='utf-8') as f:
Use python 3. Csv functions will read with unicode by default
In the end I just went with importing from the excel file think this was an issue with the csv rather than python. Thanks for your help.
from xlrd import open_workbook
wb = open_workbook('Korean.xlsx')
values = []
for s in wb.sheets():
#print 'Sheet:',s.name
for row in range(1, s.nrows):
col_names = s.row(0)
col_value = []
for name, col in zip(col_names, range(s.ncols)):
value = (s.cell(row,col).value)
try : value = str(int(value))
except : pass
col_value.append((value))
values.append(col_value)
print(values) #test
print(values[0][1],values[1][1]) #test2

Utf-8 decoding with Python

I have a csv with some data, and in one row there is a text that was added after encoding it in utf-8.
This is the text:
"b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'"
I'm trying to use this text to obtain the original characters using the decode function, but it's imposible.
Does anyone know which is the correct procedure to do it?
Assuming that the line in your file is exactly like this:
b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'
And reading the line from the file gives the output:
>>> line
"b'\\xe7\\x94\\xb3\\xe8\\xbf\\xaa\\xe8\\xa5\\xbf\\xe8\\xb7\\xaf255\\xe5\\xbc\\x84660\\xe5\\x8f\\xb7\\xe5\\x92\\x8c665\\xe5\\x8f\\xb7 \\xe4\\xb8\\xad\\xe5\\x9b\\xbd\\xe4\\xb8\\x8a\\xe6\\xb5\\xb7\\xe6\\xb5\\xa6\\xe4\\xb8\\x9c\\xe6\\x96\\xb0\\xe5\\x8c\\xba 201205'"`
You can try to use eval() function:
with open(r"your_csv.csv", "r") as csvfile:
for line in csvfile:
# when you reach the desired line
b = eval(line).decode('utf-8')
Output:
>>> print(b)
'申迪西路255弄660号和665号 中国上海浦东新区 201205'
Try this:-
a = b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'
print(a.decode('utf-8')) #your decoded output
As you are saying you are reading from file then you can try with passing encoding system when reading:-
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)

Reading russian language data from csv

I have some data in CSV file that are in Russian:
2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы
2-комнатная квартира БГР', мкр Таугуль, Дулати (Навои) — Токтабаева;Алматы
2-комнатная квартира ЦФМ', мкр Тастак-2, Тлендиева — Райымбека;Алматы
Delimiter is ; symbol.
I want to read data and put it into array. I tried to read this data using this code:
def loadCsv(filename):
lines = csv.reader(open(filename, "rb"),delimiter=";" )
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [str(x) for x in dataset[i]]
return dataset
Then I read and print result:
mydata = loadCsv('krish(csv3).csv')
print mydata
Output:
[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]
I found that in this case codecs are required and tried to do the same with this code:
import codecs
with codecs.open('krish(csv3).csv','r',encoding='utf8') as f:
text = f.read()
print text
I got this error:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte
What is the problem? When using codecs how to specify delimiter in my data?
I just want to read data from file and put it in 2-dimensional array.
\ea is the windows-1251 / cp5347 encoding for к. Therefore, you need to use windows-1251 decoding, not UTF-8.
In Python 2.7, the CSV library does not support Unicode properly - See "Unicode" in https://docs.python.org/2/library/csv.html
They propose a simple work around using:
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
This would allow you to do:
def loadCsv(filename):
lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" )
# if you really need lists then uncomment the next line
# this will let you do call exact lines by doing `line_12 = lines[12]`
# return list(lines)
# this will return an "iterator", so that the file is read on each call
# use this if you'll do a `for x in x`
return lines
If you try to print dataset, then you'll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x or \u. To print the values, do:
for csv_line in loadCsv("myfile.csv"):
print u", ".join(csv_line)
If you need to write your results to another file (fairly typical), you could do:
with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput:
for csv_line in loadCsv("myfile.csv"):
my_output.write(u", ".join(csv_line))
This will automatically convert/encode your output to UTF-8.
You cant try:
import pandas as pd
pd.read_csv(path_file , "cp1251")
or
import csv
with open(path_file, encoding="cp1251", errors='ignore') as source_file:
reader = csv.reader(source_file, delimiter=",")
Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.
In py3:
import csv
path = 'C:/Users/me/Downloads/sv.csv'
with open(path, encoding="UTF8") as f:
reader = csv.reader(f)
for row in reader:
print(row)

export a list as a csv file in python and getting UnicodeEncodeError

I want to get a csv file from my list.
This is my list:
temp = ['سلام' , 'چطوری' ]
Members of list are in Persian language.
I tried to get csv file by this code:
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(temp)
but terminal gives me this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u06a9' in position 0: ordinal not in range(128)
How can I solve it and get my csv file?
P.S
Actually when I print temp , I see these strings:
[u'\u06a9\u0627\u062e \u0645\u0648\u0632\u0647 \u06af\u0644\u0633\u062a\u0627\u0646 | Golestan Palace', u'\u062a\u0647\u0631\u0627\u0646', u'\u062a\u0647\u0631\u0627\]
But when I call Temp[1] I get this:
کاخ موزه گلستان | Golestan Palace
How can I solve it and get my csv file?
Why sometimes python encodes my data and sometime it doesn't?
In another answer, you said you were using Python 2.7. Extract from Python Standard Library Reference Manual - csv module :
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
Same paragraph gives you an example of a class that could be used to deal with unicode data :
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
But you could also try simpler code :
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows([u.encode('utf-8') for u in temp])
if temp is a list of unicode strings
or :
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows([[u.encode('utf-8') for u in row] for row in temp])
if temp is a list of list of unicode strings
The csv library in Python 2 cannot handle Unicode data. This is fixed in Python 3, but will not be backported. However, there is a drop-in replacement 3rd party library that fixes the problem.
Try using UnicodeCSV instead.

Python process a csv file to remove unicode characters greater than 3 bytes

I'm using Python 2.7.5 and trying to take an existing CSV file and process it to remove unicode characters that are greater than 3 bytes. (Sending this to Mechanical Turk, and it's an Amazon restriction.)
I've tried to use the top (amazing) answer in this question (How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?). I assume that I can just iterate through the csv row-by-row, and wherever I spot unicode characters of >3 bytes, replace them with a replacement character.
# -*- coding: utf-8 -*-
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
ifile = open('sourcefile.csv', 'rU')
reader = csv.reader(ifile, dialect=csv.excel_tab)
ofile = open('outputfile.csv', 'wb')
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
#skip header row
next(reader, None)
for row in reader:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c).encode('utf8')) for c in row])
ifile.close()
ofile.close()
I'm currently getting this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 264: ordinal not in range(128)
So this does iterate properly through some rows, but stops when it gets to the strange unicode characters.
I'd really appreciate some pointers; I'm completely confused. I've replaced 'utf8' with 'latin1' and unicode(c).encode to unicode(c).decode and I keep getting this same error.
Your input is still encoded data, not Unicode values. You'd need to decode to unicode values first, but you didn't specify an encoding to use. You then need to encode again back to encoded values to write back to the output CSV:
writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c, 'utf8')).encode('utf8')
for c in row])
Your error stems from the unicode(c) call; without an explicit codec to use, Python falls back to the default ASCII codec.
If you use your file objects as context managers, there is no need to manually close them:
import csv
import re
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def limit_to_BMP(value, patt=re_pattern):
return patt.sub(u'\uFFFD', unicode(value, 'utf8')).encode('utf8')
with open('sourcefile.csv', 'rU') as ifile, open('outputfile.csv', 'wb') as ofile:
reader = csv.reader(ifile, dialect=csv.excel_tab)
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
next(reader, None) # header is not added to output file
writer.writerows(map(limit_to_BMP, row) for row in reader)
I moved the replacement action to a separate function too, and used a generator expression to produce all rows on demand for the writer.writerows() function.

Categories