How to write exact wordings of unicode characters into a file?

How to write exact wordings of unicode characters into a file? - python

when I want to write "සිවු අවුරුදු පාටමාලාව" with the exact wording into a json file using python3.6, but instead \u0dc3\u0dd2\u0dc3\u0dd4\u0db1\u0dca\u0da7 \u0dc3\u0dd2\u0dc0\u0dd4 is written into the json file.
I read an excel using xlrd and write to using open().
import xlrd
import json
wb = xlrd.open_workbook('data.xlsx',encoding_override='utf-8')
sheet = wb.sheet_by_index(0)
with open('data.json', 'w') as outfile:
data = json.dump(outerdata,outfile,ensure_ascii=True)

If I do this in Python with the escape string you report:
>>> print ("\u0dc3\u0dd2\u0dc3\u0dd4\u0db1\u0dca\u0da7 \u0dc3\u0dd2\u0dc0\u0dd4")
සිසුන්ට සිවු
you will see that the escapes do render as the characters you want. These are two different representations of the same data. Both representations are valid in JSON. But you are using json.dump() and you have specified ensure_ascii=True. That tells json.dump() that you want the representation with escapes. That is what ascii means: only the printable characters between chr(32) and chr(126). Change that to ensure_ascii=False.
But because you are now no longer writing pure ascii to your output file data.json, you need to specify an encoding when you open it:
with open("data.json", "w", encoding="utf-8") as outfile:
data = json.dump(outerdata,outfile,ensure_ascii=False)
This will make your JSON file look the way you want it to look.

Related

Compare unicode string with byte string

Version: Python 2.7
I'm reading values from a Unicode CSV file and looping through to find a particular product code - a string. The variable p is from the CSV file.
sku = '1450' # sku can contain spaces.
print p, '|', sku
print p == '1450'
print binascii.hexlify(p), '|', binascii.hexlify(sku)
print binascii.hexlify(p) == binascii.hexlify(sku)
print 'repr(p): ', repr(p)
which results in
1450 | 1450
False
003100340035003000 | 31343530
False
repr(p): '\x001\x004\x005\x000\x00'
Q1. What is a future-proof way (for version 3, etc.) to successfully compare?
Q2. The Unicode is little-endian. Why have I got 00 at both ends of the Unicode hex?
Note: attempts at converting to Unicode - u'1450' - don't seem to have any affect on the output.
Thanks.

This is probably much easier in Python 3 due to a change in how strings are handled.
Try opening your file with the encoding specified and pass the file-like to the csv library See csv Examples
import csv
with open('some.csv', newline='', encoding='UTF-16LE') as fh:
reader = csv.reader(fh)
for row in reader: # reader is iterable
# work with row
After some comments, the read attempt comes from a FTP server.
Switching a string read to FTP binary and reading through a io.TextIOWrapper() may work out
Out now with even more context managers!:
import io
import csv
from ftplib import FTP
with FTP("ftp.example.org") as ftp:
with io.BytesIO() as binary_buffer:
# read all of products.csv into a binary buffer
ftp.retrbinary("RETR products.csv", binary_buffer.write)
binary_buffer.seek(0) # rewind file pointer
# create a text wrapper to associate an encoding with the file-like for reading
with io.TextIOWrapper(binary_buffer, encoding="UTF-16LE") as csv_string:
for row in csv.reader(csv_string):
# work with row

Saving data from python to excel file as CSV UTF-8 file format

I have been trying to save the data as a excel file as a type of CSV UTF-8 (Comma delimited) (*.csv) which is different then the normal
CSV (Comma delimited) (*.csv) file. It display the unicode text when opened in excel. I can save as that file easily from excel but from python i am only able to save it as normal csv. Which will not cause loss of data but when opened it shows this kind of text "à¤à¤‰à¤Ÿà¤¾" instead of "एउटा" this text.
If I copied the text opening it with notepad to the excel file and then manually save the file as CSV UTF-8 then it preserves the correct display. But doing so is time consuming since all values appear in same line in notepad and i have to separate it in excel file.
So i just want to know how can i save data as CSV UTF-8 format of excel using python.
I have tried the follwing code but it results in normal csv file.
import codecs
import unicodecsv as csv
input_text = codecs.open('input.txt', encoding='utf-8')
all_text = input_text.read()
text_list = all_text.split()
output_list = [['Words','Tags']]
for input_word in text_list:
word_tag_list = [input_word,'O']
output_list.append(word_tag_list)
with codecs.open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(output_list)

You need to indicate to Excel that this is a UTF-8 file. Unfortunately the only way to do this is by prepending a special byte sequence to the front of the file. Python will do this automatically if you use a special encoding.
with codecs.open("output.csv", "w", "encoding="utf_8_sig") as f:

I have found the answer. The encoding="utf_8_sig" should be given to csv.writer method to write the excel file as CSV UTF-8 file. Previous code can be witten as:
with open("output.csv", "wb") as f:
writer = csv.writer(f, dialect='excel', encoding='utf_8_sig')
writer.writerows(output_list)
However there was problem when data has , at the end Eg: "भने," For this case i didn't need the comma so i removed it with following code within the for loop.
import re
if re.search(r'.,$',input_word):
input_word = re.sub(',$','',input_word)
Finally I was able to obtain the output as desired with Unicode character correctly displayed and removing extra comma which is present at the end of data. So, if anyone know how to ignore comma at the end of data in excel file then you can comment here. Thanks.

Mixed encoding in csv file

I have a fairly large database (10,000+ records with about 120 vars each) in R. The problem is, that about half of the variables in the original .csv file were correctly encoded in UTF-8 while the rest were encoded in ANSI (Windows-1252) but are being decoded as UTF-8 resulting in weird characters for non-ASCII characters (mainly latin) like this Ã© or Ã³.
I cannot simply change the file encoding because half of it would be decoded with the wrong type. Furthermore, I have no way of knowing which columns were encoded correctly and which ones didn't, and all I have is the original .csv file which I'm trying to fix.
So far I have found that a plain text file can be encoded in UTF-8 and misinterpreted characters (bad Unicode) can be inferred. One library that provides such functionality is ftfy for Python. However, I'm using the following code and so far, haven't had success:
import ftfy
file = open("file.csv", "r", encoding = "UTF8")
content = file.read()
content = ftfy.fix_text(content)
However, content will show exactly the same text than before. I believe this has to do with the way ftfy is inferring the content encoding.
Nevertheless, if I run ftfy.fix_text("PÃºblica que cotiza en MÃ©xico") it will show the right response:
>> 'Pública que cotiza en México'
I'm thinking that maybe the way to solve the problem is to iterate through each of the values (cells) in the .csv file and try to fix if with ftfy, and the importing the file back to R, but it seems a little bit complicated
Any suggestions?

In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.
The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.
Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).
This is the code I used to process the file:
import csv
import ftfy
import sys
def main(argv):
# input file
csvfile = open(argv[1], "r", encoding = "UTF8")
reader = csv.DictReader(csvfile)
# output stream
outfile = open(argv[2], "w", encoding = "Windows-1252") # Windows doesn't like utf8
writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n")
# clean values
writer.writeheader()
for row in reader:
for col in row:
row[col] = ftfy.fix_text(row[col])
writer.writerow(row)
# close files
csvfile.close()
outfile.close()
if __name__ == "__main__":
main(sys.argv)
And then, calling:
$ python fix_encoding.py data.csv out.csv
will output a csv file with the right encoding.

a small suggestion: divide and conquer.
try using one tool (ftfy?) to align all the file to the same encoding (and save as plaintext file) and only then try parsing it as csv

Raw string for variables in python?

I have seen several similar posts on this but nothing has solved my problem.
I am reading a list of numbers with backslashes and writing them to a .csv. Obviously the backslashes are causing problems.
addr = "6253\342\200\2236387"
with open("output.csv", 'a') as w:
write = writer(w)
write.writerow([addr])
I found that using r"6253\342\200\2236387" gave me exactly what I want for the output but since I am reading my input from a file I can't use raw string. i tried .encode('string-escape') but that gave me 6253\xe2\x80\x936387 as output which is definitely not what I want. unicode-escape gave me an error. Any thoughts?

The r in front of a string is only for defining a string. If you're reading data from a file, it's already 'raw'. You shouldn't have to do anything special when reading in your data.
Note that if your data is not plain ascii, you may need to decode it or read it in binary. For example, if the data is utf-8, you can open the file like this before reading:
import codecs
f = codecs.open("test", "r", "utf-8")

Text file contains...
1234\4567\7890
41\5432\345\6789
Code:
with open('c:/tmp/numbers.csv', 'ab') as w:
f = open(textfilepath)
wr = csv.writer(w)
for line in f:
line = line.strip()
wr.writerow([line])
f.close()
This produced a csv with whole lines in a column. Maybe use 'ab' rather than 'a' as your file open type. I was getting extra blank records in my csv when using just 'a'.

I created this awhile back. This helps you write to a csv file.
def write2csv(fileName,theData):
theFile = open(fileName+'.csv', 'a')
wr = csv.writer(theFile, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
wr.writerow(theData)

remove <feff> from a file

I am using this Python script to convert CSV to XML. After conversion I see tags in the text (vim), which causes XML parsing error.
I am already tried answers from here, without success.
The converted XML file.
Thanks for any help!

Your input file has BOM (byte-order mark) characters, and Python doesn't strip them automatically when file is encoded in utf8. See: Reading Unicode file data with BOM chars in Python
>>> s = '\xef\xbb\xbfABC'
>>> s.decode('utf8')
u'\ufeffABC'
>>> s.decode('utf-8-sig')
u'ABC'
So for your specific case, try something like
from io import StringIO
s = StringIO(open(csvFile).read().decode('utf-8-sig'))
csvData = csv.reader(s)
Very terrible style, but that script is a hacked together script anyway for a one-shot job.

Change utf-8 to utf-8-sig
import csv
with open('example.txt', 'r', encoding='utf-8-sig') as file:

Here's an example of a script that uses a real XML-aware library to run a similar conversion. It doesn't have the exact same output, but, well, it's an example -- salt to taste.
import csv
import lxml.etree
csvFile = 'myData.csv'
xmlFile = 'myData.xml'
reader = csv.reader(open(csvFile, 'r'))
with lxml.etree.xmlfile(xmlFile) as xf:
xf.write_declaration(standalone=True)
with xf.element('root'):
for row in reader:
row_el = lxml.etree.Element('row')
for col in row:
col_el = lxml.etree.SubElement(row_el, 'col')
col_el.text = col
xf.write(row_el)
To refer to the content of, say, row 2 column 3, you'd then use XPath like /row[2]/col[3]/text().

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write exact wordings of unicode characters into a file? - python

Related

Compare unicode string with byte string

Saving data from python to excel file as CSV UTF-8 file format

Mixed encoding in csv file

Raw string for variables in python?

remove <feff> from a file

Categories

Resources