convert unicode data into malayalam using python (\u0d35 format) - python

I have been doing topic modeling for malayalam news article. The topics are generated in unicode format. The output is as follows:
u'0.021*"\u0d2a\u0d3f" + 0.021*"\u0d35\u0d3f\u0d36\u0d4d\u0d35\u0d02\u0d2d\u0d30\u0d28\u0d4d\u0d31\u0d46" + 0.021*"\u0d05\u0d26\u0d4d\u0d26\u0d47\u0d39\u0d02"'
I want to convert this into readable string. whenever it involves file operations it just show same string in the output file. But i want the result like:
0.021*"പി" + 0.021*"വിശ്വംഭരന്റെ" + 0.021*"അദ്ദേഹം"
into a file

seems to work fine for me ... make sure the terminal you are printing to supports it (well dang that screenshot isnt as readable as id hoped... oh well if you click it its fine)
if you want to write it to a file you probably need to encode it to utf8
with open("some_file","wb") as f:
f.write(u'0.021*"\u0d2a\u0d3f" + 0.021*"\u0d35\u0d3f\u0d36\u0d4d\u0d35\u0d02\u0d2d\u0d30\u0d28\u0d4d\u0d31\u0d46" + 0.021*"\u0d05\u0d26\u0d4d\u0d26\u0d47\u0d39\u0d02"'.encode("utf-8"))

Related

How can I write characters such as § into a file using Python?

This is my code for creating the string to be written ('result' is the variable that holds the final text):
fileobj = open('file_name.yml','a+')
begin = initial+":0 "
n_name = '"§'+tag+name+'§!"'
begin_d = initial+"_desc:0 "
n_desc = '"§3'+desc+'§!"'
title = ' '+begin + n_name
descript = ' '+begin_d + n_desc
result = title+'\n'+descript
print()
fileobj.close()
return result
This is my code for actually writing it into the file:
text = writing(initial, tag, name, desc)
override = inserter(fileobj, country, text)
fileobj.close()
fileobj = open('file_name.yml','w+')
fileobj.write(override)
fileobj.close()
(P.S: Override is a function which works perfectly. It returns a longer string to be written into the file.)
I have tried this with .txt and .yml files but in both cases, instead of §, this is what takes its place: xA7 (I cannot copy the actual text into the internet as it changes into the correct character. It is, however, appearing as xA7 in the file.) Everything else is unaffected, and the code runs fine.
Do let me know if I can improve the question in any way.
You're running into a problem called character encoding. There are two parts to the problem - first is to get the encoding you want in the file, the second is to get the OS to use the same encoding.
The most flexible and common encoding is UTF-8, because it can handle any Unicode character while remaining backwards compatible with the very old 7-bit ASCII character set. Most Unix-like systems like Linux will handle it automatically.
fileobj = open('file_name.yml','w+',encoding='utf-8')
You can set your PYTHONIOENCODING environment value to make it the default.
Windows operating systems are a little trickier because they'll rarely assume UTF-8, especially if it's a Microsoft program opening the file. There's a magic byte sequence called a BOM that will trigger Microsoft to use UTF-8 if it's at the beginning of a file. Python can add that automatically for you:
fileobj = open('file_name.yml','w+',encoding='utf_8_sig')

Wrong character encoding of rtf file

When I copy and paste the sentence How brave they’ll all think me at home! into a blank TextEdit rtf document on the Mac, it looks fine. But if I create an an apparently identical rtf file programatically, and write the same sentence into it, on opening TextEdit it appears as How brave they’ll all think me at home! In the following code, output is OK, but the file when viewed in TextEdit has problems with the right single quotation mark (here used as an apostrophe), unicode U-2019.
header = r"""{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf400
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\f0\fs24 \cf0 """
sen = 'How brave they’ll all think me at home!'
with open('staging.rtf', 'w+’) as f:
f.write(header)
f.write(sen)
f.write('}')
with open('staging.rtf') as f:
output = f.read()
print(output)
I’ve discovered from https://www.i18nqa.com/debug/utf8-debug.html that this may be caused by “UTF-8 bytes being interpreted as Windows-1252”, and that makes sense as it seems that ansicpg1252 in the header indicates US Windows.
But I still can’t work out how to fix it, even having read the similar issue here: Encoding of rtf file. I’ve tried replacing ansi with mac without effect. And adding ,encoding='utf8' to the open function doesn’t seem to help either.
(The reason for using rtf by the way is to be able to export sentences with colour-coded words, allow them to be manually edited, then read back in for further processing).
OK, I've found the answer myself. I needed to use , encoding='windows-1252' both when writing to the rtf file and also when reading from it.

Writing out text with double double quotes - Python on Linux

I'm trying to take the text output of a query to an SSD (pulling a log page, similar to pulling SMART data. I'm then trying to write this text data out of a log file I update periodically.
My problem happens when the log data for some drives has double double-quotes as a placeholder for a blank field. Here is a snippet of the input:
VER 0x10200
VID 0x15b7
BoardRev 0x0
BootLoadRev ""
When this gets written out (appended) to my own log file, the text gets replaced with several null characters and then when I try to open all the text editors tell me it's corrupted.
The "" characters are replaced by something like this on my Linux system:
BootLoadRev "\00\00\00\00"
Some fields are even longer with the \00 characters. If the "" is not there, things write out OK.
The code is similar to this:
f=open(fileName, 'w')
test_bench.send_command('get_log_page')
identify_data = test_bench.get_data_in()
f.write(identify_data)
f.close()
Is there a way to send this text to a file w/o these nulls causing problems?
Assuming that this is Python 2 (and that your content is thus what Python 3 would call a bytestring), and that your intended data format is raw ASCII, the trivial solution is simply to remove the NULs from your content before you write to disk:
f.write(identify_data.replace('\0', ''))

pyparsing not working on windows text file but works on linux text file

I have a simple pyparsing construct for extracting parts of a log message. It looks like this
log_line = timestamp + task_info + Suppress(LineEnd())
This construct parses a log file generated in Linux very well but doesn't parse a similar file generated in windows. I am pretty sure it is because of the new line representation difference. I was wondering if LineEnd() takes care of that? If it doesn't how do I take care of it?
Try Suppress("\r\n") instead of Suppress(LineEnd())

Python codecs module

I am trying to load a file saved as UTF-8 into python (ver2.6.6) which contains 14 different languages. I am using the python codecs module to decode the txt file.
import codecs
f = open('C:/temp/list_test.txt', 'r')
for lines in f:
line=filter_str(lines.decode("utf-8")
This all works well. I parse the entire file and then want to export
14 different language files. The problem that I can't understand is the following
I use the following code for output:
malangout = codecs.open("C:/temp/'polish.txt",'w','utf-8','surrogateescape')
for item in lang_dic['English']:
temp = lang_dic[lang1][item]
malangout.write(temp + '\n')
malangout.close()
Example:
Language: Polish
Expected output: Dziennik zakłóceń
Actual output: Dziennik zak‚óceƒ
The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
I have tried many encoding from the python docs (7.8 codecs). Any infomation would help at this point.
The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
Well, that's a problem since
In [25]: print(u'Dziennik zak\u201a\xf3ce\u0192')
Dziennik zak‚óceƒ
in contrast to
In [26]: print(u'Dziennik zak\u0142\xf3ce\u0144')
Dziennik zakłóceń
So it looks like the unicode you are storing is incorrect. Are you sure it is correct in C:/temp/list_test.txt? That is, does list_test.txt contain
In [28]: u'Dziennik zak\u201a\xf3ce\u0192'.encode('utf-8')
Out[28]: 'Dziennik zak\xe2\x80\x9a\xc3\xb3ce\xc6\x92'
or
In [27]: u'Dziennik zak\u0142\xf3ce\u0144'.encode('utf-8')
Out[27]: 'Dziennik zak\xc5\x82\xc3\xb3ce\xc5\x84'
?
PS. You may want to change
temp + '\n'
to
temp + u'\n'
to make it clear you are adding two unicode together to form a unicode.
The two lines above have the same result in Python2, but in Python3 adding a unicode and str together would raise a TypeError. Even though in Python3, '\n' is unicode, I think the challenge in transitioning to Python3 will be in changing one's mental attitude toward mixing unicode and str. In Python2 it is silently attempted for you, in Python3 it is disallowed.

Categories