Convert into malayalama text in python 3.3.2 - python

Hi, my code is like this(python 3.3.2)
fw = codecs.open('outputfile.txt','w')
if((unidata[i]==U'\u0d46' and unidata[i-1]==U'\u0d28') and (unidata[i+1]==U'\u0d24') and (unidata[i+2]==U'\u0d4d')):
print ('code 1')
if(var==1):
x=unidata[0:i-1]+U'\u0d7b'+ ' + '+U'\u0d0e'+unidata[i+1:len(unidata)]
first_word=unidata[0:i-1]+U'\u0d7b'
fw.write(str(first_word.encode('UTF-8')))
output in file is like this:
(b'\xe0\xb4\xb0\xe0\xb4\xbe\xe0\xb4\xae\xe0\xb5\xbb')
Actual output should be:
രാമൻ
How to resolve this?

this works..
fw=open("myunicodefile.txt","w")
fw.write(firstword.encode('UTF-8'))
but i think you are telling about the strings inside the file####
yes actualy unicode will looks like that after converting using """"str()"""
"\xe0\xb4\xb0\xe0\xb4\xbe\xe0\xb4\xae\xe0\xb5\xbb"
this is unicode.but to see this in malayalam with texteditor it must opened with uncode mode
___and if you use python to read that file then must open that file and encode to utf
example:
fr=open("mytext.txt","r")
data=fr.read()
unicodedata=data.encode("utf-8")
print unicodedata
this will print malayalam

unicode deconversion issues and solutions
I'm giving the link bc they explain better than I can and theres additional definitions of functions there as well, number 3 on the directly linked page I think helps you though.

Related

How can I print to the Visual Studio Code console in Portuguese?

I'm trying to print to the console a Portuguese name. Now, I need some particular encoding but I just can't make it work.
The code is the following:
name = "João".encode().decode("latin_1")
print(name)
I know Python 3 already decodes to utf-8, so I tried to decode it to latin_1. However, with no success. I just can't make it print the way I defined it. I already tried cp860 and cp1252, but it leads to the same problem.
The output of the previous code is:
João
How can I achieve this?
you should write your code like this
name = "João".encode('latin_1').decode("latin_1")
print(name)
while encoding the encoding type should also be mentioned otherwise it will default encoding which is UTF-8
You shouldn't need to do any encoding or decoding of the string in Python 3 for it to work with printing to your terminal as Python already knows what your terminal's encoding is and strings are already Unicode so it implicitly encodes it for you.
Executing the following from VS Code on Windows 10:
name = "João"
print(name)
leads to:
João

Basic Unicode encoding/decoding

Python 2.7.9 / Windows environment
when I
print myString
I'm seeing:
u'\u5df1\u6b66\u8d2a\u5929\u66f2'
Now I know the console I'm using (git-bash) is capable of displaying unicode. How can I encode (or decode, which ever is the right process to do) myString so that it displays:
己武贪天曲
I understand that the question is very basic. If anyone has good introductory material or reference, links would be most welcomed.
What you see is the result of print repr(u'\u5df1\u6b66\u8d2a\u5929\u66f2'). If isinstancetype(myString, (str, unicode)) is true then find the source where the string is defined and fix it. If myString is some other type then look at how its __str__, __repr__, __unicode__ methods are defined. To fix it; remove the code that calls unnecessary repr() (it can hide as a formatting operation e.g., "%r" % o).
To check whether your environment supports Unicode, run: print u'\u5929'. It should produce 天.
If your input is a Python literal and you can't change it (you should try at the very least to switch it to json format) then you could use ast.literal_eval(r"u'\u5929'") to get unicode string object:
import ast
print ast.literal_eval(myString)
You should try this:
message=u'\\u5df1\\u6b66\\u8d2a\\u5929\\u66f2'
print message.decode('unicode-escape')
I guess you are mising a "\" on every desired character
You should use the encode method . Consider this example :
str='hello'
print(str.encode(encoding='base64'))
For the list of available encoding , check this :
https://docs.python.org/2/library/codecs.html#standard-encodings

How to solve UnicodeEncodeError while working with Cyrillic (Russian) letters?

I try to read a RSS-feed using feed parser.
import feedparser
url = 'http://example.com/news.xml'
d=feedparser.parse(url)
f = open('rss.dat','w')
for e in d.entries:
title = e.title
print >>f, address
f.close()
It works fine with English RSS-feeds but I get a UnicodeEncodeError if I try to display a title written in Cyrillic letters. It happens when I:
Try to write a title into a file.
Try to display a title into the screen.
Try to use it in URL to access a web page.
My question is how to solve this problem easily. I would love to have a solution as simple as this:
new_title = some_function(title)
May be there is a way to replace every Cyrillic symbol by its HTML code?
FeedParser itself works fine with encodings, except in the case when it is wrongly declared. Refer to http://code.google.com/p/feedparser/issues/detail?id=114 for a possible explanation. It seems Python 2.5 uses ascii as default encoding, and causes problems.
Can you paste the actual feed URL, to see how the encoding is declared there. If it appear that the declare encoding is wrong - you'll have to find a way to instruct FeedParser to override the default value.
EDIT: Okay, it seems the error is in the print statement.
Use
f.write(title.encode('utf-8'))

python2.7 - reading a dictionary from a .txt file riddled with unicode

I enrolled into a Chinese Studies course some time ago, and I thought it'd be a great exercise for me to write a flashcard program in python. I'm storing the flash card lists in a dictionary in a .txt file, so far without trouble. The real problems kick in when I try to load the file, encoded in utf-8, into my program. An excerpt of my code:
import codecs
f = codecs.open(('list.txt'),'r','utf-8')
quiz_list = eval(f.read())
quizy = str(quiz_list).encode('utf-8')
print quizy
Now, if for example list.txt consists of:
{'character1':'男人'}
what is printed is actually
{'character1': '\xe7\x94\xb7\xe7\x86\xb1'}
Obviously there are some serious encoding issues here, but I cannot for the life of me understand where these occur. I am working with a terminal which supports utf-8, so not the standard cmd.exe: this is not the problem. Reading a normal list.txt without the curly dict-bits returns the chinese characters without a problem, so my guess is I'm not handling the dictionary part correctly. Any thoughts would be greatly appreciated!
There's nothing wrong with your encoding... Look at this:
>>> d = {1:'男人'}
>>> d[1]
'\xe7\x94\xb7\xe4\xba\xba'
>>> print d[1]
男人
One thing is to print a unicode string another one is printing its representation.
str(quizy) calls repr(quizy['character1']) which produces an ASCII representation of the string value. If you just print quizy['character1'] you'll see that the character codes are Unicode in the Python string.

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?
According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.
Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word
Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8
What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1
In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

Categories