How to Reverse Hebrew String in Python? - python

I am trying to reverse Hebrew string in Python:
line = 'אבגד'
reversed = line[::-1]
print reversed
but I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 0: ordinal not in range(128)
Care to explain what I'm doing wrong?
EDIT:
I'm also trying to save the string into a file using:
w1 = open('~/fileName', 'w')
w1.write(reverseLine)
but now I get:
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-3: character maps to <undefined>
Any ideas how to fix that, too?

you need more than reverse a string to flip hebrew backwords, due to the opposite order of numbers etc.
The algorithms is much more complicated;
All the answers in this page (to this date) will most likely screw your numbers and non-hebrew texts.
For most cases you should use
from bidi.algorithm import get_display
print get_display(text)

Adding u in front of the hebrew string works for me:
In [1]: line = u'אבגד'
In [2]: reversed = line[::-1]
In [2]: print reversed
דגבא
To your second question, you can use:
import codecs
w1 = codecs.open("~/fileName", "r", "utf-8")
w1.write(reversed)
To write unicode string to file fileName.
Alternatively, without using codecs, you will need to encode reversed string with utf-8 when writing to file:
with open('~/fileName', 'w') as f:
f.write(reversed.encode('utf-8'))

You need to use a unicode string constant:
line = u'אבגד'
reversed = line[::-1]
print reversed

String defaults to being treated as ascii. Use u'' for unicode
line = u'אבגד'
reversed = line[::-1]
print reversed

Make sure you're using unicode objects
line = unicode('אבגד', 'utf-8')
reversed = line[::-1]
print reversed

Found how to write to file:
w1 = codecs.open('~/fileName', 'w', encoding='utf-8')
w1.write(reverseLine)

Related

Error when reading UTF-8 characters with python

I have the following function in python, which takes a string as argument and returns the same string in ASCII (e.g. "alçapão" -> "alcapao"):
def filt(word):
dic = { u'á':'a',u'ã':'a',u'â':'a' } # the whole dictionary is too big, it is just a sample
new = ''
for l in word:
new = new + dic.get(l, l)
return new
It is supposed to "filter" all strings in a list that I read from a file using this:
lines = []
with open("to-filter.txt","r") as f:
for line in f:
lines.append(line.strip())
lines = [filt(l) for l in lines]
But I get this:
filt.py:9: UnicodeWarning: Unicode equal comparison failed to convert
both arguments to Unicode - interpreting them as being unequal
new = new + dic.get(l, l)
and the strings filtered have characters like '\xc3\xb4' instead of ASCII characters. What should I do?
You're mixing and matching Unicodes strs and regular (byte) strs.
Use the io module to open and decode your text file to Unicodes as it's read:
with io.open("to-filter.txt","r", encoding="utf-8") as f:
this assumes your to-filter.txt file is UTF-8 encoded.
You can also shrink your file read into an array with just:
with io.open("to-filter.txt","r", encoding="utf-8") as f:
lines = f.read().splitlines()
lines is now a list of Unicode strings.
Optional
It looks like you're trying to convert non-ASCII characters to their closest ASCII equivalent. The easy way to this is:
import unicodedata
def filt(word):
return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')
What this does is:
Decomposes each character into their component parts. For example, ã can be expressed as a single Unicode char (U+00E3 'LATIN SMALL LETTER A WITH TILDE') or as two Unicode characters: U+0061 'LATIN SMALL LETTER A' + U+0303 'COMBINING TILDE'.
Encode component parts to ASCII. Non ASCII parts (those with code points greater than U+007F), will be ignored.
Decode back to a Unicode str for convenience.
Tl;dr
Your code is now:
import unicodedata
def filt(word):
return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')
with io.open("to-filter.txt","r", encoding="utf-8") as f:
lines = f.read().splitlines()
lines = [filt(l) for l in lines]
Python 3.x
Although not strictly necessarily, remove io from open()
The root of your problem is that you're not reading Unicode strings from the file, you're reading byte strings. There are three ways to fix this, first is to open the file with the io module as suggested by another answer. The second is to convert each string as you read it:
with open("to-filter.txt","r") as f:
for line in f:
lines.append(line.decode('utf-8').strip())
The third way is to use Python 3, which always reads text files into Unicode strings.
Finally, there's no need to write your own code to turn accented characters into plain ASCII, there's a package unidecode to do that.
from unidecode import unidecode
print(unidecode(line))

Using ASCII number to character in python

I am trying to print a list of dicts to file that's encoded in latin-1. Each field is to be separated by an ASCII character 254 and the end of line should be ASCII character 20.
When I try to use a character that is greater than 128 I get "UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 12: ordinal not in range(128)"
This is my current code. Could some one help me with how to encode a ASCII char 254 and how to add a end of line ASCII char 20 when using DictWriter.
Thanks
my Code:
with codecs.open("test.dat", "w", "ISO-8859-1") as outputFile:
delimiter = (chr(254))
keys = file_dict[0].keys()
dict_writer = csv.DictWriter(outputFile, keys, delimiter=delimiter)
dict_writer.writeheader()
for value in file_dict:
dict_writer.writerow(value)
ASCII does only contain character codes 0-127.
Codes in the range 128-255 are not defined in ASCII but only in codecs that extend it, like ANSI, latin-1 or all Unicodes.
In your case it's probably somehow double-encoding the string, which fails.
It works if you use the standard built-in open function without specifying a codec:
with open("test.dat", "w") as outputFile: # omit the codec stuff here
delimiter = (chr(254))
keys = file_dict[0].keys()
dict_writer = csv.DictWriter(outputFile, keys, delimiter=delimiter)
dict_writer.writeheader()
for value in file_dict:
dict_writer.writerow(value)

How does one print a Unicode character code in Python?

I would like to print a unicode's character code, and not the actual glyph it represents in Python.
For example, if u is a list of unicode characters:
>>> u[0]
u'\u0103'
>>> print u[0]
ă
I would like to output the character code as a raw string: u'\u0103'.
I have tried to just print it to a file, but this doesn't work without encoding it in UTF-8.
>>> w = open('~/foo.txt', 'w')
>>> print>>w, u[0].decode('utf-8')
Traceback (most recent call last):
File "<pyshell#33>", line 1, in <module>
print>>w, u[0].decode('utf-8')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0103' in position 0: ordinal not in range(128)
>>> print>>w, u[0].encode('utf-8')
>>> w.close()
Encoding it results in the glyph ă being written to the file.
How can I write the character code?
For printing raw unicode data one only need specify the correct encoding:
>>> s = u'\u0103'
>>> print s.encode('raw_unicode_escape')
\u0103

Why is the first line longer?

i'm using python to read a txt document with:
f = open(path,"r")
for line in f:
line = line.decode('utf8').strip()
length = len(line)
firstLetter = line[:1]
it seems to work, but the first line's length is always longer by... 1
for example:
the first line is "XXXX" where X denotes a chinese character
then length will be 5, but not 4
and firstLetter will be nothing
but when it goes to the second and after lines,it works properly
tks~
You have a UTF-8 BOM at the start of your file. Don't faff about inspecting the first character. Instead of the utf8 encoding, use the utf_8_sig encoding with either codecs.open() or your_byte_string.decode() ... this sucks up the BOM if it exists and you don't see it in your code.
>>> bom8 = u'\ufeff'.encode('utf8')
>>> bom8
'\xef\xbb\xbf'
>>> bom8.decode('utf8')
u'\ufeff'
>>> bom8.decode('utf_8_sig')
u'' # removes the BOM
>>> 'abcd'.decode('utf_8_sig')
u'abcd' # doesn't care if no BOM
>>>
You are probably getting the Byte Order Mark (BOM) as the first character on the first line.
Information about dealing with it is here

Python write to file

I've got a little problem here.
I'm converting binary to ascii, in order to compress data.
All seems to work fine, but when I convert '11011011' to ascii and try to write it into file, I keep getting error
UnicodeEncodeError: 'charmap' codec can't encode character '\xdb' in position 0: character maps to
Here's my code:
byte = ""
handleR = open(self.getInput())
handleW = open(self.getOutput(), 'w')
file = handleR.readlines()
for line in file:
for a in range(0, len(line)):
chunk = result[ord(line[a])]
for b in chunk:
if (len(byte) < 8):
byte+=str(chunk[b])
else:
char = chr(eval('0b'+byte))
print(byte, char)
handleW.write(char)
byte = ""
handleR.close()
handleW.close()
Any help appreciated,
Thank You
I think you want:
handleR = open(self.getInput(), 'rb')
handleW = open(self.getOutput(), 'wb')
That will ensure you're reading and writing byte streams. Also, you can parse binary strings without eval:
char = chr(int(byte, 2))
And of course, it would be faster to use bit manipulation. Instead of appending to a string, you can use << (left shift) and | (bitwise or).
EDIT: For the actual writing, you can use:
handleW.write(bytes([char]))
This creates and writes a bytes from a list consisting of a single number.
EDIT 2: Correction, it should be:
handleW.write(bytes([int(byte, 2)]))
There is no need to use chr.

Categories