Error when reading UTF-8 characters with python - python

I have the following function in python, which takes a string as argument and returns the same string in ASCII (e.g. "alçapão" -> "alcapao"):
def filt(word):
dic = { u'á':'a',u'ã':'a',u'â':'a' } # the whole dictionary is too big, it is just a sample
new = ''
for l in word:
new = new + dic.get(l, l)
return new
It is supposed to "filter" all strings in a list that I read from a file using this:
lines = []
with open("to-filter.txt","r") as f:
for line in f:
lines.append(line.strip())
lines = [filt(l) for l in lines]
But I get this:
filt.py:9: UnicodeWarning: Unicode equal comparison failed to convert
both arguments to Unicode - interpreting them as being unequal
new = new + dic.get(l, l)
and the strings filtered have characters like '\xc3\xb4' instead of ASCII characters. What should I do?

You're mixing and matching Unicodes strs and regular (byte) strs.
Use the io module to open and decode your text file to Unicodes as it's read:
with io.open("to-filter.txt","r", encoding="utf-8") as f:
this assumes your to-filter.txt file is UTF-8 encoded.
You can also shrink your file read into an array with just:
with io.open("to-filter.txt","r", encoding="utf-8") as f:
lines = f.read().splitlines()
lines is now a list of Unicode strings.
Optional
It looks like you're trying to convert non-ASCII characters to their closest ASCII equivalent. The easy way to this is:
import unicodedata
def filt(word):
return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')
What this does is:
Decomposes each character into their component parts. For example, ã can be expressed as a single Unicode char (U+00E3 'LATIN SMALL LETTER A WITH TILDE') or as two Unicode characters: U+0061 'LATIN SMALL LETTER A' + U+0303 'COMBINING TILDE'.
Encode component parts to ASCII. Non ASCII parts (those with code points greater than U+007F), will be ignored.
Decode back to a Unicode str for convenience.
Tl;dr
Your code is now:
import unicodedata
def filt(word):
return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')
with io.open("to-filter.txt","r", encoding="utf-8") as f:
lines = f.read().splitlines()
lines = [filt(l) for l in lines]
Python 3.x
Although not strictly necessarily, remove io from open()

The root of your problem is that you're not reading Unicode strings from the file, you're reading byte strings. There are three ways to fix this, first is to open the file with the io module as suggested by another answer. The second is to convert each string as you read it:
with open("to-filter.txt","r") as f:
for line in f:
lines.append(line.decode('utf-8').strip())
The third way is to use Python 3, which always reads text files into Unicode strings.
Finally, there's no need to write your own code to turn accented characters into plain ASCII, there's a package unidecode to do that.
from unidecode import unidecode
print(unidecode(line))

Related

Trying to compare two text files in UTF-8 encoding to find and count similar words

I want to compare two text files that are in UTF-8 encoding, File 1 is a dictionary of words and file 2 contains a sentence. I want to find out the similar words that are present in File 1 and File 2.
import codecs
f1 = codecs.open('poswords.txt', 'r', 'UTF-8')
for line in f1:
print(line)
f2 = codecs.open('0001b.txt', 'r', 'UTF-8')
words=set(line.strip() for line in f1)
for line in f2:
word,freq =line.split()
if word in words:
print (word)
File 1(i.e Dictionary) contains
کرخت
ناجائز فائدہ
آب دیدہ
ابال
ابال کر پکانا
**ابالنا**
ابتدائ
ابتر
File 2 contains a sentence:
وفاقی وزیر اطلاعات فواد چودھری سے استعفیٰ لے لیا**ابالنا** گیا ہے
There are two common words in both the files i want to find them and count their occurences.
I want that it should return the similar words, but it returns an error saying that ValueError: too many values to unpack (expected 2)
You are attempting to retrieve two values from split:
word, freq = line.split()
This will only work when there are exactly two words on a line (and by the variable naming, the second should apparently be a frequency count).
Another problem is that you consume all the lines from the first file when you print them. Once you have read all the lines from a handle, attempting to read more lines will simply return nothing. The simple fix is to both print and save each input word to the words set inside the same loop. (Maybe comment out the print(), actually; or import logging and change it to logging.debug(). This also ensures that the diagnostic output is not mixed with the program's regular standard output.)
In Python 3, UTF-8 should be the default encoding on most sane platforms (though this conspicuously and emphatically excludes Windows); maybe you don't need the explicit codecs at all.
Finally, you should be aware that Unicode can often represent the same string in multiple ways. I don't read Arabic, but briefly, for example, you can write "salaam" as a single glyph U+FDF5 or you can spell it out. Unicode normalization attempts to iron out any such wrinkles so you can be sure that text which displays the same is also written the same, and thus identical to Python's string comparison operator.
import codecs
import unicodedata
with codecs.open('poswords.txt', 'r', 'UTF-8') as f1:
words = set()
for line in f1:
print(line)
words.add(unicodedata.normalize('NFC', line.strip()))
with codecs.open('0001b.txt', 'r', 'UTF-8') as f2:
for line in f2:
for word in line.split():
if unicodedata.normalize('NFC', word) in words:
print (word)

Iterating through a file and replacing strings, leaving the number of characters intact

I'm attempting to anonymize a file so that all the content except certain keywords are replaced with gibberish, but the format is kept the same (including punctuation, length of string and capitalization). For example:
I am testing this, check it out! This is a keyword: long
Wow, another line.
should turn in to:
T ad ehistmg ptrs, erovj qo giw! Tgds ar o qpyeogf: long
Yeg, rmbjthe yadn.
I am attempting to do this in python, but i'm having no luck in finding a solution. I have tried replacing via tokenization and writing to another file, but without much success.
Initially let's disregard the fact that we have to preserve some keywords. We will fix that later.
The easiest way to perform this kind of 1-to-1 mapping is to use the method str.translate. The string module also contains constants that contain all ASCII lowercase and uppercase characters, and random.shuffle can be used to obtain a random permutation.
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
random.shuffle(random_caps)
random.shuffle(random_lows)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
with open('the-file-i-want.txt', 'r') as f:
contents = f.read()
translated_contents = contents.translate(translation_table)
with open('the-file-i-want.txt', 'w') as f:
f.write(translated_contents)
In python 2 the str.maketrans is a function in the string module instead of a static method of str.
The translation_table is a mapping from characters to characters, so it will map every single ASCII character to an other one. The translate method simply applies this table to each character in the string.
Important note: the above method is actually reversible, because each letter its mapped to a unique other letter. This means that using a simple analysis over the frequency of the symbols it's possible to reverse it.
If you want to make this harder or impossible, you could re-create the translation_table for every line:
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
with open('the-file-i-want.txt', 'r') as f:
translated_lines = []
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
translated_lines.append(line.translate(translation_table))
with open('the-file-i-want.txt', 'w') as f:
f.writelines(translated_lines)
Also note that you could translate and save the file line by line:
with open('the-file-i-want.txt', 'r') as f, open('output.txt', 'w') as o:
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
o.write(line.translate(translation_table))
Which means you can translate huge files with this code, as far as the lines themselves are not insanely long.
The code above messing all characters, without taking into account such keywords.
The simplest way to handle the requirement is to simply check for each line whether one of keywords occur and "reinsert" it there:
import re
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
keywords = ['long'] # add all the possible keywords in this list
keyword_regex = re.compile('|'.join(re.escape(word) for word in keywords))
with open('the-file-i-want.txt', 'r') as f, open('output.txt', 'w') as o:
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
matches = keyword_regex.finditer(line)
translated_line = list(line.translate(translation_table))
for match in matches:
translated_line[match.start():match.end()] = match.group()
o.write(''.join(translated_line))
Sample usage (using the version that prevserves keywords):
$ echo 'I am testing this, check it out! This is a keyword: long
Wow, another line.' > the-file-i-want.txt
$ python3 trans.py
$ cat output.txt
M vy hoahitc hfia, ufoum ih pzh! Hfia ia v modjpel: long
Ltj, fstkwzb hdsz.
Note how long is preserved.

Unicode in python 3

I want to convert string, which contains Unicode numbers to usual text. For example, file "input.txt" contains string '\u0057\u0068\u0061\u0074,' and I want to know what does it mean. If string is input in the code like:
s = '\u0057\u0068\u0061\u0074'
b = s.encode('utf-8')
print(b)
it works perfectly, but if I want to do the same with file I get this result b'\\u0057\\u0068\\u0061\\u0074'.
How to fix this problem? Windows 8, encoding of files are 'windows-1251'.
If your file contains those unicode escape sequences, then you can use the unicode_escape “codec” to interpret them after you read the file contents as a string.
>>> s = r'\u0057\u0068\u0061\u0074'
>>> print(s)
\u0057\u0068\u0061\u0074
>>> s.encode('utf-8').decode('unicode_escape')
'What'
Or, you can just read a bytes string directly and decode that:
with open('file.txt', 'br') as f:
print(f.read().decode('unicode_escape'))

Python JSON preserve encoding

I have a file like this:
aarónico
aaronita
ababol
abacá
abacería
abacero
ábaco
#more words, with no ascii chars
When i read and print that file to the console, it prints exactly the same, as expected, but when i do:
f.write(json.dumps({word: Lookup(line)}))
This is saved instead:
{"aar\u00f3nico": ["Stuff"]}
When i expected:
{"aarónico": ["Stuff"]}
I need to get the same when i jason.loads() it, but i don't know where or how to do the encoding or if it's needed to get it to work.
EDIT
This is the code that saves the data to a file:
with open(LEMARIO_FILE, "r") as flemario:
with open(DATA_FILE, "w") as f:
while True:
word = flemario.readline().strip()
if word == "":
break
print word #this is correct
f.write(json.dumps({word: RAELookup(word)}))
f.write("\n")
And this one loads the data and returns the dictionary object:
with open(DATA_FILE, "r") as f:
while True:
new = f.readline().strip()
if new == "":
break
print json.loads(new) #this is not
I cannot lookup the dictionaries if the keys are not the same as the saved ones.
EDIT 2
>>> import json
>>> f = open("test", "w")
>>> f.write(json.dumps({"héllö": ["stuff"]}))
>>> f.close()
>>> f = open("test", "r")
>>> print json.loads(f.read())
{u'h\xe9ll\xf6': [u'stuff']}
>>> "héllö" in {u'h\xe9ll\xf6': [u'stuff']}
False
This is normal and valid JSON behaviour. The \uxxxx escape is also used by Python, so make sure you don't confuse python literal representations with the contents of the string.
Demo in Python 3.3:
>>> import json
>>> print('aar\u00f3nico')
aarónico
>>> print(json.dumps('aar\u00f3nico'))
"aar\u00f3nico"
>>> print(json.loads(json.dumps('aar\u00f3nico')))
aarónico
In python 2.7:
>>> import json
>>> print u'aar\u00f3nico'
aarónico
>>> print(json.dumps(u'aar\u00f3nico'))
"aar\u00f3nico"
>>> print(json.loads(json.dumps(u'aar\u00f3nico')))
aarónico
When reading and writing from and to files, and when specifying just raw byte strings (and "héllö" is a raw byte string) then you are not dealing with Unicode data. You need to learn about the differences between encoded and Unicode data first. I strongly recommend you read at least 2 of the following 3 articles:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
You were lucky with your "héllö" python raw byte string representation, Python managed to decode it automatically for you. The value read back from the file is perfectly normal and correct:
>>> print u'h\xe9ll\xf6'
héllö

Why is the first line longer?

i'm using python to read a txt document with:
f = open(path,"r")
for line in f:
line = line.decode('utf8').strip()
length = len(line)
firstLetter = line[:1]
it seems to work, but the first line's length is always longer by... 1
for example:
the first line is "XXXX" where X denotes a chinese character
then length will be 5, but not 4
and firstLetter will be nothing
but when it goes to the second and after lines,it works properly
tks~
You have a UTF-8 BOM at the start of your file. Don't faff about inspecting the first character. Instead of the utf8 encoding, use the utf_8_sig encoding with either codecs.open() or your_byte_string.decode() ... this sucks up the BOM if it exists and you don't see it in your code.
>>> bom8 = u'\ufeff'.encode('utf8')
>>> bom8
'\xef\xbb\xbf'
>>> bom8.decode('utf8')
u'\ufeff'
>>> bom8.decode('utf_8_sig')
u'' # removes the BOM
>>> 'abcd'.decode('utf_8_sig')
u'abcd' # doesn't care if no BOM
>>>
You are probably getting the Byte Order Mark (BOM) as the first character on the first line.
Information about dealing with it is here

Categories