Encoding in python - python

I have problem with comparing string from file with string I entered in the program, I should get that they are equal but no matter if i use decode('utf-8') I get that they are not equal. Here's the code:
final = open("info", 'r')
exported = open("final",'w')
lines = final.readlines()
for line in lines:
if line == "Wykształcenie i praca": #error
print "ok"
and how I save file that I try read:
comm_p = bs4.BeautifulSoup(comm)
comm_f.write(comm_p.prettify().encode('utf-8'))
for string in comm_p.strings:
#print repr(string).encode('utf-8')
save = string.encode('utf-8') # there is how i save
info.write(save)
info.write("\n")
info.close()
and at the top of file I have # -- coding: utf-8 --
Any ideas?

This should do what you need:
# -- coding: utf-8 --
import io
with io.open('info', encoding='utf-8') as final:
lines = final.readlines()
for line in lines:
if line.strip() == u"Wykształcenie i praca": #error
print "ok"
You need to open the file with the right encoding, and since your string is not ascii, you should mark it as unicode.

First, you need some basic knowledge about encodings. This is a good place to start. You don't have to read everything right now, but try to get as far as you can.
About your current problem:
You're reading a UTF-8 encoded file (probably), but you're reading it as an ASCII file. open() doesn't do any conversion for you.
So what you need to do (at least):
use codecs.open("info", "r", encoding="utf-8") to read the file
use Unicode strings for comparison: if line.rstrip() == u"Wykształcenie i praca":

It is likely the difference is in a '\n' character
readlines doesn't strip '\n' - see Best method for reading newline delimited files in Python and discarding the newlines?
In general it is not a good idea to put a Unicode string in your code, it would be a good idea to read it from a resource file

use unicode for string comparision
>>> s = u'Wykształcenie i praca'
>>> s == u'Wykształcenie i praca'
True
>>>
when it comes to string unicode is the smartest move :)

Related

Python sets and good encoding

I'm using Python2 and I try to put many words of a french dictionary in a set object, but I always have an encoding problem with the words that have accent.
This is my main code (this part reads a text file):
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from sets import Set
with open('.../test_unicode.txt', 'r') as word:
lines = word.readlines()
print(lines)
And this is the result of my print:
['\xc3\xa9l\xc3\xa9phants\n', 'bonjour\n', '\xc3\xa9l\xc3\xa8ves\n']
This is my text file for this example:
éléphants
bonjour
élèves
After, this is the continuity of my main code that put the words in a python set:
dict_word = Set()
for line in lines:
print(line)
dict_word.add(line[:-1].upper()) #Get rid of the '\n'
print(dict_word)
This is the result of my print:
Set(['\xc3\xa9L\xc3\xa8VES', 'BONJOUR', '\xc3\xa9L\xc3\xa9PHANTS'])
What I want is this output:
Set(['ÉLÈVES', 'BONJOUR', 'ÉLÉPHANTS'])
But I can't figure out a way to have this result I tried many ways including putting this line '# -- encoding: utf-8 --' at the top of my file. I also tried 'with codecs.open()' but it didn't work either.
Thanks!
In python 2 you can use the codecs module to read the file with an encoding. Remember that the repr representation of a unicode string will look funky (starts with a u, escapes the unicode stuff) but the actual string is in fact unicode.
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from sets import Set
import codecs
with codecs.open('test.txt', encoding='utf-8') as word:
lines = [line.strip() for line in word.readlines()]
# since you print the list, it shows you the repr of its values
print(lines)
# but they really are unicode
for line in lines:
print(line)
The output shows the unicode repr when printing the list, but the real string when printing the strings themselves.
[u'\xe9l\xe9phants', u'bonjour', u'\xe9l\xe8ves']
éléphants
bonjour
élèves
The reason is probably that you read the file using the wrong encoding.
In Python 3 you would simply switch:
from with open('.../test_unicode.txt', 'r') as word:
to with open('.../test_unicode.txt', 'r', encoding="utf-8") as word:
In Python 2, it seems you can do something like this: Backporting Python 3 open(encoding="utf-8") to Python 2
I.e. use io.open (you have to import io first), and specify encoding="utf-8". I would have expected this to work with codecs.open as well, if you specify that same keyword argument.
You can try to infer input encoding
from sets import Set
import chardet
with open('.../test_unicode.txt', 'rb') as word:
bin_data = word.readlines()
enc = chardet.detect(bin_data)
lines = bin_data.decode(enc['encoding'])
print(lines)

How not to decode escaped sequences when reading from file but keep the string representation

I am reading in a text file that contains lines with binaray data dumped in a encoded fashion, but still as a string (at least in emacs):
E.g.:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242
This is perfectly fine for me and when I read in that file I want to keep this string and not decode or change it in any way. However, when I am reading in the file python does the decoding. How can I prevent that?
with open("/path/to/file") as file:
for line in file:
print line
the output will look like:
'���k���G�r��#�\0320^��\021�C\035\000�\016ׁ��'
but should look like:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242
Edit: However, this encoded data is not the only data contained but part of a larger text dump.
You can read the file as binary with 'rb' option and it will retain the data as it is
EX:
with open(PathToFile, 'rb') as file:
raw_binary_data = file.read()
print(raw_binary_data)
If you really want the octal representation you can define a fuction that prints it back out.
import string
def octal_print(s):
print(''.join(map(lambda x: x if x in string.printable else '\\'+oct(ord(x))[2:], s)))
s = '\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207'
octal_print(s)
# prints:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\320^\242\367\21\227C\35\0\207
based on the answer of James I adapted the octal_print function to discriminate between actual octals and innocent characters.
def octal_print(s):
charlist = list()
for character in s:
try:
character.decode('ascii')
charlist.append(character)
except:
charlist.append('\\'+oct(ord(character))[1:])
return ''.join(charlist)

Why does  appear in my data?

I downloaded the file 'pi_million_digits.txt' from here:
https://github.com/ehmatthes/pcc/blob/master/chapter_10/pi_million_digits.txt
I then used this code to open and read it:
filename = 'pi_million_digits.txt'
with open(filename) as file_object:
lines = file_object.readlines()
pi_string = ''
for line in lines:
pi_string += line.strip()
print(pi_string[:52] + "...")
print(len(pi_string))
However the output produced is correct apart from the fact it is preceded by same strange symbols: "3.141...."
What causes these strange symbols? I am stripping the lines so I'd expect such symbols to be removed.
It looks like you're opening a file with a UTF-8 encoded Byte Order Mark using the ISO-8859-1 encoding (presumably because this is the default encoding on your OS).
If you open it as bytes and read the first line, you should see something like this:
>>> next(open('pi_million_digits.txt', 'rb'))
b'\xef\xbb\xbf3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
… where \xef\xbb\xbf is the UTF-8 encoding of the BOM. Opened as ISO-8859-1, it looks like what you're getting:
>>> next(open('pi_million_digits.txt', encoding='iso-8859-1'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
… and opening it as UTF-8 shows the actual BOM character U+FEFF:
>>> next(open('pi_million_digits.txt', encoding='utf-8'))
'\ufeff3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
To strip the mark out, use the special encoding utf-8-sig:
>>> next(open('pi_million_digits.txt', encoding='utf-8-sig'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
The use of next() in the examples above is just for demonstration purposes. In your code, you just need to add the encoding argument to your open() line, e.g.
with open(filename, encoding='utf-8-sig') as file_object:
# ... etc.

Encoding issue when writing to text file, with Python

I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.
While there are a number of questions out there on this topic, I didn't find a direct answer to my problem.
Detecting the system defaults won't help me in this case, because I need the program to be portable.
Here's the code:
def txt_to_JSON(csv_list):
...some manipulation of the list...
return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list
for i in range(0,len(lines)):
lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet
I highly recommend Ned Batchelder's presentation
http://nedbatchelder.com/text/unipain.html
for details.
There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?
TLDR:
Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.
Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.
In your case
filename = 'where your data lives'
with open(filename, 'rb') as f:
encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")
# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)
encoded_result = result.encode("UTF-16") #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
f.write(encoded_result)
You need to tell Python to use the Unicode character encoding to decode the Hebrew characters.
Here's a link to how you can read Unicode characters in Python: Character reading from file in Python

Python string encodings and ==

I am having some trouble with strings in python not being == when I think they should be, and I believe it has something to do with the way they are encoded. Basically, I parsing some comma-separated values that are stored in zip archives (GTFS feeds specifically, for those who are curious).
I'm using the ZipFile module in python to open certain files the zip archives and then comparing the text there to some known values. Here's an example file:
agency_id,agency_name,agency_url,agency_phone,agency_timezone,agency_lang
ARLC,Arlington Transit,http://www.arlingtontransit.com,703-228-7433,America/New_York,en
The code I'm using is trying to identify the position of the string "agency_id" in the first line of the text so that I can use the corresponding value in any subsequent lines. Here's a snippet of the code:
zipped_feed = ZipFile(feed_name, "r")
agency_file = zipped_feed.open("agency.txt", "r")
line_num = 0
agencyline = agency_file.readline()
while agencyline:
if line_num == 0:
# this is the header, all we care about is the agency_id
lineparts = agencyline.split(",")
position = -1
counter = 0
for part in lineparts:
part = part.strip()
if part == "agency_id":
position = counter
counter += 1
line_num += 1
agencyline = agency_file.readline()
else:
.....
This code works for some zip archives, but not for others. I did some research and tried printing repr(part) and i got '\xef\xbb\xbfagency_id' instead of 'agency_id'. Does anyone know what's going on here and how I can fix it? Thanks for all the help!
That is a Byte Order Mark, which tells the encoding of the file and in the case of UTF-16 and UTF-32 it also tells the endianess of the file. You can either interpret it or check for it and remove it from your string.
To remove it you could do this:
import codecs
unicode(part, "utf8").lstrip(codecs.BOM_UTF8.decode("utf8", "strict"))
Your input file seems to be utf-8 and starting with a 'ZERO WIDTH NO-BREAK SPACE'-character,
import unicodedata
unicodedata.name('\xef\xbb\xbf'.decode('utf8'))
# gives: 'ZERO WIDTH NO-BREAK SPACE'
which is used as a BOM (or more accurately to identify the file as being utf8, as byte order isn't really accurate with utf8, but it's commonly called BOM anyway)
Simple: some of your zip archives are printing the Unicode BOM (Byte Order Mark) at the beginning of the string. This is used to indicate the byte order for use with multi-byte encodings. This means you're reading in a Unicode string (probably UTF-16 encoded) as a bytestring. Easiest thing to do would be check for it at the start of the string and remove it.
What you've got is a file that may occasionally have a Unicode Byte Order mark at the front of the file. Sometimes this is introduced by editors to indicate encoding.
Here's some details - http://en.wikipedia.org/wiki/Byte_order_mark
Bottom line is that you could look for the \xef\xbb\xbf string which is the marker for UTF-8 encoded data and just strip it. Or the other choice is to open it with the codecs package
with codecs.open('input', 'r', 'utf-8') as file:
or in your case
zipped_feed = ZipFile(feed_name, "r")
# adding a StreamReader around the zipped_feed.open(...)
agency_file = codecs.StreamReader(zipped_feed.open("agency.txt", "r"))

Categories