I am having some trouble with strings in python not being == when I think they should be, and I believe it has something to do with the way they are encoded. Basically, I parsing some comma-separated values that are stored in zip archives (GTFS feeds specifically, for those who are curious).
I'm using the ZipFile module in python to open certain files the zip archives and then comparing the text there to some known values. Here's an example file:
agency_id,agency_name,agency_url,agency_phone,agency_timezone,agency_lang
ARLC,Arlington Transit,http://www.arlingtontransit.com,703-228-7433,America/New_York,en
The code I'm using is trying to identify the position of the string "agency_id" in the first line of the text so that I can use the corresponding value in any subsequent lines. Here's a snippet of the code:
zipped_feed = ZipFile(feed_name, "r")
agency_file = zipped_feed.open("agency.txt", "r")
line_num = 0
agencyline = agency_file.readline()
while agencyline:
if line_num == 0:
# this is the header, all we care about is the agency_id
lineparts = agencyline.split(",")
position = -1
counter = 0
for part in lineparts:
part = part.strip()
if part == "agency_id":
position = counter
counter += 1
line_num += 1
agencyline = agency_file.readline()
else:
.....
This code works for some zip archives, but not for others. I did some research and tried printing repr(part) and i got '\xef\xbb\xbfagency_id' instead of 'agency_id'. Does anyone know what's going on here and how I can fix it? Thanks for all the help!
That is a Byte Order Mark, which tells the encoding of the file and in the case of UTF-16 and UTF-32 it also tells the endianess of the file. You can either interpret it or check for it and remove it from your string.
To remove it you could do this:
import codecs
unicode(part, "utf8").lstrip(codecs.BOM_UTF8.decode("utf8", "strict"))
Your input file seems to be utf-8 and starting with a 'ZERO WIDTH NO-BREAK SPACE'-character,
import unicodedata
unicodedata.name('\xef\xbb\xbf'.decode('utf8'))
# gives: 'ZERO WIDTH NO-BREAK SPACE'
which is used as a BOM (or more accurately to identify the file as being utf8, as byte order isn't really accurate with utf8, but it's commonly called BOM anyway)
Simple: some of your zip archives are printing the Unicode BOM (Byte Order Mark) at the beginning of the string. This is used to indicate the byte order for use with multi-byte encodings. This means you're reading in a Unicode string (probably UTF-16 encoded) as a bytestring. Easiest thing to do would be check for it at the start of the string and remove it.
What you've got is a file that may occasionally have a Unicode Byte Order mark at the front of the file. Sometimes this is introduced by editors to indicate encoding.
Here's some details - http://en.wikipedia.org/wiki/Byte_order_mark
Bottom line is that you could look for the \xef\xbb\xbf string which is the marker for UTF-8 encoded data and just strip it. Or the other choice is to open it with the codecs package
with codecs.open('input', 'r', 'utf-8') as file:
or in your case
zipped_feed = ZipFile(feed_name, "r")
# adding a StreamReader around the zipped_feed.open(...)
agency_file = codecs.StreamReader(zipped_feed.open("agency.txt", "r"))
Related
How does one read binary and text from the same file in Python? I know how to do each separately, and can imagine doing both very carefully, but not both with the built-in IO library directly.
So I have a file that has a format that has large chunks of UTF-8 text interspersed with binary data. The text does not have a length written before it or a special character like "\0" delineating it from the binary data, there is a large portion of text near the end when parsed means "we are coming to an end".
The optimal solution would be to have the built-in file reading classes have "read(n)" and "read_char(n)" methods, but alas they don't. I can't even open the file twice, once as text and once as binary, since the return value of tell() on the text one can't be used with the binary one in any meaningful way.
So my first idea would be to open the whole file as binary and when I reach a chunk of text, read it "character by character" until I realize that the text is ending and then go back to reading it as binary. However this means that I have to read byte-by-byte and do my own decoding of UTF-8 characters (do I need to read another byte for this character before doing something with it?). If it was a fixed-width character encoding I would just read that many bytes each time. In the end I would also like the universal line endings as supported by the Python text-readers, but that would be even more difficult to implement while reading byte-by-byte.
Another easier solution would be if I could ask the text file object its real offset in the file. That alone would solve all my problems.
One way might be to use Hachoir to define a file parsing protocol.
The simple alternative is to open the file in binary mode and manually initialise a buffer and text wrapper around it. You can then switch in and out of binary pretty neatly:
my_file = io.open("myfile.txt", "rb")
my_file_buffer = io.BufferedReader(my_file, buffer_size=1) # Not as performant but a larger buffer will "eat" into the binary data
my_file_text_reader = io.TextIOWrapper(my_file_buffer, encoding="utf-8")
string_buffer = ""
while True:
while "near the end" not in string_buffer:
string_buffer += my_file_text_reader.read(1) # read one Unicode char at a time
# binary data must be next. Where do we get the binary length from?
print string_buffer
data = my_file_buffer.read(3)
print data
string_buffer = ""
A quicker, less extensible way might be to use the approach you've suggested in your question by intelligently parsing the text portions, reading each UTF-8 sequence of bytes at a time. The following code (from http://rosettacode.org/wiki/Read_a_file_character_by_character/UTF8#Python), seems to be a neat way to conservatively read UTF-8 bytes into characters from a binary file:
def get_next_character(f):
# note: assumes valid utf-8
c = f.read(1)
while c:
while True:
try:
yield c.decode('utf-8')
except UnicodeDecodeError:
# we've encountered a multibyte character
# read another byte and try again
c += f.read(1)
else:
# c was a valid char, and was yielded, continue
c = f.read(1)
break
# Usage:
with open("input.txt","rb") as f:
my_unicode_str = ""
for c in get_next_character(f):
my_unicode_str += c
I am using Python 2.6 on Linux. I have a shift_jis (Japanese) encoded .csv file that I am loading. I am reading the header in, and doing a regex replacement to translate a few values, then writing the file back as shift_jis. I am hitting a UnicodeDecodeError on one of the characters in the file, ①, which should be a valid character according to http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml. The other Japanese characters decode fine.
1) I am decoding the string using shift_jis in a list comprehension. What can I do if I want to just ignore (workaround) this and other bad characters? Here is the code with the csv values already read in list_of_row_values.
#! /usr/bin/python
# -*- coding: utf8 -*-
import csv
import re
with open('test.csv', 'wb') as output_file:
wr = csv.writer(output_file, delimiter=',', quoting=csv.QUOTE_NONE)
# the following corresponds to reading from a shift_jis encoded csv files "日付,直流電流計測①,直流電流計測②"
# 直流電流計測① is throwing an exception when decoded but it is a valid character according to
# http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml
list_of_row_values = ['\x93\xfa\x95t', '\x92\xbc\x97\xac\x93d\x97\xac\x8cv\x91\xaa\x87#', '\x92\xbc\x97\xac\x93d\x97\xac\x8cv\x91\xaa\x87A']
# take away the last character in entry two, and three, and it would work
# but that means I know all the bad characters before hand
#list_of_row_values = ['\x93\xfa\x95t', '\x92\xbc\x97\xac\x93d\x97\xac\x8cv\x91\xaa', '\x92\xbc\x97\xac\x93d\x97\xac\x8cv\x91\xaa']
try:
list_of_unicode_row_values = [str.decode('shift_jis') for str in list_of_row_values]
except UnicodeDecodeError:
# Question: what if I want to just ignore the character that cannot be decoded and still get the list
# of "日付,直流電流計測,直流電流計測" as unicode?
# right now, list_of_unicode_row_values would remain undefined, and the next line will
# have a NameError
print 'UnicodeDecodeError'
pass
# do a regex explanation to translate one column heading value
list_of_translated_unicode_row_values = \
[re.sub('日付'.decode('utf-8'), 'Date Time', str) for str in list_of_unicode_row_values]
list_of_translated_row_values = [unicode_str.encode('shift_jis') for unicode_str in list_of_translated_unicode_row_values]
wr.writerow(list_of_translated_row_values)
2) On a side note, how should I report this Python bug that a particular shift_jis character seems to fail to be properly decoded?
In general, you can use errors='ignore' to skip over invalid characters:
list_of_unicode_row_values = [str.decode('shift_jis', errors='ignore') for str in list_of_row_values]
This results in the following entries in list_of_unicode_row_values:
日付
直流電流計測
直流電流計測
However, in your particular case, you are using the wrong encoding. Python's shift_jis encoding conforms to the JIS X 0208 standard, while the character ① exists in the newer JIS X 0213 standard. To use the latter, just use the shift_jisx0213 encoding:
list_of_unicode_row_values = [str.decode('shift_jisx0213') for str in list_of_row_values]
You will get the following entries:
日付
直流電流計測①
直流電流計測②
as expected.
I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.
While there are a number of questions out there on this topic, I didn't find a direct answer to my problem.
Detecting the system defaults won't help me in this case, because I need the program to be portable.
Here's the code:
def txt_to_JSON(csv_list):
...some manipulation of the list...
return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list
for i in range(0,len(lines)):
lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet
I highly recommend Ned Batchelder's presentation
http://nedbatchelder.com/text/unipain.html
for details.
There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?
TLDR:
Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.
Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.
In your case
filename = 'where your data lives'
with open(filename, 'rb') as f:
encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")
# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)
encoded_result = result.encode("UTF-16") #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
f.write(encoded_result)
You need to tell Python to use the Unicode character encoding to decode the Hebrew characters.
Here's a link to how you can read Unicode characters in Python: Character reading from file in Python
I have problem with comparing string from file with string I entered in the program, I should get that they are equal but no matter if i use decode('utf-8') I get that they are not equal. Here's the code:
final = open("info", 'r')
exported = open("final",'w')
lines = final.readlines()
for line in lines:
if line == "Wykształcenie i praca": #error
print "ok"
and how I save file that I try read:
comm_p = bs4.BeautifulSoup(comm)
comm_f.write(comm_p.prettify().encode('utf-8'))
for string in comm_p.strings:
#print repr(string).encode('utf-8')
save = string.encode('utf-8') # there is how i save
info.write(save)
info.write("\n")
info.close()
and at the top of file I have # -- coding: utf-8 --
Any ideas?
This should do what you need:
# -- coding: utf-8 --
import io
with io.open('info', encoding='utf-8') as final:
lines = final.readlines()
for line in lines:
if line.strip() == u"Wykształcenie i praca": #error
print "ok"
You need to open the file with the right encoding, and since your string is not ascii, you should mark it as unicode.
First, you need some basic knowledge about encodings. This is a good place to start. You don't have to read everything right now, but try to get as far as you can.
About your current problem:
You're reading a UTF-8 encoded file (probably), but you're reading it as an ASCII file. open() doesn't do any conversion for you.
So what you need to do (at least):
use codecs.open("info", "r", encoding="utf-8") to read the file
use Unicode strings for comparison: if line.rstrip() == u"Wykształcenie i praca":
It is likely the difference is in a '\n' character
readlines doesn't strip '\n' - see Best method for reading newline delimited files in Python and discarding the newlines?
In general it is not a good idea to put a Unicode string in your code, it would be a good idea to read it from a resource file
use unicode for string comparision
>>> s = u'Wykształcenie i praca'
>>> s == u'Wykształcenie i praca'
True
>>>
when it comes to string unicode is the smartest move :)
I'm programming in Python 3 and I'm having a small problem which I can't find any reference to it on the net.
As far as I understand the default string in is utf-16, but I must work with utf-8, I can't find the command that will convert from the default one to utf-8.
I'd appreciate your help very much.
In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).
What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters
def main():
# Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
with open('output_file', 'w', encoding='utf-8') as out_file:
# read every line. We give open() the encoding so it will return a Unicode string.
for line in open('input_file', encoding='utf-8'):
#Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
print(line.replace('ć', 'ç'), out_file)
So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.
Still, because it is interesting, here the hard way, where you encode everything yourself:
def main():
# Open the file in binary mode. So we are going to write bytes to it instead of strings
with open('output_file', 'wb') as out_file:
# read every line. Again, we open it binary, so we get bytes
for line_bytes in open('input_file', 'rb'):
#Convert the bytes to a string
line_string = bytes.decode('utf-8')
#Replace the characters we want.
line_string = line_string.replace('ć', 'ç')
#Make a bytes to print
out_bytes = line_string.encode('utf-8')
#Print the bytes
print(out_bytes, out_file)
Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!
Source: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.