encoding with open parameter vs encode string method byte size - python

I have come across something that I can't get my head around.
So I try to encode my string with one unicode character using both string encode method and open encoding parameter. For some reason, there is a difference regarding written byte size between these two methods.
Here is sample code:
with open("in.txt", "wb") as f:
no = f.write("Wlazł".encode("utf-8"))
print(no) # -> 6
with open("in.txt", "w", encoding="utf-8") as f:
no = f.write("Wlazł")
print(no) # -> 5
Does anyone know why this is so?

When you open a file in binary mode you get an instance of io.RawIOBase, and RawIOBase.write returns the number of bytes written.
When you open a file in text mode you get an instance of io.TextIOBase, and TextIOBase.write returns the number of characters written.
So the reason for the difference is that one is a count of bytes, the other of characters.

Related

How to encode when writing to file?

I am trying to write some Data to a file. In some instances, obviously depending on the Data I am trying to write, I get a UnicodeEncodeError (UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f622' in position 141: character maps to )
I did some research and found out that I can encode the data I am writing with the encode function.
This is the code prior to modifying it (not supporting Unicode):
scriptDir = os.path.dirname(__file__)
path = os.path.join(scriptDir, filename)
with open(path, 'w') as fp:
for sentence in iobTriplets:
fp.write("\n".join("{} {} {}".format(triplet[0],triplet[1],triplet[2]) for triplet in sentence))
fp.write("\n")
fp.write("\n")
So I though maybe I could just add the encoding when writing like that:
fp.write("\n".join("{} {} {}".format(triplet[0],triplet[1],triplet[2]).encode('utf8') for triplet in sentence))
But that doesn't work as I am getting the following error:
TypeError: sequence item 0: expected str instance, bytes found
I also tried opening the file in byte mode with adding a b behind the w. However that didn't yield any results.
Does anybody know how to fix this?
Btw: I am using python 3.
You have already opened the file with automatic encoding. There is no need to manually encode anything unless you are writing to binary.
You can specify any supported encoding in open():
with open(path, 'w', encoding='utf-16be') as fp:
Unless the file is opened as binary, you need to remove the str.encode() in the fp.write():
fp.write("\n".join("{} {} {}".format(triplet[0],triplet[1],triplet[2]) for triplet in sentence))

How to open a file with utf-8 non encoded characters?

I want to open a text file (.dat) in python and I get the following error:
'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte
but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.
Here is my code
import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
searchfile.close()
It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:
with open('compounds.dat', 'rb') as f:
data = f.read()
the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.
In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):
# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io
with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
for line in f:
if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.
if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well
with open("filename" , 'r' , encoding="utf-8",errors="ignore") as f:
f.read()

Writing more data to file than reading?

I am currently experimenting with how Python 3 handles bytes when reading, and writing data and I have come across a particularly troubling problem that I can't seem to find the source of. I am basically reading bytes out of a JPEG file, converting them to an integer using ord(), then returning the bytes to their original character using the line chr(character).encode('utf-8') and writing it back into a JPEG file. No issue right? Well when I go to try opening the JPEG file, I get a Windows 8.1 notification saying it can not open the photo. When I check the two files against each other one is 5.04MB, and the other is 7.63MB which has me awfully confused.
def __main__():
operating_file = open('photo.jpg', 'rb')
while True:
data_chunk = operating_file.read(64*1024)
if len(data_chunk) == 0:
print('COMPLETE')
break
else:
new_operation = open('newFile.txt', 'ab')
for character in list(data_chunk):
new_operation.write(chr(character).encode('utf-8'))
if __name__ == '__main__':
__main__()
This is the exact code I am using, any ideas on what is happening and how I can fix it?
NOTE: I am assuming that the list of numbers that list(data_chunk) provides is the equivalent to ord().
Here is a simple example you might wish to play with:
import sys
f = open('gash.txt', 'rb')
stuff=f.read() # stuff refers to a bytes object
f.close()
print(stuff)
f2 = open('gash2.txt', 'wb')
for i in stuff:
f2.write(i.to_bytes(1, sys.byteorder))
f2.close()
As you can see, the bytes object is iterable, but in the for loop we get back an int in i. To convert that to a byte I use int.to_bytes() method.
When you have a code point and you encode it in UTF-8, it is possible for the result to contain more bytes than the original.
For a specific example, refer to the WikiPedia page and consider the hexadecimal value 0xA2.
This is a single binary value, less than 255, but when encoded to UTF8 it becomes 0xC2, 0xA2.
Given that you are pulling bytes out of your source file, my first recommendation would be to simply pass the bytes directly to the writer of your target file.
If you are trying to understand how file I/O works, be wary of encode() when using a binary file mode. Binary files don't need to be encoded and or decoded - they are raw data.

Read file in text mode but also count raw bytes in Python?

I'd like to read a file in text mode line-wise, but at the same time I'd like to insert an intermediate step which works on bytes data and basically counts the bytes read so far.
Is there a good way in the standard library to achieve that (without manually opening in bytes mode, searching for newlines, encoding, ...)? At the end I need a text reading object (being used in the CSV reader) which additionally has a byte counter.
Python 2
csv module works with binary files in Python 2 therefore you could just call file.tell() method to get the current byte offset in the file.
Python 3
You can't use text_file.tell() (TextIOBase instance) -- it is documented to return an opaque number that may not correspond to the actual byte position.
If it is acceptable for your use case to get the byte offset with ± bufsize precision then:
file = open(filename, 'rb') # open in binary mode
text_file = io.TextIOWrapper(file, newline='') # text mode
# pass text_file to csv module
byte_offset = file.tell() # get position ± buffering

Converting from utf-16 to utf-8 in Python 3

I'm programming in Python 3 and I'm having a small problem which I can't find any reference to it on the net.
As far as I understand the default string in is utf-16, but I must work with utf-8, I can't find the command that will convert from the default one to utf-8.
I'd appreciate your help very much.
In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).
What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters
def main():
# Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
with open('output_file', 'w', encoding='utf-8') as out_file:
# read every line. We give open() the encoding so it will return a Unicode string.
for line in open('input_file', encoding='utf-8'):
#Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
print(line.replace('ć', 'ç'), out_file)
So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.
Still, because it is interesting, here the hard way, where you encode everything yourself:
def main():
# Open the file in binary mode. So we are going to write bytes to it instead of strings
with open('output_file', 'wb') as out_file:
# read every line. Again, we open it binary, so we get bytes
for line_bytes in open('input_file', 'rb'):
#Convert the bytes to a string
line_string = bytes.decode('utf-8')
#Replace the characters we want.
line_string = line_string.replace('ć', 'ç')
#Make a bytes to print
out_bytes = line_string.encode('utf-8')
#Print the bytes
print(out_bytes, out_file)
Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!
Source: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.

Categories