When I open a file with codecs.open('f.txt', 'r', encoding=None), Python 2.7.8 chooses some default encoding.
Which is it? And where is this documented?
Some experimentation has revealed that the default encoding is not utf-8, ascii, sys.getdefaultencoding(), locale.getpreferredencoding(), or locale.getpreferredencoding(False).
Edit (clarifying my motivation): I want to know which encoding is chosen by Python 2.7.8 when I run a script like this:
f = codecs.open('f.txt', 'r', encoding=None) # or equivalently: f=open('f.txt')
for line in f:
print len(line) # obviously SOME encoding has been chosen if I can print the number of characters
I'm not interested in other ways to guess the encoding of a file.
It basically wont do any transparent encoding / decoding at all it just opens the file and returns it.
Here is the code from the library: -
def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
""" Open an encoded file using the given mode and return
a wrapped version providing transparent encoding/decoding.
Note: The wrapped version will only accept the object format
defined by the codecs, i.e. Unicode objects for most builtin
codecs. Output is also codec dependent and will usually be
Unicode as well.
Files are always opened in binary mode, even if no binary mode
was specified. This is done to avoid data loss due to encodings
using 8-bit values. The default file mode is 'rb' meaning to
open the file in binary read mode.
encoding specifies the encoding which is to be used for the
file.
errors may be given to define the error handling. It defaults
to 'strict' which causes ValueErrors to be raised in case an
encoding error occurs.
buffering has the same meaning as for the builtin open() API.
It defaults to line buffered.
The returned wrapped file object provides an extra attribute
.encoding which allows querying the used encoding. This
attribute is only available if an encoding was specified as
parameter.
"""
if encoding is not None:
if 'U' in mode:
# No automatic conversion of '\n' is done on reading and writing
mode = mode.strip().replace('U', '')
if mode[:1] not in set('rwa'):
mode = 'r' + mode
if 'b' not in mode:
# Force opening of the file in binary mode
mode = mode + 'b'
file = __builtin__.open(filename, mode, buffering)
if encoding is None:
return file
info = lookup(encoding)
srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, errors)
# Add attributes to simplify introspection
srw.encoding = encoding
return srw
As you can see if encoding is None it just returns the opened file.
Here is your file with each byte represented in decimal showing its corresponding ascii character:
46 .
46 .
46 .
32 'space'
48 0
45 -
49 1
10 'line feed'
10 'line feed'
91 [
69 E
118 v
101 e
110 n
116 t
32 'space'
34 "
72 H
97 a
114 r
118 v
97 a
114 r
100 d
32 'space'
67 C
117 u
112 p
32 'space'
51 3
48 0
180 'this is not ascii'
34 "
93 ]
10 'line feed'
46 .
46 .
46 .
The issue you are having when opening it in ascii is the byte with the decimal value 180. Ascii can only go up to 127. So this got me thinking this must be some kind of extended ascii where 128 - 255 are used for extra symbols. After a good read of the wikipedia article about ascii (https://en.wikipedia.org/wiki/ASCII) it mentioned a popular extension to ascii called windows-1252. In windows-1252 the decimal value 180 maps to the acute accent character (´). Then i decided to google the string in your file to see what it actually related to. And this is when i found "Harvard Cup 30´" http://www.365chess.com/tournaments/Harvard_Cup_30%C2%B4_1989/21650
So in summery the correct encoding is probably windows-1252. Here is my test program: -
import codecs
with codecs.open('f.txt', 'r', encoding='windows-1252') as f:
print f.read()
outputs
... 0-1
[Event "Harvard Cup 30´"]
...
Using codecs.open('f.txt','r',encoding=None) returns byte strings instead of Unicode strings when the file is read. It doesn't try to decode the file data with an encoding at all. It is equivalent to open('f.txt','r'). The length you receive is the number of individual bytes in the line as stored in the file with no translation.
A small example:
>>> import codecs
>>> codecs.open('f.txt','r',encoding=None).read()
'abc\n'
>>> codecs.open('f.txt','r',encoding='ascii').read() # Note Unicode string returned.
u'abc\r\n'
>>> open('f.txt','r').read()
'abc\n'
Related
I have python 2 code that works:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
from os import path
filename = "test.bin" # file contents in hex: 57 58 59 5A 12 00 00 00 4E 44
ID = 4
myfile = open(filename, 'rb')
filesize = path.getsize(filename)
data = list(myfile.read(filesize))
myfile.close()
temp_ptr = data[ID:ID+2]
pointer = int(''.join(reversed(temp_ptr)).encode('hex'), 16)
print(pointer)
Prints "18"
However, it does not work in python 3. I get:
Traceback (most recent call last):
File "py2vs3.py", line 13, in <module>
ptr = int(''.join(reversed(temp_ptr)).encode('hex'), 16)
TypeError: sequence item 0: expected str instance, int found
I am simply grabbing one 32-bit field from a file and printing how C would see it. How do I make this work in Py3? All the code examples I find are for python 2, and the docs make no sense to me.
Python 3 distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding based on https://docs.python.org/3/library/functions.html#open
I imitated the example provided by you inline below, instead of reading from a file.
# Python 2
frame = "\x57\x58\x59\x5A\x12\x00\x00\x00\x4E\x44"
int(''.join(reversed(frame[4:6])).encode('hex'), 16)
# Result is 18
Same thing in Python 3
# Python 3
# The preceding b'' signifies that this is a bytearray, the same type
# returned when read from a file in binary mode
frame = b"\x57\x58\x59\x5A\x12\x00\x00\x00\x4E\x44"
int.from_bytes(frame[4:6], "little")
# The 2nd argument "little" represents which is the most significant bit
# i.e left most or right most; more details in the link below
# Result is 18
https://docs.python.org/3/library/stdtypes.html#int.from_bytes has more information about the method
As Mad Wombat commented, python3 does read the file as a byte array rather than a string. The following snippet essentially synthesizes the process.
data = [char for char in myfile.read()]+['\n']
I have this very simple Python code:
in_data = "eNrtmD1Lw0AY..."
print("Input: " + in_data)
out_data = in_data.decode('base64').decode('zlib').encode('zlib').encode('base64')
print("Output: " + out_data)
It outputs:
Input: eNrtmD1Lw0AY...
Output: eJztmE1LAkEY...
The string is also correctly decoded; if I display in_data.decode('base64').decode('zlib'), it gives the expected result.
Also, the formatting is different for both strings:
Why is the decoding/encoding not working properly? Are there some sort of parameters I should use?
Your data on input starts with the hex bytes 78 DA, your output starts with 78 9C:
>>> 'eNrt'.decode('base64').encode('hex')[:4]
'78da'
>>> 'eJzt'.decode('base64').encode('hex')[:4]
'789c'
DA is the highest compression level, 9C is the default. See What does a zlib header look like?
Rather than use .encode('zlib') use the zlib.compress() function, an set the level to 9:
import zlib
zlib.compress(decoded_data, 9).encode('base64')
The output of the base64 encoding inserts a newline every 76 characters to make it suitable for MIME encapsulation (emailing). You could use the base64.b64encode() function instead to encode without newlines.
I'm trying to save concrete content of the dictionary to a file but when I try to write it, I get the following error:
Traceback (most recent call last):
File "P4.py", line 83, in <module>
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)
And here is the code:
from collections import Counter
with open("corpus.txt") as inf:
wordtagcount = Counter(line.decode("latin_1").rstrip() for line in inf)
with open("lexic.txt", "w") as outf:
outf.write('Palabra\tTag\tApariciones\n'.encode("utf-8"))
for word,count in wordtagcount.iteritems():
outf.write(u"{}\t{}\n".format(word, count).encode("utf-8"))
"""
2) TAGGING USING THE MODEL
Dados los ficheros de test, para cada palabra, asignarle el tag mas
probable segun el modelo. Guardar el resultado en ficheros que tengan
este formato para cada linea: Palabra Prediccion
"""
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}
"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
aux = linea.decode('latin_1').encode('utf-8')
sintagma = aux.split('\t') # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
if (diccionario.has_key(sintagma[0])): #Here we check if the word was included before in the dictionary
aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
aux_list_else = ([sintagma[1],sintagma[2]])
diccionario.update({sintagma[0]:aux_list_else})
"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])
For retrieve the information from diccionario, we have to keep in mind:
In case we have more than 1 Tag associated to a word (keyword ), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:
diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
with open("estimation.txt", "w") as outfile:
for keyword in diccionario:
tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) #tagSugerido is the tag with more ocurrences for a concrete keyword
maximo = float(diccionario.get(keyword)[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
suma = float(diccionario.get(keyword)[1])
for i in range (2, len(diccionario.get(keyword))):
suma += float(diccionario.get(keyword)[i][1])
if (diccionario.get(keyword)[i][1] > maximo):
tagSugerido = unicode(diccionario.get(keyword)[i][0]).decode('utf-8'))
maximo = float(diccionario.get(keyword)[i][1])
probabilidad = float(maximo/suma);
diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})
else:
diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido).encode("utf-8"))
The desired output will look like this:
keyword(String) tagSugerido(String):
Hello NC
Friend N
Run V
...etc
The conflictive line is:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
Thank you.
Like zmo suggested:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
should be:
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))
A note on unicode in Python 2
Your software should only work with unicode strings internally, converting to a particular encoding on output.
Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.
The difference between ASCII and UTF-8 encoding:
Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.
ascii (default)
1 If the code point is < 128, each byte is the same as the value of the code point.
2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
utf-8 (unicode transformation format)
1 If the code point is <128, it’s represented by the corresponding byte value.
2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
The difference between str and unicode objects:
You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.
str vs. unicode
1 str = byte string (8-bit) - uses \x and two digits
2 unicode = unicode string - uses \u and four digits
3 basestring
/\
/ \
str unicode
If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:
Rules
1 encode(): Gets you from Unicode -> bytes
encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2 decode(): Gets you from bytes -> Unicode
decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4 u”: Makes your string literals into Unicode objects rather than byte sequences.
5 unicode(string[, encoding, errors])
Warning: Don’t use encode() on bytes or decode() on Unicode objects
And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.
As you're not giving a simple concise code to illustrate your question, I'll just give you a general advice on what should be the error:
If you're getting a decode error, it's that tagSugerido is read as ASCII and not as Unicode. To fix that, you should do:
tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8'))
to store it as an unicode.
Then you're likely to get an encode error at the write() stage, and you should fix your write the following way:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
should be:
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))
I litterally answered a very similar question moments ago. And when working with unicode strings, switch to python3, it'll make your life easier!
If you cannot switch to python3 just yet, you can make your python2 behave like it is almost python3, using the python-future import statement:
from __future__ import absolute_import, division, print_function, unicode_literals
N.B.: instead of doing:
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
which will fail to close properly the file descriptor upon failure during readlines, you should better do:
with open("lexic.txt", "r") as f:
data=f.readlines()
which will take care of always closing the file even upon failure.
N.B.2: Avoid using file as this is a python type you're shadowing, but use f or lexic_file…
I'm making a encryption program and I need to open file in binary mode to access non-ascii and non-printable characters, I need to check if character from a file is letter, number, symbol or unprintable character. That means I have to check 1 by 1 if bytes (when they are decoded to ascii) match any of these characters:
{^9,dzEV=Q4ciT+/s};fnq3BFh% #2!k7>YSU<GyD\I]|OC_e.W0M~ua-jR5lv1wA`#8t*xr'K"[P)&b:g$p(mX6Ho?JNZL
I think I could encode these characters above to binary and then compare them with bytes. I don't know how to do this.
P.S. Sorry for bad English and binary misunderstanding. (I hope you
know what I mean by bytes, I mean characters in binary mode like
this):
\x01\x00\x9a\x9c\x18\x00
There are two major string types in Python: bytestrings (a sequence of bytes) that represent binary data and Unicode strings (a sequence of Unicode codepoints) that represent human-readable text. It is simple to convert one into another (☯):
unicode_text = bytestring.decode(character_encoding)
bytestring = unicode_text.encode(character_encoding)
If you open a file in binary mode e.g., 'rb' then file.read() returns a bytestring (bytes type):
>>> b'A' == b'\x41' == chr(0b1000001).encode()
True
There are several methods that can be used to classify bytes:
string methods such as bytes.isdigit():
>>> b'1'.isdigit()
True
string constants such as string.printable
>>> import string
>>> b'!' in string.printable.encode()
True
regular expressions such as \d
>>> import re
>>> bool(re.match(br'\d+$', b'123'))
True
classification functions in curses.ascii module e.g., curses.ascii.isprint()
>>> from curses import ascii
>>> bytearray(filter(ascii.isprint, b'123'))
bytearray(b'123')
bytearray is a mutable sequence of bytes — unlike a bytestring you can change it inplace e.g., to lowercase every 3rd byte that is uppercase:
>>> import string
>>> a = bytearray(b'ABCDEF_')
>>> uppercase = string.ascii_uppercase.encode()
>>> a[::3] = [b | 0b0100000 if b in uppercase else b
... for b in a[::3]]
>>> a
bytearray(b'aBCdEF_')
Notice: b'ad' are lowercase but b'_' remained the same.
To modify a binary file inplace, you could use mmap module e.g., to lowercase 4th column in every other line in 'file':
#!/usr/bin/env python3
import mmap
import string
uppercase = string.ascii_uppercase.encode()
ncolumn = 3 # select 4th column
with open('file', 'r+b') as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
while True:
mm.readline() # ignore every other line
pos = mm.tell() # remember current position
if not mm.readline(): # EOF
break
if mm[pos + ncolumn] in uppercase:
mm[pos + ncolumn] |= 0b0100000 # lowercase
Note: Python 2 and 3 APIs differ in this case. The code uses Python 3.
Input
ABCDE1
FGHIJ
ABCDE
FGHI
Output
ABCDE1
FGHiJ
ABCDE
FGHi
Notice: 4th column became lowercase on 2nd and 4h lines.
Typically if you want to change a file: you read from the file, write modifications to a temporary file, and on success you move the temporary file inplace of the original file:
#!/usr/bin/env python3
import os
import string
from tempfile import NamedTemporaryFile
caesar_shift = 3
filename = 'file'
def caesar_bytes(plaintext, shift, alphabet=string.ascii_lowercase.encode()):
shifted_alphabet = alphabet[shift:] + alphabet[:shift]
return plaintext.translate(plaintext.maketrans(alphabet, shifted_alphabet))
dest_dir = os.path.dirname(filename)
chunksize = 1 << 15
with open(filename, 'rb') as file, \
NamedTemporaryFile('wb', dir=dest_dir, delete=False) as tmp_file:
while True: # encrypt
chunk = file.read(chunksize)
if not chunk: # EOF
break
tmp_file.write(caesar_bytes(chunk, caesar_shift))
os.replace(tmp_file.name, filename)
Input
abc
def
ABC
DEF
Output
def
ghi
ABC
DEF
To convert the output back, set caesar_shift = -3.
To open a file in binary mode you use the open("filena.me", "rb") command. I've never used the command personally, but that should get you the information you need.
I am am getting an image from email attachment and will never touch disk. Image will be placed into a StringIO container and processed by PIL. How do I get the file size in bytes?
image_file = StringIO('image from email')
im = Image.open(image_file)
Use StringIO's .tell() method by seeking to the end of the file:
>>> from StringIO import StringIO
>>> s = StringIO("foobar")
>>> s.tell()
0
>>> s.seek(0, 2)
>>> s.tell()
6
In your case:
image_file = StringIO('image from email')
image_file.seek(0, 2) # Seek to the end
bytes = image_file.tell() # Get no. of bytes
image_file.seek(0) # Seek to the start
im = Image.open(image_file)
Suppose you have:
>>> s = StringIO("cat\u2014jack")
>>> s.seek(0, os.SEEK_END)
>>> print(s.tell())
8
This however is incorrect. \u2014 is an emdash character - a single character but it is 3 bytes long.
>>> len("\u2014")
1
>>> len("\u2014".encode("utf-8"))
3
StringIO may not use utf-8 for storage either. It used to use UCS-2 or UCS-4 but since pep 393 may be using utf-8 or some other encoding depending on the circumstance.
What ultimately matters is the binary representation you ultimately go with. If you're encoding text, and you'll eventually write the file out encoded as utf-8 then you have to encode the value in it's entirety to know how many bytes it will take up. This is because as a variable-length encoding, utf-8 characters may require multiple bytes to encode.
You could do something like:
>>> s = StringIO("cat\u2014jack")
>>> size = len(s.getvalue().encode('utf-8'))
10