I have python 2 code that works:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
from os import path
filename = "test.bin" # file contents in hex: 57 58 59 5A 12 00 00 00 4E 44
ID = 4
myfile = open(filename, 'rb')
filesize = path.getsize(filename)
data = list(myfile.read(filesize))
myfile.close()
temp_ptr = data[ID:ID+2]
pointer = int(''.join(reversed(temp_ptr)).encode('hex'), 16)
print(pointer)
Prints "18"
However, it does not work in python 3. I get:
Traceback (most recent call last):
File "py2vs3.py", line 13, in <module>
ptr = int(''.join(reversed(temp_ptr)).encode('hex'), 16)
TypeError: sequence item 0: expected str instance, int found
I am simply grabbing one 32-bit field from a file and printing how C would see it. How do I make this work in Py3? All the code examples I find are for python 2, and the docs make no sense to me.
Python 3 distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding based on https://docs.python.org/3/library/functions.html#open
I imitated the example provided by you inline below, instead of reading from a file.
# Python 2
frame = "\x57\x58\x59\x5A\x12\x00\x00\x00\x4E\x44"
int(''.join(reversed(frame[4:6])).encode('hex'), 16)
# Result is 18
Same thing in Python 3
# Python 3
# The preceding b'' signifies that this is a bytearray, the same type
# returned when read from a file in binary mode
frame = b"\x57\x58\x59\x5A\x12\x00\x00\x00\x4E\x44"
int.from_bytes(frame[4:6], "little")
# The 2nd argument "little" represents which is the most significant bit
# i.e left most or right most; more details in the link below
# Result is 18
https://docs.python.org/3/library/stdtypes.html#int.from_bytes has more information about the method
As Mad Wombat commented, python3 does read the file as a byte array rather than a string. The following snippet essentially synthesizes the process.
data = [char for char in myfile.read()]+['\n']
Related
When I open a file with codecs.open('f.txt', 'r', encoding=None), Python 2.7.8 chooses some default encoding.
Which is it? And where is this documented?
Some experimentation has revealed that the default encoding is not utf-8, ascii, sys.getdefaultencoding(), locale.getpreferredencoding(), or locale.getpreferredencoding(False).
Edit (clarifying my motivation): I want to know which encoding is chosen by Python 2.7.8 when I run a script like this:
f = codecs.open('f.txt', 'r', encoding=None) # or equivalently: f=open('f.txt')
for line in f:
print len(line) # obviously SOME encoding has been chosen if I can print the number of characters
I'm not interested in other ways to guess the encoding of a file.
It basically wont do any transparent encoding / decoding at all it just opens the file and returns it.
Here is the code from the library: -
def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
""" Open an encoded file using the given mode and return
a wrapped version providing transparent encoding/decoding.
Note: The wrapped version will only accept the object format
defined by the codecs, i.e. Unicode objects for most builtin
codecs. Output is also codec dependent and will usually be
Unicode as well.
Files are always opened in binary mode, even if no binary mode
was specified. This is done to avoid data loss due to encodings
using 8-bit values. The default file mode is 'rb' meaning to
open the file in binary read mode.
encoding specifies the encoding which is to be used for the
file.
errors may be given to define the error handling. It defaults
to 'strict' which causes ValueErrors to be raised in case an
encoding error occurs.
buffering has the same meaning as for the builtin open() API.
It defaults to line buffered.
The returned wrapped file object provides an extra attribute
.encoding which allows querying the used encoding. This
attribute is only available if an encoding was specified as
parameter.
"""
if encoding is not None:
if 'U' in mode:
# No automatic conversion of '\n' is done on reading and writing
mode = mode.strip().replace('U', '')
if mode[:1] not in set('rwa'):
mode = 'r' + mode
if 'b' not in mode:
# Force opening of the file in binary mode
mode = mode + 'b'
file = __builtin__.open(filename, mode, buffering)
if encoding is None:
return file
info = lookup(encoding)
srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, errors)
# Add attributes to simplify introspection
srw.encoding = encoding
return srw
As you can see if encoding is None it just returns the opened file.
Here is your file with each byte represented in decimal showing its corresponding ascii character:
46 .
46 .
46 .
32 'space'
48 0
45 -
49 1
10 'line feed'
10 'line feed'
91 [
69 E
118 v
101 e
110 n
116 t
32 'space'
34 "
72 H
97 a
114 r
118 v
97 a
114 r
100 d
32 'space'
67 C
117 u
112 p
32 'space'
51 3
48 0
180 'this is not ascii'
34 "
93 ]
10 'line feed'
46 .
46 .
46 .
The issue you are having when opening it in ascii is the byte with the decimal value 180. Ascii can only go up to 127. So this got me thinking this must be some kind of extended ascii where 128 - 255 are used for extra symbols. After a good read of the wikipedia article about ascii (https://en.wikipedia.org/wiki/ASCII) it mentioned a popular extension to ascii called windows-1252. In windows-1252 the decimal value 180 maps to the acute accent character (´). Then i decided to google the string in your file to see what it actually related to. And this is when i found "Harvard Cup 30´" http://www.365chess.com/tournaments/Harvard_Cup_30%C2%B4_1989/21650
So in summery the correct encoding is probably windows-1252. Here is my test program: -
import codecs
with codecs.open('f.txt', 'r', encoding='windows-1252') as f:
print f.read()
outputs
... 0-1
[Event "Harvard Cup 30´"]
...
Using codecs.open('f.txt','r',encoding=None) returns byte strings instead of Unicode strings when the file is read. It doesn't try to decode the file data with an encoding at all. It is equivalent to open('f.txt','r'). The length you receive is the number of individual bytes in the line as stored in the file with no translation.
A small example:
>>> import codecs
>>> codecs.open('f.txt','r',encoding=None).read()
'abc\n'
>>> codecs.open('f.txt','r',encoding='ascii').read() # Note Unicode string returned.
u'abc\r\n'
>>> open('f.txt','r').read()
'abc\n'
I'm making a encryption program and I need to open file in binary mode to access non-ascii and non-printable characters, I need to check if character from a file is letter, number, symbol or unprintable character. That means I have to check 1 by 1 if bytes (when they are decoded to ascii) match any of these characters:
{^9,dzEV=Q4ciT+/s};fnq3BFh% #2!k7>YSU<GyD\I]|OC_e.W0M~ua-jR5lv1wA`#8t*xr'K"[P)&b:g$p(mX6Ho?JNZL
I think I could encode these characters above to binary and then compare them with bytes. I don't know how to do this.
P.S. Sorry for bad English and binary misunderstanding. (I hope you
know what I mean by bytes, I mean characters in binary mode like
this):
\x01\x00\x9a\x9c\x18\x00
There are two major string types in Python: bytestrings (a sequence of bytes) that represent binary data and Unicode strings (a sequence of Unicode codepoints) that represent human-readable text. It is simple to convert one into another (☯):
unicode_text = bytestring.decode(character_encoding)
bytestring = unicode_text.encode(character_encoding)
If you open a file in binary mode e.g., 'rb' then file.read() returns a bytestring (bytes type):
>>> b'A' == b'\x41' == chr(0b1000001).encode()
True
There are several methods that can be used to classify bytes:
string methods such as bytes.isdigit():
>>> b'1'.isdigit()
True
string constants such as string.printable
>>> import string
>>> b'!' in string.printable.encode()
True
regular expressions such as \d
>>> import re
>>> bool(re.match(br'\d+$', b'123'))
True
classification functions in curses.ascii module e.g., curses.ascii.isprint()
>>> from curses import ascii
>>> bytearray(filter(ascii.isprint, b'123'))
bytearray(b'123')
bytearray is a mutable sequence of bytes — unlike a bytestring you can change it inplace e.g., to lowercase every 3rd byte that is uppercase:
>>> import string
>>> a = bytearray(b'ABCDEF_')
>>> uppercase = string.ascii_uppercase.encode()
>>> a[::3] = [b | 0b0100000 if b in uppercase else b
... for b in a[::3]]
>>> a
bytearray(b'aBCdEF_')
Notice: b'ad' are lowercase but b'_' remained the same.
To modify a binary file inplace, you could use mmap module e.g., to lowercase 4th column in every other line in 'file':
#!/usr/bin/env python3
import mmap
import string
uppercase = string.ascii_uppercase.encode()
ncolumn = 3 # select 4th column
with open('file', 'r+b') as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
while True:
mm.readline() # ignore every other line
pos = mm.tell() # remember current position
if not mm.readline(): # EOF
break
if mm[pos + ncolumn] in uppercase:
mm[pos + ncolumn] |= 0b0100000 # lowercase
Note: Python 2 and 3 APIs differ in this case. The code uses Python 3.
Input
ABCDE1
FGHIJ
ABCDE
FGHI
Output
ABCDE1
FGHiJ
ABCDE
FGHi
Notice: 4th column became lowercase on 2nd and 4h lines.
Typically if you want to change a file: you read from the file, write modifications to a temporary file, and on success you move the temporary file inplace of the original file:
#!/usr/bin/env python3
import os
import string
from tempfile import NamedTemporaryFile
caesar_shift = 3
filename = 'file'
def caesar_bytes(plaintext, shift, alphabet=string.ascii_lowercase.encode()):
shifted_alphabet = alphabet[shift:] + alphabet[:shift]
return plaintext.translate(plaintext.maketrans(alphabet, shifted_alphabet))
dest_dir = os.path.dirname(filename)
chunksize = 1 << 15
with open(filename, 'rb') as file, \
NamedTemporaryFile('wb', dir=dest_dir, delete=False) as tmp_file:
while True: # encrypt
chunk = file.read(chunksize)
if not chunk: # EOF
break
tmp_file.write(caesar_bytes(chunk, caesar_shift))
os.replace(tmp_file.name, filename)
Input
abc
def
ABC
DEF
Output
def
ghi
ABC
DEF
To convert the output back, set caesar_shift = -3.
To open a file in binary mode you use the open("filena.me", "rb") command. I've never used the command personally, but that should get you the information you need.
I'm dealing with a character separated hex file, where each field has a particular start code. I've opened the file as 'rb', but I was wondering, after I get the index of the startcode using .find, how do I read a certain number of bytes from this position?
This is how I am loading the file and what I am attempting to do
with open(someFile, 'rb') as fileData:
startIndex = fileData.find('(G')
data = fileData[startIndex:7]
where 7 is the number of bytes I want to read from the index returned by the find function. I am using python 2.7.3
You can get the position of a substring in a bytestring under python2.7 like this:
>>> with open('student.txt', 'rb') as f:
... data = f.read()
...
>>> data # holds the French word for student: élève
'\xc3\xa9l\xc3\xa8ve\n'
>>> len(data) # this shows we are dealing with bytes here, because "élève\n" would be 6 characters long, had it been properly decoded!
8
>>> len(data.decode('utf-8'))
6
>>> data.find('\xa8') # continue with the bytestring...
4
>>> bytes_to_read = 3
>>> data[4:4+bytes_to_read]
'\xa8ve'
You can look for the special characters, and for compatibility with Python3k, it's better if you prepend the character with a b, indicating these are bytes (in Python2.x, it will work without though):
>>> data.find(b'è') # in python2.x this works too (unfortunately, because it has lead to a lot of confusion): data.find('è')
3
>>> bytes_to_read = 3
>>> pos = data.find(b'è')
>>> data[pos:pos+bytes_to_read] # when you use the syntax 'n:m', it will read bytes in a bytestring
'\xc3\xa8v'
>>>
I am trying to read data from a file with big-endian coding using NumPy fromfile function.
According to the doc i figured that
">u2" - big-endian unsigned word
"<u2" - little-endian unsigned word
I made a test file to check this:
$ echo -ne '\xfe\xdc\xba\x98\x76\x54\x32\x10' > file
However, I now get the opposite result of what I expected.
For example:
from numpy import *
import sys
print sys.byteorder
with open('file', 'rb') as fh:
a=fromfile(fh, dtype='>u2', count=2, sep='')
print a
for i in a:
print hex(i)
gives output:
little
[65244 47768]
0xfedc
0xba98
showing that I am on a little-endian system (the first line of output). However, I try to read data as big-endian. Should't I then get
0xdcfe
0x98ba
?
Actually you should not:
Let's see hexdump of file
$ hexdump -C file
00000000 fe dc ba 98 76 54 32 10
Then look at the picture from wikipedia and you'll realize that your output is correct.
I'm in a little over my head on this one, so please pardon my terminology in advance.
I'm running this using Python 2.7 on Windows XP.
I found some Python code that reads a log file, does some stuff, then displays something.
What, that's not enough detail? Ok, here's a simplified version:
#!/usr/bin/python
import re
import sys
class NotSupportedTOCError(Exception):
pass
def filter_toc_entries(lines):
while True:
line = lines.next()
if re.match(r""" \s*
.+\s+ \| (?#track)
\s+.+\s+ \| (?#start)
\s+.+\s+ \| (?#length)
\s+.+\s+ \| (?#start sec)
\s+.+\s*$ (?#end sec)
""", line, re.X):
lines.next()
break
while True:
line = lines.next()
m = re.match(r"""
^\s*
(?P<num>\d+)
\s*\|\s*
(?P<start_time>[0-9:.]+)
\s*\|\s*
(?P<length_time>[0-9:.]+)
\s*\|\s*
(?P<start_sector>\d+)
\s*\|\s*
(?P<end_sector>\d+)
\s*$
""", line, re.X)
if not m:
break
yield m.groupdict()
def calculate_mb_toc_numbers(eac_entries):
eac = list(eac_entries)
num_tracks = len(eac)
tracknums = [int(e['num']) for e in eac]
if range(1,num_tracks+1) != tracknums:
raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
offsets = [(int(x['start_sector']) + 150) for x in eac]
return [1, num_tracks, leadout_offset] + offsets
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
The code works fine as long as the log file is "simple" text (I'm tempted to say ASCII although that may not be precise/accurate - for e.g. Notepad++ indicates it's ANSI).
However, the script doesn't work on certain log files (in these cases, Notepad++ says "UCS-2 Little Endian").
I get the following error:
Traceback (most recent call last):
File "simple.py", line 55, in <module>
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
File "simple.py", line 49, in calculate_mb_toc_numbers
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range
This log works
This log breaks
I believe it's the encoding that's breaking the script because if I simply do this at a command prompt:
type ascii.log > scrubbed.log
and then run the script on scrubbed.log, the script works fine (this is actually fine for my purposes since there's no loss of important information and I'm not writing back to a file, just printing to the console).
One workaround would be to "scrub" the log file before passing it to Python (e.g. using the type pipe trick above to a temporary file and then have the script run on that), but I would like to have Python "ignore" the encoding if it's possible. I'm also not sure how to detect what type of log file the script is reading so I can act appropriately.
I'm reading this and this but my eyes are still spinning around in their head, so while that may be my longer term strategy, I'm wondering if there's an interim hack I could use.
codecs.open() will allow you to open a file using a specific encoding, and it will produce unicodes. You can try a few, going from most likely to least likely (or the tool could just always produce UTF-16LE but ha ha fat chance).
Also, "Unicode In Python, Completely Demystified".
works.log appears to be encoded in ASCII:
>>> data = open('works.log', 'rb').read()
>>> all(d < '\x80' for d in data)
True
breaks.log appears to be encoded in UTF-16LE -- it starts with the 2 bytes '\xff\xfe'. None of the characters in breaks.log are outside the ASCII range:
>>> data = open('breaks.log', 'rb').read()
>>> data[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>> all(d < u'\x80' for d in udata)
True
If these are the only two possibilities, you should be able to get away with the following hack. Change your mainline code from:
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
to this:
f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
if data[:2] == '\xff\xfe':
data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n' for line in data.splitlines())
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines)) )
print mb_toc_urlpart
Python 2.x expects normal strings to be ASCII (or at least one byte). Try this:
Put this at the top of your Python source file:
from __future__ import unicode_literals
And change all the str to unicode.
[edit]
And as Ignacio Vazquez-Abrams wrote, try codecs.open() to open the input file.