Why am I only writing 28,672 bits to this file? - python

I have been working on a project where it is necessary to program a binary file, of a certain kind, to a AT28C256 chip. The specifics are not important beyond the fact that the file needs to be 32,768 bytes in size (exactly).
I have some "minimal problem" code here:
o = open("images.bin", "wb")
c = 0
for i in range(256):
for j in range(128):
c += 1
o.write(chr(0).encode('utf-8'))
print(c)
This, to me, would appear to write 32,768 bytes to a file (the split into i,j is necessary because I need to write an image to the device) as 128*256 = 32768. And the output of c is 32768!
But the file it creates is 28672 bytes long! The fact that this is 7000 in hex has not passed me by but I'm not sure why this is happening. Any ideas?

You should call o.close() to flush the write buffer and close the file properly.

Related

Reading a python binary file with a C# BinaryReader

I need to export some data like integers, floats etc. to a binary file with python. Afterwards, I have to read the file with C# again but it doesnt work for me.
I tried several ways of writing a binary file with python and it works as long as I read it with python as well:
a = 3
b = 5
with open('test.tcd', 'wb') as file:
file.write(bytes(a))
file.write(bytes(b))
or writing it like this:
import pickle as p
with open('test.tcd', 'wb') as file:
p.dump([a, b], file)
Currently I am reading the file in C# like this:
static void LoadFile(String path)
{
BinaryReader br = new BinaryReader(new FileStream(path, FileMode.Open));
int a = br.ReadInt32();
int b = br.ReadInt32();
System.Diagnostics.Debug.WriteLine(a);
System.Diagnostics.Debug.WriteLine(b);
br.Close();
}
Unfortunately the output isnt 3 and 5, instead my output is just zero. How do i read or write the binary file properly?
In Python, you have to write your integers with 4 bytes each. Read more here: struct.pack
a = 3
b = 5
with open('test.tcd', 'wb') as file:
f.write(struct.pack("<i", 3))
f.write(struct.pack("<i", 5))
Your C# code should work now.
It's possible python is not writing data in the same format that C# expects. You may need to swap byte endianess or do something else. You could read the raw bytes instead and use BitConverter to see if that fixes it.
Another option is to specify the endianess explicitly in python, I think big endian is the default binary reader format for C#.
an_int = 5
a_bytes_big = an_int.to_bytes(2, 'big')
print(a_bytes_big)
Output
b'\x00\x05'
a_bytes_little = an_int.to_bytes(2, 'little')
print(a_bytes_little)
Output
b'\x05\x00'

Why Python splits read function into multiple syscalls?

I tested this:
strace python -c "fp = open('/dev/urandom', 'rb'); ans = fp.read(65600); fp.close()"
With the following partial output:
read(3, "\211^\250\202P\32\344\262\373\332\241y\226\340\16\16!<\354\250\221\261\331\242\304\375\24\36\253!\345\311"..., 65536) = 65536
read(3, "\7\220-\344\365\245\240\346\241>Z\330\266^Gy\320\275\231\30^\266\364\253\256\263\214\310\345\217\221\300"..., 4096) = 4096
There are two calls for read syscall with different number of requested bytes.
When I repeat the same using dd command,
dd if=/dev/urandom bs=65600 count=1 of=/dev/null
just one read syscall is triggered using the exact number of bytes requested.
read(0, "P.i\246!\356o\10A\307\376\2332\365=\262r`\273\"\370\4\n!\364J\316Q1\346\26\317"..., 65600) = 65600
I have googled this without any possible explanation. Is this related to page size or any Python memory management?
Why does this happen?
I did some research on exactly why this happens.
Note: I did my tests with Python 3.5. Python 2 has a different I/O system with the same quirk for a similar reason, but this was easier to understand with the new IO system in Python 3.
As it turns out, this is due to Python's BufferedReader, not anything about the actual system calls.
You can try this code:
fp = open('/dev/urandom', 'rb')
fp = fp.detach()
ans = fp.read(65600)
fp.close()
If you try to strace this code, you will find:
read(3, "]\"\34\277V\21\223$l\361\234\16:\306V\323\266M\215\331\3bdU\265C\213\227\225pWV"..., 65600) = 65600
Our original file object was a BufferedReader:
>>> open("/dev/urandom", "rb")
<_io.BufferedReader name='/dev/urandom'>
If we call detach() on this, then we throw away the BufferedReader portion and just get the FileIO, which is what talks to the kernel. At this layer, it'll read everything at once.
So the behavior that we're looking for is in BufferedReader. We can look in Modules/_io/bufferedio.c in the Python source, specifically the function _io__Buffered_read_impl. In our case, where the file has not yet been read from until this point, we dispatch to _bufferedreader_read_generic.
Now, this is where the quirk we see comes from:
while (remaining > 0) {
/* We want to read a whole block at the end into buffer.
If we had readv() we could do this in one pass. */
Py_ssize_t r = MINUS_LAST_BLOCK(self, remaining);
if (r == 0)
break;
r = _bufferedreader_raw_read(self, out + written, r);
Essentially, this will read as many full "blocks" as possible directly into the output buffer. The block size is based on the parameter passed to the BufferedReader constructor, which has a default selected by a few parameters:
* Binary files are buffered in fixed-size chunks; the size of the buffer
is chosen using a heuristic trying to determine the underlying device's
"block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
On many systems, the buffer will typically be 4096 or 8192 bytes long.
So this code will read as much as possible without needing to start filling its buffer. This will be 65536 bytes in this case, because it's the largest multiple of 4096 bytes less than or equal to 65600. By doing this, it can read the data directly into the output and avoid filling up and emptying its own buffer, which would be slower.
Once it's done with that, there might be a bit more to read. In our case, 65600 - 65536 == 64, so it needs to read at least 64 more bytes. But yet it reads 4096! What gives? Well, the key here is that the point of a BufferedReader is to minimize the number of kernel reads we actually have to do, as each read has significant overhead in and of itself. So it simply reads another block to fill its buffer (so 4096 bytes) and gives you the first 64 of these.
Hopefully, that makes sense in terms of explaining why it happens like this.
As a demonstration, we could try this program:
import _io
fp = _io.BufferedReader(_io.FileIO("/dev/urandom", "rb"), 30000)
ans = fp.read(65600)
fp.close()
With this, strace tells us:
read(3, "\357\202{u'\364\6R\fr\20\f~\254\372\3705\2\332JF\n\210\341\2s\365]\270\r\306B"..., 60000) = 60000
read(3, "\266_ \323\346\302}\32\334Yl\ry\215\326\222\363O\303\367\353\340\303\234\0\370Y_\3232\21\36"..., 30000) = 30000
Sure enough, this follows the same pattern: as many blocks as possible, and then one more.
dd, in a quest for high efficiency of copying lots and lots of data, would try to read up to a much larger amount at once, which is why it only uses one read. Try it with a larger set of data, and I suspect you may find multiple calls to read.
TL;DR: the BufferedReader reads as many full blocks as possible (64 * 4096) and then one extra block of 4096 to fill its buffer.
EDIT:
The easy way to change the buffer size, as #fcatho pointed out, is to change the buffering argument on open:
open(name[, mode[, buffering]])
( ... )
The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used.
This works on both Python 2 and Python 3.

python error on struct.unpack

I am new to python and I am trying to use unpack like this:
data = f.read(4)
AAA=len(data)
BBB=struct.calcsize(cformat)
print AAA
print BBB
value = struct.unpack(cformat, data)
return value[0]
This runs fine as long as AAA == BBB but sometimes, f.read only reads 3 bytes and then I get an error. The actual value in the file that I am trying to read is 26. It reads all of the values from 1-221 except for 26 where it errors because f.read(size) only reads three bytes
Assuming the question is "How should I read a 26 without an error?"
First check the arguments to the open() that produces f. Under Windows, unless you open a file in binary mode (f = open(filename, "rb")), Python assumes that the file is a text file. Windows treats byte value 26 (Ctrl+Z) in a text file as an end-of-file marker, a quirk that it inherited from CP/M.
You have opened a binary file in text mode, and you are using an operating system where the distinction matters. Try adding b to the mode parameter when you open the file:
f = open("my_input_file.bin", "rb")

Reading non-text files into Python

I want to read in a non text file. It has an extension ".map" but can be opened by notepad. How should I open this file through python?
file = open("path-to-file","r") doesn't work for me. It returns No such file or directory: error.
Here's what my file looks like:
111 + gi|89106884|ref|AC_000091.1| 725803 TCGAGATCGACCATGTTGCCCGCCT IIIIIIIIIIIIIIIIIIIIIIIII 0 14:A>G
457 + gi|89106884|ref|AC_000091.1| 32629 CCGTGTCCACCGACTACGACACCTC IIIIIIIIIIIIIIIIIIIIIIIII 0 4:C>G,22:T>C
779 + gi|89106884|ref|AC_000091.1| 483582 GATCACCCACGCAAAGATGGGGCGA IIIIIIIIIIIIIIIIIIIIIIIII 0 15:A>G,18:C>G
784 + gi|89106884|ref|AC_000091.1| 226200 ACCGATAGTGAACCAGTACCGTGAG IIIIIIIIIIIIIIIIIIIIIIIII 1
If I do the follwing:
file = open("D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")
It still gives me No such file or directory: 'D:\x08owtie-0.12.7-win32\x08owtie-0.12.7\\output_635\results_NC_000117.fna.1.ebwt.map' error. Is this because the file isn't binary or I don't have some permissions?
Would apppreciate help with this!
Binary files should use a binary mode.
f = open("path-to-file","rb")
But that won't help if you don't have the appropriate permissions or don't know the format of the file itself.
EDIT:
Obviously you didn't bother reading the error message, or you would have noticed that the filename it is using is not the one you expected.
f = open("D:\\bowtie-0.12.7-win32\\bowtie-0.12.7\\output_635\\results_NC_000117.fna.1.ebwt.map","rb")
f = open(r"D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")
You have hit upon a minor difference between Unix and Windows here.
Since you mentioned Notepad, you must be running this on Windows. In DOS/Windows land, opening a binary file requires specifying attribute 'b' for binary, as others have already indicated. Unix/Linux are a bit more relaxed about this. Omitting attribute 'b' will still open a binary file.
The same behavior is exhibited in the C library's fopen() call.
If its a non-text file you could try opening it using binary format. Try this -
with open("path-to-file", "rb") as f:
byte = f.read(1)
while byte != "":
byte = f.read(1) # Do stuff with byte.
The with statement handles opening and closing the file, including if an exception is raised in the inner block.
Of course since the format is binary you need to know what you are going to do after you read. Also, here I read 1 byte at a time, you can define bigger chunk sizes too.
UPDATE: Maybe this is not a binary file. You might be having problems with file encoding, the characters might not be ascii or they might belong to unicode charset. Try this -
import codecs
f = codecs.open(u'path-to-file','r','utf-8')
print f.read()
f.close()
If you print this out in the terminal, you might still get gibberish since the terminal might not support this charset. I would advise, go ahead & process the text assuming its properly opened.
Source

Random Loss of precision in Python ReadLine()

We have a process which takes a very large csv (1.6GB) and breaks it down into pieces (in this case 3). This runs nightly and normally doesn't give us any problems. When it ran last night, however, the first of the output files had lost precision on the numeric fields in the data. The active ingredient in the script are the lines:
while lineCounter <= chunk:
oOutFile.write(oInFile.readline())
lineCounter = lineCounter + 1
and the normal output might be something like
StringField1; StringField2; StringField3; StringField4; 1000000; StringField5; 0.000054454
etc.
On this one occasion and in this one output file the numeric fields were all output with 6 zeros at the end i.e.
StringField1; StringField2; StringField3; StringField4; 1000000.000000; StringField5; 0.000000
We are using Python v2.6 (and don't want to upgrade unless we really have to) but we can't afford to lose this data. Does anyone have any idea why this might have happened? If the readline is doing some kind of implicit conversion is there a way to do a binary read, because we really just want this data to pass through untouched?
It is very wierd to us that this only affected one of the output files generated by the same script, and when it was rerun the output was as expected.
thanks
Jack
(readlines method referenced in below thread)
f = open(filename)
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
return lines
.readline() doesn't do anything with the content of the line, certainly not with numbers, so it's definitely not the culprit.
Thanks for giving more info, but this still looks very mysterious to me as neither function should be causing such a change. You didn't open the output in Excel, by any chance? Sometimes Excel does weird things and interprets stuff in an unexpected way. Grasping at straws here...
(As an aside, I don't see the big optimization potential in read_f = f.read :))

Categories