I tested this:
strace python -c "fp = open('/dev/urandom', 'rb'); ans = fp.read(65600); fp.close()"
With the following partial output:
read(3, "\211^\250\202P\32\344\262\373\332\241y\226\340\16\16!<\354\250\221\261\331\242\304\375\24\36\253!\345\311"..., 65536) = 65536
read(3, "\7\220-\344\365\245\240\346\241>Z\330\266^Gy\320\275\231\30^\266\364\253\256\263\214\310\345\217\221\300"..., 4096) = 4096
There are two calls for read syscall with different number of requested bytes.
When I repeat the same using dd command,
dd if=/dev/urandom bs=65600 count=1 of=/dev/null
just one read syscall is triggered using the exact number of bytes requested.
read(0, "P.i\246!\356o\10A\307\376\2332\365=\262r`\273\"\370\4\n!\364J\316Q1\346\26\317"..., 65600) = 65600
I have googled this without any possible explanation. Is this related to page size or any Python memory management?
Why does this happen?
I did some research on exactly why this happens.
Note: I did my tests with Python 3.5. Python 2 has a different I/O system with the same quirk for a similar reason, but this was easier to understand with the new IO system in Python 3.
As it turns out, this is due to Python's BufferedReader, not anything about the actual system calls.
You can try this code:
fp = open('/dev/urandom', 'rb')
fp = fp.detach()
ans = fp.read(65600)
fp.close()
If you try to strace this code, you will find:
read(3, "]\"\34\277V\21\223$l\361\234\16:\306V\323\266M\215\331\3bdU\265C\213\227\225pWV"..., 65600) = 65600
Our original file object was a BufferedReader:
>>> open("/dev/urandom", "rb")
<_io.BufferedReader name='/dev/urandom'>
If we call detach() on this, then we throw away the BufferedReader portion and just get the FileIO, which is what talks to the kernel. At this layer, it'll read everything at once.
So the behavior that we're looking for is in BufferedReader. We can look in Modules/_io/bufferedio.c in the Python source, specifically the function _io__Buffered_read_impl. In our case, where the file has not yet been read from until this point, we dispatch to _bufferedreader_read_generic.
Now, this is where the quirk we see comes from:
while (remaining > 0) {
/* We want to read a whole block at the end into buffer.
If we had readv() we could do this in one pass. */
Py_ssize_t r = MINUS_LAST_BLOCK(self, remaining);
if (r == 0)
break;
r = _bufferedreader_raw_read(self, out + written, r);
Essentially, this will read as many full "blocks" as possible directly into the output buffer. The block size is based on the parameter passed to the BufferedReader constructor, which has a default selected by a few parameters:
* Binary files are buffered in fixed-size chunks; the size of the buffer
is chosen using a heuristic trying to determine the underlying device's
"block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
On many systems, the buffer will typically be 4096 or 8192 bytes long.
So this code will read as much as possible without needing to start filling its buffer. This will be 65536 bytes in this case, because it's the largest multiple of 4096 bytes less than or equal to 65600. By doing this, it can read the data directly into the output and avoid filling up and emptying its own buffer, which would be slower.
Once it's done with that, there might be a bit more to read. In our case, 65600 - 65536 == 64, so it needs to read at least 64 more bytes. But yet it reads 4096! What gives? Well, the key here is that the point of a BufferedReader is to minimize the number of kernel reads we actually have to do, as each read has significant overhead in and of itself. So it simply reads another block to fill its buffer (so 4096 bytes) and gives you the first 64 of these.
Hopefully, that makes sense in terms of explaining why it happens like this.
As a demonstration, we could try this program:
import _io
fp = _io.BufferedReader(_io.FileIO("/dev/urandom", "rb"), 30000)
ans = fp.read(65600)
fp.close()
With this, strace tells us:
read(3, "\357\202{u'\364\6R\fr\20\f~\254\372\3705\2\332JF\n\210\341\2s\365]\270\r\306B"..., 60000) = 60000
read(3, "\266_ \323\346\302}\32\334Yl\ry\215\326\222\363O\303\367\353\340\303\234\0\370Y_\3232\21\36"..., 30000) = 30000
Sure enough, this follows the same pattern: as many blocks as possible, and then one more.
dd, in a quest for high efficiency of copying lots and lots of data, would try to read up to a much larger amount at once, which is why it only uses one read. Try it with a larger set of data, and I suspect you may find multiple calls to read.
TL;DR: the BufferedReader reads as many full blocks as possible (64 * 4096) and then one extra block of 4096 to fill its buffer.
EDIT:
The easy way to change the buffer size, as #fcatho pointed out, is to change the buffering argument on open:
open(name[, mode[, buffering]])
( ... )
The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used.
This works on both Python 2 and Python 3.
Related
I am currently creating a data logging function with the Raspberry Pi, and I am unsure as to whether I have found a slight bug. The code I am using is as follows:
import sys, time, os
_File = 'TemperatureData1.csv'
_newDir = '/home/pi/Documents/Temperature Data Logs'
_directoryList = os.listdir(_newDir)
os.chdir(_newDir)
# Here I am specifying the file, that I want to write to it, and that
# I want to use a buffer of 5kb
output = open(_File, 'w', 5000)
try:
while (1):
output.write('hi\n')
time.sleep(0.01)
except KeyboardInterrupt:
print('Keyboard has been pressed')
output.close()
sys.exit(1)
What I have found is that when I periodically view the created file properties, the file size increases in accordance with the default buffer setting 8192 bytes, and not the 5kb that I have specified. However, when I run the exact same program in Python 2.7.13, the buffer size changes to 5kb as requested.
I was wondering if anyone else had experienced this and had any ideas on a solution to getting the program working on Python 3.6.3? Thanks in advance. I can work with the problem on python 2.7.13, it is my pure curiosity which has led to me posting this question.
Python's definition of open in version 2 is what you are using:
open(name[, mode[, buffering]])
In Python 3, the open command is a little different, in that buffering is not a positional integer, but a keyword arg:
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
The docs have the following note:
buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
“Interactive” text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.
That special 8192 number is simply 2^13.
I would suggest trying buffering=5000.
I have done some more research, and managed to find some reasons as to why setting 'buffering' to a value more than 1, does not manually manipulate the buffer to a desired size (bytes) in python 3 or above.
It seems to be because the io library uses two buffers when working with files, a text buffer, and a binary buffer. When in text mode, the file is flushed in accordance to the text buffer (which does not seem to be able to be manipulated when buffering > 1). Instead the buffering argument, manipulates the binary buffer, which then feeds into the text buffer, therefore the buffering function does not work how the programmer expects. This is further explained in the following link:
https://bugs.python.org/issue30718
There is however a work around; you need to use open() in binary mode and not text mode, then use the io.TextIOWrapper function to write to a txt or csv file using the binary buffer. The work around is as follows:
import sys, time, os, io
_File = 'TemperatureData1.csv'
# Open or overwrite the file _file, and use a 20kb buffer in RAM
# before data is saved to disk.
output = open(_File, mode='wb', buffering=700)
output = io.TextIOWrapper(output, write_through=True)
try:
while (1):
output.write('h\n')
time.sleep(0.01)
except KeyboardInterrupt:
print('Keyboard has been pressed')
output.close()
sys.exit(1)
I'm refactoring a horrendous python script that is part of the polycode project that generates lua bindings.
I am considering writing lua lines out, as they are generated, in blocks.
But my question in generic form is, What are the detriments/caveats of writing to a file very quickly?
Take for example:
persistent_file = open('/tmp/demo.txt')
for i in range(1000000):
persistent_file.write(str(i)*80 + '\n')
for i in range(2000):
persistent_file.write(str(i)*20 + '\n')
for i in range(1000000):
persistent_file.write(str(i)*100 + '\n')
persistent_file.close()
That's just a simple way to write to a file a lot basically as quickly as possible.
I don't really expect to hit any real problems doing this, but I do want to be informed, is it ever advantageous to cache up for one big write?
From the documentation on the open function:
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None) -> file object
...
buffering is an optional integer used to set the buffering policy.
Pass 0 to switch buffering off (only allowed in binary mode), 1 to select
line buffering (only usable in text mode), and an integer > 1 to indicate
the size of a fixed-size chunk buffer. When no buffering argument is
given, the default buffering policy works as follows:
Binary files are buffered in fixed-size chunks; the size of the buffer
is chosen using a heuristic trying to determine the underlying device's
"block size" and falling back on io.DEFAULT_BUFFER_SIZE.
On many systems, the buffer will typically be 4096 or 8192 bytes long.
"Interactive" text files (files for which isatty() returns True)
use line buffering. Other text files use the policy described above
for binary files.
In other words, for the most part, the only overhead that you will hit on frequent calls to write() is the overhead of a function call.
buffersize=50000
inflie = open('in.jpg','rb')
outfile = open('out.jpg','wb')
buffer = infile.read(buffersize)
while len(buffer):
outfile.write(buffer)
buffer = infile.read(buffersize)
I am learning basics of reading / writing binary files in python, and understand this code.
I'd appreciate any help on understanding this code.
Thank you!
Q1: Is 50000 in buffersize equivalent to 50kb? (in.jpg is about 150kb)
Q2: How is next increment of data (ie. next 50,000 bytes of data) read from the input file?
(first 50,000 bytes are read and stored before while loop, then are written into output file,
how does the next 50,000 bytes are read without any incrementation in range?)
Q3: len(buffer) means the size of buffer (file object). When does this turn false in while loop?
The documentation answers all your questions:
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.
1: yes. The size parameter is interpreted as a number of bytes.
2: infile.read(50000) means "read (at most) 50000 bytes from infile". The second time you invoke this method, it will automatically read the next 50000 bytes from the file.
3: buffer is not the file but what you last read from the file. len(buffer) will evaluate to False when buffer is empty, i.e. when there's no more data to read from the file.
Inspired by this question, I'm wondering exactly what the optional buffering argument to Python's open() function does. From looking at the source, I see that buffering is passed into setvbuf to set the buffer size for the stream (and that it does nothing on a system without setvbuf, which the docs confirm).
However, when you iterate over a file, there is a constant called READAHEAD_BUFSIZE that appears to define how much data is read at a time (this constant is defined here).
My question is exactly how the buffering argument relates to READAHEAD_BUFSIZE. When I iterate through a file, which one defines how much data is being read off disk at a time? And is there a place in the C source that makes this clear?
READAHEAD_BUFSIZE is only used when you use the file as an iterator:
for line in fileobj:
print line
It is a separate buffer from the normal buffer argument, which is handled by the fread C API calls. Both are used when iterating.
From file.next():
In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer.
The OS buffer size is not changed, the setvbuf is done when the file is opened and not touched by the file iteration code. Instead, calls to Py_UniversalNewlineFread (which uses fread) are used to fill the read-ahead buffer, creating a second buffer internal to Python. Python otherwise leaves the regular buffering up to the C API calls (fread() calls are buffered; the userspace buffer is consulted by fread() to satisfy the request, Python doesn't have to do anything about that).
readahead_get_line_skip() then serves lines (newline terminated) from this buffer. If the buffer no longer contains newlines, it'll refill the buffer by recursing over itself with a buffer size 1.25 times the previous value. This means that file iteration can read the whole rest of the file into the memory buffer if there are no more newline characters in the whole file!
To see how much the buffer reads, print the file position (using fileobj.tell()) while looping:
>>> with open('test.txt') as f:
... for line in f:
... print f.tell()
...
8192 # 1 times the buffer size
8192
8192
~ lines elided
18432 # + 1.25 times the buffer size
18432
18432
~ lines elided
26624 # + 1 times the buffer size; the last newline must've aligned on the buffer boundary
26624
26624
~ lines elided
36864 # + 1.25 times the buffer size
36864
36864
etc.
What bytes are actually read from the disk (provided fileobj is an actual physical file on your disk) depend not only on the interplay between the fread() buffer and the internal read-ahead buffer; but also if the OS itself is using buffering. It could well be that even if the file buffer is exhausted, the OS serves the system call to read from the file from it's own cache instead of going to the physical disk.
After digging through the source a bit more and trying to understand more how setvbuf and fread work, I think I understand how buffering and READAHEAD_BUFSIZE relate to each other: when iterating through a file, a buffer of READAHEAD_BUFSIZE is filled on each line, but filling this buffer uses calls to fread, each of which fills a buffer of buffering bytes.
Python's read is implemented as file_read, which calls Py_UniversalNewlineFread, passing it the number of bytes to read as n. Py_UniversalNewlineFread then eventually calls fread to read n bytes.
When you iterate over a file, the function readahead_get_line_skip is what retrieves a line. This function also calls Py_UniversalNewlineFread, passing n = READAHEAD_BUFSIZE. So this eventually becomes a call to fread for READAHEAD_BUFSIZE bytes.
So now the question is, how many bytes does fread actually read from disk. If I run the following code in C, then 1024 bytes get copied into buf and 512 into buf2. (This might be obvious but never having used setvbuf before it was a useful experiment for me.)
FILE *f = fopen("test.txt", "r");
void *buf = malloc(1024);
void *buf2 = mallo(512);
setvbuf(f, buf, _IOFBF, 1024);
fread(buf2, 512, 1, f);
So, finally, this suggests to me that when iterating over a file, at least READAHEAD_BUF_SIZE bytes are read from disk, but it might be more. I think that the first iteration of for line in f will read x bytes, where x is the smallest multiple of buffering that is greater than READAHEAD_BUF_SIZE.
If anyone can confirm that this is what's actually going on, that would be great!
I need to import a binary file from Python -- the contents are signed 16-bit integers, big endian.
The following Stack Overflow questions suggest how to pull in several bytes at a time, but is this the way to scale up to read in a whole file?
Reading some binary file in Python
Receiving 16-bit integers in Python
I thought to create a function like:
from numpy import *
import os
def readmyfile(filename, bytes=2, endian='>h'):
totalBytes = os.path.getsize(filename)
values = empty(totalBytes/bytes)
with open(filename, 'rb') as f:
for i in range(len(values)):
values[i] = struct.unpack(endian, f.read(bytes))[0]
return values
filecontents = readmyfile('filename')
But this is quite slow (the file is 165924350 bytes). Is there a better way?
Use numpy.fromfile.
I would directly read until EOF (it means checking for receiving an empty string), removing then the need to use range() and getsize.
Alternatively, using xrange (instead of range) should improve things, especially for memory usage.
Moreover, as Falmarri suggested, reading more data at the same time would improve performance quite a lot.
That said, I would not expect miracles, also because I am not sure a list is the most efficient way to store all that amount of data.
What about using NumPy's Array, and its facilities to read/write binary files? In this link there is a section about reading raw binary files, using numpyio.fread. I believe this should be exactly what you need.
Note: personally, I have never used NumPy; however, its main raison d'etre is exactly handling of big sets of data - and this is what you are doing in your question.
You're reading and unpacking 2 bytes at a time
values[i] = struct.unpack(endian,f.read(bytes))[0]
Why don't you read like, 1024 bytes at a time?
I have had the same kind of problem, although in my particular case I have had to convert a very strange binary format (500 MB) file with interlaced blocks of 166 elements that were 3-bytes signed integers; so I've had also the problem of converting from 24-bit to 32-bit signed integers that slow things down a little.
I've resolved it using NumPy's memmap (it's just a handy way of using Python's memmap) and struct.unpack on large chunk of the file.
With this solution I'm able to convert (read, do stuff, and write to disk) the entire file in about 90 seconds (timed with time.clock()).
I could upload part of the code.
I think the bottleneck you have here is twofold.
Depending on your OS and disc controller, the calls to f.read(2) with f being a bigish file are usually efficiently buffered -- usually. In other words, the OS will read one or two sectors (with disc sectors usually several KB) off disc into memory because this is not a lot more expensive than reading 2 bytes from that file. The extra bytes are cached efficiently in memory ready for the next call to read that file. Don't rely on that behavior -- it might be your bottleneck -- but I think there are other issues here.
I am more concerned about the single byte conversions to a short and single calls to numpy. These are not cached at all. You can keep all the shorts in a Python list of ints and convert the whole list to numpy when (and if) needed. You can also make a single call struct.unpack_from to convert everything in a buffer vs one short at a time.
Consider:
#!/usr/bin/python
import random
import os
import struct
import numpy
import ctypes
def read_wopper(filename,bytes=2,endian='>h'):
buf_size=1024*2
buf=ctypes.create_string_buffer(buf_size)
new_buf=[]
with open(filename,'rb') as f:
while True:
st=f.read(buf_size)
l=len(st)
if l==0:
break
fmt=endian[0]+str(l/bytes)+endian[1]
new_buf+=(struct.unpack_from(fmt,st))
na=numpy.array(new_buf)
return na
fn='bigintfile'
def createmyfile(filename):
bytes=165924350
endian='>h'
f=open(filename,"wb")
count=0
try:
for int in range(0,bytes/2):
# The first 32,767 values are [0,1,2..0x7FFF]
# to allow testing the read values with new_buf[value<0x7FFF]
value=count if count<0x7FFF else random.randint(-32767,32767)
count+=1
f.write(struct.pack(endian,value&0x7FFF))
except IOError:
print "file error"
finally:
f.close()
if not os.path.exists(fn):
print "creating file, don't count this..."
createmyfile(fn)
else:
read_wopper(fn)
print "Done!"
I created a file of random shorts signed ints of 165,924,350 bytes (158.24 MB) which comports to 82,962,175 signed 2 byte shorts. With this file, I ran the read_wopper function above and it ran in:
real 0m15.846s
user 0m12.416s
sys 0m3.426s
If you don't need the shorts to be numpy, this function runs in 6 seconds. All this on OS X, python 2.6.1 64 bit, 2.93 gHz Core i7, 8 GB ram. If you change buf_size=1024*2 in read_wopper to buf_size=2**16 the run time is:
real 0m10.810s
user 0m10.156s
sys 0m0.651s
So your main bottle neck, I think, is the single byte calls to unpack -- not your 2 byte reads from disc. You might want to make sure that your data files are not fragmented and if you are using OS X that your free disc space (and here) is not fragmented.
Edit I posted the full code to create then read a binary file of ints. On my iMac, I consistently get < 15 secs to read the file of random ints. It takes about 1:23 to create since the creation is one short at a time.