Open() command buffer manual buffer manipulation does not work - python

I am currently creating a data logging function with the Raspberry Pi, and I am unsure as to whether I have found a slight bug. The code I am using is as follows:
import sys, time, os
_File = 'TemperatureData1.csv'
_newDir = '/home/pi/Documents/Temperature Data Logs'
_directoryList = os.listdir(_newDir)
os.chdir(_newDir)
# Here I am specifying the file, that I want to write to it, and that
# I want to use a buffer of 5kb
output = open(_File, 'w', 5000)
try:
while (1):
output.write('hi\n')
time.sleep(0.01)
except KeyboardInterrupt:
print('Keyboard has been pressed')
output.close()
sys.exit(1)
What I have found is that when I periodically view the created file properties, the file size increases in accordance with the default buffer setting 8192 bytes, and not the 5kb that I have specified. However, when I run the exact same program in Python 2.7.13, the buffer size changes to 5kb as requested.
I was wondering if anyone else had experienced this and had any ideas on a solution to getting the program working on Python 3.6.3? Thanks in advance. I can work with the problem on python 2.7.13, it is my pure curiosity which has led to me posting this question.

Python's definition of open in version 2 is what you are using:
open(name[, mode[, buffering]])
In Python 3, the open command is a little different, in that buffering is not a positional integer, but a keyword arg:
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
The docs have the following note:
buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
“Interactive” text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.
That special 8192 number is simply 2^13.
I would suggest trying buffering=5000.

I have done some more research, and managed to find some reasons as to why setting 'buffering' to a value more than 1, does not manually manipulate the buffer to a desired size (bytes) in python 3 or above.
It seems to be because the io library uses two buffers when working with files, a text buffer, and a binary buffer. When in text mode, the file is flushed in accordance to the text buffer (which does not seem to be able to be manipulated when buffering > 1). Instead the buffering argument, manipulates the binary buffer, which then feeds into the text buffer, therefore the buffering function does not work how the programmer expects. This is further explained in the following link:
https://bugs.python.org/issue30718
There is however a work around; you need to use open() in binary mode and not text mode, then use the io.TextIOWrapper function to write to a txt or csv file using the binary buffer. The work around is as follows:
import sys, time, os, io
_File = 'TemperatureData1.csv'
# Open or overwrite the file _file, and use a 20kb buffer in RAM
# before data is saved to disk.
output = open(_File, mode='wb', buffering=700)
output = io.TextIOWrapper(output, write_through=True)
try:
while (1):
output.write('h\n')
time.sleep(0.01)
except KeyboardInterrupt:
print('Keyboard has been pressed')
output.close()
sys.exit(1)

Related

Python writing data in a file is sometimes delayed [duplicate]

How often does Python flush to a file?
How often does Python flush to stdout?
I'm unsure about (1).
As for (2), I believe Python flushes to stdout after every new line. But, if you overload stdout to be to a file, does it flush as often?
For file operations, Python uses the operating system's default buffering unless you configure it do otherwise. You can specify a buffer size, unbuffered, or line buffered.
For example, the open function takes a buffer size argument.
http://docs.python.org/library/functions.html#open
"The optional buffering argument specifies the file’s desired buffer size:"
0 means unbuffered,
1 means line buffered,
any other positive value means use a buffer of (approximately) that size.
A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files.
If omitted, the system default is used.
code:
bufsize = 0
f = open('file.txt', 'w', buffering=bufsize)
You can also force flush the buffer to a file programmatically with the flush() method.
with open('out.log', 'w+') as f:
f.write('output is ')
# some work
s = 'OK.'
f.write(s)
f.write('\n')
f.flush()
# some other work
f.write('done\n')
f.flush()
I have found this useful when tailing an output file with tail -f.
You can also check the default buffer size by calling the read only DEFAULT_BUFFER_SIZE attribute from io module.
import io
print (io.DEFAULT_BUFFER_SIZE)
I don't know if this applies to python as well, but I think it depends on the operating system that you are running.
On Linux for example, output to terminal flushes the buffer on a newline, whereas for output to files it only flushes when the buffer is full (by default). This is because it is more efficient to flush the buffer fewer times, and the user is less likely to notice if the output is not flushed on a newline in a file.
You might be able to auto-flush the output if that is what you need.
EDIT: I think you would auto-flush in python this way (based
from here)
#0 means there is no buffer, so all output
#will be auto-flushed
fsock = open('out.log', 'w', 0)
sys.stdout = fsock
#do whatever
fsock.close()
Here is another approach, up to the OP to choose which one he prefers.
When including the code below in the __init__.py file before any other code, messages printed with print and any errors will no longer be logged to Ableton's Log.txt but to separate files on your disk:
import sys
path = "/Users/#username#"
errorLog = open(path + "/stderr.txt", "w", 1)
errorLog.write("---Starting Error Log---\n")
sys.stderr = errorLog
stdoutLog = open(path + "/stdout.txt", "w", 1)
stdoutLog.write("---Starting Standard Out Log---\n")
sys.stdout = stdoutLog
(for Mac, change #username# to the name of your user folder. On Windows the path to your user folder will have a different format)
When you open the files in a text editor that refreshes its content when the file on disk is changed (example for Mac: TextEdit does not but TextWrangler does), you will see the logs being updated in real-time.
Credits: this code was copied mostly from the liveAPI control surface scripts by Nathan Ramella

Why Python splits read function into multiple syscalls?

I tested this:
strace python -c "fp = open('/dev/urandom', 'rb'); ans = fp.read(65600); fp.close()"
With the following partial output:
read(3, "\211^\250\202P\32\344\262\373\332\241y\226\340\16\16!<\354\250\221\261\331\242\304\375\24\36\253!\345\311"..., 65536) = 65536
read(3, "\7\220-\344\365\245\240\346\241>Z\330\266^Gy\320\275\231\30^\266\364\253\256\263\214\310\345\217\221\300"..., 4096) = 4096
There are two calls for read syscall with different number of requested bytes.
When I repeat the same using dd command,
dd if=/dev/urandom bs=65600 count=1 of=/dev/null
just one read syscall is triggered using the exact number of bytes requested.
read(0, "P.i\246!\356o\10A\307\376\2332\365=\262r`\273\"\370\4\n!\364J\316Q1\346\26\317"..., 65600) = 65600
I have googled this without any possible explanation. Is this related to page size or any Python memory management?
Why does this happen?
I did some research on exactly why this happens.
Note: I did my tests with Python 3.5. Python 2 has a different I/O system with the same quirk for a similar reason, but this was easier to understand with the new IO system in Python 3.
As it turns out, this is due to Python's BufferedReader, not anything about the actual system calls.
You can try this code:
fp = open('/dev/urandom', 'rb')
fp = fp.detach()
ans = fp.read(65600)
fp.close()
If you try to strace this code, you will find:
read(3, "]\"\34\277V\21\223$l\361\234\16:\306V\323\266M\215\331\3bdU\265C\213\227\225pWV"..., 65600) = 65600
Our original file object was a BufferedReader:
>>> open("/dev/urandom", "rb")
<_io.BufferedReader name='/dev/urandom'>
If we call detach() on this, then we throw away the BufferedReader portion and just get the FileIO, which is what talks to the kernel. At this layer, it'll read everything at once.
So the behavior that we're looking for is in BufferedReader. We can look in Modules/_io/bufferedio.c in the Python source, specifically the function _io__Buffered_read_impl. In our case, where the file has not yet been read from until this point, we dispatch to _bufferedreader_read_generic.
Now, this is where the quirk we see comes from:
while (remaining > 0) {
/* We want to read a whole block at the end into buffer.
If we had readv() we could do this in one pass. */
Py_ssize_t r = MINUS_LAST_BLOCK(self, remaining);
if (r == 0)
break;
r = _bufferedreader_raw_read(self, out + written, r);
Essentially, this will read as many full "blocks" as possible directly into the output buffer. The block size is based on the parameter passed to the BufferedReader constructor, which has a default selected by a few parameters:
* Binary files are buffered in fixed-size chunks; the size of the buffer
is chosen using a heuristic trying to determine the underlying device's
"block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
On many systems, the buffer will typically be 4096 or 8192 bytes long.
So this code will read as much as possible without needing to start filling its buffer. This will be 65536 bytes in this case, because it's the largest multiple of 4096 bytes less than or equal to 65600. By doing this, it can read the data directly into the output and avoid filling up and emptying its own buffer, which would be slower.
Once it's done with that, there might be a bit more to read. In our case, 65600 - 65536 == 64, so it needs to read at least 64 more bytes. But yet it reads 4096! What gives? Well, the key here is that the point of a BufferedReader is to minimize the number of kernel reads we actually have to do, as each read has significant overhead in and of itself. So it simply reads another block to fill its buffer (so 4096 bytes) and gives you the first 64 of these.
Hopefully, that makes sense in terms of explaining why it happens like this.
As a demonstration, we could try this program:
import _io
fp = _io.BufferedReader(_io.FileIO("/dev/urandom", "rb"), 30000)
ans = fp.read(65600)
fp.close()
With this, strace tells us:
read(3, "\357\202{u'\364\6R\fr\20\f~\254\372\3705\2\332JF\n\210\341\2s\365]\270\r\306B"..., 60000) = 60000
read(3, "\266_ \323\346\302}\32\334Yl\ry\215\326\222\363O\303\367\353\340\303\234\0\370Y_\3232\21\36"..., 30000) = 30000
Sure enough, this follows the same pattern: as many blocks as possible, and then one more.
dd, in a quest for high efficiency of copying lots and lots of data, would try to read up to a much larger amount at once, which is why it only uses one read. Try it with a larger set of data, and I suspect you may find multiple calls to read.
TL;DR: the BufferedReader reads as many full blocks as possible (64 * 4096) and then one extra block of 4096 to fill its buffer.
EDIT:
The easy way to change the buffer size, as #fcatho pointed out, is to change the buffering argument on open:
open(name[, mode[, buffering]])
( ... )
The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used.
This works on both Python 2 and Python 3.

Is it bad to call write() on a file object frequently with small additions?

I'm refactoring a horrendous python script that is part of the polycode project that generates lua bindings.
I am considering writing lua lines out, as they are generated, in blocks.
But my question in generic form is, What are the detriments/caveats of writing to a file very quickly?
Take for example:
persistent_file = open('/tmp/demo.txt')
for i in range(1000000):
persistent_file.write(str(i)*80 + '\n')
for i in range(2000):
persistent_file.write(str(i)*20 + '\n')
for i in range(1000000):
persistent_file.write(str(i)*100 + '\n')
persistent_file.close()
That's just a simple way to write to a file a lot basically as quickly as possible.
I don't really expect to hit any real problems doing this, but I do want to be informed, is it ever advantageous to cache up for one big write?
From the documentation on the open function:
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None) -> file object
...
buffering is an optional integer used to set the buffering policy.
Pass 0 to switch buffering off (only allowed in binary mode), 1 to select
line buffering (only usable in text mode), and an integer > 1 to indicate
the size of a fixed-size chunk buffer. When no buffering argument is
given, the default buffering policy works as follows:
Binary files are buffered in fixed-size chunks; the size of the buffer
is chosen using a heuristic trying to determine the underlying device's
"block size" and falling back on io.DEFAULT_BUFFER_SIZE.
On many systems, the buffer will typically be 4096 or 8192 bytes long.
"Interactive" text files (files for which isatty() returns True)
use line buffering. Other text files use the policy described above
for binary files.
In other words, for the most part, the only overhead that you will hit on frequent calls to write() is the overhead of a function call.

Python fails to open 11gb csv in r+ mode but opens in r mode

I'm having problems with some code that loops through a bunch of .csvs and deletes the final line if there's nothing in it (i.e. files that end with the \n newline character)
My code works successfully on all files except one, which is the largest file in the directory at 11gb. The second largest file is 4.5gb.
The line it fails on is simply:
with open(path_str,"r+") as my_file:
and I get the following message:
IOError: [Errno 22] invalid mode ('r+') or filename: 'F:\\Shapefiles\\ab_premium\\processed_csvs\\a.csv'
The path_str I create using os.file.join to avoid errors, and I tried renaming the file to a.csv just to make sure there wasn't anything odd going on with the filename. This made no difference.
Even more strangely, the file is happy to open in r mode. I.e. the following code works fine:
with open(path_str,"r") as my_file:
I have tried navigating around the file in read mode, and it's happy to read characters at the start, end, and in the middle of the file.
Does anyone know of any limits on the size of file that Python can deal with or why I might be getting this error? I'm on Windows 7 64bit and have 16gb of RAM.
The default I/O stack in Python 2 is layered over CRT FILE streams. On Windows these are built on top of a POSIX emulation API that uses file descriptors (which in turn is layered over the user-mode Windows API, which is layered over the kernel-mode I/O system, which itself is a deeply layered system based on I/O request packets; the hardware is down there somewhere...). In the POSIX layer, opening a file with _O_RDWR | _O_TEXT mode (as in "r+"), requires seeking to the end of the file to remove CTRL+Z, if it's present. Here's a quote from the CRT's fopen documentation:
Open in text (translated) mode. In this mode, CTRL+Z is interpreted as
an end-of-file character on input. In files opened for reading/writing
with "a+", fopen checks for a CTRL+Z at the end of the file and
removes it, if possible. This is done because using fseek and ftell to
move within a file that ends with a CTRL+Z, may cause fseek to behave
improperly near the end of the file.
The problem here is that the above check calls the 32-bit _lseek (bear in mind that sizeof long is 4 bytes on 64-bit Windows, unlike most other 64-bit platforms), instead of _lseeki64. Obviously this fails for an 11 GB file. Specifically, SetFilePointer fails because it gets called with a NULL value for lpDistanceToMoveHigh. Here's the return value and LastErrorValue for the latter call:
0:000> kc 2
Call Site
KERNELBASE!SetFilePointer
MSVCR90!lseek_nolock
0:000> r rax
rax=00000000ffffffff
0:000> dt _TEB #$teb LastErrorValue
ntdll!_TEB
+0x068 LastErrorValue : 0x57
The error code 0x57 is ERROR_INVALID_PARAMETER. This is referring to lpDistanceToMoveHigh being NULL when trying to seek from the end of a large file.
To work around this problem with CRT FILE streams, I recommend opening the file using io.open instead. This is a backported implementation of Python 3's I/O stack. It always opens files in raw binary mode (_O_BINARY), and it implements its own buffering and text-mode layers on top of the raw layer.
>>> import io
>>> f = io.open('a.csv', 'r+')
>>> f
<_io.TextIOWrapper name='a.csv' encoding='cp1252'>
>>> f.buffer
<_io.BufferedRandom name='a.csv'>
>>> f.buffer.raw
<_io.FileIO name='a.csv' mode='rb+'>
>>> f.seek(0, os.SEEK_END)
11811160064L

What is the difference between the buffering argument to open() and the hardcoded readahead buffer size used when iterating through a file?

Inspired by this question, I'm wondering exactly what the optional buffering argument to Python's open() function does. From looking at the source, I see that buffering is passed into setvbuf to set the buffer size for the stream (and that it does nothing on a system without setvbuf, which the docs confirm).
However, when you iterate over a file, there is a constant called READAHEAD_BUFSIZE that appears to define how much data is read at a time (this constant is defined here).
My question is exactly how the buffering argument relates to READAHEAD_BUFSIZE. When I iterate through a file, which one defines how much data is being read off disk at a time? And is there a place in the C source that makes this clear?
READAHEAD_BUFSIZE is only used when you use the file as an iterator:
for line in fileobj:
print line
It is a separate buffer from the normal buffer argument, which is handled by the fread C API calls. Both are used when iterating.
From file.next():
In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer.
The OS buffer size is not changed, the setvbuf is done when the file is opened and not touched by the file iteration code. Instead, calls to Py_UniversalNewlineFread (which uses fread) are used to fill the read-ahead buffer, creating a second buffer internal to Python. Python otherwise leaves the regular buffering up to the C API calls (fread() calls are buffered; the userspace buffer is consulted by fread() to satisfy the request, Python doesn't have to do anything about that).
readahead_get_line_skip() then serves lines (newline terminated) from this buffer. If the buffer no longer contains newlines, it'll refill the buffer by recursing over itself with a buffer size 1.25 times the previous value. This means that file iteration can read the whole rest of the file into the memory buffer if there are no more newline characters in the whole file!
To see how much the buffer reads, print the file position (using fileobj.tell()) while looping:
>>> with open('test.txt') as f:
... for line in f:
... print f.tell()
...
8192 # 1 times the buffer size
8192
8192
~ lines elided
18432 # + 1.25 times the buffer size
18432
18432
~ lines elided
26624 # + 1 times the buffer size; the last newline must've aligned on the buffer boundary
26624
26624
~ lines elided
36864 # + 1.25 times the buffer size
36864
36864
etc.
What bytes are actually read from the disk (provided fileobj is an actual physical file on your disk) depend not only on the interplay between the fread() buffer and the internal read-ahead buffer; but also if the OS itself is using buffering. It could well be that even if the file buffer is exhausted, the OS serves the system call to read from the file from it's own cache instead of going to the physical disk.
After digging through the source a bit more and trying to understand more how setvbuf and fread work, I think I understand how buffering and READAHEAD_BUFSIZE relate to each other: when iterating through a file, a buffer of READAHEAD_BUFSIZE is filled on each line, but filling this buffer uses calls to fread, each of which fills a buffer of buffering bytes.
Python's read is implemented as file_read, which calls Py_UniversalNewlineFread, passing it the number of bytes to read as n. Py_UniversalNewlineFread then eventually calls fread to read n bytes.
When you iterate over a file, the function readahead_get_line_skip is what retrieves a line. This function also calls Py_UniversalNewlineFread, passing n = READAHEAD_BUFSIZE. So this eventually becomes a call to fread for READAHEAD_BUFSIZE bytes.
So now the question is, how many bytes does fread actually read from disk. If I run the following code in C, then 1024 bytes get copied into buf and 512 into buf2. (This might be obvious but never having used setvbuf before it was a useful experiment for me.)
FILE *f = fopen("test.txt", "r");
void *buf = malloc(1024);
void *buf2 = mallo(512);
setvbuf(f, buf, _IOFBF, 1024);
fread(buf2, 512, 1, f);
So, finally, this suggests to me that when iterating over a file, at least READAHEAD_BUF_SIZE bytes are read from disk, but it might be more. I think that the first iteration of for line in f will read x bytes, where x is the smallest multiple of buffering that is greater than READAHEAD_BUF_SIZE.
If anyone can confirm that this is what's actually going on, that would be great!

Categories