Python script that writes result to txt file - why the lag? - python

I'm using Windows 7 and I have a super-simple script that goes over a directory of images, checking a specified condition for each image (in my case, whether there's a face in the image, using dlib), while writing the paths of images that fulfilled the condition to a text file:
def process_dir(dir_path):
i = 0
with open(txt_output, 'a') as f:
for filename in os.listdir(dir_path):
# loading image to check whether dlib detects a face:
image_path = os.path.join(dir_path, filename)
opencv_img = cv2.imread(image_path)
dets = detector(opencv_img, 1)
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
i = i + 1
print i
Now the following thing happens: there seems to be a significant lag in appending lines to files. For example, I can see the script has "finished" checking several images (i.e, the console prints ~20, meaning 20 files who fulfill the condition have been found) but the .txt file is still empty. At first I thought there was a problem with my script, but after waiting a while I saw that they were in fact added to the file, only it seems to be updated in "batches".
This may not seem like the most crucial issue (and it's definitely not), but still I'm wondering - what explains this behavior? As far as I understand, every time the f.write(image_path) line is executed the file is changed - then why do I see the update with a lag?

Data written to a file object won't necessarily show up on disk immediately.
In the interests of efficiency, most operating systems will buffer the writes, meaning that data is only written out to disk when a certain amount has accumulated (usually 4K).
If you want to write your data right now, use the flush() function, as others have said.

Did you try using with buffersize 0, open(txt_output, 'a', 0).

I'm, not sure about Windows (please, someone correct me here if I'm wrong), but I believe this is because of how the write buffer is handled. Although you are requesting a write, the buffer only writes every so often (when the buffer is full), and when the file is closed. You can open the file with a smaller buffer:
with open(txt_output, 'a', 0) as f:
or manually flush it at the end of the loop:
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
f.flush()
i = i + 1
print i
I would personally recommend flushing manually when you need to.

It sounds like you're running into file stream buffering.
In short, writing to a file is a very slow process (relative to other sorts of things that the processor does). Modifying the hard disk is about the slowest thing you can do, other than maybe printing to the screen.
Because of this, most file I/O libraries will "buffer" your output, meaning that as you write to the file the library will save your data in an in-memory buffer instead of modifying the hard disk right away. Only when the buffer fills up will it "flush" the buffer (write the data to disk), after which point it starts filling the buffer again. This often reduces the number of actual write operations by quite a lot.
To answer your question, the first question to answer is, do really need to append to the file immediately every time you find a face? It will probably slow down your processing by a noticeable amount, especially if you're processing a large number of files.
If you really do need to update immediately, you basically have two options:
Manually flush the write buffer each time you write to the file. In Python, this usually means calling f.flush(), as #JamieCounsell pointed out.
Tell Python to just not use a buffer, or more accurately to use a buffer of size 0. As #VikasMadhusudana pointed out, you can tell Python how big of a buffer to use with a third argument to open(): open(txt_output, 'a', 0) for a 0-byte buffer.
Again, you probably don't need this; the only case I can think that might require this sort of thing is if you have some other external operation that's watching the file and triggers off of new data being added to it.
Hope that helps!

It's flush related, try:
print(image_path, file=f) # Python 3
or
print >>f, image_page # Python 2
instead of:
f.write(image_path)
f.write('\n')
print flushes.
another good thing about print is it gives you the newline for free.

Related

PYTHON: distinguishing between end of file and blank line in file.readline() [duplicate]

I want to iterate over each line of an entire file. One way to do this is by reading the entire file, saving it to a list, then going over the line of interest. This method uses a lot of memory, so I am looking for an alternative.
My code so far:
for each_line in fileinput.input(input_file):
do_something(each_line)
for each_line_again in fileinput.input(input_file):
do_something(each_line_again)
Executing this code gives an error message: device active.
Any suggestions?
The purpose is to calculate pair-wise string similarity, meaning for each line in file, I want to calculate the Levenshtein distance with every other line.
Nov. 2022 Edit: A related question that was asked 8 months after this question has many useful answers and comments. To get a deeper understanding of python logic, do also read this related question How should I read a file line-by-line in Python?
The correct, fully Pythonic way to read a file is the following:
with open(...) as f:
for line in f:
# Do something with 'line'
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.
There should be one -- and preferably only one -- obvious way to do it.
Two memory efficient ways in ranked order (first is best) -
use of with - supported from python 2.5 and above
use of yield if you really want to have control over how much to read
1. use of with
with is the nice and efficient pythonic way to read large files. advantages - 1) file object is automatically closed after exiting from with execution block. 2) exception handling inside the with block. 3) memory for loop iterates through the f file object line by line. internally it does buffered IO (to optimized on costly IO operations) and memory management.
with open("x.txt") as f:
for line in f:
do something with data
2. use of yield
Sometimes one might want more fine-grained control over how much to read in each iteration. In that case use iter & yield. Note with this method one explicitly needs close the file at the end.
def readInChunks(fileObj, chunkSize=2048):
"""
Lazy function to read a file piece by piece.
Default chunk size: 2kB.
"""
while True:
data = fileObj.read(chunkSize)
if not data:
break
yield data
f = open('bigFile')
for chunk in readInChunks(f):
do_something(chunk)
f.close()
Pitfalls and for the sake of completeness - below methods are not as good or not as elegant for reading large files but please read to get rounded understanding.
In Python, the most common way to read lines from a file is to do the following:
for line in open('myfile','r').readlines():
do_something(line)
When this is done, however, the readlines() function (same applies for read() function) loads the entire file into memory, then iterates over it. A slightly better approach (the first mentioned two methods above are the best) for large files is to use the fileinput module, as follows:
import fileinput
for line in fileinput.input(['myfile']):
do_something(line)
the fileinput.input() call reads lines sequentially, but doesn't keep them in memory after they've been read or even simply so this, since file in python is iterable.
References
Python with statement
To strip newlines:
with open(file_path, 'rU') as f:
for line_terminated in f:
line = line_terminated.rstrip('\n')
...
With universal newline support all text file lines will seem to be terminated with '\n', whatever the terminators in the file, '\r', '\n', or '\r\n'.
EDIT - To specify universal newline support:
Python 2 on Unix - open(file_path, mode='rU') - required [thanks #Dave]
Python 2 on Windows - open(file_path, mode='rU') - optional
Python 3 - open(file_path, newline=None) - optional
The newline parameter is only supported in Python 3 and defaults to None. The mode parameter defaults to 'r' in all cases. The U is deprecated in Python 3. In Python 2 on Windows some other mechanism appears to translate \r\n to \n.
Docs:
open() for Python 2
open() for Python 3
To preserve native line terminators:
with open(file_path, 'rb') as f:
with line_native_terminated in f:
...
Binary mode can still parse the file into lines with in. Each line will have whatever terminators it has in the file.
Thanks to #katrielalex's answer, Python's open() doc, and iPython experiments.
this is a possible way of reading a file in python:
f = open(input_file)
for line in f:
do_stuff(line)
f.close()
it does not allocate a full list. It iterates over the lines.
Some context up front as to where I am coming from. Code snippets are at the end.
When I can, I prefer to use an open source tool like H2O to do super high performance parallel CSV file reads, but this tool is limited in feature set. I end up writing a lot of code to create data science pipelines before feeding to H2O cluster for the supervised learning proper.
I have been reading files like 8GB HIGGS dataset from UCI repo and even 40GB CSV files for data science purposes significantly faster by adding lots of parallelism with the multiprocessing library's pool object and map function. For example clustering with nearest neighbor searches and also DBSCAN and Markov clustering algorithms requires some parallel programming finesse to bypass some seriously challenging memory and wall clock time problems.
I usually like to break the file row-wise into parts using gnu tools first and then glob-filemask them all to find and read them in parallel in the python program. I use something like 1000+ partial files commonly. Doing these tricks helps immensely with processing speed and memory limits.
The pandas dataframe.read_csv is single threaded so you can do these tricks to make pandas quite faster by running a map() for parallel execution. You can use htop to see that with plain old sequential pandas dataframe.read_csv, 100% cpu on just one core is the actual bottleneck in pd.read_csv, not the disk at all.
I should add I'm using an SSD on fast video card bus, not a spinning HD on SATA6 bus, plus 16 CPU cores.
Also, another technique that I discovered works great in some applications is parallel CSV file reads all within one giant file, starting each worker at different offset into the file, rather than pre-splitting one big file into many part files. Use python's file seek() and tell() in each parallel worker to read the big text file in strips, at different byte offset start-byte and end-byte locations in the big file, all at the same time concurrently. You can do a regex findall on the bytes, and return the count of linefeeds. This is a partial sum. Finally sum up the partial sums to get the global sum when the map function returns after the workers finished.
Following is some example benchmarks using the parallel byte offset trick:
I use 2 files: HIGGS.csv is 8 GB. It is from the UCI machine learning repository. all_bin .csv is 40.4 GB and is from my current project.
I use 2 programs: GNU wc program which comes with Linux, and the pure python fastread.py program which I developed.
HP-Z820:/mnt/fastssd/fast_file_reader$ ls -l /mnt/fastssd/nzv/HIGGS.csv
-rw-rw-r-- 1 8035497980 Jan 24 16:00 /mnt/fastssd/nzv/HIGGS.csv
HP-Z820:/mnt/fastssd$ ls -l all_bin.csv
-rw-rw-r-- 1 40412077758 Feb 2 09:00 all_bin.csv
ga#ga-HP-Z820:/mnt/fastssd$ time python fastread.py --fileName="all_bin.csv" --numProcesses=32 --balanceFactor=2
2367496
real 0m8.920s
user 1m30.056s
sys 2m38.744s
In [1]: 40412077758. / 8.92
Out[1]: 4530501990.807175
That’s some 4.5 GB/s, or 45 Gb/s, file slurping speed. That ain’t no spinning hard disk, my friend. That’s actually a Samsung Pro 950 SSD.
Below is the speed benchmark for the same file being line-counted by gnu wc, a pure C compiled program.
What is cool is you can see my pure python program essentially matched the speed of the gnu wc compiled C program in this case. Python is interpreted but C is compiled, so this is a pretty interesting feat of speed, I think you would agree. Of course, wc really needs to be changed to a parallel program, and then it would really beat the socks off my python program. But as it stands today, gnu wc is just a sequential program. You do what you can, and python can do parallel today. Cython compiling might be able to help me (for some other time). Also memory mapped files was not explored yet.
HP-Z820:/mnt/fastssd$ time wc -l all_bin.csv
2367496 all_bin.csv
real 0m8.807s
user 0m1.168s
sys 0m7.636s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.257s
user 0m12.088s
sys 0m20.512s
HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv
real 0m1.820s
user 0m0.364s
sys 0m1.456s
Conclusion: The speed is good for a pure python program compared to a C program. However, it’s not good enough to use the pure python program over the C program, at least for linecounting purpose. Generally the technique can be used for other file processing, so this python code is still good.
Question: Does compiling the regex just one time and passing it to all workers will improve speed? Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.
One more thing.
Does parallel CSV file reading even help? Is the disk the bottleneck, or is it the CPU? Many so-called top-rated answers on stackoverflow contain the common dev wisdom that you only need one thread to read a file, best you can do, they say. Are they sure, though?
Let’s find out:
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000
real 0m2.256s
user 0m10.696s
sys 0m19.952s
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000
real 0m17.380s
user 0m11.124s
sys 0m6.272s
Oh yes, yes it does. Parallel file reading works quite well. Well there you go!
Ps. In case some of you wanted to know, what if the balanceFactor was 2 when using a single worker process? Well, it’s horrible:
HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=2
11000000
real 1m37.077s
user 0m12.432s
sys 1m24.700s
Key parts of the fastread.py python program:
fileBytes = stat(fileName).st_size # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)
def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'): # counts number of searchChar appearing in the byte range
with open(fileName, 'r') as f:
f.seek(startByte-1) # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
bytes = f.read(endByte - startByte + 1)
cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
return cnt
The def for PartitionDataToWorkers is just ordinary sequential code. I left it out in case someone else wants to get some practice on what parallel programming is like. I gave away for free the harder parts: the tested and working parallel code, for your learning benefit.
Thanks to: The open-source H2O project, by Arno and Cliff and the H2O staff for their great software and instructional videos, which have provided me the inspiration for this pure python high performance parallel byte offset reader as shown above. H2O does parallel file reading using java, is callable by python and R programs, and is crazy fast, faster than anything on the planet at reading big CSV files.
Katrielalex provided the way to open & read one file.
However the way your algorithm goes it reads the whole file for each line of the file. That means the overall amount of reading a file - and computing the Levenshtein distance - will be done N*N if N is the amount of lines in the file. Since you're concerned about file size and don't want to keep it in memory, I am concerned about the resulting quadratic runtime. Your algorithm is in the O(n^2) class of algorithms which often can be improved with specialization.
I suspect that you already know the tradeoff of memory versus runtime here, but maybe you would want to investigate if there's an efficient way to compute multiple Levenshtein distances in parallel. If so it would be interesting to share your solution here.
How many lines do your files have, and on what kind of machine (mem & cpu power) does your algorithm have to run, and what's the tolerated runtime?
Code would look like:
with f_outer as open(input_file, 'r'):
for line_outer in f_outer:
with f_inner as open(input_file, 'r'):
for line_inner in f_inner:
compute_distance(line_outer, line_inner)
But the questions are how do you store the distances (matrix?) and can you gain an advantage of preparing e.g. the outer_line for processing, or caching some intermediate results for reuse.
Need to frequently read a large file from last position reading ?
I have created a script used to cut an Apache access.log file several times a day.
So I needed to set a position cursor on last line parsed during last execution.
To this end, I used file.seek() and file.seek() methods which allows the storage of the cursor in file.
My code :
ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))
# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")
# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")
# Set in from_line
from_position = 0
try:
with open(cursor_position, "r", encoding=ENCODING) as f:
from_position = int(f.read())
except Exception as e:
pass
# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
with open(cut_file, "w", encoding=ENCODING) as fw:
# We set cursor to the last position used (during last run of script)
f.seek(from_position)
for line in f:
fw.write("%s" % (line))
# We save the last position of cursor for next usage
with open(cursor_position, "w", encoding=ENCODING) as fw:
fw.write(str(f.tell()))
From the python documentation for fileinput.input():
This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty
further, the definition of the function is:
fileinput.FileInput([files[, inplace[, backup[, mode[, openhook]]]]])
reading between the lines, this tells me that files can be a list so you could have something like:
for each_line in fileinput.input([input_file, input_file]):
do_something(each_line)
See here for more information
#Using a text file for the example
with open("yourFile.txt","r") as f:
text = f.readlines()
for line in text:
print line
Open your file for reading (r)
Read the whole file and save each line into a list (text)
Loop through the list printing each line.
If you want, for example, to check a specific line for a length greater than 10, work with what you already have available.
for line in text:
if len(line) > 10:
print line
I would strongly recommend not using the default file loading as it is horrendously slow. You should look into the numpy functions and the IOpro functions (e.g. numpy.loadtxt()).
http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
https://store.continuum.io/cshop/iopro/
Then you can break your pairwise operation into chunks:
import numpy as np
import math
lines_total = n
similarity = np.zeros(n,n)
lines_per_chunk = m
n_chunks = math.ceil(float(n)/m)
for i in xrange(n_chunks):
for j in xrange(n_chunks):
chunk_i = (function of your choice to read lines i*lines_per_chunk to (i+1)*lines_per_chunk)
chunk_j = (function of your choice to read lines j*lines_per_chunk to (j+1)*lines_per_chunk)
similarity[i*lines_per_chunk:(i+1)*lines_per_chunk,
j*lines_per_chunk:(j+1)*lines_per_chunk] = fast_operation(chunk_i, chunk_j)
It's almost always much faster to load data in chunks and then do matrix operations on it than to do it element by element!!
Best way to read large file, line by line is to use python enumerate function
with open(file_name, "rU") as read_file:
for i, row in enumerate(read_file, 1):
#do something
#i in line of that line
#row containts all data of that line

Why doesn't write function write to file immediately? [duplicate]

I'm running a test, and found that the file doesn't actually get written until I control-C to abort the program. Can anyone explain why that would happen?
I expected it to write at the same time, so I could read the file in the middle of the process.
import os
from time import sleep
f = open("log.txt", "a+")
i = 0
while True:
f.write(str(i))
f.write("\n")
i += 1
sleep(0.1)
Writing to disk is slow, so many programs store up writes into large chunks which they write all-at-once. This is called buffering, and Python does it automatically when you open a file.
When you write to the file, you're actually writing to a "buffer" in memory. When it fills up, Python will automatically write it to disk. You can tell it "write everything in the buffer to disk now" with
f.flush()
This isn't quite the whole story, because the operating system will probably buffer writes as well. You can tell it to write the buffer of the file with
os.fsync(f.fileno())
Finally, you can tell Python not to buffer a particular file with open(f, "w", 0) or only to keep a 1-line buffer with open(f,"w", 1). Naturally, this will slow down all operations on that file, because writes are slow.
You need to f.close() to flush the file write buffer out to the file. Or in your case you might just want to do a f.flush(); os.fsync(); so you can keep looping with the opened file handle.
Don't forget to import os.
You have to force the write, so I i use the following lines to make sure a file is written:
# Two commands together force the OS to store the file buffer to disc
f.flush()
os.fsync(f.fileno())
You will want to check out file.flush() - although take note that this might not write the data to disk, to quote:
Note:
flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.
Closing the file (file.close()) will also ensure that the data is written - using with will do this implicitly, and is generally a better choice for more readability and clarity - not to mention solving other potential problems.
This is a windows-ism. If you add an explicit .close() when you're done with file, it'll appear in explorer at that time. Even just flushing it might be enough (I don't have a windows box handy to test). But basically f.write does not actually write, it just appends to the write buffer - until the buffer gets flushed you won't see it.
On unix the files will typically show up as a 0-byte file in this situation.
File Handler to be flushed.
f.flush()
The file does not get written, as the output buffer is not getting flushed until the garbage collection takes effect, and flushes the I/O buffer (more than likely by calling f.close()).
Alternately, in your loop, you can call f.flush() followed by os.fsync(), as documented here.
f.flush()
os.fsync()
All that being said, if you ever plan on sharing the data in that file with other portions of your code, I would highly recommend using a StringIO object.

Program copies specific lines from HTML file; certain lines break it with no visible pattern?

I have written a program in Python to take the text from specific elements in an HTML file ('input') and write it all into another document ('output'). The program works by searching for the particular HTML tag that precedes all elements of the desired type, and then writing the next line. This is the code, generalized:
input = open(filepath, 'r')
output = open(filepath2, 'w')
collect = 0
onstring = "string to be searched for"
for i in range(numberOfLines):
line = input.readline()
if onstring in line:
collect = 1
elif collect == 1:
output.write(line)
collect = 0
I doubt it is optimal, but it functions as intended except for one hangup: For every HTML file I try this on, between 5 and 15 of the last elements that should be copied get completely cut off. There is seemingly no pattern to what amount are cut off, so I was wondering if someone more experienced with Python could point out an apparent flaw.
If it helps, some things I have tested:
if I append two HTML files, the same amount of posts will be cut off as are cut off from the second one alone.
if I remove the last element that gets copied, more elements that were cut off after it are copied as normal, but usually some posts get cut off later, suggesting that the particular element that is copied is responsible for this issue. There is no discernible pattern of which ones 'break' the program, though.
Expanding on my comment from above. You have opened the file for writing, but each write operation does not go directly to disk. Instead, it is sent to a write buffer; when the buffer fills up, all of the write operation in the buffer are written to the physical disk. Closing the file forces whatever write operations are in the buffer to be written out.
Since your program exits without closing the file, you have writes left in the memory buffer that were never written to disk. Try using:
output.close()
I solved the problem by properly closing the files with output.close().
Credit goes to James for his helpful comment.
If not, you may have writes left in the memory buffer that were never
written to disk. Try using output.close().

How come a file doesn't get written until I stop the program?

I'm running a test, and found that the file doesn't actually get written until I control-C to abort the program. Can anyone explain why that would happen?
I expected it to write at the same time, so I could read the file in the middle of the process.
import os
from time import sleep
f = open("log.txt", "a+")
i = 0
while True:
f.write(str(i))
f.write("\n")
i += 1
sleep(0.1)
Writing to disk is slow, so many programs store up writes into large chunks which they write all-at-once. This is called buffering, and Python does it automatically when you open a file.
When you write to the file, you're actually writing to a "buffer" in memory. When it fills up, Python will automatically write it to disk. You can tell it "write everything in the buffer to disk now" with
f.flush()
This isn't quite the whole story, because the operating system will probably buffer writes as well. You can tell it to write the buffer of the file with
os.fsync(f.fileno())
Finally, you can tell Python not to buffer a particular file with open(f, "w", 0) or only to keep a 1-line buffer with open(f,"w", 1). Naturally, this will slow down all operations on that file, because writes are slow.
You need to f.close() to flush the file write buffer out to the file. Or in your case you might just want to do a f.flush(); os.fsync(); so you can keep looping with the opened file handle.
Don't forget to import os.
You have to force the write, so I i use the following lines to make sure a file is written:
# Two commands together force the OS to store the file buffer to disc
f.flush()
os.fsync(f.fileno())
You will want to check out file.flush() - although take note that this might not write the data to disk, to quote:
Note:
flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.
Closing the file (file.close()) will also ensure that the data is written - using with will do this implicitly, and is generally a better choice for more readability and clarity - not to mention solving other potential problems.
This is a windows-ism. If you add an explicit .close() when you're done with file, it'll appear in explorer at that time. Even just flushing it might be enough (I don't have a windows box handy to test). But basically f.write does not actually write, it just appends to the write buffer - until the buffer gets flushed you won't see it.
On unix the files will typically show up as a 0-byte file in this situation.
File Handler to be flushed.
f.flush()
The file does not get written, as the output buffer is not getting flushed until the garbage collection takes effect, and flushes the I/O buffer (more than likely by calling f.close()).
Alternately, in your loop, you can call f.flush() followed by os.fsync(), as documented here.
f.flush()
os.fsync()
All that being said, if you ever plan on sharing the data in that file with other portions of your code, I would highly recommend using a StringIO object.

Does the Python "open" function save its content in memory or in a temp file?

For the following Python code:
fp = open('output.txt', 'wb')
# Very big file, writes a lot of lines, n is a very large number
for i in range(1, n):
fp.write('something' * n)
fp.close()
The writing process above can last more than 30 min. Sometimes I get the error MemoryError. Is the content of the file before closing stored in memory or written in a temp file? If it's in a temporary file, what is its general location on a Linux OS?
Edit:
Added fp.write in a for loop
It's stored in the operating system's disk cache in memory until it is flushed to disk, either implicitly due to timing or space issues, or explicitly via fp.flush().
There will be write buffering in the Linux kernel, but at (ir)regular intervals they will be flushed to disk. Running out of such buffer space should never cause an application-level memory error; the buffers should empty before that happens, pausing the application while doing so.
Building on ataylor's comment to the question:
You might want to nest your loop. Something like
for i in range(1,n):
for each in range n:
fp.write('something')
fp.close()
That way, the only thing that gets put into memory is the string "something", not "something" * n.
If you a writing out a large file for which the writes might fail you a better off flushing the file to disk yourself at regular intervals using fp.flush(). This way the file will be in a location of your choosing that you can easily get to rather than being at the mercy of the OS:
fp = open('output.txt', 'wb')
counter = 0
for line in many_lines:
file.write(line)
counter += 1
if counter > 999:
fp.flush()
fp.close()
This will flush the file to disk every 1000 lines.
If you write line by line, it should not be a problem. You should show the code of what you are doing before the write. For a start you can try to delete objects where not necessary, use fp.flush() etc..
File writing should never give a memory error; with all probability, you have some bug in another place.
If you have a loop, and a memory error, then I would look if you are "leaking" references to objects.
Something like:
def do_something(a, b = []):
b.append(a)
return b
fp = open('output.txt', 'wb')
for i in range(1, n):
something = do_something(i)
fp.write(something)
fp.close()
I am now picking just an example, but in your actual case the reference leak may be much more difficult to find; however this case will just leak memory inside do_something because of the way Python handles default parameters of functions.

Categories