Read from file while it is being written to in Python?

Read from file while it is being written to in Python? - python

I followed the solution proposed here
In order to test it, I used two programs, writer.py and reader.py respectively.
# writer.py
import time
with open('pipe.txt', 'w', encoding = 'utf-8') as f:
i = 0
while True:
f.write('{}'.format(i))
print('I wrote {}'.format(i))
time.sleep(3)
i += 1
# reader.py
import time, os
#Set the filename and open the file
filename = 'pipe.txt'
file = open(filename, 'r', encoding = 'utf-8')
#Find the size of the file and move to the end
st_results = os.stat(filename)
st_size = st_results[6]
file.seek(st_size)
while 1:
where = file.tell()
line = file.readline()
if not line:
time.sleep(1)
file.seek(where)
else:
print(line)
But when I run:
> python writer.py
> python reader.py
the reader will print the lines after the writer has exited (when I kill the process)
Is there any other way around to read the contents the time they are being written ?
[EDIT]
The program that actually writes to the file is an .exe application and I don't have access to the source code.

You need to flush your writes/prints to files, or they'll default to being block-buffered (so you'd have to write several kilobytes before the user mode buffer would actually be sent to the OS for writing).
Simplest solution is to call .flush for after write calls:
f.write('{}'.format(i))
f.flush()

There are 2 different problems here:
OS and file system must allow concurrent accesses to a file. If you get no error it is the case, but on some systems it could be disallowed
The writer must flush its output to have it reach the disk so that the reader can find it. If you do not, the output stays in in memory buffer until those buffers are full which can require several kbytes
So writer should become:
# writer.py
import time
with open('pipe.txt', 'w', encoding = 'utf-8') as f:
i = 0
while True:
f.write('{}'.format(i))
f.flush()
print('I wrote {}'.format(i))
time.sleep(3)
i += 1

Related

Undesired deletion of temporaly files

I am try to create some temporal files and make some operations on them inside a loop. Then I will access the information on all of the temporal files. And do some operations with that information. For simplicity I brought the following code that reproduces my issue:
import tempfile
tmp_files = []
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt")
with open(tmp.name, "w") as f:
f.write(str(i))
tmp_files.append(tmp.name)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
ERROR:
with open(tmp_file, "r") as f: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpynh0kbnw.txt'
When I look on /tmp directory (with some time.sleep(2) on the loop) I see that the file is deleted and only one is preserved. And for that the error.
Of course I could handle to keep all the files with the flag tempfile.NamedTemporaryFile(suffix=".txt", delete=False). But that is not the idea. I would like to hold the temporal files just for the running time of the script. I also could delete the files with os.remove. But my question is more why this happen. Because I expected that the files hold to the end of the running. Because I don't close the file on the execution (or do I?).
A lot of thanks in advance.

tdelaney does already answer your actual question.
I just would like to offer you an alternative to NamedTemporaryFile. Why not creating a temporary folder which is removed (with all files in it) at the end of the script?
Instead of using a NamedTemporaryFile, you could use tempfile.TemporaryDirectory. The directory will be deleted when closed.
The example below uses the with statement which closes the file handle automatically when the block ends (see John Gordon's comment).
import os
import tempfile
with tempfile.TemporaryDirectory() as temp_folder:
tmp_files = []
for i in range(40):
tmp_file = os.path.join(temp_folder, f"{i}.txt")
with open(tmp_file, "w") as f:
f.write(str(i))
tmp_files.append(tmp_file)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)

By default, a NamedTemporaryFile deletes its file when closed. its a bit subtle, but tmp = tempfile.NamedTemporaryFile(suffix=".txt") in the loop causes the previous file to be deleted when tmp is reassigned. One option is to use the delete=False parameter. Or, just keep the file open and seek to the beginning after the write.
NamedTemporaryFile is already a file object - you can write to it directly without reopening. Just make sure the mode is "write plus" and in text, not binary mode. Put the code an a try/finally block to make sure the files are really deleted at the end.
import tempfile
tmp_files = []
try:
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt", mode="w+")
tmp.write(str(i))
tmp.seek(0)
tmp_files.append(tmp)
string = ""
for tmp_file in tmp_files:
data = tmp_file.read()
string += data
finally:
for tmp_file in tmp_files:
tmp_file.close()
print(string)

Named pipe won't block

I'm trying to make multiple program communicate using Named Pipes under python.
Here's how I'm proceeding :
import os
os.mkfifo("/tmp/p")
file = os.open("/tmp/p", os.O_RDONLY)
while True:
line = os.read(file, 255)
print("'%s'" % line)
Then, after starting it I'm sending a simple data through the pipe :
echo "test" > /tmp/p
I expected here to have test\n showing up, and the python blocks at os.read() again.
What is happening is python to print the 'test\n' and then print '' (empty string) infinitely.
Why is that happening, and what can I do about that ?

From http://man7.org/linux/man-pages/man7/pipe.7.html :
If all file descriptors referring to the write end of a pipe have been
closed, then an attempt to read(2) from the pipe will see end-of-file
From https://docs.python.org/2/library/os.html#os.read :
If the end of the file referred to by fd has been reached, an empty string is returned.
So, you're closing the write end of the pipe (when your echo command finishes) and Python is reporting that as end-of-file.
If you want to wait for another process to open the FIFO, then you could detect when read() returns end-of-file, close the FIFO, and open it again. The open should block until a new writer comes along.

As an alternative to user9876's answer you can open your pipe for writing right after creating it, this allows it to stay open for writing at all times.
Here's an example contextmanager for working with pipes:
#contextlib.contextmanager
def pipe(path):
try:
os.mkfifo(path)
except FileExistsError:
pass
try:
with open(path, 'w'): # dummy writer
with open(path, 'r') as reader:
yield reader
finally:
os.unlink(path)
And here is how you use it:
with pipe('myfile') as reader:
while True:
print(reader.readline(), end='')

python: read file continuously, even after it has been logrotated

I have a simple python script, where I read logfile continuosly (same as tail -f)
while True:
line = f.readline()
if line:
print line,
else:
time.sleep(0.1)
How can I make sure that I can still read the logfile, after it has been rotated by logrotate?
i.e. I need to do the same what tail -F would do.
I am using python 2.7

As long as you only plan to do this on Unix, the most robust way is probably to check so that the open file still refers to the same i-node as the name, and reopen it when that is no longer the case. You can get the i-number of the file from os.stat and os.fstat, in the st_ino field.
It could look like this:
import os, sys, time
name = "logfile"
current = open(name, "r")
curino = os.fstat(current.fileno()).st_ino
while True:
while True:
buf = current.read(1024)
if buf == "":
break
sys.stdout.write(buf)
try:
if os.stat(name).st_ino != curino:
new = open(name, "r")
current.close()
current = new
curino = os.fstat(current.fileno()).st_ino
continue
except IOError:
pass
time.sleep(1)
I doubt this works on Windows, but since you're speaking in terms of tail, I'm guessing that's not a problem. :)

You can do it by keeping track of where you are in the file and reopening it when you want to read. When the log file rotates, you notice that the file is smaller and since you reopen, you handle any unlinking too.
import time
cur = 0
while True:
try:
with open('myfile') as f:
f.seek(0,2)
if f.tell() < cur:
f.seek(0,0)
else:
f.seek(cur,0)
for line in f:
print line.strip()
cur = f.tell()
except IOError, e:
pass
time.sleep(1)
This example hides errors like file not found because I'm not sure of logrotate details such as small periods of time where the file is not available.
NOTE: In python 3, things are different. A regular open translates bytes to str and the interim buffer used for that conversion means that seek and tell don't operate properly (except when seeking to 0 or the end of file). Instead, open in binary mode ("rb") and do the decode manually line by line. You'll have to know the file encoding and what that encoding's newline looks like. For utf-8, its b"\n" (one of the reasons utf-8 is superior to utf-16, btw).

Thanks to #tdelaney and #Dolda2000's answers, I ended up with what follows. It should work on both Linux and Windows, and also handle logrotate's copytruncate or create options (respectively copy then truncate size to 0 and move then recreate file).
file_name = 'my_log_file'
seek_end = True
while True: # handle moved/truncated files by allowing to reopen
with open(file_name) as f:
if seek_end: # reopened files must not seek end
f.seek(0, 2)
while True: # line reading loop
line = f.readline()
if not line:
try:
if f.tell() > os.path.getsize(file_name):
# rotation occurred (copytruncate/create)
f.close()
seek_end = False
break
except FileNotFoundError:
# rotation occurred but new file still not created
pass # wait 1 second and retry
time.sleep(1)
do_stuff_with(line)
A limitation when using copytruncate option is that if lines are appended to the file while time-sleeping, and rotation occurs before wake-up, the last lines will be "lost" (they will still be in the now "old" log file, but I cannot see a decent way to "follow" that file to finish reading it). This limitation is not relevant with "move and create" create option because f descriptor will still point to the renamed file and therefore last lines will be read before the descriptor is closed and opened again.

Using 'tail -F
man tail
-F same as --follow=name --retr
-f, --follow[={name|descriptor}] output appended data as the file grows;
--retry keep trying to open a file if it is inaccessible
-F option will follow the name of the file not descriptor.
So when logrotate happens, it will follow the new file.
import subprocess
def tail(filename: str) -> Generator[str, None, None]:
proc = subprocess.Popen(["tail", "-F", filename], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
while True:
line = proc.stdout.readline()
if line:
yield line.decode("utf-8")
else:
break
for line in tail("/config/logs/openssh/current"):
print(line.strip())

I made a variation of the awesome above one by #pawamoy into a generator function one for my log monitoring and following needs.
def tail_file(file):
"""generator function that yields new lines in a file
:param file:File Path as a string
:type file: str
:rtype: collections.Iterable
"""
seek_end = True
while True: # handle moved/truncated files by allowing to reopen
with open(file) as f:
if seek_end: # reopened files must not seek end
f.seek(0, 2)
while True: # line reading loop
line = f.readline()
if not line:
try:
if f.tell() > os.path.getsize(file):
# rotation occurred (copytruncate/create)
f.close()
seek_end = False
break
except FileNotFoundError:
# rotation occurred but new file still not created
pass # wait 1 second and retry
time.sleep(1)
yield line
Which can be easily used like the below
import os, time
access_logfile = '/var/log/syslog'
loglines = tail_file(access_logfile)
for line in loglines:
print(line)

Why does tempfile.NamedTemporaryFile() truncate my data?

Here is a test I created to recreate a problem I was having when I used
tempfile.NamedTemporaryFile(). The problem is that when I use tempfile the
data in my CSV is truncated off the end of the file.
When you run this test script, temp2.csv will get truncated and temp1.csv
will be the same size as the original CSV.
I'm using Python 2.7.1.
You can download the sample CSV from http://explore.data.gov/Energy-and-Utilities/Residential-Energy-Consumption-Survey-RECS-Files-A/eypy-jxs2
#!/usr/bin/env python
import tempfile
import shutil
def main():
f = open('RECS05alldata.csv')
data = f.read()
f.close()
f = open('temp1.csv', 'w+b')
f.write(data)
f.close()
temp = tempfile.NamedTemporaryFile()
temp.write(data)
shutil.copy(temp.name, 'temp2.csv')
temp.close()
if __name__ == '__main__':
main()

Add temp.flush() after temp.write(data).

You copy the file before you close it. Files are buffered, which means that some of it will remain in the buffer while it is waiting to be written to the file. The close will write out all remaining data from the buffer to the file as part of the closing of the file.
This has nothing to do with NamedTemporaryFile.

I think your problem is that Python has not flushed the entire file to disk when you call shutil.copy.
Change
temp = tempfile.NamedTemporaryFile()
temp.write(data)
shutil.copy(temp.name, 'temp2.csv')
temp.close()
to
temp = tempfile.NamedTemporaryFile()
temp.write(data)
temp.close()
shutil.copy(temp.name, 'temp2.csv')

Slow python file I:O; Ruby runs better than this; Got the wrong language?

Please advise - I'm going to use this asa learning point. I'm a beginner.
I'm splitting a 25mb file into several smaller file.
A Kindly guru here gave me a Ruby sript. It works beautifully fast. So, in order to learn I mimicked it with a python script. This runs like a three-legged cat (slow). I wonder if anyone can tell me why?
My python script
##split a file into smaller files
###########################################
def splitlines (file) :
fileNo=0001
outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append
fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line
if re.search("Copyright ", line): # if the line is equal to the regex
outFile.close() ## close the file
fileNo +=1 #and add one to the filename, starting to read lines in again
else: # otherwise
outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append
outFile.write(line) ## then append it to the open outFile
fh.close()
The guru's Ruby 1.9 script
g=0001
f=File.open(g.to_s + ".txt","w")
open("corpus1.txt").each do |line|
if line[/\d+ of \d+ DOCUMENTS/]
f.close
f=File.open(g.to_s + ".txt","w")
g+=1
end
f.print line
end

There are many reasons why your script is slow -- the main reason being that you reopen the outputfile for almost every line you write. Since the old file gets implicitly closed on opening a new one (due to Python garbage collection), the write buffer is flushed for every single line you write, which is quite expensive.
A cleaned up and corrected version of your script would be
def file_generator():
file_no = 1
while True:
f = open(r"C:\Users\dunner7\Desktop\Textomics\Media"
r"\LexisNexus\ele\newdocs\%s.txt" % file_no, 'a')
yield f
f.close()
file_no += 1
def splitlines(filename):
files = file_generator()
out_file = next(files)
with open(filename) as in_file:
for line in in_file:
if "Copyright " in line:
out_file = next(files)
out_file.write(line)
out_file.close()

I guess the reason your script is so slow is that you open a new file descriptor for each line. If you look at your guru's ruby script, it closes and opens the output file only if your separator matches.
In contrast to that, your python script opens a new file descriptor for every line you read (and btw, does not close them). Opening a file requires talking to the kernel, so this is relatively slow.
Another change I would suggest is to change
fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line
to
fh = open(file, "r")
for line in fh:
With this change, you do not read the whole file into memory, but only block after block. Although it should not matter with a 25MiB file, it will hurt you with big files and is good practice (and less code ;)).

The Python code might be slow due to regex and not IO. Try
def splitlines (file) :
fileNo=0001
outFile=open("newdocs/%s.txt" % fileNo, 'a') ## open file to append
reg = re.compile("Copyright ")
for line in open(file, "r"):
if reg.search("Copyright ", line): # if the line is equal to the regex
outFile.close() ## close the file
outFile=open("newdocs%s.txt" % fileNo, 'a') ## open file to append
fileNo +=1 #and add one to the filename, starting to read lines in again
outFile.write(line) ## then append it to the open outFile
Several notes
Always use / instead of \ for path name
If regex is used repeatedly, compile it
Do you need re.search? or re.match?
UPDATE:
#Ed. S: point taken
#Winston Ewert: code updated to be closer to the original Ruby code

rosser,
Don't use names of built-in objects as identifiers in a code (file, splitlines)
The following code respects the effect of your own code: an out_file is closed without the line containing 'Copyright ' that constitutes the signal of closing
The use of the function writelines() is intended to obtain a faster execution than with a repetition of out_file.write(line)
The if li: block is there to trigger the closing of out_file in case the last line of the read file doesn't contains 'Copyright '
def splitfile(filename, wordstop, destrep, file_no = 1, li = []):
with open(filename) as in_file:
for line in in_file:
if wordstop in line:
with open(destrep+str(file_no)+'.txt','w') as f:
f.writelines(li)
file_no += 1
li = []
else:
li.append(line)
if li:
with open(destrep+str(file_no)+'.txt','w') as f:
f.writelines(li)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read from file while it is being written to in Python? - python

You need to flush your writes/prints to files, or they'll default to being block-buffered (so you'd have to write several kilobytes before the user mode buffer would actually be sent to the OS for writing). Simplest solution is to call .flush for after write calls: f.write('{}'.format(i)) f.flush()

Related

Undesired deletion of temporaly files

Named pipe won't block

python: read file continuously, even after it has been logrotated

Why does tempfile.NamedTemporaryFile() truncate my data?

Slow python file I:O; Ruby runs better than this; Got the wrong language?

Categories

Resources