IO + subprocess.check_output iteration breaking file writes

IO + subprocess.check_output iteration breaking file writes - python

I am writing a python script that will go through a code file, comment out an import, build the program, and determine if each import is actually being used.
I have each component working, but for some reason when I combine them I am getting a very weird issue while writing back to the code file. It seems like whenever I use subprocess.check_output for building the code, the write commands for the file seem to always stop writing around line 315 without continuing on in the file. I would assume it was some race condition but it will always end on the same line each time I run it.
I have also tried with os.system() and had the same results. I'm not much of a python coder so maybe there is something obvious with IO that I'm missing?
Some Code:
with open(file, 'r+') as f:
content = f.read()
f.seek(0)
f.truncate()
lines = re.split("\r?\n", content);
for index, line in enumerate(lines):
result = regex.search(line)
if (result):
commentLine = '//' + line + ' //Not Needed'
lines[index] = commentLine
f.seek(0)
f.truncate()
for writeLine in lines:
f.write("%s\n" % writeLine)
returnValue = 0;
try:
returnValue = subprocess.check_output(['/usr/bin/xcodebuild', '-workspace', 'path_to_my_workspace' ,'-configuration', 'Debug', '-scheme', 'my_scheme'], stderr=subprocess.STDOUT)
except:
returnValue = 1
if (returnValue == 0): #Build succeeded!
print "Build succeeded!"
else:#Build failed
print "Build failed"
lines[index] = line
f.seek(0)
f.truncate()
for writeLine in lines:
f.write("%s\n" % writeLine)
Walkthrough:
Opens a file,
Splits it line by line searching for a regex matching #import etc etc,
Comments out this line,
Tries to build it,
If success, leave the comment,
if fail, remove the comment,
repeat for each regex match

Related

How to edit a file in python without a temporary sacrificial file? [duplicate]

I'm using Python, and would like to insert a string into a text file without deleting or copying the file. How can I do that?

Unfortunately there is no way to insert into the middle of a file without re-writing it. As previous posters have indicated, you can append to a file or overwrite part of it using seek but if you want to add stuff at the beginning or the middle, you'll have to rewrite it.
This is an operating system thing, not a Python thing. It is the same in all languages.
What I usually do is read from the file, make the modifications and write it out to a new file called myfile.txt.tmp or something like that. This is better than reading the whole file into memory because the file may be too large for that. Once the temporary file is completed, I rename it the same as the original file.
This is a good, safe way to do it because if the file write crashes or aborts for any reason, you still have your untouched original file.

Depends on what you want to do. To append you can open it with "a":
with open("foo.txt", "a") as f:
f.write("new line\n")
If you want to preprend something you have to read from the file first:
with open("foo.txt", "r+") as f:
old = f.read() # read everything in the file
f.seek(0) # rewind
f.write("new line\n" + old) # write the new line before

The fileinput module of the Python standard library will rewrite a file inplace if you use the inplace=1 parameter:
import sys
import fileinput
# replace all occurrences of 'sit' with 'SIT' and insert a line after the 5th
for i, line in enumerate(fileinput.input('lorem_ipsum.txt', inplace=1)):
sys.stdout.write(line.replace('sit', 'SIT')) # replace 'sit' and write
if i == 4: sys.stdout.write('\n') # write a blank line after the 5th line

Rewriting a file in place is often done by saving the old copy with a modified name. Unix folks add a ~ to mark the old one. Windows folks do all kinds of things -- add .bak or .old -- or rename the file entirely or put the ~ on the front of the name.
import shutil
shutil.move(afile, afile + "~")
destination= open(aFile, "w")
source= open(aFile + "~", "r")
for line in source:
destination.write(line)
if <some condition>:
destination.write(<some additional line> + "\n")
source.close()
destination.close()
Instead of shutil, you can use the following.
import os
os.rename(aFile, aFile + "~")

Python's mmap module will allow you to insert into a file. The following sample shows how it can be done in Unix (Windows mmap may be different). Note that this does not handle all error conditions and you might corrupt or lose the original file. Also, this won't handle unicode strings.
import os
from mmap import mmap
def insert(filename, str, pos):
if len(str) < 1:
# nothing to insert
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
# or this could be an error
if pos > origSize:
pos = origSize
elif pos < 0:
pos = 0
m.resize(origSize + len(str))
m[pos+len(str):] = m[pos:origSize]
m[pos:pos+len(str)] = str
m.close()
f.close()
It is also possible to do this without mmap with files opened in 'r+' mode, but it is less convenient and less efficient as you'd have to read and temporarily store the contents of the file from the insertion position to EOF - which might be huge.

As mentioned by Adam you have to take your system limitations into consideration before you can decide on approach whether you have enough memory to read it all into memory replace parts of it and re-write it.
If you're dealing with a small file or have no memory issues this might help:
Option 1)
Read entire file into memory, do a regex substitution on the entire or part of the line and replace it with that line plus the extra line. You will need to make sure that the 'middle line' is unique in the file or if you have timestamps on each line this should be pretty reliable.
# open file with r+b (allow write and binary mode)
f = open("file.log", 'r+b')
# read entire content of file into memory
f_content = f.read()
# basically match middle line and replace it with itself and the extra line
f_content = re.sub(r'(middle line)', r'\1\nnew line', f_content)
# return pointer to top of file so we can re-write the content with replaced string
f.seek(0)
# clear file content
f.truncate()
# re-write the content with the updated content
f.write(f_content)
# close file
f.close()
Option 2)
Figure out middle line, and replace it with that line plus the extra line.
# open file with r+b (allow write and binary mode)
f = open("file.log" , 'r+b')
# get array of lines
f_content = f.readlines()
# get middle line
middle_line = len(f_content)/2
# overwrite middle line
f_content[middle_line] += "\nnew line"
# return pointer to top of file so we can re-write the content with replaced string
f.seek(0)
# clear file content
f.truncate()
# re-write the content with the updated content
f.write(''.join(f_content))
# close file
f.close()

Wrote a small class for doing this cleanly.
import tempfile
class FileModifierError(Exception):
pass
class FileModifier(object):
def __init__(self, fname):
self.__write_dict = {}
self.__filename = fname
self.__tempfile = tempfile.TemporaryFile()
with open(fname, 'rb') as fp:
for line in fp:
self.__tempfile.write(line)
self.__tempfile.seek(0)
def write(self, s, line_number = 'END'):
if line_number != 'END' and not isinstance(line_number, (int, float)):
raise FileModifierError("Line number %s is not a valid number" % line_number)
try:
self.__write_dict[line_number].append(s)
except KeyError:
self.__write_dict[line_number] = [s]
def writeline(self, s, line_number = 'END'):
self.write('%s\n' % s, line_number)
def writelines(self, s, line_number = 'END'):
for ln in s:
self.writeline(s, line_number)
def __popline(self, index, fp):
try:
ilines = self.__write_dict.pop(index)
for line in ilines:
fp.write(line)
except KeyError:
pass
def close(self):
self.__exit__(None, None, None)
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
with open(self.__filename,'w') as fp:
for index, line in enumerate(self.__tempfile.readlines()):
self.__popline(index, fp)
fp.write(line)
for index in sorted(self.__write_dict):
for line in self.__write_dict[index]:
fp.write(line)
self.__tempfile.close()
Then you can use it this way:
with FileModifier(filename) as fp:
fp.writeline("String 1", 0)
fp.writeline("String 2", 20)
fp.writeline("String 3") # To write at the end of the file

If you know some unix you could try the following:
Notes: $ means the command prompt
Say you have a file my_data.txt with content as such:
$ cat my_data.txt
This is a data file
with all of my data in it.
Then using the os module you can use the usual sed commands
import os
# Identifiers used are:
my_data_file = "my_data.txt"
command = "sed -i 's/all/none/' my_data.txt"
# Execute the command
os.system(command)
If you aren't aware of sed, check it out, it is extremely useful.

Python read from a file, and only do work if a string isn't found

So I'm trying to make a reddit bot that will exec code from a submission. I have my own sub for controlling these clients.
while __name__ == '__main__':
string = open('config.txt').read()
for submission in subreddit.get_new(limit = 1):
if submission.url not in string:
f.write(submission.url + "\n")
f.close()
f = open('config.txt', "a")
string = open('config.txt').read()
So what this is suppose to do is read from the config file, then only do work if the submission url isn't in config.txt. However, it always sees the most recent post and does it's work. This is how F is opened.
if not os.path.exists('file'):
open('config.txt', 'w').close()
f = open('config.txt', "a")

First a critique of your existing code (in comments):
# the next two lines are not needed; open('config.txt', "a")
# will create the file if it doesn't exist.
if not os.path.exists('file'):
open('config.txt', 'w').close()
f = open('config.txt', "a")
# this is an unusual condition which will confuse readers
while __name__ == '__main__':
# the next line will open a file handle and never explicitly close it
# (it will probably get closed automatically when it goes out of scope,
# but it's not good form)
string = open('config.txt').read()
for submission in subreddit.get_new(limit = 1):
# the next line should check for a full-line match; as written, it
# will match "http://www.test.com" if "http://www.test.com/level2"
# is in config.txt
if submission.url not in string:
f.write(submission.url + "\n")
# the next two lines could be replaced with f.flush()
f.close()
f = open('config.txt', "a")
# this is a cumbersome way to keep your string synced with the file,
# and it never explicitly releases the new file handle
string = open('config.txt').read()
# If subreddit.get_new() doesn't return any results, this will act as
# a busy loop, repeatedly requesting new results as fast as possible.
# If that is undesirable, you might want to sleep here.
# file handle f should get closed after the loop
None of the problems pointed out above should keep your code from working (except maybe the imprecise matching). But simpler code may be easier to debug. Here's some code that does the same thing. Note: I assume there is no chance any other process is writing to config.txt at the same time. You could try this code (or your code) with pdb, line-by-line, to see whether it works as expected.
import time
import praw
r = praw.Reddit(...)
subreddit = r.get_subreddit(...)
if __name__ == '__main__':
# open config.txt for reading and writing without truncating.
# moves pointer to end of file; closes file at end of block
with open('config.txt', "a+") as f:
# move pointer to start of file
f.seek(0)
# make a list of existing lines; also move pointer to end of file
lines = set(f.read().splitlines())
while True:
got_one = False
for submission in subreddit.get_new(limit=1):
got_one = True
if submission.url not in lines:
lines.add(submission.url)
f.write(submission.url + "\n")
# write data to disk immediately
f.flush()
...
if not got_one:
# wait a little while before trying again
time.sleep(10)

Why does my script write every input string twice to the output file?

When I look at what it wrote it's always double. For example if I write 'dog' ill get 'dogdog'. Why?
Reading and writing to file, filename taken from command line arguments:
from sys import argv
script,text=argv
def reading(f):
print f.read()
def writing(f):
print f.write(line)
filename=open(text)
#opening file
reading(filename)
filename.close()
filename=open(text,'w')
line=raw_input()
filename.write(line)
writing(filename)
filename.close()
As I said the output I am getting is the double value of what input I am giving.

You are getting double value because you are writing two times
1) From the Function call
def writing(f):
print f.write(line)
2) By writing in file using filename.write(line)
Use this code:
from sys import argv
script,text=argv
def reading(f):
print f.read()
def writing(f):
print f.write(line)
filename=open(text,'w')
line=raw_input()
writing(filename)
filename.close()
And also no need to close file two times, once you finished all the read and write operations then just close it.

If you want to display each line and then write a new line, you should probably just read the entire file first, and then loop over the lines when writing new content.
Here's how you can do it. When you use with open(), you don't have to close() the file, since that's done automatically.
from sys import argv
filename = argv[1]
# first read the file content
with open(filename, 'r') as fp:
lines = fp.readlines()
# `lines` is now a list of strings.
# then open the file for writing.
# This will empty the file so we can write from the start.
with open(filename, 'w') as fp:
# by using enumerate, we can get the line numbers as well.
for index, line in enumerate(lines, 1):
print 'line %d of %d:\n%s' % (index, len(lines), line.rstrip())
new_line = raw_input()
fp.write(new_line + '\n')

python: read file continuously, even after it has been logrotated

I have a simple python script, where I read logfile continuosly (same as tail -f)
while True:
line = f.readline()
if line:
print line,
else:
time.sleep(0.1)
How can I make sure that I can still read the logfile, after it has been rotated by logrotate?
i.e. I need to do the same what tail -F would do.
I am using python 2.7

As long as you only plan to do this on Unix, the most robust way is probably to check so that the open file still refers to the same i-node as the name, and reopen it when that is no longer the case. You can get the i-number of the file from os.stat and os.fstat, in the st_ino field.
It could look like this:
import os, sys, time
name = "logfile"
current = open(name, "r")
curino = os.fstat(current.fileno()).st_ino
while True:
while True:
buf = current.read(1024)
if buf == "":
break
sys.stdout.write(buf)
try:
if os.stat(name).st_ino != curino:
new = open(name, "r")
current.close()
current = new
curino = os.fstat(current.fileno()).st_ino
continue
except IOError:
pass
time.sleep(1)
I doubt this works on Windows, but since you're speaking in terms of tail, I'm guessing that's not a problem. :)

You can do it by keeping track of where you are in the file and reopening it when you want to read. When the log file rotates, you notice that the file is smaller and since you reopen, you handle any unlinking too.
import time
cur = 0
while True:
try:
with open('myfile') as f:
f.seek(0,2)
if f.tell() < cur:
f.seek(0,0)
else:
f.seek(cur,0)
for line in f:
print line.strip()
cur = f.tell()
except IOError, e:
pass
time.sleep(1)
This example hides errors like file not found because I'm not sure of logrotate details such as small periods of time where the file is not available.
NOTE: In python 3, things are different. A regular open translates bytes to str and the interim buffer used for that conversion means that seek and tell don't operate properly (except when seeking to 0 or the end of file). Instead, open in binary mode ("rb") and do the decode manually line by line. You'll have to know the file encoding and what that encoding's newline looks like. For utf-8, its b"\n" (one of the reasons utf-8 is superior to utf-16, btw).

Thanks to #tdelaney and #Dolda2000's answers, I ended up with what follows. It should work on both Linux and Windows, and also handle logrotate's copytruncate or create options (respectively copy then truncate size to 0 and move then recreate file).
file_name = 'my_log_file'
seek_end = True
while True: # handle moved/truncated files by allowing to reopen
with open(file_name) as f:
if seek_end: # reopened files must not seek end
f.seek(0, 2)
while True: # line reading loop
line = f.readline()
if not line:
try:
if f.tell() > os.path.getsize(file_name):
# rotation occurred (copytruncate/create)
f.close()
seek_end = False
break
except FileNotFoundError:
# rotation occurred but new file still not created
pass # wait 1 second and retry
time.sleep(1)
do_stuff_with(line)
A limitation when using copytruncate option is that if lines are appended to the file while time-sleeping, and rotation occurs before wake-up, the last lines will be "lost" (they will still be in the now "old" log file, but I cannot see a decent way to "follow" that file to finish reading it). This limitation is not relevant with "move and create" create option because f descriptor will still point to the renamed file and therefore last lines will be read before the descriptor is closed and opened again.

Using 'tail -F
man tail
-F same as --follow=name --retr
-f, --follow[={name|descriptor}] output appended data as the file grows;
--retry keep trying to open a file if it is inaccessible
-F option will follow the name of the file not descriptor.
So when logrotate happens, it will follow the new file.
import subprocess
def tail(filename: str) -> Generator[str, None, None]:
proc = subprocess.Popen(["tail", "-F", filename], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
while True:
line = proc.stdout.readline()
if line:
yield line.decode("utf-8")
else:
break
for line in tail("/config/logs/openssh/current"):
print(line.strip())

I made a variation of the awesome above one by #pawamoy into a generator function one for my log monitoring and following needs.
def tail_file(file):
"""generator function that yields new lines in a file
:param file:File Path as a string
:type file: str
:rtype: collections.Iterable
"""
seek_end = True
while True: # handle moved/truncated files by allowing to reopen
with open(file) as f:
if seek_end: # reopened files must not seek end
f.seek(0, 2)
while True: # line reading loop
line = f.readline()
if not line:
try:
if f.tell() > os.path.getsize(file):
# rotation occurred (copytruncate/create)
f.close()
seek_end = False
break
except FileNotFoundError:
# rotation occurred but new file still not created
pass # wait 1 second and retry
time.sleep(1)
yield line
Which can be easily used like the below
import os, time
access_logfile = '/var/log/syslog'
loglines = tail_file(access_logfile)
for line in loglines:
print(line)

Slow python file I:O; Ruby runs better than this; Got the wrong language?

Please advise - I'm going to use this asa learning point. I'm a beginner.
I'm splitting a 25mb file into several smaller file.
A Kindly guru here gave me a Ruby sript. It works beautifully fast. So, in order to learn I mimicked it with a python script. This runs like a three-legged cat (slow). I wonder if anyone can tell me why?
My python script
##split a file into smaller files
###########################################
def splitlines (file) :
fileNo=0001
outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append
fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line
if re.search("Copyright ", line): # if the line is equal to the regex
outFile.close() ## close the file
fileNo +=1 #and add one to the filename, starting to read lines in again
else: # otherwise
outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append
outFile.write(line) ## then append it to the open outFile
fh.close()
The guru's Ruby 1.9 script
g=0001
f=File.open(g.to_s + ".txt","w")
open("corpus1.txt").each do |line|
if line[/\d+ of \d+ DOCUMENTS/]
f.close
f=File.open(g.to_s + ".txt","w")
g+=1
end
f.print line
end

There are many reasons why your script is slow -- the main reason being that you reopen the outputfile for almost every line you write. Since the old file gets implicitly closed on opening a new one (due to Python garbage collection), the write buffer is flushed for every single line you write, which is quite expensive.
A cleaned up and corrected version of your script would be
def file_generator():
file_no = 1
while True:
f = open(r"C:\Users\dunner7\Desktop\Textomics\Media"
r"\LexisNexus\ele\newdocs\%s.txt" % file_no, 'a')
yield f
f.close()
file_no += 1
def splitlines(filename):
files = file_generator()
out_file = next(files)
with open(filename) as in_file:
for line in in_file:
if "Copyright " in line:
out_file = next(files)
out_file.write(line)
out_file.close()

I guess the reason your script is so slow is that you open a new file descriptor for each line. If you look at your guru's ruby script, it closes and opens the output file only if your separator matches.
In contrast to that, your python script opens a new file descriptor for every line you read (and btw, does not close them). Opening a file requires talking to the kernel, so this is relatively slow.
Another change I would suggest is to change
fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line
to
fh = open(file, "r")
for line in fh:
With this change, you do not read the whole file into memory, but only block after block. Although it should not matter with a 25MiB file, it will hurt you with big files and is good practice (and less code ;)).

The Python code might be slow due to regex and not IO. Try
def splitlines (file) :
fileNo=0001
outFile=open("newdocs/%s.txt" % fileNo, 'a') ## open file to append
reg = re.compile("Copyright ")
for line in open(file, "r"):
if reg.search("Copyright ", line): # if the line is equal to the regex
outFile.close() ## close the file
outFile=open("newdocs%s.txt" % fileNo, 'a') ## open file to append
fileNo +=1 #and add one to the filename, starting to read lines in again
outFile.write(line) ## then append it to the open outFile
Several notes
Always use / instead of \ for path name
If regex is used repeatedly, compile it
Do you need re.search? or re.match?
UPDATE:
#Ed. S: point taken
#Winston Ewert: code updated to be closer to the original Ruby code

rosser,
Don't use names of built-in objects as identifiers in a code (file, splitlines)
The following code respects the effect of your own code: an out_file is closed without the line containing 'Copyright ' that constitutes the signal of closing
The use of the function writelines() is intended to obtain a faster execution than with a repetition of out_file.write(line)
The if li: block is there to trigger the closing of out_file in case the last line of the read file doesn't contains 'Copyright '
def splitfile(filename, wordstop, destrep, file_no = 1, li = []):
with open(filename) as in_file:
for line in in_file:
if wordstop in line:
with open(destrep+str(file_no)+'.txt','w') as f:
f.writelines(li)
file_no += 1
li = []
else:
li.append(line)
if li:
with open(destrep+str(file_no)+'.txt','w') as f:
f.writelines(li)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

IO + subprocess.check_output iteration breaking file writes - python

Related

How to edit a file in python without a temporary sacrificial file? [duplicate]

Python read from a file, and only do work if a string isn't found

Why does my script write every input string twice to the output file?

python: read file continuously, even after it has been logrotated

Slow python file I:O; Ruby runs better than this; Got the wrong language?

Categories

Resources