Is it possible to parse a file line by line, and edit a line in-place while going through the lines?
Is it possible to parse a file line by line, and edit a line in-place while going through the lines?
It can be simulated using a backup file as stdlib's fileinput module does.
Here's an example script that removes lines that do not satisfy some_condition from files given on the command line or stdin:
#!/usr/bin/env python
# grep_some_condition.py
import fileinput
for line in fileinput.input(inplace=True, backup='.bak'):
if some_condition(line):
print line, # this goes to the current file
Example:
$ python grep_some_condition.py first_file.txt second_file.txt
On completion first_file.txt and second_file.txt files will contain only lines that satisfy some_condition() predicate.
fileinput module has very ugly API, I find beautiful module for this task - in_place, example for Python 3:
import in_place
with in_place.InPlace('data.txt') as file:
for line in file:
line = line.replace('test', 'testZ')
file.write(line)
main difference from fileinput:
Instead of hijacking sys.stdout, a new filehandle is returned for writing.
The filehandle supports all of the standard I/O methods, not just readline().
Important Notes:
This solution deletes every line in the file if you don't re-write it with the file.write() line.
Also, if the process is interrupted, you lose any line in the file that has not already been re-written.
No. You cannot safely write to a file you are also reading, as any changes you make to the file could overwrite content you have not read yet. To do it safely you'd have to read the file into a buffer, updating any lines as required, and then re-write the file.
If you're replacing byte-for-byte the content in the file (i.e. if the text you are replacing is the same length as the new string you are replacing it with), then you can get away with it, but it's a hornets nest, so I'd save yourself the hassle and just read the full file, replace content in memory (or via a temporary file), and write it out again.
If you only intend to perform localized changes that do not change the length of the part of the file that is modified (e.g. changing all characters to lower case), then you can actually overwrite the old contents of the file dynamically.
To do that, you can use random file access with the seek() method of a file object.
Alternatively, you may be able to use an mmap object to treat the whole file as a mutable string. Keep in mind that mmap objects may impose a maximum file-size limit in the 2-4 GB range on a 32-bit CPU, depending on your operating system and its configuration.
You have to back up by the size of the line in characters. Assuming you used readline, then you can get the length of the line and back up using:
file.seek(offset[, whence])
Set whence to SEEK_CUR, set offset to -length.
See Python Docs or look at the manpage for seek.
Related
I've installed pypiwin32 already, so I can use the win32file command, but I don't have much experience with Python.
How would I change my code below that opens a couple of files (I'm not worried about locking the first one), reads a line then replaces/writes a part of a line in the second file? I don't want the second file to get locked while it's open/writing, hence utilizing Win32 API.
with open("C:\\Temp\\Fileorg.txt", "rt") as fin:
with open("C:\\Temp\\File2.txt", "wt") as fout:
for line in fin:
fout.write(line.replace('part/list.txt', 'part/list.txt?id='+text))
New line symbol is just the character '\n'. If the file is "line1\nline2" and it is changed to "line1X\nline2" then everything after '\n' has to be re-written. Output file has to be locked.
In this scenario, it is possible to use file sharing to read characters before X. But it is a big challenge to read characters in the right sequence, on and after X.
The best option is to use a database
A second option is to write to a temporary file "temp.tmp". Once the operation is
finished, copy the whole file from "temp.tmp" to "File2.txt". Copying files is fast. The reader can check if the file is available, if it is not, it should wait 1 second for CopyFile to finish, and try again, up to 5 times (I made up the numbers 1 and 5)
A third option is to use formatted text. For example, the output file is
0line1####
0line2####
0line3####
You can modify this data only by changing the # characters. The 0 at the start of each line indicates that line is not busy. If the writer is updating the line, it changes 0 to 1, and back.
This way the file size doesn't change when data is changed, you can use file sharing and find data in the right place. You can add data but deleting data would be harder. This would be a big project.
I have a python script that is supposed to read a file. The issue is that that file is very large so for efficiency I decided that my script should only read from line 650000 and onward, since previous line does not contain relevant information.
Is there any way to only modify lines 650000 till eof, so for example, if i read() this file only those specific lines would appear?
Files are not line-oriented, they are blocks of bytes.
There's no way, short of reading the data in, to figure out how many bytes make up those first 650,000 lines, so you'd have to do that just in order to skip them.
Starting modifying a file at a certain offset is possible, but that offset will be in bytes which is the addressing unit used by files.
Skipping lines can be done easily enough:
with open("myfile.txt", "w+t") as f:
for i in xrange(650000):
f.readline() # Read a line and throw it away
f.write("hello")
This will truncate the file so that there will be no data after the hello (but 650,000 lines before it, of course).
cPickle.dump(object,file) always dumps at the end of the file. Is there a way to dump at specific position in the file? I expected the following snippet to work
file = open("test","ab")
file.seek(50,0)
cPickle.dump(object, file)
file.close()
However, the above snippet dumps the object at the end of the file (assume file already contains 1000 chars), no matter where I seek the file pointer to.
I think this may be more of a problem with how you open the file than with cPickle.
ab mode, besides being an append mode (which should bear no relevance, since you seek), provides the O_TRUNC flag to the low-level open syscall. If you don't want truncation, you should try the r+ mode.
If this doesn't solve yout problem and your objects are not very large, you can still use dumps:
file = open("test","ab")
file.seek(50,0)
dumped= cPickle.dumps(object)
file.write(dumped)
file.close()
Let's say I have a file /etc/conf1
it's contents are along the lines of
option = banana
name = monkey
operation = eat
and let's say I want to replace "monkey" with "ostrich". How can I do that without reading the file to memory, altering it and then just writing it all back? Basically, how can I modify the file "in place"?
You can't. "ostrich" is one letter more than "monkey", so you'll have to rewrite the file at least from that point onwards. File systems do not support "shifting" file contents upwards or downwards.
If it's just a small file, there's no reason to bother with even this, and you might as well rewrite the whole file.
If it's a really large file, you'll need to reconsider the internal design of the file's contents, for example, with a block-based approach.
You should look at the fileinput module:
http://docs.python.org/library/fileinput.html
There's an option to perform inplace editing via the input method:
http://docs.python.org/library/fileinput.html#fileinput.input
UPDATE - example code:
import fileinput
import re
import sys
for line in fileinput.input(inplace=True):
sys.stdout.write(re.sub(r'monkey', 'ostrich', line))
Using sys.stdout.write so as not to add any extra newlines in.
It depends on what you mean by "in place". How can you do it if you want to replace monkey with supercalifragilisticexpialidocious? Do you want to overwrite the remaining file? If not, you are going to have to read ahead and shift subsequent contents of the file forwards.
CPU instructions operate on data which come from memory.
The portion of the file you wish to read must be resident in memory before you can read it; before you write anything to disk, that information must be in memory.
The whole file doesn't have to be there at once, but to do a search-replace on an entire file, every character of the file will pass through RAM at some point.
What you're probably looking for is something like the mmap() system call. The above fileinput module sounds like a plausible thing to use.
In-place modifications are only easy if you don't alter the size of the file or only append to it. The following example replaces the first byte of the file by an "a" character:
fd = os.open("...", os.O_WRONLY | os.O_CREAT)
os.write(fd, "a")
os.close(fd)
Note that Python's file objects don't support this, you have to use the low-level functions. For appending, open file file with the open() function in "a" mode.
sed -i.bak '/monkey$/newword/' file
I have a file in CSV format where the delimiter is the ASCII unit separator ^_ and the line terminator is the ASCII record separator ^^ (obviously, since these are nonprinting characters, I've just used one of the standard ways of writing them here). I've written plenty of code that reads and writes CSV files, so my issue isn't with Python's csv module per se. The problem is that the csv module doesn't support reading (but it does support writing) line terminators other than a carriage return or line feed, at least as of Python 2.6 where I just tested it. The documentation says that this is because it's hard coded, which I take to mean it's done in the C code that underlies the module, since I didn't see anything in the csv.py file that I could change.
Does anyone know a way around this limitation (patch, another CSV module, etc.)? I really need to read in a file where I can't use carriage returns or new lines as the line terminator because those characters will appear in some of the fields, and I'd like to avoid writing my own custom reader code if possible, even though that would be rather simple to meet my needs.
Why not supply a custom iterable to the csv.reader function? Here is a naive implementation which reads the entire contents of the CSV file into memory at once (which may or may not be desirable, depending on the size of the file):
def records(path):
with open(path) as f:
contents = f.read()
return (record for record in contents.split('^^'))
csv.reader(records('input.csv'))
I think that should work.