How to skip header while processing .txt filie? - python

In Think Python by Allen Downey the excersise 13-2 asks to process any .txt file from gutenberg.org and skip the header information which end with something like "Produced by". This is the solution that author gives:
def process_file(filename, skip_header):
"""Makes a dict that contains the words from a file.
box = temp storage unit to combine two following word in one string
res = dict
filename: string
skip_header: boolean, whether to skip the Gutenberg header
returns: map from string of two word from file to list of words that comes
after them
Last two word in text maps to None"""
res = {}
fp = open(filename)
if skip_header:
skip_gutenberg_header(fp)
for line in fp:
process_line(line, res)
return res
def process_line(line, res):
for word in line.split():
word = word.lower().strip(string.punctuation)
if word.isalpha():
res[word] = res.get(word, 0) + 1
def skip_gutenberg_header(fp):
"""Reads from fp until it finds the line that ends the header.
fp: open file object
"""
for line in fp:
if line.startswith('Produced by'):
break
I really don't understand the flaw of execution in this code. Once the code starts reading the file using skip_gutenberg_header(fp) which contains "for line in fp:"; it finds needed line and breaks. However next loop picks up right where break statement left. But why? My vision of it is that there are two independent iterations here both containing "for line in fp:",
so shouldn't second one start form the beginning?

No, it shouldn't re-start from the beginning. An open file object maintains a file position indicator, which gets moved as you read (or write) the file. You can also move the position indicator via the file's .seek method, and query it via the .tell method.
So if you break out of a for line in fp: loop you can continue reading where you left off with another for line in fp: loop.
BTW, this behaviour of files isn't specific to Python: all modern languages that inherit C's notion of streams and files work like this.
The .seek and .tell methods are mentioned briefly in the tutorial.
For a more in-depth treatment of file / stream handling in Python, please see the docs for the io module. There's a lot of info in that document, and some of that information is mainly intended for advanced coders. You will probably need to read it several times and write a few test programs to absorb what it says, so feel free to skim through it the first time you try to read... or the first few times. ;)

My vision of it is that there are two independent iterations here both containing "for line in fp:", so shouldn't second one start form the beginning?
If fp were a list, then of course they would. However it's not -- it's just an iterable. In this case it's a file-like object that has methods like seek, tell, and read. In the case of file-like objects, they keep state. When you read a line from them, it changes the position of the read pointer in the file, so the next read starts a line below.
This is commonly used to skip the header of tabular data (when you're not using a csv.reader, at least)
with open("/path/to/file") as f:
headers = next(f).strip() # first line
for line in f:
# iterate by-line for the rest of the file
...

Related

Remove double file contents

I have previously written a file in python and I wrote the same contents twice while trying to run the script for the second time.
Here is my file content:
Story1: A short story is a piece of prose fiction that typically can be read in one sitting and focuses on a self-contained incident or series of linked incidents, with the intent of evoking a "single effect" or mood, however there are many exceptions to this. A dictionary definition is "an invented prose narrative shorter than a novel usually dealing with a few characters and aiming at unity of effect and often concentrating on the creation of mood rather than plot. Story1: A short story is a piece of prose fiction that typically can be read in one sitting and focuses on a self-contained incident or series of linked incidents, with the intent of evoking a "single effect" or mood, however there are many exceptions to this. A dictionary definition is "an invented prose narrative shorter than a novel usually dealing with a few characters and aiming at unity of effect and often concentrating on the creation of mood rather than plot.
I am using python Set operator like this, but this won't work for my case:
uniqlines = set(open('file.txt').readlines())
bar = open('file', 'w').writelines(set(uniqlines))
In my case, there are now newline characters so everything is read once. I want to be able to delete the contents after the Story1: is encountered the second time.
How do I accomplish it?
Update: Since you don't have line breaks to split up the file, you're likely better off just slurping the file, splitting appropriately, and writing a new file. Simple solution would be:
import os, tempfile
with open('file.txt') as f,\
tempfile.NamedTemporaryFile('w', dir='.', delete=False) as tf:
# You've got a space only before second copy, so it's a useful partition point
firstcopy, _, _ f.read().partition(' Story1: ')
# Write first copy
tf.write(firstcopy)
# Exiting with block closes temporary file so data is there
# Atomically replace original file with rewritten temporary file
os.replace(tf.name, 'file.txt')
Technically, this isn't completely safe against actual power loss, since data might not be written to disk before the replace metadata update occurs. If you're paranoid, tweak it to explicitly block until the data is synced by adding the following two lines just before dedenting out of the with block (after the write):
tf.flush() # Flushes Python level buffers to OS
os.fsync(tf.fileno()) # Flush OS kernel buffer out to disk, block until done
Old answer for case where copies begin on separate lines:
Find where the second copy begins, and truncate the file:
seen_story1 = False
with open('file.txt', 'r+') as f:
while True:
pos = f.tell() # Record position before next line
line = f.readline()
if not line:
break # Hit EOF
if line.startswith('Story1:'):
if seen_story1:
# Seen it already, we're in duplicate territory
f.seek(pos) # Go back to end of last line
f.truncate() # Truncate file
break # We're done
else:
seen_story1 = True # Seeing it for the first time
Since all you're doing is removing duplicate information from the end of the file, this is safe and effective; truncate should be atomic on most OSes, so the trailing data is freed all at once, with no risk of partial write corruption or the like.
You could use the find method.
# set the word you want to look for
myword = "Story1"
#read the file into a variable called text
with open('file.txt', 'r+') as fin:
text = fin.read()
#find your word for the first time. This method returns the lowest index of the substring if it is found.
# That's why we add the length of the word we are looking for.
index_first_time_found = text.find(myword) + len(myword)
# We search again, but now we start looking from the index of our previous result.
index_second_time_found = text.find(myword, index_first_time_found)
# We cut of everything upto the index of our second index.
new_text = text[:index_second_time_found]
print(new_text)

Python: read a line and write back to that same line

I am using python to make a template updater for html. I read a line and compare it with the template file to see if there are any changes that needs to be updated. Then I want to write any changes (if there are any) back to the same line I just read from.
Reading the file, my file pointer is positioned now on the next line after a readline(). Is there anyway I can write back to the same line without having to open two file handles for reading and writing?
Here is a code snippet of what I want to do:
cLine = fp.readline()
if cLine != templateLine:
# Here is where I would like to write back to the line I read from
# in cLine
Updating lines in place in text file - very difficult
Many questions in SO are trying to read the file and update it at once.
While this is technically possible, it is very difficult.
(text) files are not organized on disk by lines, but by bytes.
The problem is, that read number of bytes on old lines is very often different from new one, and this mess up the resulting file.
Update by creating a new file
While it sounds inefficient, it is the most effective way from programming point of view.
Just read from file on one side, write to another file on the other side, close the files and copy the content from newly created over the old one.
Or create the file in memory and finally do the writing over the old one after you close the old one.
At the OS level the things are a bit different from how it looks from Python - from Python a file looks almost like a list of strings, with each string having arbitrary length, so it seems to be easy to swap a line for something else without affecting the rest of the lines:
l = ["Hello", "world"]
l[0] = "Good bye"
In reality, though, any file is just a stream of bytes, with strings following each other without any "padding". So you can only overwrite the data in-place if the resulting string has exactly the same length as the source string - otherwise it'll simply overwrite the following lines.
If that is the case (your processing guarantees not to change the length of strings), you can "rewind" the file to the start of the line and overwrite the line with new data. The below script converts all lines in file to uppercase in-place:
def eof(f):
cur_loc = f.tell()
f.seek(0,2)
eof_loc = f.tell()
f.seek(cur_loc, 0)
if cur_loc >= eof_loc:
return True
return False
with open('testfile.txt', 'r+t') as fp:
while True:
last_pos = fp.tell()
line = fp.readline()
new_line = line.upper()
fp.seek(last_pos)
fp.write(new_line)
print "Read %s, Wrote %s" % (line, new_line)
if eof(fp):
break
Somewhat related: Undo a Python file readline() operation so file pointer is back in original state
This approach is only justified when your output lines are guaranteed to have the same length, and when, say, the file you're working with is really huge so you have to modify it in place.
In all other cases it would be much easier and more performant to just build the output in memory and write it back at once. Another option is to write to a temporary file, then delete the original and rename the temporary file so it replaces the original file.

Replace text in the first line in a huge txt tab delimited file

I have a huge text file (19GB in size); it is a genetic data file with variables and observations.
The first line contains the variable names and they are structured as followed:
id1.var1 id1.var2 id1.var3 id2.var1 id2.var2 id2.var3
I need to swap id1, id2 ect. with corresponding values that are in another text file (this file has about 7k rows) ids are not in any particular order and it's structured as follow:
oldId newIds
id1 rs004
id2 rs135
I have done some google search and could not really find a language that would allow to do the following:
read the first line
replace the ids with the new ids
remove the first line from the original file and replace it with the new one
Is this a good approach or is there a better one?
Which is the best language to accomplish this?
We have people with experience in python, vbscipt and Perl.
The whole "replace" thing is possible in almost any language (I'm sure about Python and Perl), as long as the length of the replacement line is the same as the original, or if it can be made the same by padding with whitespace (otherwise, you'll have to rewrite the whole file).
Open the file for reading and writing (w+ mode), read the first line, prepare the new line, seek to position 0 in the file, write the new line, close the file.
I suggest you use the Tie::File module, which maps the lines in a text file to a Perl array and will make the rewriting of the lines after the header a simple job.
This program demonstrates. It first reads all of the old/new IDs into a hash, and then maps the data file using Tie::File. The first line of the file (in $file[0]) is modified using a substitution, and then the array is untied to rewrite and close the file.
You will need to change your file names from the ones I have used. Also beware that I have assumed that the IDs are always "word" characters (alphanumeric plus underscore) followed by a dot, and have no spaces. Of course you will want to back up your file before you modify it, and you should test the program on a smaller file before you update the real thing.
use strict;
use warnings;
use Tie::File;
my %ids;
open my $fh, '<', 'newids.txt' or die $!;
while (<$fh>) {
my ($old, $new) = split;
$ids{$old} = $new;
}
tie my #file, 'Tie::File', 'datafile.txt' or die $!;
$file[0] =~ s<(\w+)(?=\.)><$ids{$1} // $1>eg;
untie #file;
This should be pretty easy. I would use Python as I am a Python fan. Outline:
Read the mapping file, and save the mapping (in Python, use a dictionary).
Read the data file a line at a time, remap variable names, and output the edited line.
You really can't edit a file in-place... hmm, I guess you could if every new variable name was always exactly the same length as the old name. But for ease of programming, and safety while running, it would be best to always write a new output file and then delete the original. This means you will need at least 20 GB of free disk space before running this, but that shouldn't be a problem.
Here is a Python program that shows how to do it. I used your example data to make test files and this seems to work.
#!/usr/bin/python
import re
import sys
try:
fname_idmap, fname_in, fname_out = sys.argv[1:]
except ValueError:
print("Usage: remap_ids <id_map_file> <input_file> <output_file>")
sys.exit(1)
# pattern to match an ID, only as a complete word (do not match inside another id)
# match start of line or whitespace, then match non-period until a period is seen
pat_id = re.compile("(^|\s)([^.]+).")
idmap = {}
def remap_id(m):
before_word = m.group(1)
word = m.group(2)
if word in idmap:
return before_word + idmap[word] + "."
else:
return m.group(0) # return full matched string unchanged
def replace_ids(line, idmap):
return re.sub(pat_id, remap_id, line)
with open(fname_idmap, "r") as f:
next(f) # discard first line with column header: "oldId newIds"
for line in f:
key, value = line.split()
idmap[key] = value
with open(fname_in, "r") as f_in, open(fname_out, "w") as f_out:
for line in f_in:
line = replace_ids(line, idmap)
f_out.write(line)

file pointer down then over

The Task
I am writing a program in python that running a SAP2000 program by importing a new .s2k file each time into the Sap2000 program, and then a new file is generated from the results of the previous run by the means of exporting the data.
The file is about 1,500 lines containing arbitrary words and numbers. (For a better understanding, see this: http://pastebin.com/8ptYacJz, which is the file I am dealing with.)
I'm required to replace one number in the file.
That number is somewhere in the middle of line 800.
The Question
Does anyone know an efficient way to move down to the middle of line 800 in a file, in order to replace one number?
What I've Tried
Regular expressions did not work, because there can be more then one instance of the same number.
So I came up with the solution of templating the file and writing a new file each time with the number to be changed as a template parameter.
This solution does work but the person insists that I can move the file pointer down to line 800, then over to the middle of the line to replace the number.
Here is the only code I have for the problem that takes the file buffer to a line then back up to the beginning when I try to seek over.
import sys
import os
#open file
f = open("output.$2k")
#this will go to line 883 in text file
count = 0;
while count < 883:
line = f.readline()
count = count+1
#this would seek over to middle of file DOESN'T WORK
f.seek(0,0)
line = f.readline()
print(line)
f.close()
Yes and no. Consider:
f=open('output.$2k','r+')
f.seek(300)
f.write('\n')
f.close()
This script just changes the 300th character in your ascii file to a newline. Now the tricky part is that there is no way to know the length of a line in an ascii file short of reading until you get to a newline. So, locating the particular character in the file at the middle of the 800th line is non-trivial. However, if you can make guarantees (due to the way the file was written) about the line length, you can calculate the position without any problem. Also note that replacing 1 with 100 won't work here. You need to replace 1 character with 1 character.
And just for all the other *NIX users in the world ... please don't put $ in your filename. That's just a nightmare...
OK, i'm not a professional programmer, but my (stupid) approach would be: If it's always line 800, read the file line by line while tracking the line numbers. Write then directly to a new file. Read line 800, change it, write it. Then write the rest. Dumb and not elegant but it should work-unless i miss something which i probably do. And there goes my meager reputation :D
No. Read in the line, manipulate it, then write it out to the new file you've previously opened for writing (and have been writing the other lines to, unmodified).
A first thing:
#this would seek over to middle of file DOESN'T WORK
f.seek(0,0)
this is not true. This seeks to the beginning of the file.
To your actual question:
Does anyone know an efficient way to move down to the middle of line 800 in a file, in order to replace one number?
In general, no. You'd need to rewrite the file. For example like this:
# open the file in read-and-update mode
with open("file", 'r+') as f:
# read all lines
lines = f.readlines()
# update 800'th line
my_line = lines[799].split()
my_line[5] = "%s" % my_number # TODO: put in index of number and updated number
lines[799] = " ".join(my_line)
# truncate and rewrite file
f.truncate(0)
f.writelines(lines)
You can do it, if the starting position of the number in the file is predictable (e.g. number_starting_pos = 1234 from the beginning of the file) and the size of the string representation is also predictable (e.g. 20).
Then you could rewrite the number and make sure you fill up the padding with whitespace again to overwrite any content of the previous entry.
Similar to this:
with open("file", 'r+') as f:
# seek to the number starting position
f.seek(number_starting_pos, 0)
# update number field, assuming width (20), arbitrary space-padding allowed
my_number_string = "%19s " % my_number
# make sure the string is indeed exactly of the specific size (it may be longer)
assert len(my_number_string) == 20, "file writing would fail! aborting!"
f.write(my_number_string)
For this to work, you'd need to have a look at the docs of your SAP-thingy, and see if whitespace indeed not matters.
However, both approaches are based on a lot of assumptions. Depending on your use case it may easily break your code, e.g. if a line is inserted or even a characters is inserted before the number field.

How do I modify the last line of a file?

The last line of my file is:
29-dez,40,
How can I modify that line so that it reads:
29-Dez,40,90,100,50
Note: I don't want to write a new line. I want to take the same line and put new values after 29-Dez,40,
I'm new at python. I'm having a lot of trouble manipulating files and for me every example I look at seems difficult.
Unless the file is huge, you'll probably find it easier to read the entire file into a data structure (which might just be a list of lines), and then modify the data structure in memory, and finally write it back to the file.
On the other hand maybe your file is really huge - multiple GBs at least. In which case: the last line is probably terminated with a new line character, if you seek to that position you can overwrite it with the new text at the end of the last line.
So perhaps:
f = open("foo.file", "wb")
f.seek(-len(os.linesep), os.SEEK_END)
f.write("new text at end of last line" + os.linesep)
f.close()
(Modulo line endings on different platforms)
To expand on what Doug said, in order to read the file contents into a data structure you can use the readlines() method of the file object.
The below code sample reads the file into a list of "lines", edits the last line, then writes it back out to the file:
#!/usr/bin/python
MYFILE="file.txt"
# read the file into a list of lines
lines = open(MYFILE, 'r').readlines()
# now edit the last line of the list of lines
new_last_line = (lines[-1].rstrip() + ",90,100,50")
lines[-1] = new_last_line
# now write the modified list back out to the file
open(MYFILE, 'w').writelines(lines)
If the file is very large then this approach will not work well, because this reads all the file lines into memory each time and writes them back out to the file, which is very inefficient. For a small file however this will work fine.
Don't work with files directly, make a data structure that fits your needs in form of a class and make read from/write to file methods.
I recently wrote a script to do something very similar to this. It would traverse a project, find all module dependencies and add any missing import statements. I won't clutter this post up with the entire script, but I'll show how I went about modifying my files.
import os
from mmap import mmap
def insert_import(filename, text):
if len(text) < 1:
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
m.resize(origSize + len(text))
pos = 0
while True:
l = m.readline()
if l.startswith(('import', 'from')):
continue
else:
pos = m.tell() - len(l)
break
m[pos+len(text):] = m[pos:origSize]
m[pos:pos+len(text)] = text
m.close()
f.close()
Summary: This snippet takes a filename and a blob of text to insert. It finds the last import statement already present, and sticks the text in at that location.
The part I suggest paying most attention to is the use of mmap. It lets you work with files in the same manner you may work with a string. Very handy.

Categories