Reading large files in a loop

Reading large files in a loop - python

I'm having some trouble dealing with large text files (about 1GB), when I want to read them and use them in while loops.
More specifically: First I start by doing some parsing on the lines of the file, in order to find e.g. all lines that start with "x". In doing so, I add the indices of the found lines to a list (say l). This is the pre-processing part.
Now in a while loop, I'm choosing random indices from l, and want to read its corresponding line (or say 5 lines around it). Thus I need to keep the file in memory once and for all throughout the while loop, as a priori I do not know what lines I end up reading (the line is randomly picked from l).
The problem is, when I call the file before my main loop, during the first run of the loop, the reading gets done successfully, but already from the second run, the file has vanished from memory. What I have tried:
The preprocess part:
for i, line in enumerate(filename):
prep = ''.join(c for c in line if c.isalnum() or c.isspace())
if 'x' in prep: l.append(i)
Now I have my l list. loading the file in memory before main loop:
with open(filename,'r') as f:
while (some condition):
random_index = random.sample(range(0,len(l)),1)
output_file = open("out","w") #I will write here the read line(s)
for i, line in enumerate(f):
#(the lines to be read, starting from the given random index)
if (i >= l[random_index]) and (i < l[random_index+1]):
out.write(line)
out.close()
Only during the first run of the loop, things work properly.
Alternatively I also tried:
f = open(filename)
while (some condition):
random_index = ... #rest is same as above.
Same issue, only first run work. One thing that worked was putting the f=open(filename) in the loop, so every run the file is called. But since it is a large one, this is really no practical solution.
What am I doing wrong here?
How should such readings be done properly?

What am I doing wrong here?
This answer addresses the same problem: you can't read file twice.
You open file f outside of the while loop and read it completely by calling for i, line in enumerate(f): during first iteration of the while loop. During the second iteration you can't read it again, since it has been read already.
How should such readings be done properly?
As noted in the linked answer:
To answer your question directly, once a file has been read, with read() you can use seek(0) to return the read cursor to the start of the file (docs are here).
That means, that to solve your problem you can add f.seek(0) at the end of the while loop to move pointer to the start of the file after each iteration. Doing this you can reread file from the start again.

Related

First line of the file is missing while reading

Missing first line while reading from a file and transferring lines to a new one, wondering why. I have found a solution by adding x = flower1.readline()
y = flower2.readline()
and printing them on the file, but still did not understand why the first lines are missing at the first version of my code:
flower1 = open('/Users/berketurer/Desktop/week5_flower1.txt','r')
flower2 = open('/Users/berketurer/Desktop/week5_flower2.txt', 'r')
newFile = open('/Users/berketurer/Desktop/FlowerCombined.txt','w')
for line in flower1:
lines = flower1.read()
newFile.write(lines)
newFile.write('\n')
for line in flower2:
lines = flower2.read()
newFile.write(lines)
flower1.close()
flower2.close()
newFile.close()

that loop makes no sense:
for line in flower1:
lines = flower1.read()
newFile.write(lines)
this iterates on the lines of the file, and inside the loop you read the whole file!
The fact that the first line is missing is because python has already consumed the line in the first iteration of for line in flower1. Then it reads the rest of the file and writes it, and the loop ends because the end of file has been reached... Note that line is never used in the code. Usually not a good sign.
just remove the loop:
lines = flower1.read()
newFile.write(lines)
If you want to process line by line for some reason (may come in handy if you want to filter out some lines) keep the loop but don't read in the loop:
for line in flower1:
newFile.write(line)
There's also this one-liner: newFile.writelines(flower1) if you don't need filtering.
There are better ways to concatenate general files together though (binary or text). I like the version which use shutil.copyfileobj because even if the files are huge, the memory footprint is moderate, and performance-wise since blocksize is larger than 1 line, the speed may be slightly better.

Error in looping through a text file in python

I am trying to loop through a text file and apply some logic but I am not able to loop through the text file. So currently I have a text file that is structured like this:
--- section1 ---
"a","b","c"
"d","e","f"
--- section2 ---
"1","2","3"
"4","5","6"
--- section3 ---
"12","12","12"
"11","11","11"
I am trying to filter out the first line which contains '---' and convert the lines below into json until the next '---' line appear in the text document.
However I got this error " fields1 = next(file).split(',') StopIteration
with open(fileName,'r') as file:
for line in file:
if line.startswith('-') and 'section1' in line:
while '---' not in next(file):
fields1 = next(file).split(',')
for x in range(0,len(fields1)):
testarr.append({
config.get('test','test'): fields1[x]
})
with open(test_dir,'w') as test_file:
json.dump(testarr, test_file)
Any idea why my code is not working or how i can solve the error ?

The cause of your error is that you are misusing the file object genrator by calling next on it twice as often as you think. Each call to next gets a line and returns it. Therefore, while '---' not in next(file): fields1 = next(file).split(',') gets a line, checks it for ---, then gets another line and tries to parse it. This means that you are able to skip a line containing the --- by having it come up in the second next. In that case you will get to the end of the file before you find the line you are looking for. StopIteration is how iterators normally indicate that their input has been exhausted.
There are a couple of other issues you may want to address in your code:
Using next on a generator like a file when you are already inside a for loop may cause undefined behavior. You may be getting away with it this time, but it is not good practice in general. The main reason you are getting away with it, by the way, is possibly that you never actually return control to the for loop once the while is triggered, and not that files are particularly permissive in this regard.
The inner with that dumps your data to a file is inside your while loop. That means that the file you open with 'w' permissions will get truncated for every iteration of the while (i.e., each line in the file). As the array grows, the output will actually appear fine, but you probably want to move that out of the inner loop.
The simplest solution would be to rewrite the code in two loops: one to find the start of the part you care about, and the other to process it until the end is found.
Something like this:
test_arr = []
with open(fileName, 'r') as file:
for line in file:
if line.startswith('---') and 'section1' in line:
break
for line in file:
if '---' in line:
break
fields1 = line.split(',')
for item in fields1:
testarr.append({config.get('test','test'): item})
with open(test_dir,'w') as test_file:
json.dump(testarr, test_file)
EDIT:
Taking #tripleee's advice, I have removed the regex check for the start line. While regex gives great precision and flexibility for finding a specific pattern, it is really overkill for this example. I would like to point out that if you are looking for a section other than section1, or if section1 appears after some other lines with dashes, you will absolutely need this two-loop approach. The one-loop solutions in the other answers will not work in a non-trivial case.

Looks like you are overcomplicating matters massively. The next inside the inner while loop I imagine is tripping up the outer for loop, but that's just unnecessary anyway. You are already looping over lines; pick the ones you want, then quit when you're done.
with open(fileName,'r') as inputfile:
for line in inputfile:
if line.startswith('-') and 'section1' in line:
continue
elif line.startswith('-'):
break
else:
testarr.append({config.get('test', 'test'): x
for x in line.split(',')})
with open(test_dir,'w') as test_file:
json.dump(testarr, test_file)
I hope I got the append right, as I wanted to also show you how to map the split fields more elegantly, but I'm not sure I completely understand what your original code did. (I'm guessing you'll want to trim the \n off the end of the line before splitting it, actually. Also, I imagine you want to trim the quotes from around each value. x.strip('"') for x in line.rstrip('\n').split(','))
I also renamed file to inputfile to avoid clashing with the reserved keyword file.
If you want to write more files, basically, add more states in the loop and move the write snippet back inside the loop. I don't particularly want to explain how this is equivalent to a state machine but it should not be hard to understand: with two states, you are skipping or collecting; to extend this, add one more state for the boundary when flipping back, where you write out the collected data and reinitialize the collected lines to none.

next() raises a StopIteration exception when the iterator is exhausted. In other words, your code gets to the end of the file, and you call next() again, and there's nothing more for it to return, so it raises that exception.
As for how to solve your problem, I think this might be what you want:
with open(fileName, 'r') as file:
for line in file:
if line.startswith('---'):
if 'section1' in line:
continue
else:
break
fields1 = line.split(',')
for x in range(len(fields1)):
testarr.append({
config.get('test', 'test'): fields1[x]
})
with open(test_dir, 'w') as test_file:
json.dump(testarr, test_file)

iterative variable losing value in nested loop

So I seem to be doing something incredibly dumb and I can't seem to figure it out. I am trying to create script that will search a file for terms defined in another file. This seems pretty basic to me but for some reason the outside loop iteration is empty on the inside loop.
if __name__ == "__main__":
searchfile = open(sys.argv[1],"r")
terms = open(sys.argv[2],"r")
for line in searchfile:
for term in terms:
if re.match(term, line.rstrip()):
print line
If I print line before the term loop it has the information. If I print line inside the term loop, it doesn't. What am I missing?

The issue here is that files are iterators that get exhausted - this means that once they have been iterated over once, they will not restart from the beginning.
You are probably used to lists - iterables that return a new iterator each time you loop over them, from the beginning.
Files are single-use iterables - once you loop over them, they are exhausted.
You can either use list() to construct a list you can iterate over multiple times, or open the file inside the loop, so that it is reopened each time, creating a new iterator from the beginning.
Which option is best will vary depending on the use case. Opening the file and reading from disk will be slower, but making a list will require all the data being held in memory - if your file is extremely large, this may be a problem.
It's also worth noting that you should use the with statement when opening files in Python.
with open(sys.argv[1], "r") as searchfile, open(sys.argv[2], "r") as terms:
terms = list(terms)
for line in searchfile:
for term in terms:
if re.match(term, line.rstrip()):
print line

So what are you doing: In the first for-iteration you read the first line of searchfile and compare it with every line in terms, by reading the file terms. After that, the file terms is read completely, so in every next iteration of the searchfile-loop the terms-loop isn't executed any more (terms is 'empty').

While Loop Not Performing Main Function

I'm trying to write a Python script that uses a particular external application belonging to the company I work for. I can generally figure things out for myself when it comes to programming and scripting, but this time I am truely lost!
I can't seem to figure out why the while loop wont function as it is meant to. It doesn't give any errors which doesn't help me. It just seems to skip past the important part of the code in the centre of the loop and then goes on to increment the "count" like it should afterwards!
f = open('C:/tmp/tmp1.txt', 'w') #Create a tempory textfile
f.write("TEXTFILE\nTEXTFILE\nTEXTFILE\nTEXTFILE\nTEXTFILE\nTEXTFILE\n") #Put some simple text in there
f.close() #Close the file
count = 0 #Insert the line number from the text file you want to begin with (first line starts with 0)
num_lines = sum(1 for line1 in open('C:/tmp/tmp1.txt')) #Get the number of lines from the textfile
f = open('C:/tmp/tmp2.txt', 'w') #Create a new textfile
f.close() #Close it
while (count < num_lines): #Keep the loop within the starting line and total number of lines from the first text file
with open('C:/tmp/tmp1.txt', 'r') as f: #Open the first textfile
line2 = f.readlines() #Read these lines for later input
for line2[count] in f: #For each line from chosen starting line until last line from first text file,...
with open('C:/tmp/tmp2.txt', 'a') as g: #...with the second textfile open for appending strings,...
g.write("hello\n") #...write 'hello\n' each time while "count" < "num_lines"
count = count + 1 #Increment the "count"
I think everything works up until: "for line2[count] in f:"
The real code I'm working on is somewhat more complicated, and the application I'm using isn't exactly for sharing, so I have simplified the code to give silly outputs instead just to fix the problem.
I'm not looking for alternative code, I'm just looking for a reason why the loop isn't working so I can try to fix it myself.
All answers will be appreciated, and thanking everyone in advance!
Cormac

Some comments:
num_lines = sum(1 for line1 in open('C:/tmp/tmp1.txt'))
Why? What's wrong with len(open(filename, 'rb').readlines())?
while (count < num_lines):
...
count = count + 1
This is bad style, you could use:
for i in range(num_lines):
...
Note that I named your index i, which is universally recognized, and that I used range and a for loop.
Now, your problem, like I said in the comment, is that f is a file (that is, a stream of bytes with a location pointer) and you've read all the lines from it. So when you do for line2[count] in f:, it will try reading a line into line2[count] (this is a bit weird, actually, you almost never use a for loop with a list member as an index but apparently you can do that), see that there's no line to read, and never executes what's inside the loop.
Anyway, you want to read a file, line by line, starting from a given line number? Here's a better way to do that:
from itertools import islice
start_line = 0 # change this
filename = "foobar" # also this
with open(filename, 'rb') as f:
for line in islice(f, start_line, None):
print(line)
I realize you don't want alternative code, but your code really is needlessly complicated.

If you want to iterate over the lines in the file f, I suggest replacing your "for" line with
for line in line2:
# do something with "line"...
You put the lines in an array called line2, so use that array! Using line2[count] as a loop variable doesn't make sense to me.

You seem to get it wrong how the 'for line in f' loop works. It iterates over a file and calls readline, until there are no lines to read. But at the moment you start the loop all the lines are already read(via f.readlines()) and file's current position is at end. You can achieve what you want by calling f.seek(0), but that doesn't seem to be a good decision anyway, since you're going to read file again and that's slow IO.
Instead you want to do smth like:
for line in line2[count:]: # iterate over lines read, starting with `count` line
do_smth_with(line)

Python truncate lines as they are read

I have an application that reads lines from a file and runs its magic on each line as it is read. Once the line is read and properly processed, I would like to delete the line from the file. A backup of the removed line is already being kept. I would like to do something like
file = open('myfile.txt', 'rw+')
for line in file:
processLine(line)
file.truncate(line)
This seems like a simple problem, but I would like to do it right rather than a whole lot of complicated seek() and tell() calls.
Maybe all I really want to do is remove a particular line from a file.
After spending far to long on this problem I decided that everyone was probably right and this it just not a good way to do things. It just seemed so elegant solution. What I was looking for was something akin to a FIFO that would just let me pop lines out of a file.

Remove all lines after you've done with them:
with open('myfile.txt', 'r+') as file:
for line in file:
processLine(line)
file.truncate(0)
Remove each line independently:
lines = open('myfile.txt').readlines()
for line in lines[::-1]: # process lines in reverse order
processLine(line)
del lines[-1] # remove the [last] line
open('myfile.txt', 'w').writelines(lines)
You can leave only those lines that cause exceptions:
import fileinput, sys
for line in fileinput.input(['myfile.txt'], inplace=1):
try: processLine(line)
except Exception:
sys.stdout.write(line) # it prints to 'myfile.txt'
In general, as other people already said it is a bad idea what you are trying to do.

You can't. It is just not possible with actual text file implementations on current filesystems.
Text files are sequential, because the lines in a text file can be of any length.
Deleting a particular line would mean rewriting the entire file from that point on.
Suppose you have a file with the following 3 lines;
'line1\nline2reallybig\nline3\nlast line'
To delete the second line you'd have to move the third and fourth lines' positions in the disk. The only way would be to store the third and fourth lines somewhere, truncate the file on the second line, and rewrite the missing lines.
If you know the size of every line in the text file, you can truncate the file in any position using .truncate(line_size * line_number) but even then you'd have to rewrite everything after the line.

You're better off keeping a index into the file so that you can start where you stopped last, without destroying part of the file. Something like this would work :
try :
for index, line in enumerate(file) :
processLine(line)
except :
# Failed, start from this line number next time.
print(index)
raise

Truncating the file as you read it seems a bit extreme. What if your script has a bug that doesn't cause an error? In that case you'll want to restart at the beginning of your file.
How about having your script print the line number it breaks on and having it take a line number as a parameter so you can tell it which line to start processing from?

First of all, calling the operation truncate is probably not the best pick. If I understand the problem correctly, you want to delete everything up to the current position in file. (I would expect truncate to cut everything from the current position up to the end of the file. This is how the standard Python truncate method works, at least if I Googled correctly.)
Second, I am not sure it is wise to modify the file while iterating on in using the for loop. Wouldn’t it be better to save the number of lines processed and delete them after the main loop has finished, exception or not? The file iterator supports in-place filtering, which means it should be fairly simple to drop the processed lines afterwards.
P.S. I don’t know Python, take this with a grain of salt.

A related post has what seems a good strategy to do that, see
How can I run the first process from a list of processes stored in a file and immediately delete the first line as if the file was a queue and I called "pop"?
I have used it as follows:
import os;
tasklist_file = open(tasklist_filename, 'rw');
first_line = tasklist_file.readline();
temp = os.system("sed -i -e '1d' " + tasklist_filename); # remove first line from task file;
I'm not sure it works on Windows.
Tried it on a mac and it did do the trick.

This is what I use for file based queues. It returns the first line and rewrites the file with the rest. When it's done it returns None:
def pop_a_text_line(filename):
with open(filename,'r') as f:
S = f.readlines()
if len(S) > 0:
pop = S[0]
with open(filename,'w') as f:
f.writelines(S[1:])
else:
pop = None
return pop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading large files in a loop - python

Related

First line of the file is missing while reading

Error in looping through a text file in python

iterative variable losing value in nested loop

While Loop Not Performing Main Function

Python truncate lines as they are read

Categories

Resources