iterative variable losing value in nested loop - python

So I seem to be doing something incredibly dumb and I can't seem to figure it out. I am trying to create script that will search a file for terms defined in another file. This seems pretty basic to me but for some reason the outside loop iteration is empty on the inside loop.
if __name__ == "__main__":
searchfile = open(sys.argv[1],"r")
terms = open(sys.argv[2],"r")
for line in searchfile:
for term in terms:
if re.match(term, line.rstrip()):
print line
If I print line before the term loop it has the information. If I print line inside the term loop, it doesn't. What am I missing?

The issue here is that files are iterators that get exhausted - this means that once they have been iterated over once, they will not restart from the beginning.
You are probably used to lists - iterables that return a new iterator each time you loop over them, from the beginning.
Files are single-use iterables - once you loop over them, they are exhausted.
You can either use list() to construct a list you can iterate over multiple times, or open the file inside the loop, so that it is reopened each time, creating a new iterator from the beginning.
Which option is best will vary depending on the use case. Opening the file and reading from disk will be slower, but making a list will require all the data being held in memory - if your file is extremely large, this may be a problem.
It's also worth noting that you should use the with statement when opening files in Python.
with open(sys.argv[1], "r") as searchfile, open(sys.argv[2], "r") as terms:
terms = list(terms)
for line in searchfile:
for term in terms:
if re.match(term, line.rstrip()):
print line

So what are you doing: In the first for-iteration you read the first line of searchfile and compare it with every line in terms, by reading the file terms. After that, the file terms is read completely, so in every next iteration of the searchfile-loop the terms-loop isn't executed any more (terms is 'empty').

Related

Error in looping through a text file in python

I am trying to loop through a text file and apply some logic but I am not able to loop through the text file. So currently I have a text file that is structured like this:
--- section1 ---
"a","b","c"
"d","e","f"
--- section2 ---
"1","2","3"
"4","5","6"
--- section3 ---
"12","12","12"
"11","11","11"
I am trying to filter out the first line which contains '---' and convert the lines below into json until the next '---' line appear in the text document.
However I got this error " fields1 = next(file).split(',') StopIteration
with open(fileName,'r') as file:
for line in file:
if line.startswith('-') and 'section1' in line:
while '---' not in next(file):
fields1 = next(file).split(',')
for x in range(0,len(fields1)):
testarr.append({
config.get('test','test'): fields1[x]
})
with open(test_dir,'w') as test_file:
json.dump(testarr, test_file)
Any idea why my code is not working or how i can solve the error ?
The cause of your error is that you are misusing the file object genrator by calling next on it twice as often as you think. Each call to next gets a line and returns it. Therefore, while '---' not in next(file): fields1 = next(file).split(',') gets a line, checks it for ---, then gets another line and tries to parse it. This means that you are able to skip a line containing the --- by having it come up in the second next. In that case you will get to the end of the file before you find the line you are looking for. StopIteration is how iterators normally indicate that their input has been exhausted.
There are a couple of other issues you may want to address in your code:
Using next on a generator like a file when you are already inside a for loop may cause undefined behavior. You may be getting away with it this time, but it is not good practice in general. The main reason you are getting away with it, by the way, is possibly that you never actually return control to the for loop once the while is triggered, and not that files are particularly permissive in this regard.
The inner with that dumps your data to a file is inside your while loop. That means that the file you open with 'w' permissions will get truncated for every iteration of the while (i.e., each line in the file). As the array grows, the output will actually appear fine, but you probably want to move that out of the inner loop.
The simplest solution would be to rewrite the code in two loops: one to find the start of the part you care about, and the other to process it until the end is found.
Something like this:
test_arr = []
with open(fileName, 'r') as file:
for line in file:
if line.startswith('---') and 'section1' in line:
break
for line in file:
if '---' in line:
break
fields1 = line.split(',')
for item in fields1:
testarr.append({config.get('test','test'): item})
with open(test_dir,'w') as test_file:
json.dump(testarr, test_file)
EDIT:
Taking #tripleee's advice, I have removed the regex check for the start line. While regex gives great precision and flexibility for finding a specific pattern, it is really overkill for this example. I would like to point out that if you are looking for a section other than section1, or if section1 appears after some other lines with dashes, you will absolutely need this two-loop approach. The one-loop solutions in the other answers will not work in a non-trivial case.
Looks like you are overcomplicating matters massively. The next inside the inner while loop I imagine is tripping up the outer for loop, but that's just unnecessary anyway. You are already looping over lines; pick the ones you want, then quit when you're done.
with open(fileName,'r') as inputfile:
for line in inputfile:
if line.startswith('-') and 'section1' in line:
continue
elif line.startswith('-'):
break
else:
testarr.append({config.get('test', 'test'): x
for x in line.split(',')})
with open(test_dir,'w') as test_file:
json.dump(testarr, test_file)
I hope I got the append right, as I wanted to also show you how to map the split fields more elegantly, but I'm not sure I completely understand what your original code did. (I'm guessing you'll want to trim the \n off the end of the line before splitting it, actually. Also, I imagine you want to trim the quotes from around each value. x.strip('"') for x in line.rstrip('\n').split(','))
I also renamed file to inputfile to avoid clashing with the reserved keyword file.
If you want to write more files, basically, add more states in the loop and move the write snippet back inside the loop. I don't particularly want to explain how this is equivalent to a state machine but it should not be hard to understand: with two states, you are skipping or collecting; to extend this, add one more state for the boundary when flipping back, where you write out the collected data and reinitialize the collected lines to none.
next() raises a StopIteration exception when the iterator is exhausted. In other words, your code gets to the end of the file, and you call next() again, and there's nothing more for it to return, so it raises that exception.
As for how to solve your problem, I think this might be what you want:
with open(fileName, 'r') as file:
for line in file:
if line.startswith('---'):
if 'section1' in line:
continue
else:
break
fields1 = line.split(',')
for x in range(len(fields1)):
testarr.append({
config.get('test', 'test'): fields1[x]
})
with open(test_dir, 'w') as test_file:
json.dump(testarr, test_file)

Python ValueError

So I keep receiving this error:
ValueError: Mixing iteration and read methods would lose data
And 1) I don't quite understand why I'm receiving it, and 2) people with similar problems seem to be doing things with their code which are much more complex than a beginner (like myself) can adapt with.
The idea of my code is to read a data_file.txt and convert each line into its own individual array.
so far I have this:
array = [] #declaring a list with name '**array**'
with open('file.txt','r') as input_file:
for line in input_file:
line = input_file.readlines()
array.append(line)
print('done 1') #for test purposes
return array
And I keep recieving an error.
"Value error: Mixing iteration and read methods would lose data "message while extracting numbers from a string from a .txt file using python
The above question seemed to be doing something similar, calling in items for an array, however his code was skipping lines and using a range to call in certain parts of it, I don't need that. All I need is to call in all the lines and have them be made into an array.
Python: Mixing files and loops
In this question, once again, something much more than I can understand was being asked. From what I understood, he just wanted a code that would restart after an error and continue, and the answers were about that part. Once again not what I'm looking for.
The error is pretty much self-explanatory (once you know what it is about), so here goes.
You start with the loop for line in input_file:. File objects are iterable in Python. They iterate over the lines in the file. This means that for each iteration of the loop, line will contain the next line in your file.
Next you read a line manually line = input_file.readlines(). This attempts to read a line from the file, but you are already iterating over the lines in the for loop.
Files are usually read sequentially, with no going backwards. What you end up with is a conflict. If you read a line using readline, the iterator in the loop will be forced to return the line after next since it can not go back. However, it is promising to return the next line. The error is telling you that readline knows that there is an active iterator and that calling it would interfere with the loop.
If you take out line = input_file.readlines(), the loop will do what you expect it to.
To make an array of the lines of the file, with one line per array element:
with open('file.txt','r') as input_file:
array = input_file.readlines()
return array
since readlines will give you the whole file in one shot. Alternatively,
return list(open('file.txt','r'))
will do the same per the docs.

Reading large files in a loop

I'm having some trouble dealing with large text files (about 1GB), when I want to read them and use them in while loops.
More specifically: First I start by doing some parsing on the lines of the file, in order to find e.g. all lines that start with "x". In doing so, I add the indices of the found lines to a list (say l). This is the pre-processing part.
Now in a while loop, I'm choosing random indices from l, and want to read its corresponding line (or say 5 lines around it). Thus I need to keep the file in memory once and for all throughout the while loop, as a priori I do not know what lines I end up reading (the line is randomly picked from l).
The problem is, when I call the file before my main loop, during the first run of the loop, the reading gets done successfully, but already from the second run, the file has vanished from memory. What I have tried:
The preprocess part:
for i, line in enumerate(filename):
prep = ''.join(c for c in line if c.isalnum() or c.isspace())
if 'x' in prep: l.append(i)
Now I have my l list. loading the file in memory before main loop:
with open(filename,'r') as f:
while (some condition):
random_index = random.sample(range(0,len(l)),1)
output_file = open("out","w") #I will write here the read line(s)
for i, line in enumerate(f):
#(the lines to be read, starting from the given random index)
if (i >= l[random_index]) and (i < l[random_index+1]):
out.write(line)
out.close()
Only during the first run of the loop, things work properly.
Alternatively I also tried:
f = open(filename)
while (some condition):
random_index = ... #rest is same as above.
Same issue, only first run work. One thing that worked was putting the f=open(filename) in the loop, so every run the file is called. But since it is a large one, this is really no practical solution.
What am I doing wrong here?
How should such readings be done properly?
What am I doing wrong here?
This answer addresses the same problem: you can't read file twice.
You open file f outside of the while loop and read it completely by calling for i, line in enumerate(f): during first iteration of the while loop. During the second iteration you can't read it again, since it has been read already.
How should such readings be done properly?
As noted in the linked answer:
To answer your question directly, once a file has been read, with read() you can use seek(0) to return the read cursor to the start of the file (docs are here).
That means, that to solve your problem you can add f.seek(0) at the end of the while loop to move pointer to the start of the file after each iteration. Doing this you can reread file from the start again.

Delete line after it has been read from file in Python

I have a function that read lines from a file and process them. However, I want to delete every line that I have read, but without using readlines() that reads all of the lines at once and stores them into a list.
If the problem is that you run out of memory, then I suggest you use the for line in file syntax, as this will only load the lines one at a time:
bigFile = open('path/to/file.dat','r')
for line in bigFile:
processLine(line)
If you can construct your system so that it can process the file line-by-line, then it won't run out of memory trying to read the whole file. The program will discard the copy it has made of the file contents when it moves onto the next line.
Why does this work when readlines doesn't?
In Python there are iterators, which provide an interface to supply one item of a collection at a time, iterating over the whole collection if .next() is called repeatedly. Because you rarely need the whole collection at once, this can allow the program to work with a single item in memory instead, and thus allow large files to be processed.
By contrast, the readlines function has to return a whole list, rather than an iterator object, so it cannot delay the processing of later lines like an iterator could. Since Python 2.3, the old xreadlines read iterator was deprecated in favour of using for line in file, because the file object returned by open had been changed to return an iterator rather than a list.
This follows the functional paradigm called 'lazy evaluation', where you avoid doing any actual processing unless and until the result is needed.
More iterators
Iterators can be chained together (process the lines of this file, then that one), or otherwise combined using the excellent itertools module (included in Python). These are very powerful, and can allow you to separate out the way you combine files or inputs from the code that processes them.
First of all, deleting the first line of a file is a costly process. Actually, you are unlikely to be able to do it without rewriting most of the file.
You have multiple approaches that could solve your issue:
1.In python, file objects have an iterator over the lines, may be you can use this to solve your memory issues
document_count = 0
with open(filename) as handler:
for index, line in enumerate(handler):
if line == '.':
document_count += 1
2.Use an index. Reserve a certain part of your file to the index(fixed size, make sure to reserve enough space, let's say the first 100Ko of your file should be reserved for the index, that's about 100K entries) or even another index file, every time you add a document put it's starting position on the index. Once you know the document position, just use the seek function to get there and start reading
3.Read the file once and store every document position, this is very similar to the previous idea, except it's in memory which is better performance-wise but you will have to repeat the process every time you run the application (no persistence)

Reading from text file into python list

Very new to python and can't understand why this isn't working. I have a list of web addresses stored line by line in a text file. I want to store the first 10 in an array/list called bing, the next 10 in a list called yahoo, and the last 10 in a list called duckgo. I'm using the readlines function to read the data from the file into each array. The problem is nothing is being written to the lists. The count is incrementing like it should. Also, if I remove the loops altogether and just read the whole text file into one list it works perfectly. This leads me to believe that the loops are causing the problem. The code I am using is below. Would really appreciate some feedback.
count=0;
#Open the file
fo=open("results.txt","r")
#read into each array
while(count<30):
if(count<10):
bing = fo.readlines()
count+=1
print bing
print count
elif(count>=10 and count<=19):
yahoo = fo.readlines()
count+=1
print count
elif(count>=20 and count<=29):
duckgo = fo.readlines()
count+=1
print count
print bing
print yahoo
print duckgo
fo.close
You're using readlines to read the files. readlines reads all of the lines at once, so the very first time through your loop, you exhaust the entire file and store the result in bing. Then, every time through the loop, you overwrite bing, yahoo, or duckgo with the (empty) result of the next readlines call. So your lists all wind up being empty.
There are lots of ways to fix this. Among other things, you should consider reading the file a line at a time, with readline (no 's'). Or better yet, you could iterate over the file, line by line, simply by using a for loop:
for line in fo:
...
To keep the structure of your current code you could use enumerate:
for line_number, line in enumerate(fo):
if condition(line_number):
...
But frankly I think you should ditch your current system. A much simpler way would be to use readlines without a loop, and slice the resulting list!
lines = fo.readlines()
bing = lines[0:10]
yahoo = lines[10:20]
duckgo = lines[20:30]
There are many other ways to do this, and some might be better, but none are simpler!
readlines() reads all of the lines of the file. If you call it again, you get empty list. So you are overwriting your lists with empty data when you iterate through your loop.
You should be using readline() instead of readlines()
readlines() reads the entire file in at once, whereas readline() reads a single line from the file.
I suggest you rewrite it like so:
bing = []
yahoo = []
duckgo = []
with open("results.txt", "r") as f:
for i, line in enumerate(f):
if i < 10:
bing.append(line)
elif i < 20:
yahoo.append(line)
elif i < 30:
duckgo.append(line)
else:
raise RuntimeError, "too many lines in input file"
Note how we use enumerate() to get a running count of lines, rather than making our own count variable and needing to increment it ourselves. This is considered good style in Python.
But I think the best way to solve this problem would be to use itertools like so:
import itertools as it
with open("results.txt", "r") as f:
bing = list(it.islice(f, 10))
yahoo = list(it.islice(f, 10))
duckgo = list(it.islice(f, 10))
if list(it.islice(f, 1)):
raise RuntimeError, "too many lines in input file"
itertools.islice() (or it.islice() since I did the import itertools as it) will pull a specified number of items from an iterator. Our open file-handle object f is an iterator that returns lines from the file, so it.islice(f, 10) pulls exactly 10 lines from the input file.
Because it.islice() returns an iterator, we must explicitly expand it out to a list by wrapping it in list().
I think this is the simplest way to do it. It perfectly expresses what we want: for each one, we want a list with 10 lines from the file. There is no need to keep a counter at all, just pull the 10 lines each time!
EDIT: The check for extra lines now uses it.islice(f, 1) so that it will only pull a single line. Even one extra line is enough to know that there are more than just the 30 expected lines, and this way if someone accidentally runs this code on a very large file, it won't try to slurp the whole file into memory.

Categories