Reading files to list of strings in Python

Reading files to list of strings in Python - python

When you use the fileName.readlines() function in Python, is there a symbol for the end of the file that is included in the list?
For example, if the file is read into a list of strings and the last line is 'End', will there be another place in the list with a symbol indicating the end of the file?
Thanks.

No, the list contains one element for each line in the file.
You can do something with each line in a for look like this:
lines = infile.readlines()
for line in lines:
# Do something with this line
process(line)
Python has a shorter way of accomplishing this that avoids reading the whole file into memory at once
for line in infile:
# Do something with this line
process(line)
If you just want the last line of the file
lines = infile.readlines()
last_line = lines[-1]
Why do you think you need a special symbol at the end?

The list returned by .readlines(), like any other Python list, has no "end" marker -- assumin e.g. you start with L = myfile.readlines(), you can see how many items L has with len(L), get the last one with L[-1], loop over all items one after the other with for item in L:, and so forth ... again -- just like any other list.
If for some peculiar reason you want, after the last line, some special "sentinel" value, you could for example do L.append('') to add just such a sentinel, or marker -- the empty string is never going to be the value of any other line (since each item is a complete line including the trailing '\n', so it will never be an empty string!).
You could use other arbitrary markers, such as None, but often it's simpler, when feasible [[and it's obviously feasible in this case, as I've just shown]], if the sentinel is of exactly the same type as any other item. That depends on what processing you're doing that needs a past-the-end "sentinel" (I'm not going to guess because you give us no indication, and only relatively rare problems are best solved that way in Python, anyway;-), but in almost every conceivable case of sentinel use a '' sentinel will be just fine.

It simply returns a list containing each line -- there's no "EOF" item in the list.

Related

Return the exact lines of a Huge file after pattern matching without using FOR in Python3

I am new to Python. My problem here is that:
I want to match a pattern against a large file and return matching lines(not just the matched string) from it. I DO NOT want a FOR loop for this as my file is huge. I am using mmap for reading the file.
in the above file, if I search for bhuvi, I should get 2 rows, bhuvi and bhuvi Kumar
I used re.findall() for this, but it just returns the substrings, not the whole lines.
Can someone please suggest what I can do here?

If your input file is huge, you cannot use readlines, but nothing
prevents you from reading one line in a loop.
As the file object is iterable, you can write the loop as:
for line in fh:
and process the content of the input line inside the loop.
The file size is not important, as you do not attempt to read all lines at once.
To check for presence of your string (bhuvi) in the line use
re.search, not re.findall.
Actually you don't need any list of matches, it is enough to find
a single match (it works quicker).
Below you have an example program (Python 3.7), writing the lines contaning your
string, along with the line number:
import re
cnt = 0
with open('input.txt') as fh:
for line in fh:
line = line.rstrip()
cnt += 1
if re.search('bhuvi', line):
print(f'{cnt}: {line}')
Note that I used rstrip() to remove the trailing newline, if any.
Edit after your comment:
You wrote that the file to check is huge. So there is a risk that
if you try to read it whole into the computer memory, the program
runs out of memory.
In such a case you would have to read the file chunk by chunk and
perform search in each chunk separately.
There is also a risk that a row with the text you are looking for will be
partially read in one chunk and the rest in the next,
so you have to take some measure to avoid this in your program.
On the other hand, if there is no other way but using mmap,
try something like re.finditer(r'[^\n]*bhuvi[^\n]*', map), i.e. create
an iterator looking for:
A sequence of chars other than \n.
Your string.
Another sequence of chars other than \n.
This way the match object returned by the iterator will match the
whole line, not your string alone.

Unexpected output from textfile - cleaning read in lines correctly

I am trying to use a very basic text file as a settings file. Three lines repeat in this order/format that govern some settings/input for my program. Text file is as follows:
Facebook
1#3#5#2
Header1#Header2#Header3#Header4
...
This is read in using the following Python code:
f = open('settings.txt', 'r')
for row in f:
platform = f.readline()
rows_to_keep = int(f.readline().split('#'))
row_headers = f.readline().split('#')
clean_output(rows_to_keep, row_headers, platform)
I would expect single string to be read in platform, an array of ints in the second and an array of strings in the third. These are then passed to the function and this is repeated numerous times.
However, the following three things are happening:
Int doesn't convert and I get a TypeError
First line in text file is ignored and I get rows to keep in platform
\n at the end of each line
I suspect these are related and so am only posting one question.

You cannot call int on a list, you need do do some kind of list comprehension like
rows_to_keep = [int(a) for a in f.readline().split('#')]
You're reading a line, then reading another line from the file. You should either do some kind of slicing (see Python how to read N number of lines at a time) or call a function with the three lines after every third iteration.
use .strip() to remove end of lines and other whitespace.

Try this:
with open('settings.txt', 'r') as f:
platform, rows_to_keep, row_headers = f.read().splitlines()
rows_to_keep = [int(x) for x in rows_to_keep.split('#')]
row_headers = row_headers.split('#')
clean_output(rows_to_keep, row_headers, platform)

There are several things going on here. First, when you do the split on the second line, you're trying to cast a list to type int. That won't work. You can, instead, use map.
rows_to_keep = map(int,f.readline().strip().split("#"))
Additionally, you see the strip() method above. That removes trailing whitespace chars from your line, ie: \n.
Try that change and also using strip() on each readline() call.

With as few changes as possible, I've attempted to solve your issues and show you where you went wrong. #Daniel's answer is how I would personally solve the issues.
f = open('settings.txt', 'r')
#See 1. We remove the unnecessary for loop
platform = f.readline()
#See 4. We make sure there are no unwanted leading or trailing characters by stripping them out
rows_to_keep = f.readline().strip().split('#')
#See 3. The enumerate function creates a list of pairs [index, value]
for row in enumerate(rows_to_keep):
rows_to_keep[row[0]] = int(row[1])
row_headers = f.readline().strip().split('#')
#See 2. We close the file when we're done reading
f.close()
clean_output(rows_to_keep, row_headers, platform)
You don't need (and don't want) a for loop on f, as well as calls to readline. You should pick one or the other.
You need to close f with f.close().
You cannot convert a list to an int, you want to convert the elements in the list to int. This can be accomplished with a for loop.
You probably want to call .strip to get rid of trailing newlines.

Adding numbers from a file to a list

Ok so I have a .txt file wich I need to add the contents on it to a list, the problem is that there is only one character per row, for example, if I need to have "2+3", in the .txt it would look like this:
2
+
3
and then I have to add it to a list in order for it to look like this [2,+,3]
In the code I have right now it adds the contents, in string and adds up a "\n" at the end of every list element.I can't find a way to make it so that it adds the character as a int and without the \n.
This is the code:
def readlist():
count=0
file=open("readfile.txt","r")
list1=[]
line=file.readlines()
list1.append(line)
print(list1)
file.close
(the file is reading has 1(2+3) into it)
thanks in advance for the help

The safest way is to use a try/except:
out = []
with open("in.txt") as f:
for line in f:
try:
out.append(int(line))
except ValueError:
out.append(line.rstrip())
print(out)
[2, '+', 3]
You don't need to strip whitespace or newline characters when casting to int, python is forgiving in that regard so we only need rstrip he new line when we catch an exception because then we have an operator.
Also with will automatically close your files, something you are actually not doing in your own code as your are missing parens to call the method file.close should be file.close()

This problem can be fixed with a few additions.
First every line has a \n in it's string because it's a new line in the file. To remove this you can use the rstrip method explained here very well on how it works.
From here you're going to want to convert the string into a int using int(line). This will turn the line into a integer that you can then add to your list as wanted.
The problem now is going to be choosing which line to convert into an int and which ones are arithmetic operations such as the + you have in your example file.

u can do a
line.split('\n')

Iterate over a portion of a list in a list comprehension

I'd like to print out the first 10 lines of a file and avoid reading in any extra lines. How can I do that with a list comprehension without reading in the whole file?
I know that I can do the code like this:
N = 10
with open(path,'rb') as f_in:
for line in f_in:
print line.strip()
N -= 1
if N == 0:
break
But I think a list comprehension is more appropriate:
with open(path,'rb') as f_in:
[print line for i, line in enumerate(f_in) if i<N]
However, that doesn't work because of the print statement so i end up with this mess:
with open(path,'rb') as f_in:
lines = [line.strip() for i, line in enumerate(f_in) if i<N]
for line in lines:
print line
And the real point of my question is how do you get the list comprehension to stop when i==N instead of needlessly continuing and only filtering out the extra lines?
Is there a way to limit how far into an iterator a list comprehension will go? And is there an appropriate way to print out from a list comprehension? I'm fairly new to python and so I'm trying to learn how to do things the right way rather than just the first way I can think of it. I'd like to able to write this in a pythonic way.

how do you get the list comprehension to stop when i==N instead of
needlessly continuing and only filtering out the extra lines?
Is there a way to limit how far into an iterator a list comprehension will go?
You can use itertools.islice to iterate over a slice of an iterable:
from itertools import islice
with open(path,'rb') as f_in:
for line in islice(f_in, N):
print line.strip()
Actually you can specify the index of the first line to produce and even a step (like list or string slicing).
Note that you shouldn't use a list-comprehension if you don't actually need a list, because it consumes memory (in your case you keep all the contents of the file in memory, which can be bad if the file is big).
If you simply want to iterate once over something use a generator expression:
lines = (line.strip() for line in f_in)
(Yes, you simply have to change the [] with ()).
This avoids to building the whole list when executed.
is there an appropriate way to print out from a list comprehension?
No.
In python2 print is a statement and thus it cannot be present in an expression
In python3 you could call print since it is a function, but it is a very bad idea.
List-comprehensions have a specific purpose: build a list from a given iterable.
You are throwing the list away, thus defeating the whole purpose of that syntax.
For this reason there is no support for "breaking" out of the loop in a list-comprehension. If you have a code so complex to require a break you'd better write it with an explicit for loop.
The same is true if you tried to do something like calling map:
map(lambda line: print line, lines)
Assuming the it would be possible to insert a print in a lambda
This even fails in python3 (it wont print anything).
If you want to write good python code the number one rule is to follow the language design:
don't mix expression and statements, that is to say: use expression return values, don't abuse them to produce side-effects.

You can also call next() on the file object in the range of lines you require:
lines = [f_in.next() for x in range(10)]
This will give you the first ten lines.
Using next() can be useful if you want to skip headers or other lines at the start of your file. Each time you call next on the file object you will move to the next line of the file.
If you wanted to print the contents of lines you could use join():
print "".join(lines)

Putting parts of a text file into a list

I have this text file and I need certain parts of it to be inserted into a list.
The file looks like:
blah blah
.........
item: A,B,C.....AA,BB,CC....
Other: ....
....
I only need to rip out the A,B,C.....AA,BB,CC..... parts and put them into a list. That is, everything after "Item:" and before "Other:"
This can be easily done with small input, but the problem is that it may contain a large number of items and text file may be pretty huge. Would using rfind and strip be as efficient for huge input as for small input, algorithmically speaking?
What would be an efficient way to do it?

I can see no need for rfind() nor strip().
It looks like you're simply trying to do:
start = 'item: '
end = 'Other: '
should_append = False
the_list = []
for line in open('file').readlines():
if line.startswith(start):
data = line[len(start):]
the_list.append(data)
should_append = True
elif line.startswith(end):
should_append = False
break
elif should_append:
the_list.append(line)
print the_list
This doesn't hold the whole file in memory, just the current line and the list of lines found between the start and the end patterns.

To answer the question about efficiency specifically, reading in the file and comparing it line by line will net O(n) average case performance.
Example by Code:
pattern = "item:"
with open("file.txt", 'r') as f:
for line in f:
if line.startswith(pattern):
# You can do what you like with it; split it along whitespace or a character, then put it into a list.
You're searching the entire file sequentially, and you have to compare some number of elements in the file before you come across the element you're looking for.
You have the option of building a search tree instead. While it costs O(n) to build, it would cost O(logkn) time to search (resulting in O(n) time overall, again), where k is the number of starting characters you'd have in your list.

Though I usually jump at the chance to employ regular expressions, I feel like for a single occurrence in a large file, it would be much more work and too computationally expensive to use regex. So perhaps the straightforward answer (in python) would be most appropriate:
s = 'item:'
yourlist = next(line[len(s)+1:].split(',') for line in open("c:\zzz.txt") if line.startswith(s))
This, of course, assumes that 'item:' doesn't exist on any other lines that are NOT followed by 'other:', but in the event 'item:' exists only once and at the start of the line, this simple generator should work for your purposes.

This problem is simple enough that it really only has two states, so you could just use a Boolean variable to keep track of what you are doing. But the general case for problems like this is to write a state machine that transitions from one state to the next until it has worked its way through the problem.
I like to use enums for states; unfortunately Python doesn't really have a built-in enum. So I am using a class with some class variables to store the enums.
Using the standard Python idiom for line in f (where f is the open file object) you get one line at a time from the text file. This is an efficient way to process files in Python; your initial lines, which you are skipping, are simply discarded. Then when you collect items, you just keep the ones you want.
This answer is written to assume that "item:" and "Other:" never occur on the same line. If this can ever happen, you need to write code to handle that case.
EDIT: I made the start_code and stop_code into arguments to the function, instead of hard-coding the values from the example.
import sys
class States:
pass
States.looking_for_item = 1
States.collecting_input = 2
def get_list_from_file(fname, start_code, stop_code):
lst = []
state = States.looking_for_item
with open(fname, "rt") as f:
for line in f:
l = line.lstrip()
# Don't collect anything until after we find "item:"
if state == States.looking_for_item:
if not l.startswith(start_code):
# Discard input line; stay in same state
continue
else:
# Found item! Advance state and start collecting stuff.
state = States.collecting_input
# chop out start_code
l = l[len(start_code):]
# Collect everything after "item":
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
elif state == States.collecting_input:
if not l.startswith(stop_code):
# Continue collecting input; stay in same state
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
else:
# We found our terminating condition! Don't bother to
# update the state variable, just return lst and we
# are done.
return lst
else:
print("invalid state reached somehow! state: " + str(state))
sys.exit(1)
lst = get_list_from_file(sys.argv[1], "item:", "Other:")
# do something with lst; for now, just print
print(lst)

I wrote an answer that assumes that the start code and stop code must occur at the start of a line. This answer also assumes that the lines in the file are reasonably short.
You could, instead, read the file in chunks, and check to see if the start code exists in the chunk. For this simple check, you could use if code in chunk (in other words, use the Python in operator to check for a string being contained within another string).
So, read a chunk, check for start code; if not present discard the chunk. If start code present, begin collecting chunks while searching for the stop code. In a recent Python version you can concatenate the blocks one at a time with reasonable performance. (In an old version of Python you should store the chunks in a list, then use the .join() method to join the chunks together.)
Once you have built a string that holds data from the start code to the end code, you can use .find() and .rfind() to find the start code and end code, and then cut out just the data you want.
If the start code and stop code can occur more than once in the file, wrap all of the above in a loop and loop until end of file is reached.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading files to list of strings in Python - python

It simply returns a list containing each line -- there's no "EOF" item in the list.

Related

Return the exact lines of a Huge file after pattern matching without using FOR in Python3

Unexpected output from textfile - cleaning read in lines correctly

Adding numbers from a file to a list

Iterate over a portion of a list in a list comprehension

Putting parts of a text file into a list

Categories

Resources