Unexpected output from textfile - cleaning read in lines correctly

Unexpected output from textfile - cleaning read in lines correctly - python

I am trying to use a very basic text file as a settings file. Three lines repeat in this order/format that govern some settings/input for my program. Text file is as follows:
Facebook
1#3#5#2
Header1#Header2#Header3#Header4
...
This is read in using the following Python code:
f = open('settings.txt', 'r')
for row in f:
platform = f.readline()
rows_to_keep = int(f.readline().split('#'))
row_headers = f.readline().split('#')
clean_output(rows_to_keep, row_headers, platform)
I would expect single string to be read in platform, an array of ints in the second and an array of strings in the third. These are then passed to the function and this is repeated numerous times.
However, the following three things are happening:
Int doesn't convert and I get a TypeError
First line in text file is ignored and I get rows to keep in platform
\n at the end of each line
I suspect these are related and so am only posting one question.

You cannot call int on a list, you need do do some kind of list comprehension like
rows_to_keep = [int(a) for a in f.readline().split('#')]
You're reading a line, then reading another line from the file. You should either do some kind of slicing (see Python how to read N number of lines at a time) or call a function with the three lines after every third iteration.
use .strip() to remove end of lines and other whitespace.

Try this:
with open('settings.txt', 'r') as f:
platform, rows_to_keep, row_headers = f.read().splitlines()
rows_to_keep = [int(x) for x in rows_to_keep.split('#')]
row_headers = row_headers.split('#')
clean_output(rows_to_keep, row_headers, platform)

There are several things going on here. First, when you do the split on the second line, you're trying to cast a list to type int. That won't work. You can, instead, use map.
rows_to_keep = map(int,f.readline().strip().split("#"))
Additionally, you see the strip() method above. That removes trailing whitespace chars from your line, ie: \n.
Try that change and also using strip() on each readline() call.

With as few changes as possible, I've attempted to solve your issues and show you where you went wrong. #Daniel's answer is how I would personally solve the issues.
f = open('settings.txt', 'r')
#See 1. We remove the unnecessary for loop
platform = f.readline()
#See 4. We make sure there are no unwanted leading or trailing characters by stripping them out
rows_to_keep = f.readline().strip().split('#')
#See 3. The enumerate function creates a list of pairs [index, value]
for row in enumerate(rows_to_keep):
rows_to_keep[row[0]] = int(row[1])
row_headers = f.readline().strip().split('#')
#See 2. We close the file when we're done reading
f.close()
clean_output(rows_to_keep, row_headers, platform)
You don't need (and don't want) a for loop on f, as well as calls to readline. You should pick one or the other.
You need to close f with f.close().
You cannot convert a list to an int, you want to convert the elements in the list to int. This can be accomplished with a for loop.
You probably want to call .strip to get rid of trailing newlines.

Related

Amend list from file - Correct syntax and file format?

I currently have a list hard coded into my python code. As it keeps expanding, I wanted to make it more dynamic by reading the list from a file. I have read through many articles about how to do this, but in practice I can't get this working. So firstly, here is an example of the existing hardcoded list:
serverlist = []
serverlist.append(("abc.com", "abc"))
serverlist.append(("def.com", "def"))
serverlist.append(("hji.com", "hji"))
When I enter the command 'print serverlist' the output is shown below and my list works perfectly when I access it:
[('abc.com', 'abc'), ('def.com', 'def'), ('hji.com', 'hji')]
Now I've replaced the above code with the following:
serverlist = []
with open('/server.list', 'r') as f:
serverlist = [line.rstrip('\n') for line in f]
With the contents of server.list being:
'abc.com', 'abc'
'def.com', 'def'
'hji.com', 'hji'
When I now enter the command print serverlist, the output is shown below:
["'abc.com', 'abc'", "'def.com', 'def'", "'hji.com', 'hji'"]
And the list is not working correctly. So what exactly am I doing wrong? Am I reading the file incorrectly or am I formatting the file incorrectly? Or something else?

The contents of the file are not interpreted as Python code. When you read a line in f, it is a string; and the quotation marks, commas etc. in your file are just those characters as parts of a string.
If you want to create some other data structure from the string, you need to parse it. The program has no way to know that you want to turn the string "'abc.com', 'abc'" into the tuple ('abc.com', 'abc'), unless you instruct it to.
This is the point where the question becomes "too broad".
If you are in control of the file contents, then you can simplify the data format to make this more straightforward. For example, if you just have abc.com abc on the line of the file, so that your string ends up as 'abc.com abc', you can then just .split() that; this assumes that you don't need to represent whitespace inside either of the two items. You could instead split on another character (like the comma, in your case) if necessary (.split(',')). If you need a general-purpose hammer, you might want to look into JSON. There is also ast.literal_eval which can be used to treat text as simple Python literal expressions - in this case, you would need the lines of the file to include the enclosing parentheses as well.

If you are willing to let go of the quotes in your file and rewrite it as
abc.com, abc
def.com, def
hji.com, hji
the code to load can be reduced to a one liner using the fact that files are iterables
with open('servers.list') as f:
servers = [tuple(line.split(', ')) for line in f]
Remember that using a file as an iterator already strips off the newlines.
You can allow arbitrary whitespace by doing something like
servers = [tuple(word.strip() for word in line.split(',')) for line in f]
It might be easier to use something like regex to parse the original format. You could use an expression that captures the parts of the line you care about and matches but discards the rest:
import re
pattern = re.compile('\'(.+)\',\\s*\'(.+)\'')
You could then extract the names from the matched groups
with open('servers.list') as f:
servers = [pattern.fullmatch(line).groups() for line in f]
This is just a trivialized example. You can make it as complicated as you wish for your real file format.

Try this:
serverlist = []
with open('/server.list', 'r') as f:
for line in f:
serverlist.append(tuple(line.rstrip('\n').split(',')))
Explanation
You want an explicit for loop so you cycle through each line as expected.
You need list.append for each line to append to your list.
You need to use split(',') in order to split by commas.
Convert to tuple as this is your desired output.
List comprehension method
The for loop can be condensed as below:
with open('/server.list', 'r') as f:
serverlist = [tuple(line.rstrip('\n').split(',')) for line in f]

Adding numbers from a file to a list

Ok so I have a .txt file wich I need to add the contents on it to a list, the problem is that there is only one character per row, for example, if I need to have "2+3", in the .txt it would look like this:
2
+
3
and then I have to add it to a list in order for it to look like this [2,+,3]
In the code I have right now it adds the contents, in string and adds up a "\n" at the end of every list element.I can't find a way to make it so that it adds the character as a int and without the \n.
This is the code:
def readlist():
count=0
file=open("readfile.txt","r")
list1=[]
line=file.readlines()
list1.append(line)
print(list1)
file.close
(the file is reading has 1(2+3) into it)
thanks in advance for the help

The safest way is to use a try/except:
out = []
with open("in.txt") as f:
for line in f:
try:
out.append(int(line))
except ValueError:
out.append(line.rstrip())
print(out)
[2, '+', 3]
You don't need to strip whitespace or newline characters when casting to int, python is forgiving in that regard so we only need rstrip he new line when we catch an exception because then we have an operator.
Also with will automatically close your files, something you are actually not doing in your own code as your are missing parens to call the method file.close should be file.close()

This problem can be fixed with a few additions.
First every line has a \n in it's string because it's a new line in the file. To remove this you can use the rstrip method explained here very well on how it works.
From here you're going to want to convert the string into a int using int(line). This will turn the line into a integer that you can then add to your list as wanted.
The problem now is going to be choosing which line to convert into an int and which ones are arithmetic operations such as the + you have in your example file.

u can do a
line.split('\n')

Python - txt file into a list

I'm trying to read a file and to save the text inside as a list containing strings,
(each object in the list is a string which is one line in the text)
i only managed to print out the text line by line , and not as a list.
the text is a long list of biological stuff (random letter if you ask me (:)
and-
def read_proteome ( filename ):
f= open(filename).readlines()
for line in f:
print(line)
only printed the file (did seperated the lines..)
where did i go wrong ?
how do i set it into a list ?

The result of .readlines() is a list. Just print f:
print(f)

Martijn Pieters already gives a simple and complete answer, but it's worth learning how to figure these things out yourself. It's usually faster, and it doesn't cost you a half-dozen downvotes.
First, the fact that you can use for line in f: means that f is obviously some kind of list-ish object, in that it can be used in a for loop the same way a list can. Technically, this means it's an "iterable".
Maybe this means it's already a list? If so, you're already done. If not, the list function takes any iterable and makes it into a list, so, you can just add f = list(f) and you're done.
How do you find out which?
Well, you can add print(type(f)) to your code and see what it prints out. If it says list, you're done; if it says anything else, you need to add the conversion line f = list(f).
It's often easier to do this all interactively, rather than in a script:
>>> f = open(filename).readlines()
>>> type(f)
list
>>> f
['first line\n', 'second line\n', 'last line\n']
If you break this down into pieces, you can see the types of each piece separately:
>>> filename = 'C:/foo.txt'
>>> type(filename)
str
>>> fileobj = open(filename)
>>> type(fileobj)
_io.TextIOWrapper
>>> lines = fileobj.readlines()
>>> type(lines)
list
When you put this all together in one line, lines=open('C:/foo.txt').readlines(), the end result is the same as if you did it in three steps—lines is a list.
But what if you can't figure something out by experimenting, because you don't know what to try?
Well, the interactive interpreter has built-in help:
>>> fileobj = open(filename)
>>> help(fileobject.readlines)
Help on built-in function readlines:
readlines(...)
Return a list of lines from the stream.
hint can be specified to control the number of lines read: no more
lines will be read if the total size (in bytes/characters) of all
lines so far exceeds hint.
It says right there that it returns a list.
Or, you can look at the documentation. Trying to guess where readlines might be in 3.x is actually a bit complicated, because the type of thing open returns is not obvious… but you can just use the "quick search" on the left, and you'll find io.ioBase.readlines, which gives you the same answer:
readlines(hint=-1)
Read and return a list of lines from the stream. hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.

As Martijn said, .readlines() converts a file into a list and cycles through them. However, this is what I think you are trying to do.
with open("file.txt", 'r') as file:
print file.read().split("\n")

As Martijn said, the readlines() method returns a list as said in the python documentation .
If what you need is to convert a string to a list use this: ast helper
import ast
stringList = "{'a':'text'}"
realList = ast.literal_eval(stringList)
print realList
Hope this is what you need!

How can I open a file and iterate through it, adding data from only certain lines?

I have the following code
my_file=open("test.stl","r+")
vertices=[]
for line in my_file:
line=line.strip()
line=line.split()
if line.startswith('vertex'):
vertices.append([[line[1],line[2],line[3]])
print vertices
my_file.close()
and right now it gives this error:
File "convert.py", line 10
vertices.append([[line[1],line[2],line[3]])
^
SyntaxError: invalid syntax
My file has a bunch of lines in it, alot of them formated as vertex 5.6354345 3.34344 7.345345 for example (stl file). I want to add those three numbers to my array so that my array will eventually have [[v1,v2,v3],[v1,v2,v3],....] where all those v's are from the lines. Reading other similar questions it looks like I may need to import sys, but I am not sure why this is.

Do the lines in your STL file have any leading whitespace?
If they do, you need to strip that off first.
line = line.strip()
Also: calling line.split() doesn't affect line. It produces a new list, and you're expected to give the new list a name and use it afterwards, like this:
fields = line.split()
vertices.append([fields[1], fields[2], fields[3]])

your not assigning line.strip to a variable e.g:
line_split = line.split()
vertices.append([[line_split[1],line_split[2],line_split[3]])
Another way would be:
for line in my_file:
line_split = line.split()
if line_split[0] == 'vertex':
vertices.append([[line_split[1],line_split[2],line_split[3]])

vertices.append([[line[1],line[2],line[3]])
^
SyntaxError: invalid syntax
Remove the first [ (there is missing ] otherwise) to fix the SyntaxError. There are other errors in your code.
To parse lines that have:
vertex 5.6354345 3.34344 7.345345
format into a list of 3D points with float coordinates:
with open("test.stl") as file:
vertices = [map(float, line.split()[1:4])
for line in file
if line.lstrip().startswith('vertex')]
print vertices

Apart from what others have mentioned:
vertices.append([[line[1],line[2],line[3]])
One too many left brackets before line[1], should be:
vertices.append([line[1],line[2],line[3]])
print verticies
Your list is named vertices, not verticies.

list.split() does not modify the list; it produces an entirely new list.
Assign the result of line.split() to line: line = line.split()
Then proceed as normal.
http://www.tutorialspoint.com/python/string_split.htm
This won't solve the problem though as you should still be pulling individual characters out of line (instead of blank space) due to the fact that strings act as lists of characters to begin with (see below).
text = "cat"
print(text[1])
>>> 'a'
I suspect that Python never gets past the if line.startswith('vertex'): condition. So as others have said, the core issue probably involves leading space or the file itself.
Also, if you're only reading the file, there's no need to include the access mode "r+". my_file=open("test.stl") works just as well and is more pythonic.

Try to use:
for line in my_file.readlines():
readlines returns a list of all lines in the file.
You don't need to import sys in your case.

Python: How to ignore #comment lines when reading in a file

In Python, I have just read a line form a text file and I'd like to know how to code to ignore comments with a hash # at the beginning of the line.
I think it should be something like this:
for
if line !contain #
then ...process line
else end for loop
But I'm new to Python and I don't know the syntax

you can use startswith()
eg
for line in open("file"):
li=line.strip()
if not li.startswith("#"):
print line.rstrip()

I recommend you don't ignore the whole line when you see a # character; just ignore the rest of the line. You can do that easily with a string method function called partition:
with open("filename") as f:
for line in f:
line = line.partition('#')[0]
line = line.rstrip()
# ... do something with line ...
partition returns a tuple: everything before the partition string, the partition string, and everything after the partition string. So, by indexing with [0] we take just the part before the partition string.
EDIT:
If you are using a version of Python that doesn't have partition(), here is code you could use:
with open("filename") as f:
for line in f:
line = line.split('#', 1)[0]
line = line.rstrip()
# ... do something with line ...
This splits the string on a '#' character, then keeps everything before the split. The 1 argument makes the .split() method stop after a one split; since we are just grabbing the 0th substring (by indexing with [0]) you would get the same answer without the 1 argument, but this might be a little bit faster. (Simplified from my original code thanks to a comment from #gnr. My original code was messier for no good reason; thanks, #gnr.)
You could also just write your own version of partition(). Here is one called part():
def part(s, s_part):
i0 = s.find(s_part)
i1 = i0 + len(s_part)
return (s[:i0], s[i0:i1], s[i1:])
#dalle noted that '#' can appear inside a string. It's not that easy to handle this case correctly, so I just ignored it, but I should have said something.
If your input file has simple enough rules for quoted strings, this isn't hard. It would be hard if you accepted any legal Python quoted string, because there are single-quoted, double-quoted, multiline quotes with a backslash escaping the end-of-line, triple quoted strings (using either single or double quotes), and even raw strings! The only possible way to correctly handle all that would be a complicated state machine.
But if we limit ourselves to just a simple quoted string, we can handle it with a simple state machine. We can even allow a backslash-quoted double quote inside the string.
c_backslash = '\\'
c_dquote = '"'
c_comment = '#'
def chop_comment(line):
# a little state machine with two state varaibles:
in_quote = False # whether we are in a quoted string right now
backslash_escape = False # true if we just saw a backslash
for i, ch in enumerate(line):
if not in_quote and ch == c_comment:
# not in a quote, saw a '#', it's a comment. Chop it and return!
return line[:i]
elif backslash_escape:
# we must have just seen a backslash; reset that flag and continue
backslash_escape = False
elif in_quote and ch == c_backslash:
# we are in a quote and we see a backslash; escape next char
backslash_escape = True
elif ch == c_dquote:
in_quote = not in_quote
return line
I didn't really want to get this complicated in a question tagged "beginner" but this state machine is reasonably simple, and I hope it will be interesting.

I'm coming at this late, but the problem of handling shell style (or python style) # comments is a very common one.
I've been using some code almost everytime I read a text file.
Problem is that it doesn't handle quoted or escaped comments properly. But it works for simple cases and is easy.
for line in whatever:
line = line.split('#',1)[0].strip()
if not line:
continue
# process line
A more robust solution is to use shlex:
import shlex
for line in instream:
lex = shlex.shlex(line)
lex.whitespace = '' # if you want to strip newlines, use '\n'
line = ''.join(list(lex))
if not line:
continue
# process decommented line
This shlex approach not only handles quotes and escapes properly, it adds a lot of cool functionality (like the ability to have files source other files if you want). I haven't tested it for speed on large files, but it is zippy enough of small stuff.
The common case when you're also splitting each input line into fields (on whitespace) is even simpler:
import shlex
for line in instream:
fields = shlex.split(line, comments=True)
if not fields:
continue
# process list of fields

This is the shortest possible form:
for line in open(filename):
if line.startswith('#'):
continue
# PROCESS LINE HERE
The startswith() method on a string returns True if the string you call it on starts with the string you passed in.
While this is okay in some circumstances like shell scripts, it has two problems. First, it doesn't specify how to open the file. The default mode for opening a file is 'r', which means 'read the file in binary mode'. Since you're expecting a text file it is better to open it with 'rt'. Although this distinction is irrelevant on UNIX-like operating systems, it's important on Windows (and on pre-OS X Macs).
The second problem is the open file handle. The open() function returns a file object, and it's considered good practice to close files when you're done with them. To do that, call the close() method on the object. Now, Python will probably do this for you, eventually; in Python objects are reference-counted, and when an object's reference count goes to zero it gets freed, and at some point after an object is freed Python will call its destructor (a special method called __del__). Note that I said probably: Python has a bad habit of not actually calling the destructor on objects whose reference count drops to zero shortly before the program finishes. I guess it's in a hurry!
For short-lived programs like shell scripts, and particularly for file objects, this doesn't matter. Your operating system will automatically clean up any file handles left open when the program finishes. But if you opened the file, read the contents, then started a long computation without explicitly closing the file handle first, Python is likely to leave the file handle open during your computation. And that's bad practice.
This version will work in any 2.x version of Python, and fixes both the problems I discussed above:
f = open(file, 'rt')
for line in f:
if line.startswith('#'):
continue
# PROCESS LINE HERE
f.close()
This is the best general form for older versions of Python.
As suggested by steveha, using the "with" statement is now considered best practice. If you're using 2.6 or above you should write it this way:
with open(filename, 'rt') as f:
for line in f:
if line.startswith('#'):
continue
# PROCESS LINE HERE
The "with" statement will clean up the file handle for you.
In your question you said "lines that start with #", so that's what I've shown you here. If you want to filter out lines that start with optional whitespace and then a '#', you should strip the whitespace before looking for the '#'. In that case, you should change this:
if line.startswith('#'):
to this:
if line.lstrip().startswith('#'):
In Python, strings are immutable, so this doesn't change the value of line. The lstrip() method returns a copy of the string with all its leading whitespace removed.

I've found recently that a generator function does a great job of this. I've used similar functions to skip comment lines, blank lines, etc.
I define my function as
def skip_comments(file):
for line in file:
if not line.strip().startswith('#'):
yield line
That way, I can just do
f = open('testfile')
for line in skip_comments(f):
print line
This is reusable across all my code, and I can add any additional handling/logging/etc. that I need.

I know that this is an old thread, but this is a generator function that I
use for my own purposes. It strips comments no matter where they
appear in the line, as well as stripping leading/trailing whitespace and
blank lines. The following source text:
# Comment line 1
# Comment line 2
# host01 # This host commented out.
host02 # This host not commented out.
host03
host04 # Oops! Included leading whitespace in error!
will yield:
host02
host03
host04
Here is documented code, which includes a demo:
def strip_comments(item, *, token='#'):
"""Generator. Strips comments and whitespace from input lines.
This generator strips comments, leading/trailing whitespace, and
blank lines from its input.
Arguments:
item (obj): Object to strip comments from.
token (str, optional): Comment delimiter. Defaults to ``#``.
Yields:
str: Next uncommented non-blank line from ``item`` with
comments and leading/trailing whitespace stripped.
"""
for line in item:
s = line.split(token, 1)[0].strip()
if s:
yield s
if __name__ == '__main__':
HOSTS = """# Comment line 1
# Comment line 2
# host01 # This host commented out.
host02 # This host not commented out.
host03
host04 # Oops! Included leading whitespace in error!""".split('\n')
hosts = strip_comments(HOSTS)
print('\n'.join(h for h in hosts))
The normal use case will be to strip the comments from a file (i.e., a hosts file, as in my example above). If this is the case, then the tail end of the above code would be modified to:
if __name__ == '__main__':
with open('aa.txt', 'r') as f:
hosts = strip_comments(f)
for host in hosts:
print('\'%s\'' % host)

A more compact version of a filtering expression can also look like this:
for line in (l for l in open(filename) if not l.startswith('#')):
# do something with line
(l for ... ) is called "generator expression" which acts here as a wrapping iterator that will filter out all unneeded lines from file while iterating over it. Don't confuse it with the same thing in square brakets [l for ... ] which is a "list comprehension" that will first read all the lines from the file into memory and only then will start iterating over it.
Sometimes you might want to have it less one-liney and more readable:
lines = open(filename)
lines = (l for l in lines if ... )
# more filters and mappings you might want
for line in lines:
# do something with line
All the filters will be executed on the fly in one iteration.

Use regex re.compile("^(?:\s+)*#|(?:\s+)") to skip the new lines and comments.

I tend to use
for line in lines:
if '#' not in line:
#do something
This will ignore the whole line, though the answer which includes rpartition has my upvote as it can include any information from before the #

a good thing to get rid of coments that works for both inline and on a line
def clear_coments(f):
new_text = ''
for line in f.readlines():
if "#" in line: line = line.split("#")[0]
new_text += line
return new_text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.