Python - txt file into a list - python

I'm trying to read a file and to save the text inside as a list containing strings,
(each object in the list is a string which is one line in the text)
i only managed to print out the text line by line , and not as a list.
the text is a long list of biological stuff (random letter if you ask me (:)
and-
def read_proteome ( filename ):
f= open(filename).readlines()
for line in f:
print(line)
only printed the file (did seperated the lines..)
where did i go wrong ?
how do i set it into a list ?

The result of .readlines() is a list. Just print f:
print(f)

Martijn Pieters already gives a simple and complete answer, but it's worth learning how to figure these things out yourself. It's usually faster, and it doesn't cost you a half-dozen downvotes.
First, the fact that you can use for line in f: means that f is obviously some kind of list-ish object, in that it can be used in a for loop the same way a list can. Technically, this means it's an "iterable".
Maybe this means it's already a list? If so, you're already done. If not, the list function takes any iterable and makes it into a list, so, you can just add f = list(f) and you're done.
How do you find out which?
Well, you can add print(type(f)) to your code and see what it prints out. If it says list, you're done; if it says anything else, you need to add the conversion line f = list(f).
It's often easier to do this all interactively, rather than in a script:
>>> f = open(filename).readlines()
>>> type(f)
list
>>> f
['first line\n', 'second line\n', 'last line\n']
If you break this down into pieces, you can see the types of each piece separately:
>>> filename = 'C:/foo.txt'
>>> type(filename)
str
>>> fileobj = open(filename)
>>> type(fileobj)
_io.TextIOWrapper
>>> lines = fileobj.readlines()
>>> type(lines)
list
When you put this all together in one line, lines=open('C:/foo.txt').readlines(), the end result is the same as if you did it in three steps—lines is a list.
But what if you can't figure something out by experimenting, because you don't know what to try?
Well, the interactive interpreter has built-in help:
>>> fileobj = open(filename)
>>> help(fileobject.readlines)
Help on built-in function readlines:
readlines(...)
Return a list of lines from the stream.
hint can be specified to control the number of lines read: no more
lines will be read if the total size (in bytes/characters) of all
lines so far exceeds hint.
It says right there that it returns a list.
Or, you can look at the documentation. Trying to guess where readlines might be in 3.x is actually a bit complicated, because the type of thing open returns is not obvious… but you can just use the "quick search" on the left, and you'll find io.ioBase.readlines, which gives you the same answer:
readlines(hint=-1)
Read and return a list of lines from the stream. hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.

As Martijn said, .readlines() converts a file into a list and cycles through them. However, this is what I think you are trying to do.
with open("file.txt", 'r') as file:
print file.read().split("\n")

As Martijn said, the readlines() method returns a list as said in the python documentation .
If what you need is to convert a string to a list use this: ast helper
import ast
stringList = "{'a':'text'}"
realList = ast.literal_eval(stringList)
print realList
Hope this is what you need!

Related

Amend list from file - Correct syntax and file format?

I currently have a list hard coded into my python code. As it keeps expanding, I wanted to make it more dynamic by reading the list from a file. I have read through many articles about how to do this, but in practice I can't get this working. So firstly, here is an example of the existing hardcoded list:
serverlist = []
serverlist.append(("abc.com", "abc"))
serverlist.append(("def.com", "def"))
serverlist.append(("hji.com", "hji"))
When I enter the command 'print serverlist' the output is shown below and my list works perfectly when I access it:
[('abc.com', 'abc'), ('def.com', 'def'), ('hji.com', 'hji')]
Now I've replaced the above code with the following:
serverlist = []
with open('/server.list', 'r') as f:
serverlist = [line.rstrip('\n') for line in f]
With the contents of server.list being:
'abc.com', 'abc'
'def.com', 'def'
'hji.com', 'hji'
When I now enter the command print serverlist, the output is shown below:
["'abc.com', 'abc'", "'def.com', 'def'", "'hji.com', 'hji'"]
And the list is not working correctly. So what exactly am I doing wrong? Am I reading the file incorrectly or am I formatting the file incorrectly? Or something else?
The contents of the file are not interpreted as Python code. When you read a line in f, it is a string; and the quotation marks, commas etc. in your file are just those characters as parts of a string.
If you want to create some other data structure from the string, you need to parse it. The program has no way to know that you want to turn the string "'abc.com', 'abc'" into the tuple ('abc.com', 'abc'), unless you instruct it to.
This is the point where the question becomes "too broad".
If you are in control of the file contents, then you can simplify the data format to make this more straightforward. For example, if you just have abc.com abc on the line of the file, so that your string ends up as 'abc.com abc', you can then just .split() that; this assumes that you don't need to represent whitespace inside either of the two items. You could instead split on another character (like the comma, in your case) if necessary (.split(',')). If you need a general-purpose hammer, you might want to look into JSON. There is also ast.literal_eval which can be used to treat text as simple Python literal expressions - in this case, you would need the lines of the file to include the enclosing parentheses as well.
If you are willing to let go of the quotes in your file and rewrite it as
abc.com, abc
def.com, def
hji.com, hji
the code to load can be reduced to a one liner using the fact that files are iterables
with open('servers.list') as f:
servers = [tuple(line.split(', ')) for line in f]
Remember that using a file as an iterator already strips off the newlines.
You can allow arbitrary whitespace by doing something like
servers = [tuple(word.strip() for word in line.split(',')) for line in f]
It might be easier to use something like regex to parse the original format. You could use an expression that captures the parts of the line you care about and matches but discards the rest:
import re
pattern = re.compile('\'(.+)\',\\s*\'(.+)\'')
You could then extract the names from the matched groups
with open('servers.list') as f:
servers = [pattern.fullmatch(line).groups() for line in f]
This is just a trivialized example. You can make it as complicated as you wish for your real file format.
Try this:
serverlist = []
with open('/server.list', 'r') as f:
for line in f:
serverlist.append(tuple(line.rstrip('\n').split(',')))
Explanation
You want an explicit for loop so you cycle through each line as expected.
You need list.append for each line to append to your list.
You need to use split(',') in order to split by commas.
Convert to tuple as this is your desired output.
List comprehension method
The for loop can be condensed as below:
with open('/server.list', 'r') as f:
serverlist = [tuple(line.rstrip('\n').split(',')) for line in f]

Removing an imported text file (Python)

I'm trying to remove a couple of lines from a text file that I imported from my Kindle. The text looks like:
Shall I come to you?
Nicholls David, One Day, loc. 876-876
Dexter looked up at the window of the flat where Emma used to live.
Nicholls David, One Day, loc. 883-884
I want to grab the bin bag and do a forensics
Sophie Kinsella, I've Got Your Number, loc. 64-64
The complete file is longer, this is just a piece of document. The aim with my code is to remove all lines where "loc. " is written so that just the extracts remain. My target can be also seen as removing the line which is just before the blank line.
My code so far look like this:
f = open('clippings_export.txt','r', encoding='utf-8')
message = f.read()
line=message[0:400]
f.close()
key=["l","o","c","."," "]
for i in range(0,len(line)-5):
if line[i]==key[0]:
if line[i+1]==key[1]:
if line[i + 2]==key[2]:
if line[i + 3]==key[3]:
if line[i + 4]==key[4]:
The last if finds exactly the position (indices) where each "loc. " is located in file. Nevertheless, after this stage I do not know how to go back in the line so that the code catches where the line starts, and it can be completely remove. What could I do next? Do you recommend me another way to remove this line?
Thanks in advance!
I think that the question might be a bit misleading!
Anyway, if you simply want to remove those lines, you need to check whether they contain the "loc." substring. Probably the easiest way is to use the in operator.
Instead of getting whole file from read() function, read the file line by line (using the readlines() function for example). You can then check if it contains your key and omit it if it does.
Since the result is now list of strings, you might want to merge it: str.join().
Here I used another list to store desired lines, you can also use "more pythonic" filter() or list comprehension (example in similar question I mentioned below).
f = open('clippings_export.txt','r', encoding='utf-8')
lines = f.readlines()
f.close()
filtered_lines = []
for line in lines:
if "loc." in line:
continue
else:
filtered_lines.append(line)
result = ""
result = result.join(filtered_lines)
By the way, I thought it might be a duplicate - Here's question about the opposite (that is wanting lines which contain the key).

Unexpected output from textfile - cleaning read in lines correctly

I am trying to use a very basic text file as a settings file. Three lines repeat in this order/format that govern some settings/input for my program. Text file is as follows:
Facebook
1#3#5#2
Header1#Header2#Header3#Header4
...
This is read in using the following Python code:
f = open('settings.txt', 'r')
for row in f:
platform = f.readline()
rows_to_keep = int(f.readline().split('#'))
row_headers = f.readline().split('#')
clean_output(rows_to_keep, row_headers, platform)
I would expect single string to be read in platform, an array of ints in the second and an array of strings in the third. These are then passed to the function and this is repeated numerous times.
However, the following three things are happening:
Int doesn't convert and I get a TypeError
First line in text file is ignored and I get rows to keep in platform
\n at the end of each line
I suspect these are related and so am only posting one question.
You cannot call int on a list, you need do do some kind of list comprehension like
rows_to_keep = [int(a) for a in f.readline().split('#')]
You're reading a line, then reading another line from the file. You should either do some kind of slicing (see Python how to read N number of lines at a time) or call a function with the three lines after every third iteration.
use .strip() to remove end of lines and other whitespace.
Try this:
with open('settings.txt', 'r') as f:
platform, rows_to_keep, row_headers = f.read().splitlines()
rows_to_keep = [int(x) for x in rows_to_keep.split('#')]
row_headers = row_headers.split('#')
clean_output(rows_to_keep, row_headers, platform)
There are several things going on here. First, when you do the split on the second line, you're trying to cast a list to type int. That won't work. You can, instead, use map.
rows_to_keep = map(int,f.readline().strip().split("#"))
Additionally, you see the strip() method above. That removes trailing whitespace chars from your line, ie: \n.
Try that change and also using strip() on each readline() call.
With as few changes as possible, I've attempted to solve your issues and show you where you went wrong. #Daniel's answer is how I would personally solve the issues.
f = open('settings.txt', 'r')
#See 1. We remove the unnecessary for loop
platform = f.readline()
#See 4. We make sure there are no unwanted leading or trailing characters by stripping them out
rows_to_keep = f.readline().strip().split('#')
#See 3. The enumerate function creates a list of pairs [index, value]
for row in enumerate(rows_to_keep):
rows_to_keep[row[0]] = int(row[1])
row_headers = f.readline().strip().split('#')
#See 2. We close the file when we're done reading
f.close()
clean_output(rows_to_keep, row_headers, platform)
You don't need (and don't want) a for loop on f, as well as calls to readline. You should pick one or the other.
You need to close f with f.close().
You cannot convert a list to an int, you want to convert the elements in the list to int. This can be accomplished with a for loop.
You probably want to call .strip to get rid of trailing newlines.

Python read file from command line and strip "\n\r" with very large files

I am learning python for the first time and I've just learned that readlines() is incredibly slow and taxing on memory. This would be fine, but as I am programming for a data structures class with up to 10^6 inputs, I believe that runtime is very important.
This is what I have so far that works. I did not strip the '\r' yet.
def generateListOfPoints(stuff):
List = open(stuff).readlines()
a = []
for i in range(len(List)):
a.append(List[i].rstrip('\n').split(","))
return a
This is what I tried to do with a for loop (which I heard was better), but all I'm getting is errors and I don't know what is going on.
def generateListOfPoints(stuff):
a = []
with open(stuff) as f:
for line in f:
a.append(stuff.rstrip('\n').rstrip('\r').split(","))
return a
Replace stuff with line. stuff is simply the filepath, the actual content is in line -- the variable used for iterating over the generator f
a.append(line.rstrip('\n').split(","))
You might like to store the list formed after using split on line, as a tuple instead, such that a would be a list of tuples, where each tuple would correspond to a line in the file. You can do that using:
a.append(tuple(line.rstrip('\n').split(",")))
Make sure to name your variables so they make sense. Naming something stuff is convenient but obviously leads to errors. The example below renames this to filename and fixes appending line to the list instead of the filename.
Also, the rstrip function takes a set of characters to strip, so you can strip both \r and \n in one function call. So you would have:
def generateListOfPoints(filename):
a = []
with open(filename) as f:
for line in f:
a.append(line.rstrip('\r\n').split(","))
return a
This will create a list of lists. If you want to flatten out the inner list in your solution, you will want to use extend instead of append.
I suggest you experiment using the command line interpreter. This
makes it easy to learn how rstring and split work. Assuming you
start using the line variable as suggested, You probably are not
appending to the a list what you want. Also you can strip both \n
and \r with one call to rstrip.
python
>>> a = []
>>> line = "this,is,a,test\n\r"
>>> line.rstrip('\n\r')
'this,is,a,test'
>>> line.rstrip('\n\r').split(',')
['this', 'is', 'a', 'test']
>>> a.append(line.rstrip('\n\r').split(','))
>>> a
[['this', 'is', 'a', 'test']]

Iterate over a portion of a list in a list comprehension

I'd like to print out the first 10 lines of a file and avoid reading in any extra lines. How can I do that with a list comprehension without reading in the whole file?
I know that I can do the code like this:
N = 10
with open(path,'rb') as f_in:
for line in f_in:
print line.strip()
N -= 1
if N == 0:
break
But I think a list comprehension is more appropriate:
with open(path,'rb') as f_in:
[print line for i, line in enumerate(f_in) if i<N]
However, that doesn't work because of the print statement so i end up with this mess:
with open(path,'rb') as f_in:
lines = [line.strip() for i, line in enumerate(f_in) if i<N]
for line in lines:
print line
And the real point of my question is how do you get the list comprehension to stop when i==N instead of needlessly continuing and only filtering out the extra lines?
Is there a way to limit how far into an iterator a list comprehension will go? And is there an appropriate way to print out from a list comprehension? I'm fairly new to python and so I'm trying to learn how to do things the right way rather than just the first way I can think of it. I'd like to able to write this in a pythonic way.
how do you get the list comprehension to stop when i==N instead of
needlessly continuing and only filtering out the extra lines?
Is there a way to limit how far into an iterator a list comprehension will go?
You can use itertools.islice to iterate over a slice of an iterable:
from itertools import islice
with open(path,'rb') as f_in:
for line in islice(f_in, N):
print line.strip()
Actually you can specify the index of the first line to produce and even a step (like list or string slicing).
Note that you shouldn't use a list-comprehension if you don't actually need a list, because it consumes memory (in your case you keep all the contents of the file in memory, which can be bad if the file is big).
If you simply want to iterate once over something use a generator expression:
lines = (line.strip() for line in f_in)
(Yes, you simply have to change the [] with ()).
This avoids to building the whole list when executed.
is there an appropriate way to print out from a list comprehension?
No.
In python2 print is a statement and thus it cannot be present in an expression
In python3 you could call print since it is a function, but it is a very bad idea.
List-comprehensions have a specific purpose: build a list from a given iterable.
You are throwing the list away, thus defeating the whole purpose of that syntax.
For this reason there is no support for "breaking" out of the loop in a list-comprehension. If you have a code so complex to require a break you'd better write it with an explicit for loop.
The same is true if you tried to do something like calling map:
map(lambda line: print line, lines)
Assuming the it would be possible to insert a print in a lambda
This even fails in python3 (it wont print anything).
If you want to write good python code the number one rule is to follow the language design:
don't mix expression and statements, that is to say: use expression return values, don't abuse them to produce side-effects.
You can also call next() on the file object in the range of lines you require:
lines = [f_in.next() for x in range(10)]
This will give you the first ten lines.
Using next() can be useful if you want to skip headers or other lines at the start of your file. Each time you call next on the file object you will move to the next line of the file.
If you wanted to print the contents of lines you could use join():
print "".join(lines)

Categories