using a text file in python - python

Im trying to take a text file and use only the first 30 lines of it in python.
this is what I wrote:
text = open("myText.txt")
lines = myText.readlines(30)
print lines
for some reason I get more then 150 lines when I print?
What am I doing wrong?

Use itertools.islice
import itertools
for line in itertools.islice(open("myText.txt"), 0, 30)):
print line

If you are going to process your lines individually, an alternative could be to use a loop:
file = open('myText.txt')
for i in range(30):
line = file.readline()
# do stuff with line here
EDIT: some of the comments below express concern about this method assuming there are at least 30 lines in the file. If that is an issue for your application, you can check the value of line before processing. readline() will return an empty string '' once EOF has been reached:
for i in range(30):
line = file.readline()
if line == '': # note that an empty line will return '\n', not ''!
break
index = new_index
# do stuff with line here

The sizehint argument for readlines isn't what you think it is (bytes, not lines).
If you really want to use readlines, try text.readlines()[:30] instead.
Do note that this is inefficient for large files as it first creates a list containing the whole file before returning a slice of it.
A straight-forward solution would be to use readline within a loop (as shown in mac's answer).
To handle files of various sizes (more or less than 30), Andrew's answer provides a robust solution using itertools.islice(). To achieve similar results without itertools, consider:
output = [line for _, line in zip(range(30), open("yourfile.txt", "r"))]
or as a generator expression (Python >2.4):
output = (line for _, line in zip(range(30), open("yourfile.txt", "r")))
for line in output:
# do something with line.

The argument for readlines is the size (in bytes) that you want to read in. Apparently 150+ lines is 30 bytes worth of data.
Doing it with a for loop instead will give you proper results. Unfortunately, there doesn't seem to be a better built-in function for that.

Related

First line of the file is missing while reading

Missing first line while reading from a file and transferring lines to a new one, wondering why. I have found a solution by adding x = flower1.readline()
y = flower2.readline()
and printing them on the file, but still did not understand why the first lines are missing at the first version of my code:
flower1 = open('/Users/berketurer/Desktop/week5_flower1.txt','r')
flower2 = open('/Users/berketurer/Desktop/week5_flower2.txt', 'r')
newFile = open('/Users/berketurer/Desktop/FlowerCombined.txt','w')
for line in flower1:
lines = flower1.read()
newFile.write(lines)
newFile.write('\n')
for line in flower2:
lines = flower2.read()
newFile.write(lines)
flower1.close()
flower2.close()
newFile.close()
that loop makes no sense:
for line in flower1:
lines = flower1.read()
newFile.write(lines)
this iterates on the lines of the file, and inside the loop you read the whole file!
The fact that the first line is missing is because python has already consumed the line in the first iteration of for line in flower1. Then it reads the rest of the file and writes it, and the loop ends because the end of file has been reached... Note that line is never used in the code. Usually not a good sign.
just remove the loop:
lines = flower1.read()
newFile.write(lines)
If you want to process line by line for some reason (may come in handy if you want to filter out some lines) keep the loop but don't read in the loop:
for line in flower1:
newFile.write(line)
There's also this one-liner: newFile.writelines(flower1) if you don't need filtering.
There are better ways to concatenate general files together though (binary or text). I like the version which use shutil.copyfileobj because even if the files are huge, the memory footprint is moderate, and performance-wise since blocksize is larger than 1 line, the speed may be slightly better.

How to read a specific range of lines in an external file in python?

Lets say you have a python file with 50 lines of code in it, and you want to read a specific range lines into a list. If you want to read ALL the lines in the file, you can just use the code from this answer:
with open('yourfile.py') as f:
content = f.readlines()
print(content)
But what if you want to read a specific range of lines, like reading line 23-27?
I tried this, but it doesn't work:
f.readlines(23:27)
You were close. readlines returns a list and you can slice that, but it's invalid syntax to try and pass the slice directly in the function call.
f.readlines()[23:27]
If the file is very large, avoid the memory overhead of reading the entire file:
start, stop = 23, 27
for i in range(start):
next(f)
content = []
for i in range(stop-start):
content.append(next(f))
Try this:
sublines = content[23:27]
If there are lots and lots of lines in your file, I believe you should consider using f.readline() (without an s) 27 times, and only save your lines starting wherever you want. :)
Else, the other ones solution is what I would have done too (meaning : f.readlines()[23:28]. 28, because as far as I remember, outer range is excluded.

Not understanding read command in Python

I'm trying to understand what's going on with my read function. I'm simply doing a readline of a text document I created in canopy. For some reason it only gives me w for whatever value I put in. I'm new to the world of Python so I'm sure its an easy answer! Thanks for your help!
import os
my_file = open(os.path.expanduser("~/Desktop/Python Files/Test Text.txt"),'r')
print my_file.readline(3)
my_file.close()
My Text document is below
w
o
r
d
s
my_file.readline(3) reads up to 3 bytes from the first line.
The first line contains a w and an end-of-line character.
If you want to read up to the first 3 bytes regardless of the line, use my_file.read(3). Note that end-of-line characters are included in the count.
If you want to print the first 3 lines, you could use
import os
with open(os.path.expanduser("~/Desktop/Python Files/Test Text.txt"),'r') as my_file:
for i, line in enumerate(my_file):
if i >= 3: break
print(line)
or
import itertools as IT
with open(os.path.expanduser("~/Desktop/Python Files/Test Text.txt"),'r') as my_file:
for line in IT.islice(my_file, 3):
print(line)
For short files you could instead use
with open(os.path.expanduser("~/Desktop/Python Files/Test Text.txt"),'r') as my_file:
lines = my_file.readlines()
for line in lines[:3]:
print(line)
but note that my_file.readlines() returns a list of all the lines in the
file. Since this can be very memory-intensive if the file is huge, and since it
is usually possible to process a file line-by-line (which is much less
memory-intensive), generally the first two methods of reading a file are
preferred over the third.
'readline([size]) -> next line from the file, as a string.Retain newline. A non-negative size argument limits the maximum number of bytes to return (an incomplete line may be returned then).Return an empty string at EOF.
readline reads next line and so on.The argument size is for how many bytes should it read from the corresponding line.
Using f.readline does not give random access to the file. I think you want to read the third (or maybe fourth if you're zero-indexing) line. The argument that you're passing to f.readline is a maximum byte count to read, rather than a specific line to read.

Reading from text file into python list

Very new to python and can't understand why this isn't working. I have a list of web addresses stored line by line in a text file. I want to store the first 10 in an array/list called bing, the next 10 in a list called yahoo, and the last 10 in a list called duckgo. I'm using the readlines function to read the data from the file into each array. The problem is nothing is being written to the lists. The count is incrementing like it should. Also, if I remove the loops altogether and just read the whole text file into one list it works perfectly. This leads me to believe that the loops are causing the problem. The code I am using is below. Would really appreciate some feedback.
count=0;
#Open the file
fo=open("results.txt","r")
#read into each array
while(count<30):
if(count<10):
bing = fo.readlines()
count+=1
print bing
print count
elif(count>=10 and count<=19):
yahoo = fo.readlines()
count+=1
print count
elif(count>=20 and count<=29):
duckgo = fo.readlines()
count+=1
print count
print bing
print yahoo
print duckgo
fo.close
You're using readlines to read the files. readlines reads all of the lines at once, so the very first time through your loop, you exhaust the entire file and store the result in bing. Then, every time through the loop, you overwrite bing, yahoo, or duckgo with the (empty) result of the next readlines call. So your lists all wind up being empty.
There are lots of ways to fix this. Among other things, you should consider reading the file a line at a time, with readline (no 's'). Or better yet, you could iterate over the file, line by line, simply by using a for loop:
for line in fo:
...
To keep the structure of your current code you could use enumerate:
for line_number, line in enumerate(fo):
if condition(line_number):
...
But frankly I think you should ditch your current system. A much simpler way would be to use readlines without a loop, and slice the resulting list!
lines = fo.readlines()
bing = lines[0:10]
yahoo = lines[10:20]
duckgo = lines[20:30]
There are many other ways to do this, and some might be better, but none are simpler!
readlines() reads all of the lines of the file. If you call it again, you get empty list. So you are overwriting your lists with empty data when you iterate through your loop.
You should be using readline() instead of readlines()
readlines() reads the entire file in at once, whereas readline() reads a single line from the file.
I suggest you rewrite it like so:
bing = []
yahoo = []
duckgo = []
with open("results.txt", "r") as f:
for i, line in enumerate(f):
if i < 10:
bing.append(line)
elif i < 20:
yahoo.append(line)
elif i < 30:
duckgo.append(line)
else:
raise RuntimeError, "too many lines in input file"
Note how we use enumerate() to get a running count of lines, rather than making our own count variable and needing to increment it ourselves. This is considered good style in Python.
But I think the best way to solve this problem would be to use itertools like so:
import itertools as it
with open("results.txt", "r") as f:
bing = list(it.islice(f, 10))
yahoo = list(it.islice(f, 10))
duckgo = list(it.islice(f, 10))
if list(it.islice(f, 1)):
raise RuntimeError, "too many lines in input file"
itertools.islice() (or it.islice() since I did the import itertools as it) will pull a specified number of items from an iterator. Our open file-handle object f is an iterator that returns lines from the file, so it.islice(f, 10) pulls exactly 10 lines from the input file.
Because it.islice() returns an iterator, we must explicitly expand it out to a list by wrapping it in list().
I think this is the simplest way to do it. It perfectly expresses what we want: for each one, we want a list with 10 lines from the file. There is no need to keep a counter at all, just pull the 10 lines each time!
EDIT: The check for extra lines now uses it.islice(f, 1) so that it will only pull a single line. Even one extra line is enough to know that there are more than just the 30 expected lines, and this way if someone accidentally runs this code on a very large file, it won't try to slurp the whole file into memory.

How to jump to a particular line in a huge text file?

Are there any alternatives to the code below:
startFromLine = 141978 # or whatever line I need to jump to
urlsfile = open(filename, "rb", 0)
linesCounter = 1
for line in urlsfile:
if linesCounter > startFromLine:
DoSomethingWithThisLine(line)
linesCounter += 1
If I'm processing a huge text file (~15MB) with lines of unknown but different length, and need to jump to a particular line which number I know in advance? I feel bad by processing them one by one when I know I could ignore at least first half of the file. Looking for more elegant solution if there is any.
You can't jump ahead without reading in the file at least once, since you don't know where the line breaks are. You could do something like:
# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
line_offset.append(offset)
offset += len(line)
file.seek(0)
# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])
linecache:
The linecache module allows one to get any line from a Python source file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file. This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback...
You don't really have that many options if the lines are of different length... you sadly need to process the line ending characters to know when you've progressed to the next line.
You can, however, dramatically speed this up AND reduce memory usage by changing the last parameter to "open" to something not 0.
0 means the file reading operation is unbuffered, which is very slow and disk intensive. 1 means the file is line buffered, which would be an improvement. Anything above 1 (say 8 kB, i.e. 8192, or higher) reads chunks of the file into memory. You still access it through for line in open(etc):, but python only goes a bit at a time, discarding each buffered chunk after its processed.
I am suprised no one mentioned islice
line = next(itertools.islice(Fhandle,index_of_interest,index_of_interest+1),None) # just the one line
or if you want the whole rest of the file
rest_of_file = itertools.islice(Fhandle,index_of_interest)
for line in rest_of_file:
print line
or if you want every other line from the file
rest_of_file = itertools.islice(Fhandle,index_of_interest,None,2)
for odd_line in rest_of_file:
print odd_line
I'm probably spoiled by abundant ram, but 15 M is not huge. Reading into memory with readlines() is what I usually do with files of this size. Accessing a line after that is trivial.
Since there is no way to determine the length of all lines without reading them, you have no choice but to iterate over all lines before your starting line. All you can do is make it look nice. If the file is really huge then you might want to use a generator-based approach:
from itertools import dropwhile
def iterate_from_line(f, start_from_line):
return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))
for line in iterate_from_line(open(filename, "r", 0), 141978):
DoSomethingWithThisLine(line)
Note: the index is zero-based in this approach.
I have had the same problem (need to retrieve from huge file specific line).
Surely, I can every time run through all records in file and stop it when counter will be equal to target line, but it does not work effectively in a case when you want to obtain plural number of specific rows. That caused main issue to be resolved - how handle directly to necessary place of file.
I found out next decision:
Firstly I completed dictionary with start position of each line (key is line number, and value – cumulated length of previous lines).
t = open(file,’r’)
dict_pos = {}
kolvo = 0
length = 0
for each in t:
dict_pos[kolvo] = length
length = length+len(each)
kolvo = kolvo+1
ultimately, aim function:
def give_line(line_number):
t.seek(dict_pos.get(line_number))
line = t.readline()
return line
t.seek(line_number) – command that execute pruning of file up to line inception.
So, if you next commit readline – you obtain your target line.
Using such approach I have saved significant part of time.
If you don't want to read the entire file in memory .. you may need to come up with some format other than plain text.
of course it all depends on what you're trying to do, and how often you will jump across the file.
For instance, if you're gonna be jumping to lines many times in the same file, and you know that the file does not change while working with it, you can do this:
First, pass through the whole file, and record the "seek-location" of some key-line-numbers (such as, ever 1000 lines),
Then if you want line 12005, jump to the position of 12000 (which you've recorded) then read 5 lines and you'll know you're in line 12005
and so on
You may use mmap to find the offset of the lines. MMap seems to be the fastest way to process a file
example:
with open('input_file', "r+b") as f:
mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
i = 1
for line in iter(mapped.readline, ""):
if i == Line_I_want_to_jump:
offsets = mapped.tell()
i+=1
then use f.seek(offsets) to move to the line you need
None of the answers are particularly satisfactory, so here's a small snippet to help.
class LineSeekableFile:
def __init__(self, seekable):
self.fin = seekable
self.line_map = list() # Map from line index -> file position.
self.line_map.append(0)
while seekable.readline():
self.line_map.append(seekable.tell())
def __getitem__(self, index):
# NOTE: This assumes that you're not reading the file sequentially.
# For that, just use 'for line in file'.
self.fin.seek(self.line_map[index])
return self.fin.readline()
Example usage:
In: !cat /tmp/test.txt
Out:
Line zero.
Line one!
Line three.
End of file, line four.
In:
with open("/tmp/test.txt", 'rt') as fin:
seeker = LineSeekableFile(fin)
print(seeker[1])
Out:
Line one!
This involves doing a lot of file seeks, but is useful for the cases where you can't fit the whole file in memory. It does one initial read to get the line locations (so it does read the whole file, but doesn't keep it all in memory), and then each access does a file seek after the fact.
I offer the snippet above under the MIT or Apache license at the discretion of the user.
If you know in advance the position in the file (rather the line number), you can use file.seek() to go to that position.
Edit: you can use the linecache.getline(filename, lineno) function, which will return the contents of the line lineno, but only after reading the entire file into memory. Good if you're randomly accessing lines from within the file (as python itself might want to do to print a traceback) but not good for a 15MB file.
What generates the file you want to process? If it is something under your control, you could generate an index (which line is at which position.) at the time the file is appended to. The index file can be of fixed line size (space padded or 0 padded numbers) and will definitely be smaller. And thus can be read and processed qucikly.
Which line do you want?.
Calculate byte offset of corresponding line number in index file(possible because line size of index file is constant).
Use seek or whatever to directly jump to get the line from index file.
Parse to get byte offset for corresponding line of actual file.
Do the lines themselves contain any index information? If the content of each line was something like "<line index>:Data", then the seek() approach could be used to do a binary search through the file, even if the amount of Data is variable. You'd seek to the midpoint of the file, read a line, check whether its index is higher or lower than the one you want, etc.
Otherwise, the best you can do is just readlines(). If you don't want to read all 15MB, you can use the sizehint argument to at least replace a lot of readline()s with a smaller number of calls to readlines().
If you're dealing with a text file & based on linux system, you could use the linux commands.
For me, this worked well!
import commands
def read_line(path, line=1):
return commands.getoutput('head -%s %s | tail -1' % (line, path))
line_to_jump = 141978
read_line("path_to_large_text_file", line_to_jump)
Here's an example using readlines(sizehint) to read a chunk of lines at a time. DNS pointed out that solution. I wrote this example because the other examples here are single-line oriented.
def getlineno(filename, lineno):
if lineno < 1:
raise TypeError("First line is line 1")
f = open(filename)
lines_read = 0
while 1:
lines = f.readlines(100000)
if not lines:
return None
if lines_read + len(lines) >= lineno:
return lines[lineno-lines_read-1]
lines_read += len(lines)
print getlineno("nci_09425001_09450000.smi", 12000)
#george brilliantly suggested mmap, which presumably uses the syscall mmap. Here's another rendition.
import mmap
LINE = 2 # your desired line
with open('data.txt','rb') as i_file, mmap.mmap(i_file.fileno(), length=0, prot=mmap.PROT_READ) as data:
for i,line in enumerate(iter(data.readline, '')):
if i!=LINE: continue
pos = data.tell() - len(line)
break
# optionally copy data to `chunk`
i_file.seek(pos)
chunk = i_file.read(len(line))
print(f'line {i}')
print(f'byte {pos}')
print(f'data {line}')
print(f'data {chunk}')
Can use this function to return line n:
def skipton(infile, n):
with open(infile,'r') as fi:
for i in range(n-1):
fi.next()
return fi.next()

Categories