I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file
Related
This question already has answers here:
How to read the last line of a file in Python?
(10 answers)
Closed 1 year ago.
I have a csv file that grows until it reaches approximately 48M of lines.
Before adding new lines to it, I need to read the last line.
I tried the code below, but it got too slow and I need a faster alternative:
def return_last_line(filepath):
with open(filepath,'r') as file:
for x in file:
pass
return x
return_last_line('lala.csv')
Here is my take, in python:
I created a function that lets you choose how many last lines, because the last lines may be empty.
def get_last_line(file, how_many_last_lines = 1):
# open your file using with: safety first, kids!
with open(file, 'r') as file:
# find the position of the end of the file: end of the file stream
end_of_file = file.seek(0,2)
# set your stream at the end: seek the final position of the file
file.seek(end_of_file)
# trace back each character of your file in a loop
n = 0
for num in range(end_of_file+1):
file.seek(end_of_file - num)
# save the last characters of your file as a string: last_line
last_line = file.read()
# count how many '\n' you have in your string:
# if you have 1, you are in the last line; if you have 2, you have the two last lines
if last_line.count('\n') == how_many_last_lines:
return last_line
get_last_line('lala.csv', 2)
This lala.csv has 48 million lines, such as in your example. It took me 0 seconds to get the last line.
Here is code for finding the last line of a file mmap, and it should work on Unixen and derivatives and Windows alike (I've tested this on Linux only, please tell me if it works on Windows too ;), i.e. pretty much everywhere where it matters. Since it uses memory mapped I/O it could be expected to be quite performant.
It expects that you can map the entire file into the address space of a processor - should be OK for 50M file everywhere but for 5G file you'd need a 64-bit processor or some extra slicing.
import mmap
def iterate_lines_backwards(filename):
with open(filename, "rb") as f:
# memory-map the file, size 0 means whole file
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
start = len(mm)
while start > 0:
start, prev = mm.rfind(b"\n", 0, start), start
slice = mm[start + 1:prev + 1]
# if the last character in the file was a '\n',
# technically the empty string after that is not a line.
if slice:
yield slice.decode()
def get_last_nonempty_line(filename):
for line in iterate_lines_backwards(filename):
if stripped := line.rstrip("\r\n"):
return stripped
print(get_last_nonempty_line("datafile.csv"))
As a bonus there is a generator iterate_lines_backwards that would efficiently iterate over the lines of a file in reverse for any number of lines:
print("Iterating the lines of datafile.csv backwards")
for l in iterate_lines_backwards("datafile.csv"):
print(l, end="")
This is generally a rather tricky thing to do. A very efficient way of getting a chunk that includes the last lines is the following:
import os
def get_last_lines(path, offset=500):
""" An efficient way to get the last lines of a file.
IMPORTANT:
1. Choose offset to be greater than
max_line_length * number of lines that you want to recover.
2. This will throw an os.OSError if the file is shorter than
the offset.
"""
with path.open("rb") as f:
f.seek(-offset, os.SEEK_END)
while f.read(1) != b"\n":
f.seek(-2, os.SEEK_CUR)
return f.readlines()
You need to know the maximum line length though and ensure that the file is at least one offset long!
To use it, do the following:
from pathlib import Path
n_last_lines = 10
last_bit_of_file = get_last_lines(Path("/path/to/my/file"))
real_last_n_lines = last_bit_of_file[-10:]
Now finally you need to decode the binary to strings:
real_last_n_lines_non_binary = [x.decode() for x in real_last_n_lines]
Probably all of this could be wrapped in one more convenient function.
If you are running your code in a Unix based environment, you can execute tail shell command from Python to read the last line:
import subprocess
subprocess.run(['tail', '-n', '1', '/path/to/lala.csv'])
You could additionally store the last line in a separate file, which you update whenever you add new lines to the main file.
This works well for me:
https://pypi.org/project/file-read-backwards/
from file_read_backwards import FileReadBackwards
with FileReadBackwards("/tmp/file", encoding="utf-8") as frb:
# getting lines by lines starting from the last line up
for l in frb:
if l:
print(l)
break
An easy way to do this is with deque:
from collections import deque
def return_last_line(filepath):
with open(filepath,'r') as f:
q = deque(f, 1)
return q[0]
since seek() returns the position that it moved to, you can use it to move backward and position the cursor to the beginning of the last line.
with open("test.txt") as f:
p = f.seek(0,2)-1 # ignore trailing end of line
while p>0 and f.read(1)!="\n": # detect end of line (or start of file)
p = f.seek(p-1,0) # search backward
lastLine = f.read().strip() # read from start of last line
print(lastLine)
To get the last non-empty line, you can add a while loop around the search:
with open("test.txt") as f:
p,lastLine = f.seek(0,2),"" # start from end of file
while p and not lastLine: # want last non-empty line
while p>0 and f.read(1)!="\n": # detect end of line (or start of file)
p = f.seek(p-1,0) # search backward
lastLine = f.read().strip() # read from start of last line
Based on #kuropan
Faster and shorter:
# 60.lastlinefromlargefile.py
# juanfc 2021-03-17
import os
def get_last_lines(fileName, offset=500):
""" An efficient way to get the last lines of a file.
IMPORTANT:
1. Choose offset to be greater than
max_line_length * number of lines that you want to recover.
2. This will throw an os.OSError if the file is shorter than
the offset.
"""
with open(fileName, "rb") as f:
f.seek(-offset, os.SEEK_END)
return f.read().decode('utf-8').rstrip().split('\n')[-1]
print(get_last_lines('60.lastlinefromlargefile.py'))
I would like to insert a string at a specific column of a specific line in a file.
Suppose I have a file file.txt
How was the English test?
How was the Math test?
How was the Chemistry test?
How was the test?
I would like to change the last line to say How was the History test? by adding the string History at line 4 column 13.
Currently I read in every line of the file and add the string to the specified position.
with open("file.txt", "r+") as f:
# Read entire file
lines = f.readlines()
# Update line
lino = 4 - 1
colno = 13 -1
lines[lino] = lines[lino][:colno] + "History " + lines[lino][colno:]
# Rewrite file
f.seek(0)
for line in lines:
f.write(line)
f.truncate()
f.close()
But I feel like I should be able to simply add the line to the file without having to read and rewrite the entire file.
This is possibly a duplicate of below SO thread
Fastest Way to Delete a Line from Large File in Python
In above it's a talk about delete, which is just a manipulation, and yours is more of a modification. So the code would get updated like below
def update(filename, lineno, column, text):
fro = open(filename, "rb")
current_line = 0
while current_line < lineno - 1:
fro.readline()
current_line += 1
seekpoint = fro.tell()
frw = open(filename, "r+b")
frw.seek(seekpoint, 0)
# read the line we want to update
line = fro.readline()
chars = line[0: column-1] + text + line[column-1:]
while chars:
frw.writelines(chars)
chars = fro.readline()
fro.close()
frw.truncate()
frw.close()
if __name__ == "__main__":
update("file.txt", 4, 13, "History ")
In a large file it make sense to not make modification till the lineno where the update needs to happen, Imagine you have file with 10K lines and update needs to happen at 9K, your code will load all 9K lines of data in memory unnecessarily. The code you have would work still but is not the optimal way of doing it
The function readlines() reads the entire file. But it doesn't have to. It actually reads from the current file cursor position to the end, which happens to be 0 right after opening. (To confirm this, try f.tell() right after with statement.) What if we started closer to the end of the file?
The way your code is written implies some prior knowledge of your file contents and layouts. Can you place any constraints on each line? For example, given your sample data, we might say that lines are guaranteed to be 27 bytes or less. Let's round that to 32 for "power of 2-ness" and try seeking backwards from the end of the file.
# note the "rb+"; need to open in binary mode, else seeking is strictly
# a "forward from 0" operation. We need to be able to seek backwards
with open("file.txt", "rb+") as f:
# caveat: if file is less than 32 bytes, this will throw
# an exception. The second parameter, 2, says "from end of file"
f.seek(-32, 2)
last = f.readlines()[-1].decode()
At which point the code has only read the last 32 bytes of the file.1 readlines() (at the byte level) will look for the line end byte (in Unix, \n or 0x0a or byte value 10), and return the before and after. Spelled out:
>>> last = f.readlines()
>>> print( last )
[b'hemistry test?\n', b'How was the test?']
>>> last = last[-1]
>>> print( last )
b'How was the test?'
Crucially, this works robustly under UTF-8 encoding by exploiting the UTF-8 property that ASCII byte values under 128 do not occur when encoding non-ASCII bytes. In other words, the exact byte \n (or 0x0a) only ever occurs as a newline and never as part of a character. If you are using a non-UTF-8 encoding, you will need to check if the code assumptions still hold.
Another note: 32 bytes is arbitrary given the example data. A more realistic and typical value might be 512, 1024, or 4096. Finally, to put it back to a working example for you:
with open("file.txt", "rb+") as f:
# caveat: if file is less than 32 bytes, this will throw
# an exception. The second parameter, 2, says "from end of file"
f.seek(-32, 2)
# does *not* read while file, unless file is exactly 32 bytes.
last = f.readlines()[-1]
last_decoded = last.decode()
# Update line
colno = 13 -1
last_decoded = last_decoded[:colno] + "History " + last_decoded[colno:]
last_line_bytes = len( last )
f.seek(-last_line_bytes, 2)
f.write( last_decoded.encode() )
f.truncate()
Note that there is no need for f.close(). The with statement handles that automatically.
1 The pedantic will correctly note that the computer and OS will likely have read at least 512 bytes, if not 4096 bytes, relating to the on-disk or in-memory page size.
You can use this piece of code :
with open("test.txt",'r+') as f:
# Read the file
lines=f.readlines()
# Gets the column
column=int(input("Column:"))-1
# Gets the line
line=int(input("Line:"))-1
# Gets the word
word=input("Word:")
lines[line]=lines[line][0:column]+word+lines[line][column:]
# Delete the file
f.seek(0)
for i in lines:
# Append the lines
f.write(i)
This answer will only loop through the file once and only write everything after the insert. In cases where the insert is at the end there is almost no overhead and where the insert at the beginning it is no worse than a full read and write.
def insert(file, line, column, text):
ln, cn = line - 1, column - 1 # offset from human index to Python index
count = 0 # initial count of characters
with open(file, 'r+') as f: # open file for reading an writing
for idx, line in enumerate(f): # for all line in the file
if idx < ln: # before the given line
count += len(line) # read and count characters
elif idx == ln: # once at the line
f.seek(count + cn) # place cursor at the correct character location
remainder = f.read() # store all character afterwards
f.seek(count + cn) # move cursor back to the correct character location
f.write(text + remainder) # insert text and rewrite the remainder
return # You're finished!
I'm not sure whether you were having problems changing your file to contain the word "History", or whether you wanted to know how to only rewrite certain parts of a file, without having to rewrite the whole thing.
If you were having problems in general, here is some simple code which should work, so long as you know the line within the file that you want to change. Just change the first and last lines of the program to read and write statements accordingly.
fileData="""How was the English test?
How was the Math test?
How was the Chemistry test?
How was the test?""" # So that I don't have to create the file, I'm writing the text directly into a variable.
fileData=fileData.split("\n")
fileData[3]=fileData[3][:11]+" History"+fileData[3][11:] # The 3 referes to the line to add "History" to. (The first line is line 0)
storeData=""
for i in fileData:storeData+=i+"\n"
storeData=storeData[:-1]
print(storeData) # You can change this to a write command.
If you wanted to know how to change specific "parts" to a file, without rewriting the whole thing, then (to my knowledge) that is not possible.
Say you had a file which said Ths is a TEST file., and you wanted to correct it to say This is a TEST file.; you would technically be changing 17 characters and adding one on the end. You are changing the "s" to an "i", the first space to an "s", the "i" (from "is") to a space, etc... as you shift the text forward.
A computer can't actually insert bytes between other bytes. It can only move the data, to make room.
I have a very long file containing data ("text.txt") and a single file that contains exactly 1 line that is the last line of text.txt. This single line should be overwritten every 10 minutes (done by a simple chronjob) as text.txt receives another line every 10 minutes.
So based on other code snippets I found on stackoverflow I currently run this code:
#!/usr/bin/env python
import os, sys
file = open(sys.argv[1], "r+")
#Move the pointer (similar to a cursor in a text editor) to the end of the file.
file.seek(0, os.SEEK_END)
#This code means the following code skips the very last character in the file -
#i.e. in the case the last line is null we delete the last line
#and the penultimate one
pos = file.tell() - 1
#Read each character in the file one at a time from the penultimate
#character going backwards, searching for a newline character
#If we find a new line, exit the search
while pos > 0 and file.read(1) != "\n":
pos -= 1
file.seek(pos, os.SEEK_SET)
#So long as we're not at the start of the file, delete all the characters ahead of this position
if pos > 0:
file.seek(pos, os.SEEK_SET)
w = open("new.txt",'w')
file.writelines(pos)
w.close()
file.close()
With this code I get the error:
TypeError: writelines() requires an iterable argument
(of course). When using file.truncate() I can get rid of the last line in the original file; but I want to keep it there and just extract that last line to new.txt. But I don't comprehend how this works when working with file.seek. So I'd need help for the last part of the code.
file.readlines() with lines[:-1] does not work properly with such huge files.
Not sure why you're opening w, only to close it without doing anything with it. If you want new.txt to have all the text from file starting at pos and ending at the end, how about:
if pos > 0:
file.seek(pos, os.SEEK_SET)
w = open("new.txt",'w')
w.write(file.read())
w.close()
According to your code, pos is an integer which is used to denote the position of first \n from the end of the file.
You cannot do - file.writelines(pos) , as writelines requires a list of lines. But pos is a single integer.
Also you want to write to new.txt , so you should use w file to write, not file . Example -
if pos > 0:
file.seek(pos, os.SEEK_SET)
w = open("new.txt",'w')
w.write(file.read())
w.close()
How about the following approach:
max_line_length = 1000
with open(sys.argv[1], "r") as f_long, open('new.txt', 'w') as f_new:
f_long.seek(-max_line_length, os.SEEK_END)
lines = [line for line in f_long.read().split("\n") if len(line)]
f_new.write(lines[-1])
This will seek to almost the end of the file and read the remaining part of the file in. It is then split into non-empty lines and the last entry is written to new.txt.
Here's how to tail the last 2 lines of a file into a list:
import subprocess
output = subprocess.check_output(['tail', '-n 2', '~/path/to/my_file.txt'])
lines = output.split('\n')
Now you can get the info you need out of the list lines.
I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.
I have 2 simple questions about python:
1.How to get number of lines of a file in python?
2.How to locate the position in a file object to the
last line easily?
lines are just data delimited by the newline char '\n'.
1) Since lines are variable length, you have to read the entire file to know where the newline chars are, so you can count how many lines:
count = 0
for line in open('myfile'):
count += 1
print count, line # it will be the last line
2) reading a chunk from the end of the file is the fastest method to find the last newline char.
def seek_newline_backwards(file_obj, eol_char='\n', buffer_size=200):
if not file_obj.tell(): return # already in beginning of file
# All lines end with \n, including the last one, so assuming we are just
# after one end of line char
file_obj.seek(-1, os.SEEK_CUR)
while file_obj.tell():
ammount = min(buffer_size, file_obj.tell())
file_obj.seek(-ammount, os.SEEK_CUR)
data = file_obj.read(ammount)
eol_pos = data.rfind(eol_char)
if eol_pos != -1:
file_obj.seek(eol_pos - len(data) + 1, os.SEEK_CUR)
break
file_obj.seek(-len(data), os.SEEK_CUR)
You can use that like this:
f = open('some_file.txt')
f.seek(0, os.SEEK_END)
seek_newline_backwards(f)
print f.tell(), repr(f.readline())
Let's not forget
f = open("myfile.txt")
lines = f.readlines()
numlines = len(lines)
lastline = lines[-1]
NOTE: this reads the whole file in memory as a list. Keep that in mind in the case that the file is very large.
The easiest way is simply to read the file into memory. eg:
f = open('filename.txt')
lines = f.readlines()
num_lines = len(lines)
last_line = lines[-1]
However for big files, this may use up a lot of memory, as the whole file is loaded into RAM. An alternative is to iterate through the file line by line. eg:
f = open('filename.txt')
num_lines = sum(1 for line in f)
This is more efficient, since it won't load the entire file into memory, but only look at a line at a time. If you want the last line as well, you can keep track of the lines as you iterate and get both answers by:
f = open('filename.txt')
count=0
last_line = None
for line in f:
num_lines += 1
last_line = line
print "There were %d lines. The last was: %s" % (num_lines, last_line)
One final possible improvement if you need only the last line, is to start at the end of the file, and seek backwards until you find a newline character. Here's a question which has some code doing this. If you need both the linecount as well though, theres no alternative except to iterate through all lines in the file however.
For small files that fit memory,
how about using str.count() for getting the number of lines of a file:
line_count = open("myfile.txt").read().count('\n')
I'd like too add to the other solutions that some of them (those who look for \n) will not work with files with OS 9-style line endings (\r only), and that they may contain an extra blank line at the end because lots of text editors append it for some curious reasons, so you might or might not want to add a check for it.
The only way to count lines [that I know of] is to read all lines, like this:
count = 0
for line in open("file.txt"): count = count + 1
After the loop, count will have the number of lines read.
For the first question there're already a few good ones, I'll suggest #Brian's one as the best (most pythonic, line ending character proof and memory efficient):
f = open('filename.txt')
num_lines = sum(1 for line in f)
For the second one, I like #nosklo's one, but modified to be more general should be:
import os
f = open('myfile')
to = f.seek(0, os.SEEK_END)
found = -1
while found == -1 and to > 0:
fro = max(0, to-1024)
f.seek(fro)
chunk = f.read(to-fro)
found = chunk.rfind("\n")
to -= 1024
if found != -1:
found += fro
It seachs in chunks of 1Kb from the end of the file, until it finds a newline character or the file ends. At the end of the code, found is the index of the last newline character.
Answer to the first question (beware of poor performance on large files when using this method):
f = open("myfile.txt").readlines()
print len(f) - 1
Answer to the second question:
f = open("myfile.txt").read()
print f.rfind("\n")
P.S. Yes I do understand that this only suits for small files and simple programs. I think I will not delete this answer however useless for real use-cases it may seem.
Answer1:
x = open("file.txt")
opens the file or we have x associated with file.txt
y = x.readlines()
returns all lines in list
length = len(y)
returns length of list to Length
Or in one line
length = len(open("file.txt").readlines())
Answer2 :
last = y[-1]
returns the last element of list
Approach:
Open the file in read-mode and assign a file object named “file”.
Assign 0 to the counter variable.
Read the content of the file using the read function and assign it to a
variable named “Content”.
Create a list of the content where the elements are split wherever they encounter an “\n”.
Traverse the list using a for loop and iterate the counter variable respectively.
Further the value now present in the variable Counter is displayed
which is the required action in this program.
Python program to count the number of lines in a text file
# Opening a file
file = open("filename","file mode")#file mode like r,w,a...
Counter = 0
# Reading from file
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
The above code will print the number of lines present in a file. Replace filename with the file with extension and file mode with read - 'r'.