Two simple questions about python

Two simple questions about python - python

I have 2 simple questions about python:
1.How to get number of lines of a file in python?
2.How to locate the position in a file object to the
last line easily?

lines are just data delimited by the newline char '\n'.
1) Since lines are variable length, you have to read the entire file to know where the newline chars are, so you can count how many lines:
count = 0
for line in open('myfile'):
count += 1
print count, line # it will be the last line
2) reading a chunk from the end of the file is the fastest method to find the last newline char.
def seek_newline_backwards(file_obj, eol_char='\n', buffer_size=200):
if not file_obj.tell(): return # already in beginning of file
# All lines end with \n, including the last one, so assuming we are just
# after one end of line char
file_obj.seek(-1, os.SEEK_CUR)
while file_obj.tell():
ammount = min(buffer_size, file_obj.tell())
file_obj.seek(-ammount, os.SEEK_CUR)
data = file_obj.read(ammount)
eol_pos = data.rfind(eol_char)
if eol_pos != -1:
file_obj.seek(eol_pos - len(data) + 1, os.SEEK_CUR)
break
file_obj.seek(-len(data), os.SEEK_CUR)
You can use that like this:
f = open('some_file.txt')
f.seek(0, os.SEEK_END)
seek_newline_backwards(f)
print f.tell(), repr(f.readline())

Let's not forget
f = open("myfile.txt")
lines = f.readlines()
numlines = len(lines)
lastline = lines[-1]
NOTE: this reads the whole file in memory as a list. Keep that in mind in the case that the file is very large.

The easiest way is simply to read the file into memory. eg:
f = open('filename.txt')
lines = f.readlines()
num_lines = len(lines)
last_line = lines[-1]
However for big files, this may use up a lot of memory, as the whole file is loaded into RAM. An alternative is to iterate through the file line by line. eg:
f = open('filename.txt')
num_lines = sum(1 for line in f)
This is more efficient, since it won't load the entire file into memory, but only look at a line at a time. If you want the last line as well, you can keep track of the lines as you iterate and get both answers by:
f = open('filename.txt')
count=0
last_line = None
for line in f:
num_lines += 1
last_line = line
print "There were %d lines. The last was: %s" % (num_lines, last_line)
One final possible improvement if you need only the last line, is to start at the end of the file, and seek backwards until you find a newline character. Here's a question which has some code doing this. If you need both the linecount as well though, theres no alternative except to iterate through all lines in the file however.

For small files that fit memory,
how about using str.count() for getting the number of lines of a file:
line_count = open("myfile.txt").read().count('\n')

I'd like too add to the other solutions that some of them (those who look for \n) will not work with files with OS 9-style line endings (\r only), and that they may contain an extra blank line at the end because lots of text editors append it for some curious reasons, so you might or might not want to add a check for it.

The only way to count lines [that I know of] is to read all lines, like this:
count = 0
for line in open("file.txt"): count = count + 1
After the loop, count will have the number of lines read.

For the first question there're already a few good ones, I'll suggest #Brian's one as the best (most pythonic, line ending character proof and memory efficient):
f = open('filename.txt')
num_lines = sum(1 for line in f)
For the second one, I like #nosklo's one, but modified to be more general should be:
import os
f = open('myfile')
to = f.seek(0, os.SEEK_END)
found = -1
while found == -1 and to > 0:
fro = max(0, to-1024)
f.seek(fro)
chunk = f.read(to-fro)
found = chunk.rfind("\n")
to -= 1024
if found != -1:
found += fro
It seachs in chunks of 1Kb from the end of the file, until it finds a newline character or the file ends. At the end of the code, found is the index of the last newline character.

Answer to the first question (beware of poor performance on large files when using this method):
f = open("myfile.txt").readlines()
print len(f) - 1
Answer to the second question:
f = open("myfile.txt").read()
print f.rfind("\n")
P.S. Yes I do understand that this only suits for small files and simple programs. I think I will not delete this answer however useless for real use-cases it may seem.

Answer1:
x = open("file.txt")
opens the file or we have x associated with file.txt
y = x.readlines()
returns all lines in list
length = len(y)
returns length of list to Length
Or in one line
length = len(open("file.txt").readlines())
Answer2 :
last = y[-1]
returns the last element of list

Approach:
Open the file in read-mode and assign a file object named “file”.
Assign 0 to the counter variable.
Read the content of the file using the read function and assign it to a
variable named “Content”.
Create a list of the content where the elements are split wherever they encounter an “\n”.
Traverse the list using a for loop and iterate the counter variable respectively.
Further the value now present in the variable Counter is displayed
which is the required action in this program.
Python program to count the number of lines in a text file
# Opening a file
file = open("filename","file mode")#file mode like r,w,a...
Counter = 0
# Reading from file
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
The above code will print the number of lines present in a file. Replace filename with the file with extension and file mode with read - 'r'.

Related

Python. ".readline", ".readlines" methods issue. How does it work

Don't understand (passing task on Coursers) how the second code can read the file content. Without ".readline", ".readlines" methods.
Can't run the first.
Task (The file contains lines of words, need to count quantity of words)
#the first
file = open('/Users/max/Desktop/education/Test_record.csv', 'r')
num_words = 0
for lines in file.readlines():
line = file.split()
num_words += len(line)
print(num_words)
#file.close()
#the second (runs well)
num_words = 0
fileref = "/Users/max/Desktop/education/Test_record.csv"
with open(fileref, 'r') as file:
for line in file:
num_words += len(line.split())
print(num_words)
#file.close()

Your first sample could work, it only contains an error since you define a lines variable then try to call split on file, not on lines. It is however a bit redundant, cause you first read all lines in memory, then iterate over them.
The second sample is the standard way to do the job: you iterate over the file lines while reading them.

How to quickly get the last line of a huge csv file (48M lines)? [duplicate]

This question already has answers here:
How to read the last line of a file in Python?
(10 answers)
Closed 1 year ago.
I have a csv file that grows until it reaches approximately 48M of lines.
Before adding new lines to it, I need to read the last line.
I tried the code below, but it got too slow and I need a faster alternative:
def return_last_line(filepath):
with open(filepath,'r') as file:
for x in file:
pass
return x
return_last_line('lala.csv')

Here is my take, in python:
I created a function that lets you choose how many last lines, because the last lines may be empty.
def get_last_line(file, how_many_last_lines = 1):
# open your file using with: safety first, kids!
with open(file, 'r') as file:
# find the position of the end of the file: end of the file stream
end_of_file = file.seek(0,2)
# set your stream at the end: seek the final position of the file
file.seek(end_of_file)
# trace back each character of your file in a loop
n = 0
for num in range(end_of_file+1):
file.seek(end_of_file - num)
# save the last characters of your file as a string: last_line
last_line = file.read()
# count how many '\n' you have in your string:
# if you have 1, you are in the last line; if you have 2, you have the two last lines
if last_line.count('\n') == how_many_last_lines:
return last_line
get_last_line('lala.csv', 2)
This lala.csv has 48 million lines, such as in your example. It took me 0 seconds to get the last line.

Here is code for finding the last line of a file mmap, and it should work on Unixen and derivatives and Windows alike (I've tested this on Linux only, please tell me if it works on Windows too ;), i.e. pretty much everywhere where it matters. Since it uses memory mapped I/O it could be expected to be quite performant.
It expects that you can map the entire file into the address space of a processor - should be OK for 50M file everywhere but for 5G file you'd need a 64-bit processor or some extra slicing.
import mmap
def iterate_lines_backwards(filename):
with open(filename, "rb") as f:
# memory-map the file, size 0 means whole file
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
start = len(mm)
while start > 0:
start, prev = mm.rfind(b"\n", 0, start), start
slice = mm[start + 1:prev + 1]
# if the last character in the file was a '\n',
# technically the empty string after that is not a line.
if slice:
yield slice.decode()
def get_last_nonempty_line(filename):
for line in iterate_lines_backwards(filename):
if stripped := line.rstrip("\r\n"):
return stripped
print(get_last_nonempty_line("datafile.csv"))
As a bonus there is a generator iterate_lines_backwards that would efficiently iterate over the lines of a file in reverse for any number of lines:
print("Iterating the lines of datafile.csv backwards")
for l in iterate_lines_backwards("datafile.csv"):
print(l, end="")

This is generally a rather tricky thing to do. A very efficient way of getting a chunk that includes the last lines is the following:
import os
def get_last_lines(path, offset=500):
""" An efficient way to get the last lines of a file.
IMPORTANT:
1. Choose offset to be greater than
max_line_length * number of lines that you want to recover.
2. This will throw an os.OSError if the file is shorter than
the offset.
"""
with path.open("rb") as f:
f.seek(-offset, os.SEEK_END)
while f.read(1) != b"\n":
f.seek(-2, os.SEEK_CUR)
return f.readlines()
You need to know the maximum line length though and ensure that the file is at least one offset long!
To use it, do the following:
from pathlib import Path
n_last_lines = 10
last_bit_of_file = get_last_lines(Path("/path/to/my/file"))
real_last_n_lines = last_bit_of_file[-10:]
Now finally you need to decode the binary to strings:
real_last_n_lines_non_binary = [x.decode() for x in real_last_n_lines]
Probably all of this could be wrapped in one more convenient function.

If you are running your code in a Unix based environment, you can execute tail shell command from Python to read the last line:
import subprocess
subprocess.run(['tail', '-n', '1', '/path/to/lala.csv'])

You could additionally store the last line in a separate file, which you update whenever you add new lines to the main file.

This works well for me:
https://pypi.org/project/file-read-backwards/
from file_read_backwards import FileReadBackwards
with FileReadBackwards("/tmp/file", encoding="utf-8") as frb:
# getting lines by lines starting from the last line up
for l in frb:
if l:
print(l)
break

An easy way to do this is with deque:
from collections import deque
def return_last_line(filepath):
with open(filepath,'r') as f:
q = deque(f, 1)
return q[0]

since seek() returns the position that it moved to, you can use it to move backward and position the cursor to the beginning of the last line.
with open("test.txt") as f:
p = f.seek(0,2)-1 # ignore trailing end of line
while p>0 and f.read(1)!="\n": # detect end of line (or start of file)
p = f.seek(p-1,0) # search backward
lastLine = f.read().strip() # read from start of last line
print(lastLine)
To get the last non-empty line, you can add a while loop around the search:
with open("test.txt") as f:
p,lastLine = f.seek(0,2),"" # start from end of file
while p and not lastLine: # want last non-empty line
while p>0 and f.read(1)!="\n": # detect end of line (or start of file)
p = f.seek(p-1,0) # search backward
lastLine = f.read().strip() # read from start of last line

Based on #kuropan
Faster and shorter:
# 60.lastlinefromlargefile.py
# juanfc 2021-03-17
import os
def get_last_lines(fileName, offset=500):
""" An efficient way to get the last lines of a file.
IMPORTANT:
1. Choose offset to be greater than
max_line_length * number of lines that you want to recover.
2. This will throw an os.OSError if the file is shorter than
the offset.
"""
with open(fileName, "rb") as f:
f.seek(-offset, os.SEEK_END)
return f.read().decode('utf-8').rstrip().split('\n')[-1]
print(get_last_lines('60.lastlinefromlargefile.py'))

Counting the lines of a long file not working correctly in Python

I've been trying to count the lines of a very long file (more than 635000 lines).
I've tried with:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
and also:
num_lines = sum(1 for line in open(fname))
Both work perfectly for files with not so much lines. I've checked making a 5 lines file and works ok, output is 5.
But with with a long file, which exactly has 635474 lines, the output of both methods posted above is 635466.
I know that the file has 635474 lines, not 635466 lines because I'm creating strings inside of the file and the last two lines are:
alm_asdf_alarm635473=.NOT USED
alm_asdf_alarm635474=.NOT USED
And also because if I open the file with Notepad++ the last line is counted as 635474.
What's the logic behind this? Why is it counting less lines that the real ones?
Thanks in advance.

If all your lines have same structure, you could try a program like that :
import re
num = re.compile('[^0-9]*([0-9]+)')
delta = 1 # initial delta
with open(...) as fd:
for i, line in enumerate(fd, delta):
m = num.match(line)
if i != int(m.group(1)):
print i, "th line for number ", int(m.group(1))
break
It should be enough to find first line where you have a difference (delta is here for the case where first line would be internally numbered 1 and not 0). Then you could more easily understand where the problem really comes from with notepad++.
Note : if only some lines have this structure, you could use that variation :
m = num.match(line)
if (m is not None) and (i != int(m.group(1))):

Python- how to use while loop to return longest line of code

I just started learning python 3 weeks ago, I apologize if this is really basic. I needed to open a .txt file and print the length of the longest line of code in the file. I just made a random file named it myfile and saved it to my desktop.
myfile= open('myfile', 'r')
line= myfile.readlines()
len(max(line))-1
#the (the "-1" is to remove the /n)
Is this code correct? I put it in interpreter and it seemed to work OK.
But I got it wrong because apparently I was supposed to use a while loop. Now I am trying to figure out how to put it in a while loop. I've read what it says on python.org, watched videos on youtube and looked through this site. I just am not getting it. The example to follow that was given is this:
import os
du=os.popen('du/urs/local')
while 1:
line= du.readline()
if not line:
break
if list(line).count('/')==3:
print line,

print max([len(line) for line in file(filename).readlines()])

Taking what you have and stripping out the parts you don't need
myfile = open('myfile', 'r')
max_len = 0
while 1:
line = myfile.readline()
if not line:
break
if len(line) # ... somethin
# something
Note that this is a crappy way to loop over a file. It relys on the file having an empty line at the end. But homework is homework...

max(['b','aaa']) is 'b'
This lexicographic order isn't what you want to maximise, you can use the key flag to choose a different function to maximise, like len.
max(['b','aaa'], key=len) is 'aaa'
So the solution could be: len ( max(['b','aaa'], key=len) is 'aaa' ).
A more elegant solution would be to use list comprehension:
max ( len(line)-1 for line in myfile.readlines() )
.
As an aside you should enclose opening a file using a with statement, this will worry about closing the file after the indentation block:
with open('myfile', 'r') as mf:
print max ( len(line)-1 for line in mf.readlines() )

As other's have mentioned, you need to find the line with the maximum length, which mean giving the max() function a key= argument to extract that from each of lines in the list you pass it.
Likewise, in a while loop you'd need to read each line and see if its length was greater that the longest one you had seen so far, which you could store in a separate variable and initialize to 0 before the loop.
BTW, you would not want to open the file with os.popen() as shown in your second example.

I think it will be easier to understand if we keep it simple:
max_len = -1 # Nothing was read so far
with open("filename.txt", "r") as f: # Opens the file and magically closes at the end
for line in f:
max_len = max(max_len, len(line))
print max_len
As this is homework... I would ask myself if I should count the line feed character or not. If you need to chop the last char, change len(line) by len(line[:-1]).
If you have to use while, try this:
max_len = -1 # Nothing was read
with open("t.txt", "r") as f: # Opens the file
while True:
line = f.readline()
if(len(line)==0):
break
max_len = max(max_len, len(line[:-1]))
print max_len

For those still in need. This is a little function which does what you need:
def get_longest_line(filename):
length_lines_list = []
open_file_name = open(filename, "r")
all_text = open_file_name.readlines()
for line in all_text:
length_lines_list.append(len(line))
max_length_line = max(length_lines_list)
for line in all_text:
if len(line) == max_length_line:
return line.strip()
open_file_name.close()

Fastest Way to Delete a Line from Large File in Python

I am working with a very large (~11GB) text file on a Linux system. I am running it through a program which is checking the file for errors. Once an error is found, I need to either fix the line or remove the line entirely. And then repeat...
Eventually once I'm comfortable with the process, I'll automate it entirely. For now however, let's assume I'm running this by hand.
What would be the fastest (in terms of execution time) way to remove a specific line from this large file? I thought of doing it in Python...but would be open to other examples. The line might be anywhere in the file.
If Python, assume the following interface:
def removeLine(filename, lineno):
Thanks,
-aj

You can have two file objects for the same file at the same time (one for reading, one for writing):
def removeLine(filename, lineno):
fro = open(filename, "rb")
current_line = 0
while current_line < lineno:
fro.readline()
current_line += 1
seekpoint = fro.tell()
frw = open(filename, "r+b")
frw.seek(seekpoint, 0)
# read the line we want to discard
fro.readline()
# now move the rest of the lines in the file
# one line back
chars = fro.readline()
while chars:
frw.writelines(chars)
chars = fro.readline()
fro.close()
frw.truncate()
frw.close()

Modify the file in place, offending line is replaced with spaces so the remainder of the file does not need to be shuffled around on disk. You can also "fix" the line in place if the fix is not longer than the line you are replacing
import os
from mmap import mmap
def removeLine(filename, lineno):
f=os.open(filename, os.O_RDWR)
m=mmap(f,0)
p=0
for i in range(lineno-1):
p=m.find('\n',p)+1
q=m.find('\n',p)
m[p:q] = ' '*(q-p)
os.close(f)
If the other program can be changed to output the fileoffset instead of the line number, you can assign the offset to p directly and do without the for loop

As far as I know, you can't just open a txt file with python and remove a line. You have to make a new file and move everything but that line to it. If you know the specific line, then you would do something like this:
f = open('in.txt')
fo = open('out.txt','w')
ind = 1
for line in f:
if ind != linenumtoremove:
fo.write(line)
ind += 1
f.close()
fo.close()
You could of course check the contents of the line instead to determine if you want to keep it or not. I also recommend that if you have a whole list of lines to be removed/changed to do all those changes in one pass through the file.

If the lines are variable length then I don't believe that there is a better algorithm than reading the file line by line and writing out all lines, except for the one(s) that you do not want.
You can identify these lines by checking some criteria, or by keeping a running tally of lines read and suppressing the writing of the line(s) that you do not want.
If the lines are fixed length and you want to delete specific line numbers, then you may be able to use seek to move the file pointer... I doubt you're that lucky though.

Update: solution using sed as requested by poster in comment.
To delete for example the second line of file:
sed '2d' input.txt
Use the -i switch to edit in place. Warning: this is a destructive operation. Read the help for this command for information on how to make a backup automatically.

def removeLine(filename, lineno):
in = open(filename)
out = open(filename + ".new", "w")
for i, l in enumerate(in, 1):
if i != lineno:
out.write(l)
in.close()
out.close()
os.rename(filename + ".new", filename)

I think there was a somewhat similar if not exactly the same type of question asked here. Reading (and writing) line by line is slow, but you can read a bigger chunk into memory at once, go through that line by line skipping lines you don't want, then writing this as a single chunk to a new file. Repeat until done. Finally replace the original file with the new file.
The thing to watch out for is when you read in a chunk, you need to deal with the last, potentially partial line you read, and prepend that into the next chunk you read.

#OP, if you can use awk, eg assuming line number is 10
$ awk 'NR!=10' file > newfile

I will provide two alternatives based on the look-up factor (line number or a search string):
Line number
def removeLine2(filename, lineNumber):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
currentLineNumber = 0
while currentLineNumber < lineNumber:
inputFile.readline()
currentLineNumber += 1
seekPosition = inputFile.tell()
outputFile.seek(seekPosition, 0)
inputFile.readline()
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()
String
def removeLine(filename, key):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
seekPosition = 0
currentLine = inputFile.readline()
while not currentLine.strip().startswith('"%s"' % key):
seekPosition = inputFile.tell()
currentLine = inputFile.readline()
outputFile.seek(seekPosition, 0)
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Two simple questions about python - python

I have 2 simple questions about python: 1.How to get number of lines of a file in python? 2.How to locate the position in a file object to the last line easily?

Let's not forget f = open("myfile.txt") lines = f.readlines() numlines = len(lines) lastline = lines[-1] NOTE: this reads the whole file in memory as a list. Keep that in mind in the case that the file is very large.

For small files that fit memory, how about using str.count() for getting the number of lines of a file: line_count = open("myfile.txt").read().count('\n')

The only way to count lines [that I know of] is to read all lines, like this: count = 0 for line in open("file.txt"): count = count + 1 After the loop, count will have the number of lines read.

Answer1: x = open("file.txt") opens the file or we have x associated with file.txt y = x.readlines() returns all lines in list length = len(y) returns length of list to Length Or in one line length = len(open("file.txt").readlines()) Answer2 : last = y[-1] returns the last element of list

Related

Python. ".readline", ".readlines" methods issue. How does it work

How to quickly get the last line of a huge csv file (48M lines)? [duplicate]

Counting the lines of a long file not working correctly in Python

Python- how to use while loop to return longest line of code

Fastest Way to Delete a Line from Large File in Python

Categories

Resources