I have a file, like this:
<prop type="ltattr-match">1-1</prop>
id =>3</prop>
<tuv xml:lang="en">
<seg> He is not a good man </seg>
And what I want is to detect the third line before the line He is not a good man , i.e (id =>3). The file is big. what I can do
I suggest using a double ended queue with a maximum length: this way, only the required amount of "backlog" is stored and you don't have to fiddle around with slices manually. We don't need the "double-ended-ness", but the normal Queue class blocks if the queue is full.
import collections
dq = collections.deque([], 3) # create an empty queue
with open("mybigfile.txt") as file:
for line in file.readlines():
if line.startswith('<seg>'):
return dq[0] # or add to list
dq.append(line) # save the line, if already 3 lines stored,
# discard oldest line.
with open("mybigfile.txt") as file:
lines = file.readlines()
for idx, line in enumerate(lines):
if line.startswith("<seg>"):
line_to_detect = lines[idx-3]
#use idx-2 if you want the _second_ line before this one,
#ex `id =>3</prop>`
print "This line was detected:"
print line_to_detect
Result:
This line was detected:
<prop type="ltattr-match">1-1</prop>
As we previously discussed in chat, this method can be memory intensive for very large files. But 100 pages isn't very large, so this should be fine.
Read each line in sequence, remembering only the last 3 read at any point.
Something like:
# Assume f is a file object open to your file
last3 = []
last3.append( f.readline() )
last3.append( f.readline() )
last3.append( f.readline() )
while ( True ):
line = f.readline()
if (line satisfies condition):
break
last3 = last3[1:]+[line]
# At this point last3[0] is 3 lines before the matching line
You'll need to modify this to handle files w/ < 3 lines, or if no line matches your condition.
file = "path/to/the/file"
f = open(file, "r")
lines = f.readlines()
f.close()
i = 0
for line in lines:
if "<seg> He is not a good man </seg>" in line:
print(lines[i]) #Print the prvious line
else
i += 1
If you need the second line before just change to print(lines[i-1])
Related
Suppose I have a text file that goes like this:
AAAAAAAAAAAAAAAAAAAAA #<--- line 1
BBBBBBBBBBBBBBBBBBBBB #<--- line 2
CCCCCCCCCCCCCCCCCCCCC #<--- line 3
DDDDDDDDDDDDDDDDDDDDD #<--- line 4
EEEEEEEEEEEEEEEEEEEEE #<--- line 5
FFFFFFFFFFFFFFFFFFFFF #<--- line 6
GGGGGGGGGGGGGGGGGGGGG #<--- line 7
HHHHHHHHHHHHHHHHHHHHH #<--- line 8
Ignore "#<--- line...", it's just for demonstration
Assumptions
I don't know what line 3 is going to contain (because it changes
all the time)...
The first 2 lines have to be deleted...
After the first 2 lines, I want to keep 3 lines...
Then, I want to delete all lines after the 3rd line.
End Result
The end result should look like this:
CCCCCCCCCCCCCCCCCCCCC #<--- line 3
DDDDDDDDDDDDDDDDDDDDD #<--- line 4
EEEEEEEEEEEEEEEEEEEEE #<--- line 5
Lines deleted: First 2 + Everything after the next 3 (i.e. after line 5)
Required
All Pythonic suggestions are welcome! Thanks!
Reference Material
https://thispointer.com/python-how-to-delete-specific-lines-in-a-file-in-a-memory-efficient-way/
def delete_multiple_lines(original_file, line_numbers):
"""In a file, delete the lines at line number in given list"""
is_skipped = False
counter = 0
# Create name of dummy / temporary file
dummy_file = original_file + '.bak'
# Open original file in read only mode and dummy file in write mode
with open(original_file, 'r') as read_obj, open(dummy_file, 'w') as write_obj:
# Line by line copy data from original file to dummy file
for line in read_obj:
# If current line number exist in list then skip copying that line
if counter not in line_numbers:
write_obj.write(line)
else:
is_skipped = True
counter += 1
# If any line is skipped then rename dummy file as original file
if is_skipped:
os.remove(original_file)
os.rename(dummy_file, original_file)
else:
os.remove(dummy_file)
Then...
delete_multiple_lines('sample.txt', [0,1,2])
The problem with this method might be that, if your file had 1-100 lines on top to delete, you'll have to specify [0,1,2...100]. Right?
Answer
Courtesy of #sandes
The following code will:
delete the first 63
get you the next 95
ignore the rest
create a new file
with open("sample.txt", "r") as f:
lines = f.readlines()
new_lines = []
idx_lines_wanted = [x for x in range(63,((63*2)+95))]
# delete first 63, then get the next 95
for i, line in enumerate(lines):
if i > len(idx_lines_wanted) -1:
break
if i in idx_lines_wanted:
new_lines.append(line)
with open("sample2.txt", "w") as f:
for line in new_lines:
f.write(line)
EDIT: iterating directly over f
based in #Kenny's comment and #chepner's suggestion
with open("your_file.txt", "r") as f:
new_lines = []
for idx, line in enumerate(f):
if idx in [x for x in range(2,5)]: #[2,3,4]
new_lines.append(line)
with open("your_new_file.txt", "w") as f:
for line in new_lines:
f.write(line)
This is really something that's better handled by an actual text editor.
import subprocess
subprocess.run(['ed', original_file], input=b'1,2d\n+3,$d\nwq\n')
A crash course in ed, the POSIX standard text editor.
ed opens the file named by its argument. It then proceeds to read commands from its standard input. Each command is a single character, with some commands taking one or two "addresses" to indicate which lines to operate on.
After each command, the "current" line number is set to the line last affected by a command. This is used with relative addresses, as we'll see in a moment.
1,2d means to delete lines 1 through 2; the current line is set to 2
+3,$d deletes all the lines from line 5 (current line is 2, so 2 + 3 == 5) through the end of the file ($ is a special address indicating the last line of the file)
wq writes all changes to disk and quits the editor.
I am trying to copy lines four lines before a line that contains a specific keyword.
if line.find("keyword") == 0:
f.write(line -3)
I don't need the line where I found the keyword, but 4 lines before it. Since the write method doesn't work with line numbers, I got stuck
If you're already using two files, it's as simple as keeping a buffer and writing out the last 3 entries in it when you encounter a match:
buf = [] # your buffer
with open("in_file", "r") as f_in, open("out_file", "w") as f_out: # open the in/out files
for line in f_in: # iterate the input file line by line
if "keyword" in line: # the current line contains a keyword
f_out.writelines(buf[-3:]) # write the last 3 lines (or less if not available)
f_out.write(line) # write the current line, omit if not needed
buf = [] # reset the buffer
else:
buf.append(line) # add the current line to the buffer
You can just use a list, append to the list each line (and truncate to last 4). When you reach the target line you are done.
last_3 = []
with open("the_dst_file") as fw:
with open("the_source_file") as fr:
for line in fr:
if line.find("keyword") == 0:
fw.write(last_3[0] + "\n")
last_3 = []
continue
last_3.append(line)
last_3 = last_3[-3:]
If the format of the file is known in a way that "keyword" will always have at least 3 lines preceding it, and at least 3 lines between instances, then the above is good. If not, then you would need to guard against the write by checking that the len of last_3 is at == 3 before pulling off the first element.
I have a file, after reading a line from the file I have named it current_line, I want to fetch the 4th line above the current_line. How can this be done using python?
line 1
line 2
line 3
line 4
line 5
line 6
Now say I have fetched line 6 and I have made
current_line = line 6
Now i want 4 the line from above (ie) N now want line 2
output_line = line 2
PS: I dont want to read the file from bottom.
You can keep a list of the last 4 lines while iterating over the lines of your file. A good way to do it is to use a deque with a maximum length of 4:
from collections import deque
last_lines = deque(maxlen=4)
with open('test.txt') as f:
for line in f:
if line.endswith('6\n'): # Your real condition here
print(last_lines[0])
last_lines.append(line)
# Output:
# line 2
Once a bounded length deque is full, when new items are added, a
corresponding number of items are discarded from the opposite end.
We read the file line by line and only keep the needed lines in memory.
Imagine we have just read line 10. We have lines 6 to 9 in the queue.
If the condition is met, we retrieve line 6 at the start of the queue and use it.
We append line 10 to the deque, the first item (line 6) gets pushed out, as we are sure that we won't need it anymore, we now have lines 7 to 10 in the queue.
My approach would be converting the contents to a list splitting on \n and retrieving required line by index.
lines = '''line 1
line 2
line 3
line 4
line 5
line 6'''
s = lines.split('\n')
current_line = 'line 6'
output_line = s[s.index(current_line) - 4]
# line 2
Since you are reading from file, you don't need to explicitly split on \n. You could read from file as list of lines using readlines:
with open('path/to/your_file') as f:
lines = f.readlines()
current_line = 'line 6'
output_line = lines[lines.index(current_line) - 4]
# line 2
You can use enumerate for your open(). For example:
with open('path/to/your.file') as f:
for i, line in enumerate(f):
# Do something with line
# And you have the i as index.
To go back to the i-4 line, you may think about using while.
But why do you need to go back?
you can do:
with open("file.txt") as f:
lines = f.readlines()
for nbr_line, line in enumerate(lines):
if line == ...:
output_line = lines[nbr_line - 4] # !!! nbr_line - 4 may be < 0
As I can see you are reading the file line by line. I suggest you to read whole file into the list as below example.
with open("filename.txt","r") as fd:
lines=fd.readlines() # this will read each line and append it to lines list
lines[line_number] will give you the respected line.
f.readLines not effective solution. If you work on huge file why do you want to read all file into memory?
def getNthLine(i):
if i<1:
return 'NaN';
else:
with open('temp.text', 'r') as f:
for index, line in enumerate(f):
if index == i:
line = line.strip()
return line
f = open('temp.text','r');
for i,line in enumerate(f):
print(line.strip());
print(getNthLine(i-1));
There is no much more options to solve that kind of a problem.
you could also use tell and seek methods to play around but generally no need for ninja :).
If you using on huge file just do not forget to use enumerate
This is how you could do it with a generator, avoids reading the whole file into memory.
Update: used collections.deque (deque stands for "double ended queue") as recommended by Thierry Lathuille.
import collections
def file_generator(filepath):
with open(filepath) as file:
for l in file:
yield l.rstrip()
def get_n_lines_previous(filepath, n, match):
file_gen = file_generator(filepath)
stored_lines = collections.deque('',n)
for line in file_gen:
if line == match:
return stored_lines[0]
stored_lines.append(line)
if __name__ == "__main__":
print(get_n_lines_previous("lines.txt", 4, "line 6"))
How do you skip to the next lines of a file being looped line by line. This code below is skipping lines for the total count in the 2nd loop, I want it to skip the line 1 by 1 for the desired count so I can pull the right information from the file.
f = open("someTXT", "r")
lines = iter(f.readlines())
for line in lines:
thisLine = line.split(',')
if len(thisLine) > 3:
count = thisLine[4]
for i in range(1,int(count)):
next(lines)
print(line)
Here's a bit of code review. Not sure what you're asking though.
Use the context manager to open files:
with open("someTXT", 'rU') as f: # Universal newline flag, best practice
# lines = iter(f) # no need for this, my_file is an iterator
container = [] # use a container to hold your lines
for line in f:
test = test_for_correct_lines(line) # return True if keep and print
if test:
container.append(line)
# join the lines you want to keep with a newline and print them
print('\n'.join(container))
I am working with a very large (~11GB) text file on a Linux system. I am running it through a program which is checking the file for errors. Once an error is found, I need to either fix the line or remove the line entirely. And then repeat...
Eventually once I'm comfortable with the process, I'll automate it entirely. For now however, let's assume I'm running this by hand.
What would be the fastest (in terms of execution time) way to remove a specific line from this large file? I thought of doing it in Python...but would be open to other examples. The line might be anywhere in the file.
If Python, assume the following interface:
def removeLine(filename, lineno):
Thanks,
-aj
You can have two file objects for the same file at the same time (one for reading, one for writing):
def removeLine(filename, lineno):
fro = open(filename, "rb")
current_line = 0
while current_line < lineno:
fro.readline()
current_line += 1
seekpoint = fro.tell()
frw = open(filename, "r+b")
frw.seek(seekpoint, 0)
# read the line we want to discard
fro.readline()
# now move the rest of the lines in the file
# one line back
chars = fro.readline()
while chars:
frw.writelines(chars)
chars = fro.readline()
fro.close()
frw.truncate()
frw.close()
Modify the file in place, offending line is replaced with spaces so the remainder of the file does not need to be shuffled around on disk. You can also "fix" the line in place if the fix is not longer than the line you are replacing
import os
from mmap import mmap
def removeLine(filename, lineno):
f=os.open(filename, os.O_RDWR)
m=mmap(f,0)
p=0
for i in range(lineno-1):
p=m.find('\n',p)+1
q=m.find('\n',p)
m[p:q] = ' '*(q-p)
os.close(f)
If the other program can be changed to output the fileoffset instead of the line number, you can assign the offset to p directly and do without the for loop
As far as I know, you can't just open a txt file with python and remove a line. You have to make a new file and move everything but that line to it. If you know the specific line, then you would do something like this:
f = open('in.txt')
fo = open('out.txt','w')
ind = 1
for line in f:
if ind != linenumtoremove:
fo.write(line)
ind += 1
f.close()
fo.close()
You could of course check the contents of the line instead to determine if you want to keep it or not. I also recommend that if you have a whole list of lines to be removed/changed to do all those changes in one pass through the file.
If the lines are variable length then I don't believe that there is a better algorithm than reading the file line by line and writing out all lines, except for the one(s) that you do not want.
You can identify these lines by checking some criteria, or by keeping a running tally of lines read and suppressing the writing of the line(s) that you do not want.
If the lines are fixed length and you want to delete specific line numbers, then you may be able to use seek to move the file pointer... I doubt you're that lucky though.
Update: solution using sed as requested by poster in comment.
To delete for example the second line of file:
sed '2d' input.txt
Use the -i switch to edit in place. Warning: this is a destructive operation. Read the help for this command for information on how to make a backup automatically.
def removeLine(filename, lineno):
in = open(filename)
out = open(filename + ".new", "w")
for i, l in enumerate(in, 1):
if i != lineno:
out.write(l)
in.close()
out.close()
os.rename(filename + ".new", filename)
I think there was a somewhat similar if not exactly the same type of question asked here. Reading (and writing) line by line is slow, but you can read a bigger chunk into memory at once, go through that line by line skipping lines you don't want, then writing this as a single chunk to a new file. Repeat until done. Finally replace the original file with the new file.
The thing to watch out for is when you read in a chunk, you need to deal with the last, potentially partial line you read, and prepend that into the next chunk you read.
#OP, if you can use awk, eg assuming line number is 10
$ awk 'NR!=10' file > newfile
I will provide two alternatives based on the look-up factor (line number or a search string):
Line number
def removeLine2(filename, lineNumber):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
currentLineNumber = 0
while currentLineNumber < lineNumber:
inputFile.readline()
currentLineNumber += 1
seekPosition = inputFile.tell()
outputFile.seek(seekPosition, 0)
inputFile.readline()
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()
String
def removeLine(filename, key):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
seekPosition = 0
currentLine = inputFile.readline()
while not currentLine.strip().startswith('"%s"' % key):
seekPosition = inputFile.tell()
currentLine = inputFile.readline()
outputFile.seek(seekPosition, 0)
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()