Python - Speedy solution for deleting duplicate lines in 2 files

Python - Speedy solution for deleting duplicate lines in 2 files - python

I have two files: fileA and fileB. I'd like to get the line numbers of all the lines in the fileB that exist in the fileA. But if a line is indeed in fileA, I won't identify it as "exists in fileA" unless the next line is also in it. So I've written the following code:
def compare_two(fileA, fileB):
with open(fileA, 'r') as fa:
fa_content = fa.read()
with open(fileB, 'r') as fb:
keep_line_num = [] # the line number that's not in fileA
i = 1
while True:
line = fb.readline()
if line == '': # There are no blank lines in both files
break
last_pos = fb.tell()
theFollowing = line
new_line = fb.readline() # get the next line
theFollowing += new_line
fb.seek(last_pos)
if theFollowing not in fa_content:
keep_line_num.append(i)
i += 1
fb.close()
fa.close()
return keep_line_num
compare_two(fileA, fileB)
This works fine for small files. But I want to use it for large files as large as 2GB and this method is too slow for me. Are there any other way to work with this in Python2.7?

Take a look at difflib, it comes with Python.
It can tell you where your files are different or identical. See also python difflib comparing files

Related

Python. ".readline", ".readlines" methods issue. How does it work

Don't understand (passing task on Coursers) how the second code can read the file content. Without ".readline", ".readlines" methods.
Can't run the first.
Task (The file contains lines of words, need to count quantity of words)
#the first
file = open('/Users/max/Desktop/education/Test_record.csv', 'r')
num_words = 0
for lines in file.readlines():
line = file.split()
num_words += len(line)
print(num_words)
#file.close()
#the second (runs well)
num_words = 0
fileref = "/Users/max/Desktop/education/Test_record.csv"
with open(fileref, 'r') as file:
for line in file:
num_words += len(line.split())
print(num_words)
#file.close()

Your first sample could work, it only contains an error since you define a lines variable then try to call split on file, not on lines. It is however a bit redundant, cause you first read all lines in memory, then iterate over them.
The second sample is the standard way to do the job: you iterate over the file lines while reading them.

How to merge two files by line names using python

I think this should be easy but yet have not been able to solve it. I have two files as below and I want to merge them in a way that lines starting with > in the file1 to be the header of the lines in the file2
file1:
>seq12
ACGCTCGCA
>seq34
GCATCGCGT
>seq56
GCGATGCGC
file2:
ATCGCGCATGATCTCAG
AGCGCGCATGCGCATCG
AGCAAATTTAGCAACTC
so the desired output should be:
>seq12
ATCGCGCATGATCTCAG
>seq34
AGCGCGCATGCGCATCG
>seq56
AGCAAATTTAGCAACTC
I have tried this code so far but in output, all the lines coming from file2 are the same:
from Bio import SeqIO
with open(file1) as fw:
with open(file2,'r') as rv:
for line in rv:
items = line
for record in SeqIO.parse(fw, 'fasta'):
print('>' + record.id)
print(line)

If you cannot store your files in memory, you need a solution that reads line by line from each file, and writes accordingly to the output file. The following program does that. The comments try to clarify, though I believe it is clear from the code.
with open("file1.txt") as first, open("file2.txt") as second, open("output.txt", "w+") as output:
while 1:
line_first = first.readline() # line from file1 (header)
line_second = second.readline() # line from file2 (body)
if not (line_first and line_second):
# if any file has ended
break
# write to output file
output.writelines([line_first, line_second])
# jump one line from file1
first.readline()
Note that this will only work if file1.txt has the specific format you presented (odd lines are headers, even lines are useless).
In order to allow a bit more customization, you can wrap it up in a function as:
def merge_files(header_file_path, body_file_path, output_file="output.txt", every_n_lines=2):
with open(header_file_path) as first, open(body_file_path) as second, open(output_file, "w+") as output:
while 1:
line_first = first.readline() # line from header
line_second = second.readline() # line from body
if not (line_first and line_second):
# if any file has ended
break
# write to output file
output.writelines([line_first, line_second])
# jump n lines from header
for _ in range(every_n_lines - 1):
first.readline()
And then calling merge_files("file1.txt", "file2.txt") should do the trick.

If both files are small enough to fit in memory simultaneously, you can simply read them simultaneously and interleave them.
# Open two file handles.
with open("f1", mode="r") as f1, open("f2", mode="r") as f2:
lines_first = f1.readlines() # Read all lines in f1.
lines_second = f2.readlines() # Read all lines in f2.
lines_out = []
# For each line in the file without headers...
for idx in range(len(lines_second)):
# Take every even line from the first file and prepend it to
# the line from the second.
lines_out.append(lines_first[2 * idx + 1].rstrip() + lines_second[idx].rstrip())
You can generate the seq headers very easily given idx: I leave this as an exercise to the reader.
If either or both files are too large to fit in memory, you can repeat the above process line-by-line over both handles (using one variable to store information from the file with headers).

How to extract specific line from large number (4.5 M) of files and debug properly?

I have a question regarding data manipulation and extraction.
I have a large amount of files (about 4.5 million files) from which I want to extract the third row (line) from each file and save it to a new file. However, there seems to be a small discrepancy of about 5 lines that are missing with the number of files and the number of lines extracted.
I have tried debugging to see where the error occurs. For debugging purposes I can think of two possible problems:
(1) I am counting the number of lines incorrectly (I have tried two algorithms for row count and they seem to match)
(2) It reads an empty string which I have also tried to debug in the code. What other possibilities are there that I could look to debug?
Algorithm for calculating file length 1
def file_len(filename):
with open(filename) as f:
for i, l in enumerate(f):
pass
return i + 1
Algorithm for calculating file length 2
def file_len2(filename):
i = sum(1 for line in open(filename))
return i
Algorithm for extracting line no. 3
def extract_line(filename):
f = open(filename, 'r')
for i, line in enumerate(f):
if i == 2: # Line number 3
a = line
if not a.strip():
print(Error!)
f.close()
return a
There were no error messages.
I expect the number of input files to match the number of lines in the output file, but there is a small discrepancy of about 5 lines out of 4.5 million lines between the two.

Suggestion: If a is set globally, checking to see if there are less than three lines will fail.
(I would put this in comments but I don’t have enough rep)

Your general idea is correct, but things can be made a bit simpler.
I also suppose that the discrepancy is due to files with empty third line, or with fewer than 3 lines..
def extract_line(filename):
with open(filename) as f:
for line_no, line_text in enumerate(f):
if line_no == 2:
return line_text.strip() # Stop searching, we found the third line
# Here file f is closed because the `with` statement's scope ended.
# None is implicitly returned here.
def process_files(source_of_filenames):
processed = 0 # Count the files where we found the third line.
for filename in source_of_filenames:
third_line = extract_line(filename)
if third_line:
processed += 1 # Account for the success.
# Write the third line; given as an illustration.
with open(filename + ".3rd-line", "w") as f:
f.write(third_line)
else:
print("File %s has a problem with third line" % filename);
return processed
def main(): # I don't know the source of your file names.
filenames = # Produce a list or a generator here.
processed = process_files(filenames)
print("Processed %d files successfully", processed)
Hope this helps.

Fastest Way to Delete a Line from Large File in Python

I am working with a very large (~11GB) text file on a Linux system. I am running it through a program which is checking the file for errors. Once an error is found, I need to either fix the line or remove the line entirely. And then repeat...
Eventually once I'm comfortable with the process, I'll automate it entirely. For now however, let's assume I'm running this by hand.
What would be the fastest (in terms of execution time) way to remove a specific line from this large file? I thought of doing it in Python...but would be open to other examples. The line might be anywhere in the file.
If Python, assume the following interface:
def removeLine(filename, lineno):
Thanks,
-aj

You can have two file objects for the same file at the same time (one for reading, one for writing):
def removeLine(filename, lineno):
fro = open(filename, "rb")
current_line = 0
while current_line < lineno:
fro.readline()
current_line += 1
seekpoint = fro.tell()
frw = open(filename, "r+b")
frw.seek(seekpoint, 0)
# read the line we want to discard
fro.readline()
# now move the rest of the lines in the file
# one line back
chars = fro.readline()
while chars:
frw.writelines(chars)
chars = fro.readline()
fro.close()
frw.truncate()
frw.close()

Modify the file in place, offending line is replaced with spaces so the remainder of the file does not need to be shuffled around on disk. You can also "fix" the line in place if the fix is not longer than the line you are replacing
import os
from mmap import mmap
def removeLine(filename, lineno):
f=os.open(filename, os.O_RDWR)
m=mmap(f,0)
p=0
for i in range(lineno-1):
p=m.find('\n',p)+1
q=m.find('\n',p)
m[p:q] = ' '*(q-p)
os.close(f)
If the other program can be changed to output the fileoffset instead of the line number, you can assign the offset to p directly and do without the for loop

As far as I know, you can't just open a txt file with python and remove a line. You have to make a new file and move everything but that line to it. If you know the specific line, then you would do something like this:
f = open('in.txt')
fo = open('out.txt','w')
ind = 1
for line in f:
if ind != linenumtoremove:
fo.write(line)
ind += 1
f.close()
fo.close()
You could of course check the contents of the line instead to determine if you want to keep it or not. I also recommend that if you have a whole list of lines to be removed/changed to do all those changes in one pass through the file.

If the lines are variable length then I don't believe that there is a better algorithm than reading the file line by line and writing out all lines, except for the one(s) that you do not want.
You can identify these lines by checking some criteria, or by keeping a running tally of lines read and suppressing the writing of the line(s) that you do not want.
If the lines are fixed length and you want to delete specific line numbers, then you may be able to use seek to move the file pointer... I doubt you're that lucky though.

Update: solution using sed as requested by poster in comment.
To delete for example the second line of file:
sed '2d' input.txt
Use the -i switch to edit in place. Warning: this is a destructive operation. Read the help for this command for information on how to make a backup automatically.

def removeLine(filename, lineno):
in = open(filename)
out = open(filename + ".new", "w")
for i, l in enumerate(in, 1):
if i != lineno:
out.write(l)
in.close()
out.close()
os.rename(filename + ".new", filename)

I think there was a somewhat similar if not exactly the same type of question asked here. Reading (and writing) line by line is slow, but you can read a bigger chunk into memory at once, go through that line by line skipping lines you don't want, then writing this as a single chunk to a new file. Repeat until done. Finally replace the original file with the new file.
The thing to watch out for is when you read in a chunk, you need to deal with the last, potentially partial line you read, and prepend that into the next chunk you read.

#OP, if you can use awk, eg assuming line number is 10
$ awk 'NR!=10' file > newfile

I will provide two alternatives based on the look-up factor (line number or a search string):
Line number
def removeLine2(filename, lineNumber):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
currentLineNumber = 0
while currentLineNumber < lineNumber:
inputFile.readline()
currentLineNumber += 1
seekPosition = inputFile.tell()
outputFile.seek(seekPosition, 0)
inputFile.readline()
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()
String
def removeLine(filename, key):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
seekPosition = 0
currentLine = inputFile.readline()
while not currentLine.strip().startswith('"%s"' % key):
seekPosition = inputFile.tell()
currentLine = inputFile.readline()
outputFile.seek(seekPosition, 0)
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()

Two simple questions about python

I have 2 simple questions about python:
1.How to get number of lines of a file in python?
2.How to locate the position in a file object to the
last line easily?

lines are just data delimited by the newline char '\n'.
1) Since lines are variable length, you have to read the entire file to know where the newline chars are, so you can count how many lines:
count = 0
for line in open('myfile'):
count += 1
print count, line # it will be the last line
2) reading a chunk from the end of the file is the fastest method to find the last newline char.
def seek_newline_backwards(file_obj, eol_char='\n', buffer_size=200):
if not file_obj.tell(): return # already in beginning of file
# All lines end with \n, including the last one, so assuming we are just
# after one end of line char
file_obj.seek(-1, os.SEEK_CUR)
while file_obj.tell():
ammount = min(buffer_size, file_obj.tell())
file_obj.seek(-ammount, os.SEEK_CUR)
data = file_obj.read(ammount)
eol_pos = data.rfind(eol_char)
if eol_pos != -1:
file_obj.seek(eol_pos - len(data) + 1, os.SEEK_CUR)
break
file_obj.seek(-len(data), os.SEEK_CUR)
You can use that like this:
f = open('some_file.txt')
f.seek(0, os.SEEK_END)
seek_newline_backwards(f)
print f.tell(), repr(f.readline())

Let's not forget
f = open("myfile.txt")
lines = f.readlines()
numlines = len(lines)
lastline = lines[-1]
NOTE: this reads the whole file in memory as a list. Keep that in mind in the case that the file is very large.

The easiest way is simply to read the file into memory. eg:
f = open('filename.txt')
lines = f.readlines()
num_lines = len(lines)
last_line = lines[-1]
However for big files, this may use up a lot of memory, as the whole file is loaded into RAM. An alternative is to iterate through the file line by line. eg:
f = open('filename.txt')
num_lines = sum(1 for line in f)
This is more efficient, since it won't load the entire file into memory, but only look at a line at a time. If you want the last line as well, you can keep track of the lines as you iterate and get both answers by:
f = open('filename.txt')
count=0
last_line = None
for line in f:
num_lines += 1
last_line = line
print "There were %d lines. The last was: %s" % (num_lines, last_line)
One final possible improvement if you need only the last line, is to start at the end of the file, and seek backwards until you find a newline character. Here's a question which has some code doing this. If you need both the linecount as well though, theres no alternative except to iterate through all lines in the file however.

For small files that fit memory,
how about using str.count() for getting the number of lines of a file:
line_count = open("myfile.txt").read().count('\n')

I'd like too add to the other solutions that some of them (those who look for \n) will not work with files with OS 9-style line endings (\r only), and that they may contain an extra blank line at the end because lots of text editors append it for some curious reasons, so you might or might not want to add a check for it.

The only way to count lines [that I know of] is to read all lines, like this:
count = 0
for line in open("file.txt"): count = count + 1
After the loop, count will have the number of lines read.

For the first question there're already a few good ones, I'll suggest #Brian's one as the best (most pythonic, line ending character proof and memory efficient):
f = open('filename.txt')
num_lines = sum(1 for line in f)
For the second one, I like #nosklo's one, but modified to be more general should be:
import os
f = open('myfile')
to = f.seek(0, os.SEEK_END)
found = -1
while found == -1 and to > 0:
fro = max(0, to-1024)
f.seek(fro)
chunk = f.read(to-fro)
found = chunk.rfind("\n")
to -= 1024
if found != -1:
found += fro
It seachs in chunks of 1Kb from the end of the file, until it finds a newline character or the file ends. At the end of the code, found is the index of the last newline character.

Answer to the first question (beware of poor performance on large files when using this method):
f = open("myfile.txt").readlines()
print len(f) - 1
Answer to the second question:
f = open("myfile.txt").read()
print f.rfind("\n")
P.S. Yes I do understand that this only suits for small files and simple programs. I think I will not delete this answer however useless for real use-cases it may seem.

Answer1:
x = open("file.txt")
opens the file or we have x associated with file.txt
y = x.readlines()
returns all lines in list
length = len(y)
returns length of list to Length
Or in one line
length = len(open("file.txt").readlines())
Answer2 :
last = y[-1]
returns the last element of list

Approach:
Open the file in read-mode and assign a file object named “file”.
Assign 0 to the counter variable.
Read the content of the file using the read function and assign it to a
variable named “Content”.
Create a list of the content where the elements are split wherever they encounter an “\n”.
Traverse the list using a for loop and iterate the counter variable respectively.
Further the value now present in the variable Counter is displayed
which is the required action in this program.
Python program to count the number of lines in a text file
# Opening a file
file = open("filename","file mode")#file mode like r,w,a...
Counter = 0
# Reading from file
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
The above code will print the number of lines present in a file. Replace filename with the file with extension and file mode with read - 'r'.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Speedy solution for deleting duplicate lines in 2 files - python

Take a look at difflib, it comes with Python. It can tell you where your files are different or identical. See also python difflib comparing files

Related

Python. ".readline", ".readlines" methods issue. How does it work

How to merge two files by line names using python

How to extract specific line from large number (4.5 M) of files and debug properly?

Fastest Way to Delete a Line from Large File in Python

Two simple questions about python

Categories

Resources