How to merge two files by line names using python - python

I think this should be easy but yet have not been able to solve it. I have two files as below and I want to merge them in a way that lines starting with > in the file1 to be the header of the lines in the file2
file1:
>seq12
ACGCTCGCA
>seq34
GCATCGCGT
>seq56
GCGATGCGC
file2:
ATCGCGCATGATCTCAG
AGCGCGCATGCGCATCG
AGCAAATTTAGCAACTC
so the desired output should be:
>seq12
ATCGCGCATGATCTCAG
>seq34
AGCGCGCATGCGCATCG
>seq56
AGCAAATTTAGCAACTC
I have tried this code so far but in output, all the lines coming from file2 are the same:
from Bio import SeqIO
with open(file1) as fw:
with open(file2,'r') as rv:
for line in rv:
items = line
for record in SeqIO.parse(fw, 'fasta'):
print('>' + record.id)
print(line)

If you cannot store your files in memory, you need a solution that reads line by line from each file, and writes accordingly to the output file. The following program does that. The comments try to clarify, though I believe it is clear from the code.
with open("file1.txt") as first, open("file2.txt") as second, open("output.txt", "w+") as output:
while 1:
line_first = first.readline() # line from file1 (header)
line_second = second.readline() # line from file2 (body)
if not (line_first and line_second):
# if any file has ended
break
# write to output file
output.writelines([line_first, line_second])
# jump one line from file1
first.readline()
Note that this will only work if file1.txt has the specific format you presented (odd lines are headers, even lines are useless).
In order to allow a bit more customization, you can wrap it up in a function as:
def merge_files(header_file_path, body_file_path, output_file="output.txt", every_n_lines=2):
with open(header_file_path) as first, open(body_file_path) as second, open(output_file, "w+") as output:
while 1:
line_first = first.readline() # line from header
line_second = second.readline() # line from body
if not (line_first and line_second):
# if any file has ended
break
# write to output file
output.writelines([line_first, line_second])
# jump n lines from header
for _ in range(every_n_lines - 1):
first.readline()
And then calling merge_files("file1.txt", "file2.txt") should do the trick.

If both files are small enough to fit in memory simultaneously, you can simply read them simultaneously and interleave them.
# Open two file handles.
with open("f1", mode="r") as f1, open("f2", mode="r") as f2:
lines_first = f1.readlines() # Read all lines in f1.
lines_second = f2.readlines() # Read all lines in f2.
lines_out = []
# For each line in the file without headers...
for idx in range(len(lines_second)):
# Take every even line from the first file and prepend it to
# the line from the second.
lines_out.append(lines_first[2 * idx + 1].rstrip() + lines_second[idx].rstrip())
You can generate the seq headers very easily given idx: I leave this as an exercise to the reader.
If either or both files are too large to fit in memory, you can repeat the above process line-by-line over both handles (using one variable to store information from the file with headers).

Related

Find unique entries in files

guess you have a solution concerning the following issue:
I want to compare two lists for common entries (on the basis of column 10) and write common entries to one file and unique entries for the first list into another file. The code I wrote is:
INFILE1 = open ("c:\\python\\test\\58962.filtered.csv", "r")
INFILE2 = open ("c:\\python\\test\\83887.filtered.csv", "r")
OUTFILE1 = open ("c:\\python\\test\\58962_vs_83887.common.csv", "w")
OUTFILE2 = open ("c:\\python\\test\\58962_vs_83887.unique.csv", "w")
for line in INFILE1:
line = line.rstrip().split(",")
if line[11] in INFILE2:
OUTFILE1.write(line)
else:
OUTFILE2.write(line)
INFILE1.close()
INFILE2.close()
OUTFILE1.close()
OUTFILE2.close()
The following error appears:
8 OUTFILE1.write(line)
9 else:
---> 10 OUTFILE2.write(line)
11 INFILE1.close()
TypeError: write() argument must be str, not list
Does somebody know about help for this?
Best
This line
line = line.rstrip().split(",")
replaces the line you read from a file by it's splitted list. You then try to write the splitted list to your file - thats not how the write method works and it tells you exactly that.
Change it to :
for line in INFILE1:
lineList = line.rstrip().split(",") # dont overwrite line, use lineList
if lineList[11] in INFILE2: # used lineList
OUTFILE1.write(line) # corrected indentation
else:
OUTFILE2.write(line)
You could have easily found your error yourself, just printing out the line before and after splitting or just befrore writing.
Please read How to debug small programs (#1) and follow it - its easier to find and fix bugs yourself then posting questions here.
You have some other problem at hand, though:
Files are stream based, they start with a position of 0 in the file. The position is advanced if you access parts of the file. When at the end, you wont get anything by using INFILE2.read() or other methods.
So if you want to repeatadly check if some lines column of file1 is somewhere in file2 you need to read file2 into a list (or other datastructure) so your repeated checks work. In other words, this:
if lineList[11] in INFILE2:
might work once, then the file is consumed and it will return false all the time.
You also might want to change from:
f = open(...., ...)
# do something with f
f.close()
to
with open(name,"r") as f:
# do something with f, no close needed, closed when leaving block
as it is safer, will close the file even if exceptions happen.
To solve that try this (untested) code:
with open ("c:\\python\\test\\83887.filtered.csv", "r") as file2:
infile2 = file2.readlines() # read in all lines as list
with open ("c:\\python\\test\\58962.filtered.csv", "r") as INFILE1:
# next 2 lines are 1 line, \ at end signifies line continues
with open ("c:\\python\\test\\58962_vs_83887.common.csv", "w") as OUTFILE1, \
with open ("c:\\python\\test\\58962_vs_83887.unique.csv", "w") as OUTFILE2:
for line in INFILE1:
lineList = line.rstrip().split(",")
if any(lineList[11] in x for x in infile2): # check the list of lines if
# any contains line[11]
OUTFILE1.write(line)
else:
OUTFILE2.write(line)
# all files are autoclosed here
Links to read:
the-with-statement
any() and other built-ins

in python loop print lines from alternating files

I am trying to use python to find four-line blocks of interest in two separate files then print out some of those lines in controlled order. Below are the two input files and an example of the desired output file. Note that the DNA sequence in the Input.fasta is different than the DNA sequence in Input.fastq because the .fasta file has been read corrected.
Input.fasta
>read1
AAAGGCTGT
>read2
AGTCTTTAT
>read3
CGTGCCGCT
Input.fastq
#read1
AAATGCTGT
+
'(''%$'))
#read2
AGTCTCTAT
+
&---+2010
#read3
AGTGTCGCT
+
0-23;:677
DesiredOutput.fastq
#read1
AAAGGCTGT
+
'(''%$'))
#read2
AGTCTTTAT
+
&---+2010
#read3
CGTGCCGCT
+
0-23;:677
Basically I need the sequence line "AAAGGCTGT",
"AGTCTTTAT", and "CGTGCCGCT" from "input.fasta" and all other lines from "input.fastq". This allows the restoration of quality information to a read corrected .fasta file.
Here is my closest failed attempt:
fastq = open(Input.fastq, "r")
fasta = open(Input.fasta, "r")
ReadIDs = []
IDs = []
with fastq as fq:
for line in fq:
if "read" in line:
ReadIDs.append(line)
print(line.strip())
for ID in ReadIDs:
IDs.append(ID[1:6])
with fasta as fa:
for line in fa:
if any(string in line for string in IDs):
print(next(fa).strip())
next(fq)
print(next(fq).strip())
print(next(fq).strip())
I think I am running into trouble by trying to nest "with" calls to two different files in the same loop. This prints the desired lines for read1 correctly but does not continue to iterate through the remaining lines and throws an error "ValueError: I/O operation on closed file"
I suggest you use Biopython, which will save you a lot of trouble as it provides nice parsers for these file formats, which handle not only the standard cases but also for example multi-line fasta.
Here is an implementation that replaces the fastq sequence lines with the corresponding fasta sequence lines:
from Bio import SeqIO
fasta_dict = {record.id: record.seq for record in
SeqIO.parse('Input.fasta', 'fasta')}
def yield_records():
for record in SeqIO.parse('Input.fastq', 'fastq'):
record.seq = fasta_dict[record.id]
yield record
SeqIO.write(yield_records(), 'DesiredOutput.fastq', 'fastq')
If you don't want to use the headers but just rely on the order then the solution is even simpler and more memory efficient (just make sure the order and number of records is the same), no need to define the dictionary first, just iterate over the records together:
fasta_records = SeqIO.parse('Input.fasta', 'fasta')
fastq_records = SeqIO.parse('Input.fastq', 'fastq')
def yield_records():
for fasta_record, fastq_record in zip(fasta_records, fastq_records):
fastq_record.seq = fasta_record.seq
yield fastq_record
## Open the files (and close them after the 'with' block ends)
with open("Input.fastq", "r") as fq, open("Input.fasta", "r") as fa:
## Read in the Input.fastq file and save its content to a list
fastq = fq.readlines()
## Do the same for the Input.fasta file
fasta = fa.readlines()
## For every line in the Input.fastq file
for i in range(len(fastq)):
print(fastq[i]))
print(fasta[2 * i])
print(fasta[(2 * i) + 1])
I like the Biopython solution by #Chris_Rands better for small files, but here is a solution that only uses the batteries included with Python and is memory efficient. It assumes the fasta and fastq files to contain the same number of reads in the same order.
with open('Input.fasta') as fasta, open('Input.fastq') as fastq, open('DesiredOutput.fastq', 'w') as fo:
for i, line in enumerate(fastq):
if i % 4 == 1:
for j in range(2):
line = fasta.readline()
print(line, end='', file=fo)

Searching rows of a file in another file and printing appropriate rows in python

I have a csv file like this: (no headers)
aaa,1,2,3,4,5
bbb,2,3,4,5,6
ccc,3,5,7,8,5
ddd,4,6,5,8,9
I want to search another csv file: (no headers)
bbb,1,2,3,4,5,,6,4,7
kkk,2,3,4,5,6,5,4,5,6
ccc,3,4,5,6,8,9,6,9,6
aaa,1,2,3,4,6,6,4,6,4
sss,1,2,3,4,5,3,5,3,5
and print rows in the second file(based on matching of the first columns) that exist in the first file. So results will be:
bbb,1,2,3,4,5,,6,4,7
ccc,3,4,5,6,8,9,6,9,6
aaa,1,2,3,4,6,6,4,6,4
I have following code, but it does not print anything:
labels = []
with open("csv1.csv", "r") as f:
f.readline()
for line in f:
labels.append((line.strip("\n")))
with open("csv2.csv", "r") as f:
f.readline()
for line in f:
if (line.split(",")[1]) in labels:
print (line)
If possible, could you tell me how to do this, please ? What is wrong with my code ? Thanks in advance !
This is one solution, although you may also look into csv-specific tools and pandas as suggested:
labels = []
with open("csv1.csv", "r") as f:
lines = f.readlines()
for line in lines:
labels.append(line.split(',')[0])
with open("csv2.csv", "r") as f:
lines = f.readlines()
with open("csv_out.csv", "w") as out:
for line in lines:
temp = line.split(',')
if any(temp[0].startswith(x) for x in labels):
out.write((',').join(temp))
The program first collects only labels from csv1.csv - note that you used readline, where the program seems to expected all the lines from the file read at once. One way to do it is by using readlines. The program also has to collect the lines from readlines - here it stores them in a list named lines. To collect the labels, the program loops through each line, splits it by a , and appends the first element to the array with labels, labels.
In the second part, the program reads all the lines from csv2.csv while also opening the file for writing the output, csv.out. It processes the lines from csv2.csv line by line while at the same time writing the target files to the output file.
To do that, the program again splits each line by , and looks if the label from csv2 is found in the labels array. If it is, that line is written to csv_out.csv.
Try using pandas, its a very effective way to read csv files into a data structure called dataframes.
EDIT
labels = []
with open("csv1.csv", "r") as f:
f.readline()
for line in f:
labels.append((line.split(',')[0])
with open("csv2.csv", "r") as f:
f.readline()
for line in f:
if (line.split(",")[0]) in labels:
print (line)
I it so that labels only contains the first part of the string so ['aaa','bbb', etc]
Then you want to check if line.split(",")[0] is in labels
Since you want to only match it based on the first column, you should use split and then get the first item from the split which is at index 0.

Python - Compare values in lists (not 1:1 match)

I've got 2 txt files that are structured like this:
File 1
LINK1;FILENAME1
LINK2;FILENAME2
LINK3;FILENAME3
File 2
FILENAME1
FILENAME2
FILENAME3
And I use this code to print the "unique" lines contained in both files:
with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
a = f1.readlines()
b = f2.readlines()
non_duplicates = [line for line in a if line not in b]
non_duplicates += [line for line in b if line not in a]
for i in range(1, len(non_duplicates)):
print non_duplicates[i]
The problem is that in this way it prints all the lines of both files, what I want to do is to search if FILENAME1 is in some line of file 1 (the one with both links and filenams) and delete this line.
You need to first load all lines in 2.txt and then filter lines in 1.txt that contains a line from the former. Use a set or frozenset to organize the "blacklist", so that each not in runs in O(1) in average. Also note that f1 and f2 are already iterable:
with open('2.txt', 'r') as f2:
blacklist = frozenset(f2)
with open('1.txt', 'r') as f1:
non_duplicates = [x.strip() for x in f1 if x.split(";")[1] not in blacklist]
If the file2 is not too big create a set of all the lines, split the file1 lines and check if the second element is in the set of lines:
import fileinput
import sys
with open("file2.txt") as f:
lines = set(map(str.rstrip,f)) # itertools.imap python2
for line in fileinput.input("file1.txt",inplace=True):
# if FILENAME1 etc.. is not in the line write the line
if line.rstrip().split(";")[1] not in lines:
sys.stdout.write(line)
file1:
LINK1;FILENAME1
LINK2;FILENAME2
LINK3;FILENAME3
LINK1;FILENAME4
LINK2;FILENAME5
LINK3;FILENAME6
file2:
FILENAME1
FILENAME2
FILENAME3
file1 after:
LINK1;FILENAME4
LINK2;FILENAME5
LINK3;FILENAME6
fileinput.input with inplace changes the original file. You don't need to store the lines in a list.
You can also write to a tempfile, writing the unique lines to it and using shutil.move to replace the original file:
from tempfile import NamedTemporaryFile
from shutil import move
with open("file2.txt") as f, open("file1.txt") as f2, NamedTemporaryFile(dir=".",delete=False) as out:
lines = set(map(str.rstrip,f))
for line in f2:
if line.rstrip().split(";")[1] not in lines:
out.write(line)
move(out.name,"file1.txt")
If your code errors you won't lose any data in the original file using a tempfile.
using a set to store the lines means we have on average 0(1) lookups, storing all the lines in a list would give you a quadratic as opposed to a linear solution which for larger files would give you a significantly more efficient solution. There is also no need to store all the lines of the other file in a list with readlines as you can write as you iterate over the file object and do your lookups.
Unless the files are too large, then you may print the lines in file1.txt (that I call entries) whose filename-part is not listed in file2.txt with something like this:
with open('file1.txt') as f1:
entries = f1.read().splitlines()
with open('file2.txt') as f2:
filenames_to_delete = f2.read().splitlines()
print [entry for entry in entries if entry.split(';')[1] not in filenames_to_delete]
If file1.txt is large and file2.txt is small, then you may load the filenames in file2.txt entirely in memory, and then open file1.txt and go through it, checking against the in-memory list.
If file1.txt is small and file2.txt is large, you may do it the other way round.
If file1.txt and file2.txt are both excessively large, then if it is known that both files’ lines are sorted by filename, one could write some elaborate code to take advantage of that sorting to get the task done without loading the entire files in memory, as in this SO question. But if this is not an issue, you’ll be better off loading everything in memory and keeping things simple.
P.S. Once it is not necessary to open the two files simultaneously, we avoid it; we open a file, read it, close it, and then repeat for the next. Like that the code is simpler to follow.

Fastest Way to Delete a Line from Large File in Python

I am working with a very large (~11GB) text file on a Linux system. I am running it through a program which is checking the file for errors. Once an error is found, I need to either fix the line or remove the line entirely. And then repeat...
Eventually once I'm comfortable with the process, I'll automate it entirely. For now however, let's assume I'm running this by hand.
What would be the fastest (in terms of execution time) way to remove a specific line from this large file? I thought of doing it in Python...but would be open to other examples. The line might be anywhere in the file.
If Python, assume the following interface:
def removeLine(filename, lineno):
Thanks,
-aj
You can have two file objects for the same file at the same time (one for reading, one for writing):
def removeLine(filename, lineno):
fro = open(filename, "rb")
current_line = 0
while current_line < lineno:
fro.readline()
current_line += 1
seekpoint = fro.tell()
frw = open(filename, "r+b")
frw.seek(seekpoint, 0)
# read the line we want to discard
fro.readline()
# now move the rest of the lines in the file
# one line back
chars = fro.readline()
while chars:
frw.writelines(chars)
chars = fro.readline()
fro.close()
frw.truncate()
frw.close()
Modify the file in place, offending line is replaced with spaces so the remainder of the file does not need to be shuffled around on disk. You can also "fix" the line in place if the fix is not longer than the line you are replacing
import os
from mmap import mmap
def removeLine(filename, lineno):
f=os.open(filename, os.O_RDWR)
m=mmap(f,0)
p=0
for i in range(lineno-1):
p=m.find('\n',p)+1
q=m.find('\n',p)
m[p:q] = ' '*(q-p)
os.close(f)
If the other program can be changed to output the fileoffset instead of the line number, you can assign the offset to p directly and do without the for loop
As far as I know, you can't just open a txt file with python and remove a line. You have to make a new file and move everything but that line to it. If you know the specific line, then you would do something like this:
f = open('in.txt')
fo = open('out.txt','w')
ind = 1
for line in f:
if ind != linenumtoremove:
fo.write(line)
ind += 1
f.close()
fo.close()
You could of course check the contents of the line instead to determine if you want to keep it or not. I also recommend that if you have a whole list of lines to be removed/changed to do all those changes in one pass through the file.
If the lines are variable length then I don't believe that there is a better algorithm than reading the file line by line and writing out all lines, except for the one(s) that you do not want.
You can identify these lines by checking some criteria, or by keeping a running tally of lines read and suppressing the writing of the line(s) that you do not want.
If the lines are fixed length and you want to delete specific line numbers, then you may be able to use seek to move the file pointer... I doubt you're that lucky though.
Update: solution using sed as requested by poster in comment.
To delete for example the second line of file:
sed '2d' input.txt
Use the -i switch to edit in place. Warning: this is a destructive operation. Read the help for this command for information on how to make a backup automatically.
def removeLine(filename, lineno):
in = open(filename)
out = open(filename + ".new", "w")
for i, l in enumerate(in, 1):
if i != lineno:
out.write(l)
in.close()
out.close()
os.rename(filename + ".new", filename)
I think there was a somewhat similar if not exactly the same type of question asked here. Reading (and writing) line by line is slow, but you can read a bigger chunk into memory at once, go through that line by line skipping lines you don't want, then writing this as a single chunk to a new file. Repeat until done. Finally replace the original file with the new file.
The thing to watch out for is when you read in a chunk, you need to deal with the last, potentially partial line you read, and prepend that into the next chunk you read.
#OP, if you can use awk, eg assuming line number is 10
$ awk 'NR!=10' file > newfile
I will provide two alternatives based on the look-up factor (line number or a search string):
Line number
def removeLine2(filename, lineNumber):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
currentLineNumber = 0
while currentLineNumber < lineNumber:
inputFile.readline()
currentLineNumber += 1
seekPosition = inputFile.tell()
outputFile.seek(seekPosition, 0)
inputFile.readline()
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()
String
def removeLine(filename, key):
with open(filename, 'r+') as outputFile:
with open(filename, 'r') as inputFile:
seekPosition = 0
currentLine = inputFile.readline()
while not currentLine.strip().startswith('"%s"' % key):
seekPosition = inputFile.tell()
currentLine = inputFile.readline()
outputFile.seek(seekPosition, 0)
currentLine = inputFile.readline()
while currentLine:
outputFile.writelines(currentLine)
currentLine = inputFile.readline()
outputFile.truncate()

Categories