Multiple file comparison - python

I want to compare multiple files (15-20), which are gzipped, and restore from them lines, that are common. But this is not so simple. Lines that are exact in certain columns, and also I would like to have for them count information in how many files they were present. If 1, the line is unique to a file, etc. Would be also nice to hold those file names as well.
each file looks st like this:
##SAMPLE=<ID=NormalID,Description="Cancer-paired normal sample. Sample ID 'NORMAL'">
##SAMPLE=<ID=CancerID,Description="Cancer sample. Sample ID 'TUMOR'">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NormalID_NORMAL CancerID_TUMOR
chrX 136109567 . C CT . PASS IC=8;IHP=8;NT=ref;QSI=35;QSI_NT=35;RC=7;RU=T;SGT=ref->het;SOMATIC;TQSI=1;TQSI_NT=1;phastCons;CSQ=T|ENSG00000165370|ENST00000298110|Transcript|5KB_downstream_variant|||||||||YES|GPR101||||| DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50 23:23:21,21:0,0:2,2:21.59:0.33:0.00 33:33:16,16:13,13:4,4:33.38:0.90:0.00
chrX 150462334 . T TA . PASS IC=2;IHP=2;NT=ref;QSI=56;QSI_NT=56;RC=1;RU=A;SGT=ref->het;SOMATIC;TQSI=2;TQSI_NT=2;CSQ=A||||intergenic_variant||||||||||||||| DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50 30:30:30,30:0,0:0,0:31.99:0.00:0.00 37:37:15,17:16,16:6,5:36.7:0.31:0.00
Files are tab delimited.
If line starts with #, ignore this line. We are interested only in those, that do not.
Taking 0 based python coordinates, we are interested in 0,1,2,3,4 fields. They have to match between files to be reported as common. However we still need tohold information about the rest of the coulmns/fields, so that they can be written tot he output file
Right now I have the following code:
import gzip
filenames = ['a','b','c']
files = [gzip.open(name) for name in filenames]
sets = [set(line.strip() for line in file if not line.startswith('#')) for file in files]
common = set.intersection(*sets)
for file in files: file.close()
print common
In my currenyt code I do not know how to implement correctly the if not line.startswith() (which place?), and how to specify the columns in line that should be matched. Not to mention, that I have no idea how to get the lines that are for example present in 6 files, or present in 10 out of total 15 files.
Any help with this?

Collect the lines in a dictionary with the fields that make them similar as key:
from collections import defaultdict
d = defaultdict(list)
def process(filename, line):
if line[0] == '#':
return
fields = line.split('\t')
key = tuple(fields[0:5]) # Fields that makes lines similar/same
d[key].append((filename, line))
for filename in filenames:
with gzip.open(filename) as fh:
for line in fh:
process(filename, line.strip())
Now, you have a dictionary with lists of filename-line tuples. You can now print all the lines which appear more than 10 times:
for l in d.values():
if len(l) < 10: continue
print 'Same key found %d times:' % len(l)
for filename, line in l:
print '%s: %s' % (filename, line)

Related

Use a file to search another file and print lines matching a pattern to first file

Python noob here. I've been smashing my head trying to do this, tried several Unix tools and I'm convinced that python is the way to go.
I have two files, File1 has headers and numbers like this:
>id1
77
>id2
2
>id3
2
>id4
22
...
Note that id number is unique, but the number assigned to it may repeat. I have several files like this all with the same number of headers (~500).
File2 has all numbers of File1 and an appended sequence
1
ATCGTCATA
2
ATCGTCGTA
...
22
CCCGTCGTA
...
77
ATCGTCATA
...
Note that sequence id is unique, as all sequences after it. I have the same amount of files as File1 but the number of sequences within each File2 may vary(~150).
My desired output is the File1 with the sequence from File2, it is important that File1 maintains original order.
>id1
ATCGTCATA
>id2
ATCGTCGTA
>id3
ATCGTCGTA
>id4
CCCGTCGTA
My approach is to extract numbers from File1 and use them as a pattern to match in File2. First I am trying to make this work with only a pair of files. here is what I achieved:
#!/usr/bin/env python
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
with open(datafile, 'r') as f:
datafile_lines = set([line.strip() for line in f]) #maybe I could use regex to get only lines with number as pattern?
print (datafile_lines)
outputlist = []
with open(schemaseqs, 'r') as f:
for line in f:
seqs = line.split(',')[0]
if seqs[1:-1] in datafile_lines:
outputlist.append(line)
print (outputlist)
This outputs a mix of patterns from File1 and the sequences from File2. Any help is appreciated.
Ps: I am open to modifications in files structure, I tried substituting \n in File2 for "," with no avail.
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
i = 0
for line in f:
if i % 2 == 0:
d[line.strip()]=0
prev = line.strip()
else:
d[prev] = line.strip()
i+=1
new_d = {}
with open(schemaseqs, 'r') as f:
i=0
prev = None
for line in f:
if i % 2 == 0:
new_d[line.strip()]=0
prev = line.strip()
else:
new_d[prev] = line.strip()
i+=1
for key, value in d.items():
if value in new_d:
d[key] = new_d[value]
print(d)
with open(datafile,'w') as filee:
for k,v in d.items():
filee.writelines(k)
filee.writelines('\n')
filee.writelines(v)
filee.writelines('\n')
creating two dictionary would be easy and then map both dictionary values.
Since the files are so neatly organized, I wouldn't use a set to store the lines. Sets don't enforce order, and the order of these lines conveys a lot of information. I also wouldn't use Regex; it's probably overkill for the task of parsing individual lines, but not powerful enough to keep track of which ID corresponds to each gene sequence.
Instead, I would read the files in the opposite order. First, read the file with the gene sequences and build a mapping of IDs to genes. Then read in the first file and replace each id with the corresponding value in that mapping.
If the IDs are a continuous sequence (1, 2, 3... n, n+1), then a list is probably the easiest way to store them. If the file is already in order, you don't even have to pay attention to the ID numbers; you can just skip every other row and append each gene sequence to an array in order. If they aren't continuous, you can use a dictionary with the IDs as keys. I'll use the dictionary approach for this example:
id_to_gene_map = {}
with open(file2, 'r') as id_to_gene_file:
for line_number, line in enumerate(id_to_gene_file, start=1):
if line_number % 2 == 1: # Update ID on odd numbered lines, including line 1
current_id = line
else:
id_to_gene_map[current_id] = line # Map previous line's ID to this line's value
with open(file1, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
if not line.startswith(">"): # Keep ">id1" lines unchanged
line = id_to_gene_map[line] # Otherwise, replace with the corresponding gene
output_file.write(line)
In this case, the IDs and values both have trailing newlines. You can strip them out, but since you'll want to add them back in for writing the output file, it's probably easiest to leave them alone.

How can i edit my python script so it can select the whole text of a fasta sequence?

I have 2 files: one is a text file that contains a series of IDs, and the other is a multifasta file that contains fasta sequences corresponding to the IDs in the first file. I have a python a script that can select the matching IDs from both files. It looks like this:
from Bio import SeqIO
fasta=SeqIO.parse("fasta1.fasta","fasta")
seq_dict={}
for record in fasta:
seq_dict[record.id]=record.seq
#print (seq_dict)
for line in open("names","r"):
line=line.strip()
print(line)
for cle in seq_dict.keys():
print(cle)
I need to edit my script so it can select the text of the sequence next to its corresponding ID. Can you help me please to do that? Thank you.
After playing a bit with Bio.SeqIO, I concluded that #Bazingaa is probably correct. Adapted your code like so:
from Bio import SeqIO
fasta=SeqIO.parse("fasta1.fasta","fasta")
seq_dict={}
for record in fasta:
seq_dict[record.id]=record.description
#print (seq_dict)
for line in open("names","r"):
line=line.strip()
print(line)
for cle, desc in seq_dict.items():
print(cle)
print(desc)
You seem to be new to python, so here's what I did:
instead of keeping record.seq, i stored record.description
for a, b in <some dictionary>.items() will iterate through the dictionary items returning key, value pairs into the a, b variables
Hope this helps.
Edit:
Here's a somewhat more "pythonic" version. I don't really understand what fasta is so I assumed you'd want to read lines from names, take the 'tr|something|something' part as ids (without the leading '>') and print out the ones from 'fasta1.fasta' if they're in names:
from Bio import SeqIO
fasta = SeqIO.parse("fasta1.fasta","fasta")
# read all the names
with open("names", "r") as f: # this takes care to close the file afterwards
names = [line.strip().lstrip('>') for line in f]
print("Names: ", names)
for record in fasta:
print("Record:", record.id)
if record.id in names:
print("Matching record:", record.id, record.seq, record.description)
If you want to extract only the items from the names file, it is probably more efficient to read the names into memory first.
from Bio import SeqIO
wanted = dict()
with open("names","r") as lines:
for line in lines:
wanted[line.strip()] = 1
for record in SeqIO.parse("fasta1.fasta","fasta"):
if record.id in wanted:
print(record.seq)
See if this works:
from Bio import SeqIO
fasta=SeqIO.parse("fasta1.fasta","fasta")
seq_dict = {}
for record in fasta:
seq_dict[record.id.strip()] = record.seq
with open("names","r") as lines:
for line in lines:
l = line.strip().lstrip('<')
if l in seq_dict:
print(l) # ID
print(seq_dict[l]) # sequence
Note this assumes that the IDs obtained from the fasta file are the same as the IDs in the name file. If this is not the case, please provide further detail of exactly what each of the two files contains (with examples)

how to subsample a fasta file based on the headers if headers contain certain strings?

I have a fasta file like this:
>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC
and I have a text file containing a list of numbers:
373248686
930076459
296280703
......
I want to do the following:
if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.
How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:
from Bio import SeqIO
import sys
wanted = []
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
new_seq = []
for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)
SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()
Got this error:
new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax
Is there a better way to do this?
Thank you!
Use a program like this
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
which you can then call like so from the commandline
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.
keepers = {}
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
A list comprehension generates a list, so you don't need to append to another list.
keeper_seqs = [x for x in seqiter if x.id in keepers]
With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.
You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().

Compare files line by line to see if they are the same, if so output them

How would I go about this, I have files which I have sorted the information in, I want to compare a certain index in that file with an index in another, one problem is that the files are enormously large, millions of lines. I want to compare line by line the files I have, if they match I want to input both those values along with other values using an index method.
=======================
Let me clarify, I want to take say line[x] the x will remain the same as it is formatted uniformly, I want to run line[x] against line[y] in another file, I want to do this to the whole file and output every matching pair to another file. In that other file I also want to be able to include other pieces from the first file which would be like just adding more indexes such as; line[a],line[b],line[c],line[d], and finally line[y] as the match to that information.
Try 3:
I have a file with information in this format:
#x is a line
x= data,data,data,data,data,data
there is millions of lines of that.
I have another file, same format:
xis a line
x= data,data,data,data
I want to use x[#] from first file and x[#] from second file, I want to see if those two values match, if they do I want to output those, along with several other x[#] values from the second file, which are on the same line.
Did that help at all to understand?
The format the files are in are like i said:(but there is millions, and I want to find the pairs in the two files because they all should match up)
line 1 data,data,data,data
line 2 data,data,data,data
data from file 1:
(N'068D556A1A665123A6DD2073A36C1CAF', N'A76EEAF6D310D4FD2F0BD610FAC02C04DFE6EB67',
N'D7C970DFE09687F1732C568AE1CFF9235B2CBB3673EA98DAA8E4507CC8B9A881');
data from file 2:
00000040f2213a27ff74019b8bf3cfd1|index.docbook|Redhat 7.3 (32bit)|Linux
00000040f69413a27ff7401b8bf3cfd1|index.docbook|Redhat 8.0 (32bit)|Linux
00000965b3f00c92a18b2b31e75d702c|Localizable.strings|Mac OS X 10.4|OSX
0000162d57845b6512e87db4473c58ea|SYSTEM|Windows 7 Home Premium (32bit)|Windows
000011b20f3cefd491dbc4eff949cf45|totem.devhelp|Linux Ubuntu Desktop 9.10 (32bit)|Linux
The order it is sorted in is alphanumeric, and I want to use a slider method. By that I mean if file1[x] is < file2[x] move the slider down or up depending on whether one value is greater than the other, until a match is found, when and if so, print the output along with other values that will identify that hash.
What I want as a result would be:
file1[x] and its corresponding match on file2[x] outputted to a file, as well as other file1[x] where x can be any index from the line.
using this method and comparing compare line by line you don't have to store files in the memory as the files are huge in size.
with open('file1.txt') as f1, open('file2.txt') as f2, open('file3.txt','w') as f3:
for x, y in zip(f1, f2):
if x == y:
f3.write(x)
What I got from the clarification:
file1 and file2 are in the same format, where each line looks like
{32 char hex key}|{text1}|{text2}|{text3}
the files are sorted in ascending order by key
for each key that appears in both file1 and file2, you want merged output, so each line looks like
{32 char hex key}|{text11}|{text12}|{text13}|{text21}|{text22}|{text23}
You basically want the collisions from a merge sort:
import csv
def getnext(csvfile, key=lambda row: int(row[0], 16)):
row = csvfile.next()
return key(row),row
with open('file1.dat','rb') as inf1, open('file2.dat','rb') as inf2, open('merged.dat','wb') as outf:
a = csv.reader(inf1, delimiter='|')
b = csv.reader(inf2, delimiter='|')
res = csv.writer(outf, delimiter='|')
a_key, b_key = -1, 0
try:
while True:
while a_key < b_key:
a_key, a_row = getnext(a)
while b_key < a_key:
b_key, b_row = getnext(b)
if a_key==b_key:
res.writerow(a_row + b_row[1:])
except StopIteration:
# reached the end of an input file
pass
I still have no idea what you are trying to communicate by 'as well as other file1[x] where x can be any index from the line'.
Comparing the contents of two files at a specified index:
fp1 = open("file1.txt", "r")
fp2 = open("file2.txt", "r")
fp1.seek(index)
fp2.seek(index)
line1 = fp1.readline()
line2 = fp2.readline()
if line1 == line2:
print(line1)
fp1.close()
fp2.close()
Comparing two files line by line to see if they match, otherwise print the line:
fp1 = open("file1.txt", "r")
fp2 = open("file2.txt", "r")
line1, line2 = fp1.readline(), fp2.readline()
while line1 and line2:
if line1 != line2:
print("Mismatch.\n1: %s\n2: %s" % (line1, line2))
fp1.close()
fp2.close()

writing lines group by group in different files

I've got a little script which is not working nicely for me, hope you can help and find the problem.
I have two starting files:
traveltimes: contains the lines I need, it's a column file (every row has just a number). The lines I need are separated by a line which starts with 11 whitespaces
header lines: contains three header lines
output_file: I want to get 29 files (STA%s). What's inside? Every file will contain the same header lines after which I want to append the group of lines contained in the traveltimes file (one different group of lines for every file). Every group of lines is made by 74307 rows (1 column)
So far this script creates 29 files with the same header lines but then it mixes up everything, I mean it writes something but it's not what I want.
Any idea????
def make_station_files(traveltimes, header_lines):
"""Gives the STAxx.tgrid files required by loc3d"""
sta_counter = 1
with open (header_lines, 'r') as file_in:
data = file_in.readlines()
for i in range (29):
with open ('STA%s' % (sta_counter), 'w') as output_files:
sta_counter += 1
for i in data [0:3]:
values = i.strip()
output_files.write ("%s\n\t1\n" % (values))
with open (traveltimes, 'r') as times_file:
#collector = []
for line in times_file:
if line.startswith (" "):
break
output_files.write ("%s" % (line))
Suggestion:
Read the header rows first. Make sure this works before proceeding. None of the rest of the code needs to be indented under this.
Consider writing a separate function to group the traveltimes file into a list of lists.
Once you have a working traveltimes reader and grouper, only then create a new STA file, print the headers to it, and then write the timegroups to it.
Build your program up step-by-step, making sure it does what you expect at each step. Don't try to do it all at once because then you won't easily be able to track down where the issue lies.
My quick edit of your script uses itertools.groupby() as a grouper. It is a little advanced because the grouping function is stateful and tracks it state in a mutable list:
def make_station_files(traveltimes, header_lines):
'Gives the STAxx.tgrid files required by loc3d'
with open (header_lines, 'r') as f:
headers = f.readlines()
def station_counter(line, cnt=[1]):
'Stateful station counter -- Keeps the count in a mutable list'
if line.strip() == '':
cnt[0] += 1
return cnt[0]
with open(traveltimes, 'r') as times_file:
for station, group in groupby(times_file, station_counter):
with open('STA%s' % (station), 'w') as output_file:
for header in headers[:3]:
output_file.write ('%s\n\t1\n' % (header.strip()))
for line in group:
if not line.startswith(' '):
output_file.write ('%s' % (line))
This code is untested because I don't have sample data. Hopefully, you'll get the gist of it.

Categories