comparing parts of lines in two tsv files in python - python

So I want to sum/analyse values pertaining to a given line in one file which match another file.
The format of the first file I wish to compare against is:
Acetobacter cibinongensis Acetobacter Acetobacteraceae
Rhodospirillales Proteobacteria Bacteria
Acetobacter ghanensis Acetobacter Acetobacteraceae Rhodospirillales Proteobacteria Bacteria
Acetobacter pasteurianus Acetobacter Acetobacteraceae Rhodospirillales Proteobacteria Bacteria
And the second file is like:
Blochmannia endosymbiont of Polyrhachis (Hedomyrma) turneri Candidatus Blochmannia Enterobacteriaceae Enterobacteriales Proteobacteria Bacteria 1990 7.511 14946.9
Blochmannia endosymbiont of Polyrhachis (Hedomyrma) turneri Candidatus Blochmannia Enterobacteriaceae Enterobacteriales Proteobacteria Bacteria 2061 6.451 13295.5
Calyptogena okutanii thioautotrophic gill symbiont Proteobacteria-undef Proteobacteria-undef Proteobacteria-undef Proteobacteria Bacteria 7121 2.466 17560.4
What I want to do is parse every line in the first file, and for every line in the second file where the first 6 fields match, perform analysis on the numbers in the 3 fields following the species info.
My code is as follows:
with open('file1', 'r') as file1:
with open('file2', 'r') as file2:
for line in file1:
count = 0
line = line.split("\t")
for l in file2:
l = l.split("\t")
if l[0:6] == line[0:6]:
count+=1
count = str(count)
print line + '\t' + count +'\t'+'\n'
Which I'm hoping will give me the line from the first file and the number of times that species was found in the second file.
I know there's probably a better way of doing THIS particular part of the analysis but I wanted to give a simple example of the objective..
Anyway, I don't get any matches, i.e. I never see an instance where
l[0:6] == line[0:6]
is True.
Any ideas?? :-S

The root cause is that you consume file2 at the first iteration, then it always iterate over nothing.
Quick fix: read file2 fully and put it in a list. However, this is rather inefficient in terms of speed (O(N^2): double loop). Could be better if creating a dictionary with key = tuple of the 6 first values.
with open('file2', 'r') as f:
file2 = list(f)
with open('file1', 'r') as file1:
for line in file1:
count = 0
line = line.split("\t")
for l in file2:
l = l.split("\t")
if l[0:6] == line[0:6]:
count+=1
count = str(count)
print line + '\t' + count +'\t'+'\n'
Also, using csv module configured with TAB as separator would avoid you some surprises in the future.
Better version, using a dictionary for faster access on data of file2 (the first 6 elements are the key, note that we cannot use a list as key since it's mutable but we have to convert it to a tuple):
d = dict()
# create the dictionary from file2
with open('file2', 'r') as file2:
for l in file2:
fields = l.split("\t")
d[tuple(fields[0:6])] = fields[6:]
# iterate through file1, and use dict lookup on data of file2
# much, much faster if file2 contains a lot of data
with open('file1', 'r') as file1:
for line in file1:
count = 0
line = line.split("\t")
if tuple(line[0:6]) in d: # check if in dictionary
count+=1
# we could extract the extra data by accessing
# d[tuple(line[0:6])]
count = str(count)
print(line + '\t' + count +'\t'+'\n')

Related

replace lines in a larger file with ID

Hello every one i have problem with replacing same content lines with same ID e.x:
ONE -----------> 1
TWO -----------> 2
THREE-----------> 3
HELLO-----------> 4
SEVEN-----------> 5
ONE-----------> 1
ONE-----------> 1
ONE-----------> 1
TWO-----------> 2
I have worked on this code below but with no results:
NOTE: filein and file2 have same value of the defined example.
# opening the file in read mode
file = open("filein.txt", "r")
# opening the file in read and write mod
file2 = open("filein2.txt", "r+")
replacement = ""
count=1
# using the for loop
for line in file:
for line2 in file2:
line = line.strip()
if line == line2 :
changes = line.replace(line, str(count))
replacement = replacement + changes + "\n"
file2.seek(0)
file2.write(replacement)
count=count+1
file.close()
filein and filein2 contain same value
ONE
TWO
THREE
HELLO
SEVEN
ONE
ONE
ONE
TWO
To my understanding this is what you want; compare two files line by line, if the corresponding lines are equal, assign an ID to them, if the lines repeat somewhere else in the file assign the same ID as before, if the lines have not occurred before assign a new ID. If the lines are different, get both of their contents. In the end write either the ID or line content to a new file:
index_dct = dict()
id_ = 1
with open('text.txt') as f1, open('text1.txt') as f2, open('result.txt', 'w') as result:
for line1, line2 in zip(f1, f2):
line1, line2 = line1.strip(), line2.strip()
if line1 == line2:
text = index_dct.get(line1)
if text is None:
text = index_dct[line1] = id_
id_ += 1
else:
text = f'{line1} {line2}'
result.write(f'{text}\n')
A quick overview of how this works:
First you have a dictionary to store the value and its corresponding ID so that if a value repeats you can assign the same ID.
Then using a context manager (with) you open three files:
then iterate over the first two files at the same time using zip and compare if the lines match, if they do then first try to get their corresponding ID based on their value, if there is not yet such a value in the dictionary assign the current line value as a key and have its value be the ID, then increase ID by one.
If the lines don't match then just concatenate them together
Finally write the resultant value to the third file
If you're trying to make each unique word have a unique ID, you could use a dictionary:
inputText = "ONE TWO THREE HELLO SEVEN ONE ONE ONE TWO"
indexDictionary = {}
count = 1
outList = []
for word in inputText.split(" "):
if word not in indexDictionary.keys():
indexDictionary[word] = count
count += 1
outList.append(indexDictionary[word])
print(outList)
print(indexDictionary)

Use a file to search another file and print lines matching a pattern to first file

Python noob here. I've been smashing my head trying to do this, tried several Unix tools and I'm convinced that python is the way to go.
I have two files, File1 has headers and numbers like this:
>id1
77
>id2
2
>id3
2
>id4
22
...
Note that id number is unique, but the number assigned to it may repeat. I have several files like this all with the same number of headers (~500).
File2 has all numbers of File1 and an appended sequence
1
ATCGTCATA
2
ATCGTCGTA
...
22
CCCGTCGTA
...
77
ATCGTCATA
...
Note that sequence id is unique, as all sequences after it. I have the same amount of files as File1 but the number of sequences within each File2 may vary(~150).
My desired output is the File1 with the sequence from File2, it is important that File1 maintains original order.
>id1
ATCGTCATA
>id2
ATCGTCGTA
>id3
ATCGTCGTA
>id4
CCCGTCGTA
My approach is to extract numbers from File1 and use them as a pattern to match in File2. First I am trying to make this work with only a pair of files. here is what I achieved:
#!/usr/bin/env python
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
with open(datafile, 'r') as f:
datafile_lines = set([line.strip() for line in f]) #maybe I could use regex to get only lines with number as pattern?
print (datafile_lines)
outputlist = []
with open(schemaseqs, 'r') as f:
for line in f:
seqs = line.split(',')[0]
if seqs[1:-1] in datafile_lines:
outputlist.append(line)
print (outputlist)
This outputs a mix of patterns from File1 and the sequences from File2. Any help is appreciated.
Ps: I am open to modifications in files structure, I tried substituting \n in File2 for "," with no avail.
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
i = 0
for line in f:
if i % 2 == 0:
d[line.strip()]=0
prev = line.strip()
else:
d[prev] = line.strip()
i+=1
new_d = {}
with open(schemaseqs, 'r') as f:
i=0
prev = None
for line in f:
if i % 2 == 0:
new_d[line.strip()]=0
prev = line.strip()
else:
new_d[prev] = line.strip()
i+=1
for key, value in d.items():
if value in new_d:
d[key] = new_d[value]
print(d)
with open(datafile,'w') as filee:
for k,v in d.items():
filee.writelines(k)
filee.writelines('\n')
filee.writelines(v)
filee.writelines('\n')
creating two dictionary would be easy and then map both dictionary values.
Since the files are so neatly organized, I wouldn't use a set to store the lines. Sets don't enforce order, and the order of these lines conveys a lot of information. I also wouldn't use Regex; it's probably overkill for the task of parsing individual lines, but not powerful enough to keep track of which ID corresponds to each gene sequence.
Instead, I would read the files in the opposite order. First, read the file with the gene sequences and build a mapping of IDs to genes. Then read in the first file and replace each id with the corresponding value in that mapping.
If the IDs are a continuous sequence (1, 2, 3... n, n+1), then a list is probably the easiest way to store them. If the file is already in order, you don't even have to pay attention to the ID numbers; you can just skip every other row and append each gene sequence to an array in order. If they aren't continuous, you can use a dictionary with the IDs as keys. I'll use the dictionary approach for this example:
id_to_gene_map = {}
with open(file2, 'r') as id_to_gene_file:
for line_number, line in enumerate(id_to_gene_file, start=1):
if line_number % 2 == 1: # Update ID on odd numbered lines, including line 1
current_id = line
else:
id_to_gene_map[current_id] = line # Map previous line's ID to this line's value
with open(file1, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
if not line.startswith(">"): # Keep ">id1" lines unchanged
line = id_to_gene_map[line] # Otherwise, replace with the corresponding gene
output_file.write(line)
In this case, the IDs and values both have trailing newlines. You can strip them out, but since you'll want to add them back in for writing the output file, it's probably easiest to leave them alone.

Joining DNA sequences from two files under the same species name

I have two FASTA file with DNA sequences coding for two different proteins. I want to join the sequences for the different proteins and same species into one long sequence.
for example, I have:
Protein 1
>sce
AGTAGATGACAGCT
>act
GCTAGCTAGCT
Protein 2
>sce
GCTACGATCGACT
>act
TACGATCAGCTA
Protein 1+2
>sce
AGTAGATGACAGCTGCTACGATCGACT
>act
GCTAGCTAGCTTACGATCAGCTA
Something that might be a bit of an issue is that the species don't appear in the same order in both files and there's a few sequences that are found in one, but not in the other (files are about 110-species long, with discrepancy of 4 or 5).
My first attempt at writing a code for it was:
gamma = open('gamma.fas', 'w')
spc = open("spc98.fas", 'w')
outfile = open("joined.fas", 'w')
for line in gamma:
if line.startswith(">"):
for line2 in spc:
if line2.startswith(">"):
if line == line2:
outfile.write(line)
else:
outfile.write(line)
fh.close()
but since the DNA sequences are very long and take many lines of the file, I don't know how to select them.
Please help!
Since you tagged Biopython, here is a compact solution. Note it puts the whole file into memory (as most simple approaches will):
from Bio.Seq import Seq
from Bio import SeqIO
d = SeqIO.to_dict(SeqIO.parse('1.fasta', 'fasta'))
for r in SeqIO.parse('2.fasta', 'fasta'):
d[r.id] = d.setdefault(r.id, Seq('')) + r.seq
SeqIO.write(d.values(), 'output.fasta', 'fasta')
Here 1.fasta and 2.fasta are your two input fasta files, and output.fasta is your merged output file.
Also, note that biologically I think this is an odd thing to do, concatenating sequences across multiple files could lead to the creation of 'fake' contiguous sequences, and the order of concatenation is surely important, so be careful
By using a dictionary, you could append fasta sequences to each ID. And then, print them to the output file.
outfile = open("joined.fas", 'w')
d = dict()
for file in ('gamma.fas', 'spc98.fas'):
with open(file, 'r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
key = line
else:
d.setdefault(key, '')
d[key] += line
for key, seq in d.items():
outfile.write(key + "\n" + seq + "\n")
outfile.close()
EDIT: By the way, you are opening your two reading files as open for writing which will clobber the two input files.
gamma = open('gamma.fas', 'w')
spc = open("spc98.fas", 'w')
They should be opened with r instead of w.

Read lines in one file and find all strings starting with 4-letter strings listed in another txt file

I have 2 txt files (a and b_).
file_a.txt contains a long list of 4-letter combinations (one combination per line):
aaaa
bcsg
aacd
gdee
aadw
hwer
etc.
file_b.txt contains a list of letter combinations of various length (some with spaces):
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
etc.
I am looking for a python script that would allow me to do the following:
read file_a.txt line by line
take each 4-letter combination (e.g. aaai)
read file_b.txt and find all the various-length letter combinations starting with the 4-letter combination (eg. aaaibjkes, aaailoiersaaageehikjaaa, aaailoiuwegoiglkjaaaike etc.)
print the results of each search in a separate txt file named with the 4-letter combination.
File aaai.txt:
aaaibjkes
aaailoiersaaageehikjaaa
aaailoiuwegoiglkjaaake
etc.
File bcsi.txt:
bcspwiopiejowih
bcsiweyoieotpwe
etc.
I'm sorry I'm a newbie. Can someone point me in the right direction, please. So far I've got only:
#I presume I will have to use regex at some point
import re
file1 = open('file_a.txt', 'r').readlines()
file2 = open('file_b.txt', 'r').readlines()
#Should I look into findall()?
I hope this would help you;
file1 = open('file_a.txt', 'r')
file2 = open('file_b.txt', 'r')
#get every item in your second file into a list
mylist = file2.readlines()
# read each line in the first file
while file1.readline():
searchStr = file1.readline()
# find this line in your second file
exists = [s for s in mylist if searchStr in s]
if (exists):
# if this line exists in your second file then create a file for it
fileNew = open(searchStr,'w')
for line in exists:
fileNew.write(line)
fileNew.close()
file1.close()
What you can do is to open both files and run both files down line by line using for loops.
You can have two for loops, the first one reading file_a.txt as you will be reading through it only once. The second will read through file_b.txt and look for the string at the start.
To do so, you will have to use .find() to search for the string. Since it is at the start, the value should be 0.
file_a = open("file_a.txt", "r")
file_b = open("file_b.txt", "r")
for a_line in file_a:
# This result value will be written into your new file
result = ""
# This is what we will search with
search_val = a_line.strip("\n")
print "---- Using " + search_val + " from file_a to search. ----"
for b_line in file_b:
print "Searching file_b using " + b_line.strip("\n")
if b_line.strip("\n").find(search_val) == 0:
result += (b_line)
print "---- Search ended ----"
# Set the read pointer to the start of the file again
file_b.seek(0, 0)
if result:
# Write the contents of "results" into a file with the name of "search_val"
with open(search_val + ".txt", "a") as f:
f.write(result)
file_a.close()
file_b.close()
Test Cases:
I am using the test cases in your question:
file_a.txt
aaaa
bcsg
aacd
gdee
aadw
hwer
file_b.txt
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
The program produces an output file bcsg.txt as it is supposed to with bcsgiweyoieotpwe inside.
Try this:
f1 = open("a.txt","r").readlines()
f2 = open("b.txt","r").readlines()
file1 = [word.replace("\n","") for word in f1]
file2 = [word.replace("\n","") for word in f2]
data = []
data_dict ={}
for short_word in file1:
data += ([[short_word,w] for w in file2 if w.startswith(short_word)])
for single_data in data:
if single_data[0] in data_dict:
data_dict[single_data[0]].append(single_data[1])
else:
data_dict[single_data[0]]=[single_data[1]]
for key,val in data_dict.iteritems():
open(key+".txt","w").writelines("\n".join(val))
print(key + ".txt created")

Compare files line by line to see if they are the same, if so output them

How would I go about this, I have files which I have sorted the information in, I want to compare a certain index in that file with an index in another, one problem is that the files are enormously large, millions of lines. I want to compare line by line the files I have, if they match I want to input both those values along with other values using an index method.
=======================
Let me clarify, I want to take say line[x] the x will remain the same as it is formatted uniformly, I want to run line[x] against line[y] in another file, I want to do this to the whole file and output every matching pair to another file. In that other file I also want to be able to include other pieces from the first file which would be like just adding more indexes such as; line[a],line[b],line[c],line[d], and finally line[y] as the match to that information.
Try 3:
I have a file with information in this format:
#x is a line
x= data,data,data,data,data,data
there is millions of lines of that.
I have another file, same format:
xis a line
x= data,data,data,data
I want to use x[#] from first file and x[#] from second file, I want to see if those two values match, if they do I want to output those, along with several other x[#] values from the second file, which are on the same line.
Did that help at all to understand?
The format the files are in are like i said:(but there is millions, and I want to find the pairs in the two files because they all should match up)
line 1 data,data,data,data
line 2 data,data,data,data
data from file 1:
(N'068D556A1A665123A6DD2073A36C1CAF', N'A76EEAF6D310D4FD2F0BD610FAC02C04DFE6EB67',
N'D7C970DFE09687F1732C568AE1CFF9235B2CBB3673EA98DAA8E4507CC8B9A881');
data from file 2:
00000040f2213a27ff74019b8bf3cfd1|index.docbook|Redhat 7.3 (32bit)|Linux
00000040f69413a27ff7401b8bf3cfd1|index.docbook|Redhat 8.0 (32bit)|Linux
00000965b3f00c92a18b2b31e75d702c|Localizable.strings|Mac OS X 10.4|OSX
0000162d57845b6512e87db4473c58ea|SYSTEM|Windows 7 Home Premium (32bit)|Windows
000011b20f3cefd491dbc4eff949cf45|totem.devhelp|Linux Ubuntu Desktop 9.10 (32bit)|Linux
The order it is sorted in is alphanumeric, and I want to use a slider method. By that I mean if file1[x] is < file2[x] move the slider down or up depending on whether one value is greater than the other, until a match is found, when and if so, print the output along with other values that will identify that hash.
What I want as a result would be:
file1[x] and its corresponding match on file2[x] outputted to a file, as well as other file1[x] where x can be any index from the line.
using this method and comparing compare line by line you don't have to store files in the memory as the files are huge in size.
with open('file1.txt') as f1, open('file2.txt') as f2, open('file3.txt','w') as f3:
for x, y in zip(f1, f2):
if x == y:
f3.write(x)
What I got from the clarification:
file1 and file2 are in the same format, where each line looks like
{32 char hex key}|{text1}|{text2}|{text3}
the files are sorted in ascending order by key
for each key that appears in both file1 and file2, you want merged output, so each line looks like
{32 char hex key}|{text11}|{text12}|{text13}|{text21}|{text22}|{text23}
You basically want the collisions from a merge sort:
import csv
def getnext(csvfile, key=lambda row: int(row[0], 16)):
row = csvfile.next()
return key(row),row
with open('file1.dat','rb') as inf1, open('file2.dat','rb') as inf2, open('merged.dat','wb') as outf:
a = csv.reader(inf1, delimiter='|')
b = csv.reader(inf2, delimiter='|')
res = csv.writer(outf, delimiter='|')
a_key, b_key = -1, 0
try:
while True:
while a_key < b_key:
a_key, a_row = getnext(a)
while b_key < a_key:
b_key, b_row = getnext(b)
if a_key==b_key:
res.writerow(a_row + b_row[1:])
except StopIteration:
# reached the end of an input file
pass
I still have no idea what you are trying to communicate by 'as well as other file1[x] where x can be any index from the line'.
Comparing the contents of two files at a specified index:
fp1 = open("file1.txt", "r")
fp2 = open("file2.txt", "r")
fp1.seek(index)
fp2.seek(index)
line1 = fp1.readline()
line2 = fp2.readline()
if line1 == line2:
print(line1)
fp1.close()
fp2.close()
Comparing two files line by line to see if they match, otherwise print the line:
fp1 = open("file1.txt", "r")
fp2 = open("file2.txt", "r")
line1, line2 = fp1.readline(), fp2.readline()
while line1 and line2:
if line1 != line2:
print("Mismatch.\n1: %s\n2: %s" % (line1, line2))
fp1.close()
fp2.close()

Categories