I have files (fasta files with a sequence) that look like this:
File1.fasta
>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
AGCCAACATCCTGGCTGTCTGG
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT
AAGCCTGTGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACAT
AAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGACCAAGGGTTCCTGTCCAAC
GTTAATCGGGGCAGG
File2.fasta
>1
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
>2
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
With my script, I count all the 5-mers in these files. My code is as follows:
import operator
import glob
def printSeq(name, seq):
kmers = {}
k = 5
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print (kmer + "\t" + str(count))
sortedKmer = sorted(kmers.items(), reverse=True)
for item in sortedKmer:
print (item[0] + "\t" + str(item[1]))
for name in glob.glob('*.fasta'):
with open(name, 'r') as f:
seq = ""
key = ""
for line in f.readlines():
if line.startswith(">"):
if key and seq:
printSeq(key, seq)
key = line[1:].strip()
seq = ""
else:
seq += line.strip()
printSeq(key, seq)
The output is now the 5-mer followed with the count.
I want to adjust my output so that for each output line I get the filename followed by the header and than the count, like this:
File1 1 GTCTT 1
File1 1 TCTTC 1
File1 1 CTTCC 1
....
File2 2 TTCTA 1
How can I achieve that?
Additional question
I want to add the reverse complement sequence of the data and count that together with the previous data. My code to get the reverse complement is as follows
from Bio import SeqIO
for fasta_file in glob.glob('*.fasta'):
for record in SeqIO.parse(fasta_file, "fasta"):
reverse_complement = ">" + record.id + "\n" + record.seq.reverse_complement()
So the "reverse_complement" of file one, header >1 has to be counted together with the previous one etc. How can I include this data to my previous files and count together?
My reverse_complement data is
File1.fasta (reverse_complement)
>1
CCAGACAGCCAGGATGTTGGCTTAGAAGCAGCCATCATTTAAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAAACCATGCACCGAAGCTGCGGCAGCGACGTTTATGCGTTGTTGGGTAGGGGAGCGTTCTGTAAGCCTGCGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACATAAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGAC
>2
CCTGCCCCGATTAACGTTGGACAGGAACCCTTGGTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTTCTGATACCTCCAGCAACCCTCACAGGCCACCTTCACAGGCTTACAGAACGCTCCCCTACCCAACAACACCTGAGTGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTTCCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTT
This could also be done using a Counter() as follows:
from collections import Counter
from itertools import groupby
import glob
for fasta_file in glob.glob('*.fasta'):
basename = os.path.splitext(os.path.basename(fasta_file))[0]
with open(fasta_file) as f_fasta:
for k, g in groupby(f_fasta, lambda x: x.startswith('>')):
if k:
sequence = next(g).strip('>\n')
else:
d = list(''.join(line.strip() for line in g))
counts = Counter()
while len(d) >= 5:
five_mer = '{}{}{}{}{}'.format(d[0], d[1], d[2], d[3], d[4])
counts[five_mer] += 1
del d[0]
for five_mer, count in sorted(counts.items(), key=lambda x: (-x[1], x[0])):
print "{} {} {} {}".format(basename, sequence, five_mer, count)
This would give you output with the largest counts first and then alphabetically:
File1 1 CAGGC 3
File1 1 CGCAG 3
File1 1 GCTTT 3
File1 1 AACGC 2
File1 1 ACATC 2
File1 1 ACGCT 2
File1 1 AGGCC 2
It uses Python's groupby() function to read groups of lines together. It either reads a single sequence line or a list of five mer lines. k is the result of the startswith() call. So when k is False, take all the lines returned, remove the newline from each and then join them together to make a single line of characters.
It then reads the first 5 characters from the list, joins them back together and adds them as a key to a Counter(). It then removes the first character from the list and repeats until there are less than 5 characters remaining.
For just alphabetical ordering:
for five_mer, count in sorted(counts.items()):
A Counter() works the same way as a dictionary, so .items() would give a list of key value pairs. These are sorted before being displayed.
You change the signature of
def printSeq(name, seq)
to
def printSeq(file, header, name, seq):
incorporate the new variables in the print statements.
e.g.
print (item[0] + "\t" + str(item[1]))
v
print (file + "\t" + header + "\t" + item[0] + "\t" + str(item[1]))
Then, in your loop you pass the information to this function.
You have the file name available in the loop, stored in the variable name
You parse the header in the lines where you detect it, and store it in a variable for later use. The later use is when you call the printSeq-function
Related
I am trying to use glob.glob to loop through my text files whilst at the same time looping through a list of words so I can concatenate a different word to each text file
The 4 text files and contents
File01.txt | File02.txt | File03.txt | File04.txt
bird | cat | dog | fish
List of four words: ['ONE','TWO','THREE','FOUR']
Here is my code:
import glob
import os
for name in glob.glob('file*'):
print(name)
num = 4
lst = ['ONE','TWO','THREE','FOUR']
cnt = 0
while num > 0:
lst_loop = lst[cnt]
print(lst_loop)
cnt += 1
num -= 1
file = open('header', 'w')
file.write(lst[0])
file.close()
# message = "cat header " + name + " > " lst[cnt] + name
# command = os.popen(message)
# print(command.read())
# print(command.close())
I am aware that glob does not return a sequenced list but as it stands it returns the text files in the order of
file03.txt
file01.txt
file02.txt
file04.txt
So with this in mind the resulting files and file names should look like so:
TWOFile01.txt | THREEFile02.txt | ONEFile03.txt | FOURFile04.txt
TWO | THREE | ONE | FOUR
bird | cat | dog | fish
files = ["File01.txt", "File02.txt", "File03.txt", "File04.txt"]
words = ["ONE", "TWO", "THREE", "FOUR"]
for i in range(len(files)):
with open(files[i], "a") as f:
f.write(words[i])
```
import glob
import os
##list has newline character to allow for correct concatenation
lst = ['ONE\n','TWO\n','THREE\n','FOUR\n']
cnt = 0
for name in glob.glob('file*'):
lst_loop = lst[cnt]
print(lst_loop)
##as I loop I write the contents to a text file called 'header'
file = open('header', 'w')
file.write(lst_loop)
file.close()
##I now strip the new line character from the list for naming
strip = lst_loop.strip("\n")
## and concatanate the contents of the header to the current file
message = "cat header " + name + " > " + strip + name
command = os.popen(message)
print(command.read())
print(command.close())
##counter to loop through list of words
cnt += 1
Something like this should work
from glob import iglob
words = ["ONE", "TWO", "THREE", "FOUR"]
for word, f in zip(words, sorted(iglob("file*.txt"))):
with open(f), open(word + f.capitalize(), "w+") as fp, new:
new.write("\n".join([word, fp.read()]))
NOTE: zip will not throw an error if length mismatch between words and files found with glob pattern. It will use the shortest iterable an then break the loop.
sorted(iglob("file*.txt")) should get the order correct for you
open(word + f.capitalize(), "w+") create a new file or overwrite if it exist
with statement is used to handle file close even if you get an error during runtime.
import glob
fls=glob.glob("*.txt") #match all text files
#shorting the list given by glob in ascending order
fls = sorted(fls) #['File01.txt', 'File02.txt', 'File03.txt', 'File04.txt']
lst = ['ONE','TWO','THREE','FOUR']
c = 0
# counter variable c will help us to loop the list 'lst'
for fl in fls:
with open(fl, 'r+') as f: #open files and read/write mode
k = f.read().splitlines() #each line into list
s = lst[c] + "\n" + k[0] #using counter c and variable k will be a list
f.seek(0) #move file pointer to starting of file
f.truncate() #remove everything from starting
f.write(s) #write the new string 's'
c = c + 1 #change the counter value to next index in list 'lst'
I have a file that looks like this:
<s0> 3
line1
line2
line3
<s1> 5
line1
line2
<s2> 4
etc. up to more than a thousand
Each sequence has a header like <s0> 3, which in this case states that three lines follow. In the example above, the number of lines below <s1> is two, so I have to correct the header to <s1> 2.
The code I have below picks out the sequence headers and the correct number of lines below them. But for some reason, it never gets the details of the last sequence. I know something is wrong but I don't know what. Can someone point me to what I am doing wrong?
import re
def call():
with open('trial_perl.txt') as fp:
docHeader = open("C:\path\header.txt","w")
c = 0
c1 = 0
header = []
k = -1
for line in fp:
if line.startswith("<s"):
#header = line.split(" ")
#print header[1]
c = 0
else:
c1 = c + 1
c += 1
if c == 0 and c1>0:
k +=1
printing = c1
if printing >= 0:
s = "<s%s>" % (k)
#print "%s %d" % (s, printing)
docHeader.write(s+" "+str(printing)+"\n")
call()
you have no sentinel at the end of the last sequence in your data, so your code will need to deal with the last sequence AFTER the loop is done.
If I may suggest some python tricks to get to your results; you don't need those c/c1/k counter variables, as they make the code more difficult to read and maintain. Instead, populate a map of sequence header to sequence items and then use the map to do all your work:
(this code works only if all sequence headers are unique - if you have duplicates, it won't work)
with open('trial_perl.txt') as fp:
docHeader = open("C:\path\header.txt","w")
data = {}
for line in fp:
if line.startswith("<s"):
current_sequence = line
# create a list with the header as the key
data[current_sequence] = []
else:
# add each sequence to the list we defined above
data[current_sequence].append(line)
Your map is ready! It looks like this:
{"<s0> 3": ["line1", "line2", "line5"],
"<s1> 5": ["line1", "line2"]}
You can iterate it like this:
for header, lines in data.items():
# header is the key, or "<s0> 3"
# lines is the list of lines under that header ["line1", "line2", etc]
num_of_lines = len(lines)
The main problem is that you neglect to check the value of c after you have read the last line. You probably had difficulty spotting this problem because of all the superfluous code. You don't have to increment k, since you can extract the value from the <s...> tag. And you don't have to have all three variables c, c1, and printing. A single count variable will do.
import re, sys
def call():
with open('trial_perl.txt') as fp:
docHeader = sys.stdout #open("C:\path\header.txt","w")
count = 0
id = None
for line in fp:
if line.startswith("<s"):
if id != None:
tag = '<s%s>' % id
docHeader.write('<s%d> %d\n' % (id, count))
count = 0
id = int(line[2:line.find('>')])
else:
count += 1
if id != None:
tag = '<s%s>' % id
docHeader.write('<s%d> %d\n' % (id, count))
call()
Another approach using groupby from itertools, where you take the maximum number of line in each group - a group corresponding to a sequence of header + line in your file: :
from itertools import groupby
def call():
with open('stack.txt') as fp:
header = [-1]
lines = [0]
for line in fp:
if line.startswith("<s"):
header.append(header[-1]+1)
lines.append(0)
else:
header.append(header[-1])
lines.append(lines[-1] +1)
with open('result','w') as f:
for key, group in groupby(zip(header[1:],lines[1:]), lambda x: x[0]):
f.write(str(("<s%d> %d\n" % max(group))))
f.close()
call()
#<s0> 3
#<s1> 2
stack.txt is the file containing your data:
<s0> 3
line1
line2
line3
<s1> 5
line1
line2
This program is to take the grammar rules found in Binary.text and store them into a dictionary, where the rules are:
N = N D
N = D
D = 0
D = 1
but the current code returns D: D = 1, N:N = D, whereas I want N: N D, N: D, D:0, D:1
import sys
import string
#default length of 3
stringLength = 3
#get last argument of command line(file)
filename1 = sys.argv[-1]
#get a length from user
try:
stringLength = int(input('Length? '))
filename = input('Filename: ')
except ValueError:
print("Not a number")
#checks
print(stringLength)
print(filename)
def str2dict(filename="Binary.txt"):
result = {}
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]] = line
print (result)
return result
print (str2dict("Binary.txt"))
Firstly, your data structure of choice is wrong. Dictionary in python is a simple key-to-value mapping. What you'd like is a map from a key to multiple values. For that you'll need:
from collections import defaultdict
result = defaultdict(list)
Next, where are you splitting on '=' ? You'll need to do that in order to get the proper key/value you are looking for? You'll need
key, value = line.split('=', 1) #Returns an array, and gets unpacked into 2 variables
Putting the above two together, you'd go about in the following way:
result = defaultdict(list)
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
key, value = line.split('=', 1)
result[key.strip()].append(value.strip())
return result
Dictionaries, by definition, cannot have duplicate keys. Therefor there can only ever be a single 'D' key. You could, however, store a list of values at that key if you'd like. Ex:
from collections import defaultdict
# rest of your code...
result = defaultdict(list) # Use defaultdict so that an insert to an empty key creates a new list automatically
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]].append(line)
print (result)
return result
This will result in something like:
{"D" : ["D = N D", "D = 0", "D = 1"], "N" : ["N = D"]}
i have multiple files each containing 8/9 columns.
for a single file : I have to read last column containing some value and count the number of occurrence of each value and then generate an outfile.
I have done it like:
inp = open(filename,'r').read().strip().split('\n')
out = open(filename,'w')
from collections import Counter
C = Counter()
for line in inp:
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()
now the problem is it works fine if I have to generate one output for one input. But I need to scan a directory using glob.iglobfunction for all files to be used as input. And then have to perform above said program on each file to gather result for each file and then of course have to write all of the analyzed results for each file into a single OUTPUT file.
NOTE: During generating single OUTPUT file if any value is found to be getting repeated then instead of writing same entry twice it is preferred to sum up the 'count' only. e.g. analysis of 1st file generate:
123 6
111 5
0 6
45 5
and 2nd file generate:
121 9
111 7
0 1
22 2
in this case OUTPUT file must be written such a way that it contain:
123 6
111 12 #sum up count no. in case of similar value entry
0 7
45 5
22 2
i have written prog. for single file analysis BUT i'm stuck in mass analysis section.
please help.
from collections import Counter
import glob
out = open(filename,'w')
g_iter = glob.iglob('path_to_dir/*')
C = Counter()
for filename in g_iter:
f = open(filename,'r')
inp = f.read().strip().split('\n')
f.close()
for line in inp:
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()
After de-uglification:
from collections import Counter
import glob
def main():
# create Counter
cnt = Counter()
# collect data
for fname in glob.iglob('path_to_dir/*.dat'):
with open(fname) as inf:
cnt.update(line.split()[-1] for line in inf)
# dump results
with open("summary.dat", "w") as outf:
outf.writelines("{:5s} {:>5d}\n".format(val,num) for val,num in cnt.iteritems())
if __name__=="__main__":
main()
Initialise a empty dictionary at the top of the program,
lets say, dic=dict()
and for each Counter update the dic so that the values of similar keys are summed and the new keys are also added to the dic
to update dic use this:
dic=dict( (n, dic.get(n, 0)+C.get(n, 0)) for n in set(dic)|set(C) )
where C is the current Counter, and after all files are finished write the dic to the output file.
import glob
from collections import Counter
dic=dict()
g_iter = glob.iglob(r'c:\\python32\fol\*')
for x in g_iter:
lis=[]
with open(x) as f:
inp = f.readlines()
for line in inp:
num=line.split()[-1]
lis.append(num)
C=Counter(lis)
dic=dict( (n, dic.get(n, 0)+C.get(n, 0)) for n in set(dic)|set(C) )
for x in dic:
print(x,'\t',dic[x])
I did like this.
import glob
out = open("write.txt",'a')
from collections import Counter
C = Counter()
for file in glob.iglob('temp*.txt'):
for line in open(file,'r').read().strip().split('\n'):
k = line.split()[-1] #as to read last column
C[k] += 1
for value,count in C.items():
x = "%s %d" % (value,count)
out.write(x)
out.write('\n')
out.close()
The following program has been running for about ~22 hours on two files (txt, ~10MB ea.). Each file has about ~100K rows. Can someone give me an indication of how inefficient my code is and perhaps a faster method. The input dict are ordered and preserving order is necessary:
import collections
def uniq(input):
output = []
for x in input:
if x not in output:
output.append(x)
return output
Su = {}
with open ('Sucrose_rivacombined.txt') as f:
for line in f:
(key, val) = line.split('\t')
Su[(key)] = val
Su_OD = collections.OrderedDict(Su)
Su_keys = Su_OD.keys()
Et = {}
with open ('Ethanol_rivacombined.txt') as g:
for line in g:
(key, val) = line.split('\t')
Et[(key)] = val
Et_OD = collections.OrderedDict(Et)
Et_keys = Et_OD.keys()
merged_keys = Su_keys + Et_keys
merged_keys = uniq(merged_keys)
d3=collections.OrderedDict()
output_doc = open("compare.txt","w+")
for chr_local in merged_keys:
line_output = chr_local
if (Et.has_key(chr_local)):
line_output = line_output + "\t" + Et[chr_local]
else:
line_output = line_output + "\t" + "ND"
if (Su.has_key(chr_local)):
line_output = line_output + "\t" + Su[chr_local]
else:
line_output = line_output + "\t" + "ND"
output_doc.write(line_output + "\n")
The input files are as follows: not every key is present in both files
Su:
chr1:3266359 80.64516129
chr1:3409983 100
chr1:3837894 75.70093458
chr1:3967565 100
chr1:3977957 100
Et:
chr1:3266359 95
chr1:3456683 78
chr1:3837894 54.93395855
chr1:3967565 100
chr1:3976722 23
I would like the output to look as follows:
chr1:3266359 80.645 95
chr1:3456683 ND 78
Replace uniq with this as the inputs are hashable:
def uniq(input):
output = []
s = set()
for x in input:
if x not in s:
output.append(x)
s.add(x)
return output
This will reduce a nearly O(n^2) process to nearly O(n).
You don't need your unique function.
pseudo code like:
read file 2 as OrderedDict
process file 1 writing out it's item (already ordered correctly)
pop, with defalut from file 2 for last part of the output line
after file one is consumed process the Ordered dict from file 2
Also, love list comprehensions...you can read the file with:
OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))
Only one ordered dict and 'Sucrose_rivacombined.txt' never even makes it into memory. should be super fast
EDIT complete code (not sure about your output line format)
from collections import OrderedDict
Et_OD = OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))
with open("compare.txt","w+") as output_doc:
for line in open('Sucrose_rivacombined.txt'):
key,val = line.strip().split('\t')
line_out = '\t'.join((key,val,Et_OD.pop(key,'ND')))
output_doc.write(line_out+'\n')
for key,val in Et_OD.items():
line_out = '\t'.join((key,'ND',val))
output_doc.write(line_out+'\n')
your output is a list, but your input is dictionaries: their keys are guaranteed unique, but your not in output is going to need to compare against every element of the list, which is combinatorial. (You're doing n^2 comparisons because of that not check.)
You could probably completely replace uniq with:
Su_OD.update(Et_OD)
This works for me:
from collections import OrderedDict
one = OrderedDict()
two = OrderedDict()
one['hello'] = 'one'
one['world'] = 'one'
two['world'] = 'two'
two['cats'] = 'two'
one.update(two)
for k, v in one.iteritems():
print k, v
output:
hello one
world two
cats two