Assist writing awk script version of python code to generate count matrix - python

I have not found any similar question to this...
I have this python script to generate a count matrix from files containing only sequences, but it takes eternity to run but I know awk will do a faster job. i am not so good with awk but hoping someone might be able to help.
the python script is as follows:
numFiles = int(sys.argv[1])
allParams = int(numFiles + 4)
key_file = sys.argv[2]
out_file = sys.argv[3]
#open the output file
outHandle = open(out_file,'w')
#Open key file and read one line at a time
with open(key_file) as kf:
for eachline in kf:
temp_list = [0] * numFiles
kSeq = eachline.strip(' \t\n\r')
upRange = int(numFiles + 4)
for i in range(4,upRange):
with open(sys.argv[i]) as f:
for eachline in f:
seq = eachline.strip(' \t\n\r')
if (kSeq == seq):
curr = int(temp_list[i-4])
nw = int(curr + 1)
temp_list[i-4] = nw
else:
continue
outHandle.write(str(kSeq) + "\t")
for ind,item in enumerate(temp_list):
lastItemIndex = numFiles - 1
if(ind == lastItemIndex):
outHandle.write(str(item) + "\n")
else:
outHandle.write(str(item) + "\t")
Trying to create an example:
Input: A keyFile, X number of other files (All input file are basically just words in a single column)
Output: A matrix containing the number of occurrence of the words in the keyFile in the X number of files.
Keyfile:
word
one
two
three
four
five
file1:
word
three
five
three
one
two
one
four
four
three
file2:
word
four
one
three
three
one
two
three
two
one
OUTPUT:
word
file1
file2
one
2
3
two
1
2
three
3
3
four
2
1
five
1
0
the number of files could be up to 4
I hope this illustration is clearer.
Thank you

So, after extensive reading and tries, i got what I wanted to achieve using the code
awk 'fname != FILENAME { fname = FILENAME; idx++ } idx == 1 {key[$0] = $0 } idx == 2 {if($1 == key[$1]){ f1[$1] += 1 }} idx == 3 {if($1 == key[$1]){ f2[$1] += 1 }} END {for(seq in key) print seq "\t" f1[seq] "\t" f2[seq] }' keyFile file1 file2
Thanks all for your input.

Related

Trying to stepwise iterate through 2 files in python

I am trying to merge two LARGE input files together into 1 output, sorting as I go.
## Above I counted the number of lines in each table
print("Processing Table Lines: table 1 has " + str(count1) + " and table 2 has " + str(count2) )
newLine, compare, line1, line2 = [], 0, [], []
while count1 + count2 > 0:
if count1 > 0 and compare <= 0: count1, line1 = count1 - 1, ifh1.readline().rstrip().split('\t')
else: line1 = []
if count2 > 0 and compare >= 0: count2, line2 = count2 - 1, ifh2.readline().rstrip().split('\t')
else: line2 = []
compare = compareTableLines( line1, line2 )
newLine = mergeLines( line1, line2, compare, tIndexes )
ofh.write('\t'.join( newLine + '\n'))
What I expect to happen is that as lines are written to output, I pull the next line in the file I used to be read in if available. I also expect that the loop cuts out once both files are empty.
However I keep getting this error:
ValueError: Mixing iteration and read methods would lose data
I just don't see how to get around it. Either file is too large to keep in memory so I want to read as I go.
Here's an example of merging two ordered files, CSV files in this case, using heapq.merge() and itertools.groupby(). Given 2 CSV files:
x.csv:
key1,99
key2,100
key4,234
y.csv:
key1,345
key2,4
key3,45
Running:
import csv, heapq, itertools
keyfun = lambda row: row[0]
with open("x.csv") as inf1, open("y.csv") as inf2, open("z.csv", "w") as outf:
in1, in2, out = csv.reader(inf1), csv.reader(inf2), csv.writer(outf)
for key, rows in itertools.groupby(heapq.merge(in1, in2, key=keyfun), keyfun):
out.writerow([key, sum(int(r[1]) for r in rows)])
we get:
z.csv:
key1,444
key2,104
key3,45
key4,234

how to change this recursive to loop

I want to write all base-26 numbers (with letters of the alphabet as digits) of a certain length into an ASCII-file.
For length = 4 this would look like
aaaa
aaab
aaac
...
zzzx
zzzy
zzzz
I achieved this with the following recursive code:
def fuz(data, ll_str):
ll_str += 1
def for_once(data_once, ll_str_once):
tmp_str = ll_str_once
tmp_str -= 1
new_data = []
for m in data_once:
for i1 in range(97, 123):
new_data.append(m + chr(i1))
if tmp_str != 0:
return for_once(new_data, tmp_str)
else:
return data_once
return for_once(data, ll_str)
if __name__ == '__main__':
ll = 4
test = ['']
file_output = open("out.txt", 'a')
out_data = fuz(test, ll)
for out in out_data:
file_output.write(out + '\n')
file_output.close()
However, for any length > 4, this solution runs out of memory on my machine.
Therefore I look for an alternative without recursion - can anybody give me a hint how to do this?
This loop writes all base-26 numbers of length 4 (with letters as digits) in a file named out.txt.
base and length can be arbitrarily chosen - but prepare to be patient for higher values...
import itertools as it
base = 26
lngth = 4
with open('out.txt', 'w') as f:
for t in it.product(range(97, 97+base), repeat=lngth):
s = ''.join(map(chr, (t)))
f.write(s + chr(13))
At least it doesn't consume too much memory, as requested by the OP.
However, with base 26 a length 5 file had already 70MB and a length 6 file I stopped the writing process at 1.4GB; there Notepad++ was already not able to open it anymore. So everybody can think about the use of this code by himself.

Possible to do this more efficiently (turn compact file to sparse)

I have to read in a file line by line that has indices of where a vector has 1's
so for example:
1 3 9 10
means:
0,1,0,1,0,0,0,0,0,1,1
My goal is to write program that will take each line and print out the full vector with the 0's.
I am able to do this with my current program for a few lines:
#create a sparse vector
list_line_sparse = [0] * int(num_features)
#loop over all the lines
for item in lines:
#split the line on spaces
zz = item.split(' ')
#get all ints on a line
d = [int(x.strip()) for x in zz]
#loop over all ints and change index to 1 in sparse vector
for i in d:
list_line_sparse[i]=1
out_file += (', '.join(str(item) for item in list_line_sparse))
#change back to 0's
for i in d:
list_line_sparse[i]=0
out_file +='\n'
f = open('outfile', 'w')
f.write(out_file)
f.close()
The problem is that for a file with a lot of features and lines, my program is very very inefficient - it basically never finishes. Is there anything sticking out that I should change to make it more efficent? (I.e. the 2 for loops)
It would probably be more efficient to write each line of data to your output file as it is generated, rather than building up a huge string in memory.
numpy is a popular Python module that's good for doing bulk operations on numbers. If you start with:
import numpy as np
list_line_sparse = np.zeros(num_features, dtype=np.uint8)
Then, given d as the list of numbers on the current line, you can simply do:
list_line_sparse[d] = 1
to set ALL of those indexes in the array at the same time, no loop required. (At the Python level at least, obviously there's still a loop involved, but it's down in the C implementation of numpy).
It is slowing down because you are doing string concatenation. It is better to work with lists.
Also you could use csv to read your space separated lines in, and to then write each row with commas automatically added:
import csv
num_features = 20
with open('input.txt', 'r', newline='') as f_input, open('output.txt', 'w', newline='') as f_output:
csv_input = csv.reader(f_input, delimiter=' ')
csv_output = csv.writer(f_output)
for row in csv_input:
list_line_sparse = [0] * int(num_features)
for v in map(int, row):
list_line_sparse[v] = 1
csv_output.writerow(list_line_sparse)
So if input.txt contained the following:
1 3 9 10
1 3 9 11
2 7 3 5
Giving you an output.txt containing:
0,1,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0
0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0
0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
Too much loops: first, the item.split(), then the for x in zz, then for i in d, then for item in list_line_sparse, and then for i in d again. Strings concatenations could be your most expensive part: the .join and the output +=. And all this for every line.
You could try a "character by character" parsing and writing. Something like this:
#features per line
count = int(num_features)
f = open('outfile.txt', 'w')
#loop over all lines
for item in lines:
#reset the feature
i = 0
#the characters buffer
index = ""
#parse character by character
for character in item:
#if a space or end of line is found,
#and the characters buffer (index) is not empty
if character in (" ", "\r", "\n"):
if index:
#parse the characters buffer
index = int(index)
#if is not the first feature
if i > 0:
#add the separator
f.write(", ")
#add 0's until index
while i < index:
f.write("0, ")
i += 1
#and write 1
f.write("1")
i += 1
#reset the characters buffer
index = ""
#if is not a space or end on line
else:
#add the character to the buffer
index += character
#if the last line didn't end with a carriage return,
#index could be waiting to be parsed
if index:
index = int(index)
if i > 0:
f.write(", ")
while i < index:
f.write("0, ")
i += 1
f.write("1")
i += 1
index = ""
#fill with 0's
while i < count:
if i == 0:
f.write("0")
else:
f.write(", 0")
i += 1
f.write("\n")
f.close()
Let's rework your code into a simpler package that takes better advantage of Python's features:
import sys
NUM_FEATURES = 12
with open(sys.argv[1]) as source, open(sys.argv[2], 'w') as sink:
for line in source:
list_line_sparse = [0] * NUM_FEATURES
indicies = map(int, line.rstrip().split())
for index in indicies:
list_line_sparse[index] = 1
print(*list_line_sparse, file=sink, sep=',')
I revisited this problem with your "more efficiently" in mind. Although the above is more memory efficient, it is a hair slower time-wise. I reconsidered your original and came up with a solution that is less memory efficient but about 2x faster than your code:
import sys
NUM_FEATURES = 12
data = ''
with open(sys.argv[1]) as source:
for line in source:
list_line_sparse = ["0"] * NUM_FEATURES
indicies = map(int, line.rstrip().split())
for index in indicies:
list_line_sparse[index] = "1"
data += ",".join(list_line_sparse) + '\n'
with open(sys.argv[2], 'w') as sink:
sink.write(data)
Like your original solution, it stores all the data in memory and writes it out at the end which is both a disadvantage (memory-wise) and an advantage (time-wise.)
input.txt
1 3 9 10
1 3 9 11
2 7 3 5
USAGE
% python3 test.py input.txt output.txt
output.txt
0,1,0,1,0,0,0,0,0,1,1,0
0,1,0,1,0,0,0,0,0,1,0,1
0,0,1,1,0,1,0,1,0,0,0,0

Python; print filename and header

I have files (fasta files with a sequence) that look like this:
File1.fasta
>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
AGCCAACATCCTGGCTGTCTGG
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT
AAGCCTGTGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACAT
AAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGACCAAGGGTTCCTGTCCAAC
GTTAATCGGGGCAGG
File2.fasta
>1
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
>2
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
With my script, I count all the 5-mers in these files. My code is as follows:
import operator
import glob
def printSeq(name, seq):
kmers = {}
k = 5
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print (kmer + "\t" + str(count))
sortedKmer = sorted(kmers.items(), reverse=True)
for item in sortedKmer:
print (item[0] + "\t" + str(item[1]))
for name in glob.glob('*.fasta'):
with open(name, 'r') as f:
seq = ""
key = ""
for line in f.readlines():
if line.startswith(">"):
if key and seq:
printSeq(key, seq)
key = line[1:].strip()
seq = ""
else:
seq += line.strip()
printSeq(key, seq)
The output is now the 5-mer followed with the count.
I want to adjust my output so that for each output line I get the filename followed by the header and than the count, like this:
File1 1 GTCTT 1
File1 1 TCTTC 1
File1 1 CTTCC 1
....
File2 2 TTCTA 1
How can I achieve that?
Additional question
I want to add the reverse complement sequence of the data and count that together with the previous data. My code to get the reverse complement is as follows
from Bio import SeqIO
for fasta_file in glob.glob('*.fasta'):
for record in SeqIO.parse(fasta_file, "fasta"):
reverse_complement = ">" + record.id + "\n" + record.seq.reverse_complement()
So the "reverse_complement" of file one, header >1 has to be counted together with the previous one etc. How can I include this data to my previous files and count together?
My reverse_complement data is
File1.fasta (reverse_complement)
>1
CCAGACAGCCAGGATGTTGGCTTAGAAGCAGCCATCATTTAAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAAACCATGCACCGAAGCTGCGGCAGCGACGTTTATGCGTTGTTGGGTAGGGGAGCGTTCTGTAAGCCTGCGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACATAAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGAC
>2
CCTGCCCCGATTAACGTTGGACAGGAACCCTTGGTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTTCTGATACCTCCAGCAACCCTCACAGGCCACCTTCACAGGCTTACAGAACGCTCCCCTACCCAACAACACCTGAGTGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTTCCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTT
This could also be done using a Counter() as follows:
from collections import Counter
from itertools import groupby
import glob
for fasta_file in glob.glob('*.fasta'):
basename = os.path.splitext(os.path.basename(fasta_file))[0]
with open(fasta_file) as f_fasta:
for k, g in groupby(f_fasta, lambda x: x.startswith('>')):
if k:
sequence = next(g).strip('>\n')
else:
d = list(''.join(line.strip() for line in g))
counts = Counter()
while len(d) >= 5:
five_mer = '{}{}{}{}{}'.format(d[0], d[1], d[2], d[3], d[4])
counts[five_mer] += 1
del d[0]
for five_mer, count in sorted(counts.items(), key=lambda x: (-x[1], x[0])):
print "{} {} {} {}".format(basename, sequence, five_mer, count)
This would give you output with the largest counts first and then alphabetically:
File1 1 CAGGC 3
File1 1 CGCAG 3
File1 1 GCTTT 3
File1 1 AACGC 2
File1 1 ACATC 2
File1 1 ACGCT 2
File1 1 AGGCC 2
It uses Python's groupby() function to read groups of lines together. It either reads a single sequence line or a list of five mer lines. k is the result of the startswith() call. So when k is False, take all the lines returned, remove the newline from each and then join them together to make a single line of characters.
It then reads the first 5 characters from the list, joins them back together and adds them as a key to a Counter(). It then removes the first character from the list and repeats until there are less than 5 characters remaining.
For just alphabetical ordering:
for five_mer, count in sorted(counts.items()):
A Counter() works the same way as a dictionary, so .items() would give a list of key value pairs. These are sorted before being displayed.
You change the signature of
def printSeq(name, seq)
to
def printSeq(file, header, name, seq):
incorporate the new variables in the print statements.
e.g.
print (item[0] + "\t" + str(item[1]))
v
print (file + "\t" + header + "\t" + item[0] + "\t" + str(item[1]))
Then, in your loop you pass the information to this function.
You have the file name available in the loop, stored in the variable name
You parse the header in the lines where you detect it, and store it in a variable for later use. The later use is when you call the printSeq-function

How to combine the output of two files into single in python?

My code looks like this:
Right now my code outputs two text file named absorbance.txt and energy.txt separately. I need to modify it so that it outputs only one file named combined.txt such that every line of combined.txt has two values separated by comma. The first value must be from absorbance.txt and second must be from energy.txt. ( I apologize if anyone is confused by my writting, Please comment if you need more clarification)
g = open("absorbance.txt", "w")
h = open("Energy.txt", "w")
ask = easygui.fileopenbox()
f = open( ask, "r")
a = f.readlines()
bg = []
wavelength = []
for string in a:
index_j = 0
comma_count = 0
for j in string:
index_j += 1
if j == ',':
comma_count += 1
if comma_count == 1:
slicing_point = index_j
t = string[slicing_point:]
new_str = string[:(slicing_point- 1)]
new_energ = (float(1239.842 / int (float(new_str))) * 8065.54)
print >>h, new_energ
import math
list = []
for i in range(len(ref)):
try:
ans = ((float (ref[i]) - float (bg[i])) / (float(sample[i]) - float(bg[i])))
print ans
base= 10
final_ans = (math.log(ans, base))
except:
ans = -1 * ((float (ref[i]) - float (bg[i])) / (float(sample[i]) - float(bg[i])))
print ans
base= 10
final_ans = (math.log(ans, base))
print >>g, final_ans
Similar to Robert's approach, but aiming to keep control flow as simple as possible.
absorbance.txt:
Hello
How are you
I am fine
Does anybody want a peanut?
energy.txt:
123
456
789
Code:
input_a = open("absorbance.txt")
input_b = open("energy.txt")
output = open("combined.txt", "w")
for left, right in zip(input_a, input_b):
#use rstrip to remove the newline character from the left string
output.write(left.rstrip() + ", " + right)
input_a.close()
input_b.close()
output.close()
combined.txt:
Hello, 123
How are you, 456
I am fine, 789
Note that the fourth line of absorbance.txt was not included in the result, because energy.txt does not have a fourth line to go with it.
You can open both text files and append them to the new text file as shown below. This is what I gave based on your question, not necessarily the code your provided.
combined = open("Combined.txt","w")
with open(r'Engery.txt', "rU") as EnergyLine:
with open(r'Absorbance.txt', "rU") as AbsorbanceLine:
for line in EnergyLine:
Eng = line[:-1]
for line2 in AbsorbanceLine:
Abs = line2[:-1]
combined.write("%s,%s\n" %(Eng,Abs))
break
combined.close()

Categories