Trying to stepwise iterate through 2 files in python - python

I am trying to merge two LARGE input files together into 1 output, sorting as I go.
## Above I counted the number of lines in each table
print("Processing Table Lines: table 1 has " + str(count1) + " and table 2 has " + str(count2) )
newLine, compare, line1, line2 = [], 0, [], []
while count1 + count2 > 0:
if count1 > 0 and compare <= 0: count1, line1 = count1 - 1, ifh1.readline().rstrip().split('\t')
else: line1 = []
if count2 > 0 and compare >= 0: count2, line2 = count2 - 1, ifh2.readline().rstrip().split('\t')
else: line2 = []
compare = compareTableLines( line1, line2 )
newLine = mergeLines( line1, line2, compare, tIndexes )
ofh.write('\t'.join( newLine + '\n'))
What I expect to happen is that as lines are written to output, I pull the next line in the file I used to be read in if available. I also expect that the loop cuts out once both files are empty.
However I keep getting this error:
ValueError: Mixing iteration and read methods would lose data
I just don't see how to get around it. Either file is too large to keep in memory so I want to read as I go.

Here's an example of merging two ordered files, CSV files in this case, using heapq.merge() and itertools.groupby(). Given 2 CSV files:
x.csv:
key1,99
key2,100
key4,234
y.csv:
key1,345
key2,4
key3,45
Running:
import csv, heapq, itertools
keyfun = lambda row: row[0]
with open("x.csv") as inf1, open("y.csv") as inf2, open("z.csv", "w") as outf:
in1, in2, out = csv.reader(inf1), csv.reader(inf2), csv.writer(outf)
for key, rows in itertools.groupby(heapq.merge(in1, in2, key=keyfun), keyfun):
out.writerow([key, sum(int(r[1]) for r in rows)])
we get:
z.csv:
key1,444
key2,104
key3,45
key4,234

Related

Assist writing awk script version of python code to generate count matrix

I have not found any similar question to this...
I have this python script to generate a count matrix from files containing only sequences, but it takes eternity to run but I know awk will do a faster job. i am not so good with awk but hoping someone might be able to help.
the python script is as follows:
numFiles = int(sys.argv[1])
allParams = int(numFiles + 4)
key_file = sys.argv[2]
out_file = sys.argv[3]
#open the output file
outHandle = open(out_file,'w')
#Open key file and read one line at a time
with open(key_file) as kf:
for eachline in kf:
temp_list = [0] * numFiles
kSeq = eachline.strip(' \t\n\r')
upRange = int(numFiles + 4)
for i in range(4,upRange):
with open(sys.argv[i]) as f:
for eachline in f:
seq = eachline.strip(' \t\n\r')
if (kSeq == seq):
curr = int(temp_list[i-4])
nw = int(curr + 1)
temp_list[i-4] = nw
else:
continue
outHandle.write(str(kSeq) + "\t")
for ind,item in enumerate(temp_list):
lastItemIndex = numFiles - 1
if(ind == lastItemIndex):
outHandle.write(str(item) + "\n")
else:
outHandle.write(str(item) + "\t")
Trying to create an example:
Input: A keyFile, X number of other files (All input file are basically just words in a single column)
Output: A matrix containing the number of occurrence of the words in the keyFile in the X number of files.
Keyfile:
word
one
two
three
four
five
file1:
word
three
five
three
one
two
one
four
four
three
file2:
word
four
one
three
three
one
two
three
two
one
OUTPUT:
word
file1
file2
one
2
3
two
1
2
three
3
3
four
2
1
five
1
0
the number of files could be up to 4
I hope this illustration is clearer.
Thank you
So, after extensive reading and tries, i got what I wanted to achieve using the code
awk 'fname != FILENAME { fname = FILENAME; idx++ } idx == 1 {key[$0] = $0 } idx == 2 {if($1 == key[$1]){ f1[$1] += 1 }} idx == 3 {if($1 == key[$1]){ f2[$1] += 1 }} END {for(seq in key) print seq "\t" f1[seq] "\t" f2[seq] }' keyFile file1 file2
Thanks all for your input.

Possible to do this more efficiently (turn compact file to sparse)

I have to read in a file line by line that has indices of where a vector has 1's
so for example:
1 3 9 10
means:
0,1,0,1,0,0,0,0,0,1,1
My goal is to write program that will take each line and print out the full vector with the 0's.
I am able to do this with my current program for a few lines:
#create a sparse vector
list_line_sparse = [0] * int(num_features)
#loop over all the lines
for item in lines:
#split the line on spaces
zz = item.split(' ')
#get all ints on a line
d = [int(x.strip()) for x in zz]
#loop over all ints and change index to 1 in sparse vector
for i in d:
list_line_sparse[i]=1
out_file += (', '.join(str(item) for item in list_line_sparse))
#change back to 0's
for i in d:
list_line_sparse[i]=0
out_file +='\n'
f = open('outfile', 'w')
f.write(out_file)
f.close()
The problem is that for a file with a lot of features and lines, my program is very very inefficient - it basically never finishes. Is there anything sticking out that I should change to make it more efficent? (I.e. the 2 for loops)
It would probably be more efficient to write each line of data to your output file as it is generated, rather than building up a huge string in memory.
numpy is a popular Python module that's good for doing bulk operations on numbers. If you start with:
import numpy as np
list_line_sparse = np.zeros(num_features, dtype=np.uint8)
Then, given d as the list of numbers on the current line, you can simply do:
list_line_sparse[d] = 1
to set ALL of those indexes in the array at the same time, no loop required. (At the Python level at least, obviously there's still a loop involved, but it's down in the C implementation of numpy).
It is slowing down because you are doing string concatenation. It is better to work with lists.
Also you could use csv to read your space separated lines in, and to then write each row with commas automatically added:
import csv
num_features = 20
with open('input.txt', 'r', newline='') as f_input, open('output.txt', 'w', newline='') as f_output:
csv_input = csv.reader(f_input, delimiter=' ')
csv_output = csv.writer(f_output)
for row in csv_input:
list_line_sparse = [0] * int(num_features)
for v in map(int, row):
list_line_sparse[v] = 1
csv_output.writerow(list_line_sparse)
So if input.txt contained the following:
1 3 9 10
1 3 9 11
2 7 3 5
Giving you an output.txt containing:
0,1,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0
0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0
0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
Too much loops: first, the item.split(), then the for x in zz, then for i in d, then for item in list_line_sparse, and then for i in d again. Strings concatenations could be your most expensive part: the .join and the output +=. And all this for every line.
You could try a "character by character" parsing and writing. Something like this:
#features per line
count = int(num_features)
f = open('outfile.txt', 'w')
#loop over all lines
for item in lines:
#reset the feature
i = 0
#the characters buffer
index = ""
#parse character by character
for character in item:
#if a space or end of line is found,
#and the characters buffer (index) is not empty
if character in (" ", "\r", "\n"):
if index:
#parse the characters buffer
index = int(index)
#if is not the first feature
if i > 0:
#add the separator
f.write(", ")
#add 0's until index
while i < index:
f.write("0, ")
i += 1
#and write 1
f.write("1")
i += 1
#reset the characters buffer
index = ""
#if is not a space or end on line
else:
#add the character to the buffer
index += character
#if the last line didn't end with a carriage return,
#index could be waiting to be parsed
if index:
index = int(index)
if i > 0:
f.write(", ")
while i < index:
f.write("0, ")
i += 1
f.write("1")
i += 1
index = ""
#fill with 0's
while i < count:
if i == 0:
f.write("0")
else:
f.write(", 0")
i += 1
f.write("\n")
f.close()
Let's rework your code into a simpler package that takes better advantage of Python's features:
import sys
NUM_FEATURES = 12
with open(sys.argv[1]) as source, open(sys.argv[2], 'w') as sink:
for line in source:
list_line_sparse = [0] * NUM_FEATURES
indicies = map(int, line.rstrip().split())
for index in indicies:
list_line_sparse[index] = 1
print(*list_line_sparse, file=sink, sep=',')
I revisited this problem with your "more efficiently" in mind. Although the above is more memory efficient, it is a hair slower time-wise. I reconsidered your original and came up with a solution that is less memory efficient but about 2x faster than your code:
import sys
NUM_FEATURES = 12
data = ''
with open(sys.argv[1]) as source:
for line in source:
list_line_sparse = ["0"] * NUM_FEATURES
indicies = map(int, line.rstrip().split())
for index in indicies:
list_line_sparse[index] = "1"
data += ",".join(list_line_sparse) + '\n'
with open(sys.argv[2], 'w') as sink:
sink.write(data)
Like your original solution, it stores all the data in memory and writes it out at the end which is both a disadvantage (memory-wise) and an advantage (time-wise.)
input.txt
1 3 9 10
1 3 9 11
2 7 3 5
USAGE
% python3 test.py input.txt output.txt
output.txt
0,1,0,1,0,0,0,0,0,1,1,0
0,1,0,1,0,0,0,0,0,1,0,1
0,0,1,1,0,1,0,1,0,0,0,0

Code doesn't print the last sequence in a file

I have a file that looks like this:
<s0> 3
line1
line2
line3
<s1> 5
line1
line2
<s2> 4
etc. up to more than a thousand
Each sequence has a header like <s0> 3, which in this case states that three lines follow. In the example above, the number of lines below <s1> is two, so I have to correct the header to <s1> 2.
The code I have below picks out the sequence headers and the correct number of lines below them. But for some reason, it never gets the details of the last sequence. I know something is wrong but I don't know what. Can someone point me to what I am doing wrong?
import re
def call():
with open('trial_perl.txt') as fp:
docHeader = open("C:\path\header.txt","w")
c = 0
c1 = 0
header = []
k = -1
for line in fp:
if line.startswith("<s"):
#header = line.split(" ")
#print header[1]
c = 0
else:
c1 = c + 1
c += 1
if c == 0 and c1>0:
k +=1
printing = c1
if printing >= 0:
s = "<s%s>" % (k)
#print "%s %d" % (s, printing)
docHeader.write(s+" "+str(printing)+"\n")
call()
you have no sentinel at the end of the last sequence in your data, so your code will need to deal with the last sequence AFTER the loop is done.
If I may suggest some python tricks to get to your results; you don't need those c/c1/k counter variables, as they make the code more difficult to read and maintain. Instead, populate a map of sequence header to sequence items and then use the map to do all your work:
(this code works only if all sequence headers are unique - if you have duplicates, it won't work)
with open('trial_perl.txt') as fp:
docHeader = open("C:\path\header.txt","w")
data = {}
for line in fp:
if line.startswith("<s"):
current_sequence = line
# create a list with the header as the key
data[current_sequence] = []
else:
# add each sequence to the list we defined above
data[current_sequence].append(line)
Your map is ready! It looks like this:
{"<s0> 3": ["line1", "line2", "line5"],
"<s1> 5": ["line1", "line2"]}
You can iterate it like this:
for header, lines in data.items():
# header is the key, or "<s0> 3"
# lines is the list of lines under that header ["line1", "line2", etc]
num_of_lines = len(lines)
The main problem is that you neglect to check the value of c after you have read the last line. You probably had difficulty spotting this problem because of all the superfluous code. You don't have to increment k, since you can extract the value from the <s...> tag. And you don't have to have all three variables c, c1, and printing. A single count variable will do.
import re, sys
def call():
with open('trial_perl.txt') as fp:
docHeader = sys.stdout #open("C:\path\header.txt","w")
count = 0
id = None
for line in fp:
if line.startswith("<s"):
if id != None:
tag = '<s%s>' % id
docHeader.write('<s%d> %d\n' % (id, count))
count = 0
id = int(line[2:line.find('>')])
else:
count += 1
if id != None:
tag = '<s%s>' % id
docHeader.write('<s%d> %d\n' % (id, count))
call()
Another approach using groupby from itertools, where you take the maximum number of line in each group - a group corresponding to a sequence of header + line in your file: :
from itertools import groupby
def call():
with open('stack.txt') as fp:
header = [-1]
lines = [0]
for line in fp:
if line.startswith("<s"):
header.append(header[-1]+1)
lines.append(0)
else:
header.append(header[-1])
lines.append(lines[-1] +1)
with open('result','w') as f:
for key, group in groupby(zip(header[1:],lines[1:]), lambda x: x[0]):
f.write(str(("<s%d> %d\n" % max(group))))
f.close()
call()
#<s0> 3
#<s1> 2
stack.txt is the file containing your data:
<s0> 3
line1
line2
line3
<s1> 5
line1
line2

Counting the number of character differences between two files

I have two somewhat large (~20 MB) txt files which are essentially just long strings of integers (only either 0,1,2). I would like to write a python script which iterates through the files and compares them integer by integer. At the end of the day I want the number of integers that are different and the total number of integers in the files (they should be exactly the same length). I have done some searching and it seems like difflib may be useful but I am fairly new to python and I am not sure if anything in difflib will count the differences or the number of entries.
Any help would be greatly appreciated! What I am trying right now is the following but it only looks at one entry and then terminates and I don't understand why.
f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
fileOne = f1.readlines()
fileTwo = f2.readlines()
f1.close()
f2.close()
correct = 0
x = 0
total = 0
for i in fileOne:
if i != fileTwo[x]:
correct +=1
x += 1
total +=1
if total != 0:
percent = (correct / total) * 100
print "The file is %.1f %% correct!" % (percent)
print "%i out of %i symbols were correct!" % (correct, total)
Not tested at all, but look at this as something a lot easier (and more Pythonic):
from itertools import izip
with open("file1.txt", "r") as f1, open("file2.txt", "r") as f2:
data=[(1, x==y) for x, y in izip(f1.read(), f2.read())]
print sum(1.0 for t in data if t[1]) / len(data) * 100
You can use enumerate to check the chars in your strings that don't match
If all strings are guaranteed to be the same length:
with open("file1.txt","r") as f:
l1 = f.readlines()
with open("file2.txt","r") as f:
l2 = f.readlines()
non_matches = 0.
total = 0.
for i,j in enumerate(l1):
non_matches += sum([1 for k,l in enumerate(j) if l2[i][k]!= l]) # add 1 for each non match
total += len(j.split(","))
print non_matches,total*2
print non_matches / (total * 2) * 100. # if strings are all same length just mult total by 2
6 40
15.0

Improving a python code reading files

I wrote a python script to treat text files.
The input is a file with several lines. At the beginning of each line, there is a number (1, 2, 3... , n). Then an empty line and the last line on which some text is written.
I need to read through this file to delete some lines at the beginning and some in the end (say number 1 to 5 and then number 78 to end). I want to write the remaining lines on a new file (in a new directory) and renumber the first numbers written on these lines (in my example, 6 would become 1, 7 2 etc.)
I wrote the following:
def treatFiles(oldFile,newFile,firstF, startF, lastF):
% firstF is simply an index
% startF corresponds to the first line I want to keep
% lastF corresponds to the last line I want to keep
numberFToDeleteBeginning = int(startF) - int(firstF)
with open(oldFile) as old, open(newFile, 'w') as new:
countLine = 0
for line in old:
countLine += 1
if countLine <= numberFToDeleteBeginning:
pass
elif countLine > int(lastF) - int(firstF):
pass
elif line.split(',')[0] == '\n':
newLineList = line.split(',')
new.write(line)
else:
newLineList = [str(countLine - numberFToDeleteBeginning)] + line.split(',')
del newLineList[1]
newLine = str(newLineList[0])
for k in range(1, len(newLineList)):
newLine = newLine + ',' + str(newLineList[k])
new.write(newLine)
if __name__ == '__main__':
from sys import argv
import os
os.makedirs('treatedFiles')
new = 'treatedFiles/' + argv[1]
treatFiles(argv[1], argv[2], newFile, argv[3], argv[4], argv[5])
My code works correctly but is far too slow (I have files of about 10Gb to treat and it's been running for hours).
Does anyone know how I can improve it?
I would get rid of the for loop in the middle and the expensive .split():
from itertools import islice
def treatFiles(old_file, new_file, index, start, end):
with open(old_file, 'r') as old, open(new_file, 'w') as new:
sliced_file = islice(old, start - index, end - index)
for line_number, line in enumerate(sliced_file, start=1):
number, rest = line.split(',', 1)
if number == '\n':
new.write(line)
else:
new.write(str(line_number) + ',' + rest)
Also, convert your three numerical arguments to integers before passing them into the function:
treatFiles(argv[1], argv[2], newFile, int(argv[3]), int(argv[4]), int(argv[5]))

Categories