Counting the number of character differences between two files - python

I have two somewhat large (~20 MB) txt files which are essentially just long strings of integers (only either 0,1,2). I would like to write a python script which iterates through the files and compares them integer by integer. At the end of the day I want the number of integers that are different and the total number of integers in the files (they should be exactly the same length). I have done some searching and it seems like difflib may be useful but I am fairly new to python and I am not sure if anything in difflib will count the differences or the number of entries.
Any help would be greatly appreciated! What I am trying right now is the following but it only looks at one entry and then terminates and I don't understand why.
f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
fileOne = f1.readlines()
fileTwo = f2.readlines()
f1.close()
f2.close()
correct = 0
x = 0
total = 0
for i in fileOne:
if i != fileTwo[x]:
correct +=1
x += 1
total +=1
if total != 0:
percent = (correct / total) * 100
print "The file is %.1f %% correct!" % (percent)
print "%i out of %i symbols were correct!" % (correct, total)

Not tested at all, but look at this as something a lot easier (and more Pythonic):
from itertools import izip
with open("file1.txt", "r") as f1, open("file2.txt", "r") as f2:
data=[(1, x==y) for x, y in izip(f1.read(), f2.read())]
print sum(1.0 for t in data if t[1]) / len(data) * 100

You can use enumerate to check the chars in your strings that don't match
If all strings are guaranteed to be the same length:
with open("file1.txt","r") as f:
l1 = f.readlines()
with open("file2.txt","r") as f:
l2 = f.readlines()
non_matches = 0.
total = 0.
for i,j in enumerate(l1):
non_matches += sum([1 for k,l in enumerate(j) if l2[i][k]!= l]) # add 1 for each non match
total += len(j.split(","))
print non_matches,total*2
print non_matches / (total * 2) * 100. # if strings are all same length just mult total by 2
6 40
15.0

Related

Trying to stepwise iterate through 2 files in python

I am trying to merge two LARGE input files together into 1 output, sorting as I go.
## Above I counted the number of lines in each table
print("Processing Table Lines: table 1 has " + str(count1) + " and table 2 has " + str(count2) )
newLine, compare, line1, line2 = [], 0, [], []
while count1 + count2 > 0:
if count1 > 0 and compare <= 0: count1, line1 = count1 - 1, ifh1.readline().rstrip().split('\t')
else: line1 = []
if count2 > 0 and compare >= 0: count2, line2 = count2 - 1, ifh2.readline().rstrip().split('\t')
else: line2 = []
compare = compareTableLines( line1, line2 )
newLine = mergeLines( line1, line2, compare, tIndexes )
ofh.write('\t'.join( newLine + '\n'))
What I expect to happen is that as lines are written to output, I pull the next line in the file I used to be read in if available. I also expect that the loop cuts out once both files are empty.
However I keep getting this error:
ValueError: Mixing iteration and read methods would lose data
I just don't see how to get around it. Either file is too large to keep in memory so I want to read as I go.
Here's an example of merging two ordered files, CSV files in this case, using heapq.merge() and itertools.groupby(). Given 2 CSV files:
x.csv:
key1,99
key2,100
key4,234
y.csv:
key1,345
key2,4
key3,45
Running:
import csv, heapq, itertools
keyfun = lambda row: row[0]
with open("x.csv") as inf1, open("y.csv") as inf2, open("z.csv", "w") as outf:
in1, in2, out = csv.reader(inf1), csv.reader(inf2), csv.writer(outf)
for key, rows in itertools.groupby(heapq.merge(in1, in2, key=keyfun), keyfun):
out.writerow([key, sum(int(r[1]) for r in rows)])
we get:
z.csv:
key1,444
key2,104
key3,45
key4,234

how to change this recursive to loop

I want to write all base-26 numbers (with letters of the alphabet as digits) of a certain length into an ASCII-file.
For length = 4 this would look like
aaaa
aaab
aaac
...
zzzx
zzzy
zzzz
I achieved this with the following recursive code:
def fuz(data, ll_str):
ll_str += 1
def for_once(data_once, ll_str_once):
tmp_str = ll_str_once
tmp_str -= 1
new_data = []
for m in data_once:
for i1 in range(97, 123):
new_data.append(m + chr(i1))
if tmp_str != 0:
return for_once(new_data, tmp_str)
else:
return data_once
return for_once(data, ll_str)
if __name__ == '__main__':
ll = 4
test = ['']
file_output = open("out.txt", 'a')
out_data = fuz(test, ll)
for out in out_data:
file_output.write(out + '\n')
file_output.close()
However, for any length > 4, this solution runs out of memory on my machine.
Therefore I look for an alternative without recursion - can anybody give me a hint how to do this?
This loop writes all base-26 numbers of length 4 (with letters as digits) in a file named out.txt.
base and length can be arbitrarily chosen - but prepare to be patient for higher values...
import itertools as it
base = 26
lngth = 4
with open('out.txt', 'w') as f:
for t in it.product(range(97, 97+base), repeat=lngth):
s = ''.join(map(chr, (t)))
f.write(s + chr(13))
At least it doesn't consume too much memory, as requested by the OP.
However, with base 26 a length 5 file had already 70MB and a length 6 file I stopped the writing process at 1.4GB; there Notepad++ was already not able to open it anymore. So everybody can think about the use of this code by himself.

Possible to do this more efficiently (turn compact file to sparse)

I have to read in a file line by line that has indices of where a vector has 1's
so for example:
1 3 9 10
means:
0,1,0,1,0,0,0,0,0,1,1
My goal is to write program that will take each line and print out the full vector with the 0's.
I am able to do this with my current program for a few lines:
#create a sparse vector
list_line_sparse = [0] * int(num_features)
#loop over all the lines
for item in lines:
#split the line on spaces
zz = item.split(' ')
#get all ints on a line
d = [int(x.strip()) for x in zz]
#loop over all ints and change index to 1 in sparse vector
for i in d:
list_line_sparse[i]=1
out_file += (', '.join(str(item) for item in list_line_sparse))
#change back to 0's
for i in d:
list_line_sparse[i]=0
out_file +='\n'
f = open('outfile', 'w')
f.write(out_file)
f.close()
The problem is that for a file with a lot of features and lines, my program is very very inefficient - it basically never finishes. Is there anything sticking out that I should change to make it more efficent? (I.e. the 2 for loops)
It would probably be more efficient to write each line of data to your output file as it is generated, rather than building up a huge string in memory.
numpy is a popular Python module that's good for doing bulk operations on numbers. If you start with:
import numpy as np
list_line_sparse = np.zeros(num_features, dtype=np.uint8)
Then, given d as the list of numbers on the current line, you can simply do:
list_line_sparse[d] = 1
to set ALL of those indexes in the array at the same time, no loop required. (At the Python level at least, obviously there's still a loop involved, but it's down in the C implementation of numpy).
It is slowing down because you are doing string concatenation. It is better to work with lists.
Also you could use csv to read your space separated lines in, and to then write each row with commas automatically added:
import csv
num_features = 20
with open('input.txt', 'r', newline='') as f_input, open('output.txt', 'w', newline='') as f_output:
csv_input = csv.reader(f_input, delimiter=' ')
csv_output = csv.writer(f_output)
for row in csv_input:
list_line_sparse = [0] * int(num_features)
for v in map(int, row):
list_line_sparse[v] = 1
csv_output.writerow(list_line_sparse)
So if input.txt contained the following:
1 3 9 10
1 3 9 11
2 7 3 5
Giving you an output.txt containing:
0,1,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0
0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0
0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
Too much loops: first, the item.split(), then the for x in zz, then for i in d, then for item in list_line_sparse, and then for i in d again. Strings concatenations could be your most expensive part: the .join and the output +=. And all this for every line.
You could try a "character by character" parsing and writing. Something like this:
#features per line
count = int(num_features)
f = open('outfile.txt', 'w')
#loop over all lines
for item in lines:
#reset the feature
i = 0
#the characters buffer
index = ""
#parse character by character
for character in item:
#if a space or end of line is found,
#and the characters buffer (index) is not empty
if character in (" ", "\r", "\n"):
if index:
#parse the characters buffer
index = int(index)
#if is not the first feature
if i > 0:
#add the separator
f.write(", ")
#add 0's until index
while i < index:
f.write("0, ")
i += 1
#and write 1
f.write("1")
i += 1
#reset the characters buffer
index = ""
#if is not a space or end on line
else:
#add the character to the buffer
index += character
#if the last line didn't end with a carriage return,
#index could be waiting to be parsed
if index:
index = int(index)
if i > 0:
f.write(", ")
while i < index:
f.write("0, ")
i += 1
f.write("1")
i += 1
index = ""
#fill with 0's
while i < count:
if i == 0:
f.write("0")
else:
f.write(", 0")
i += 1
f.write("\n")
f.close()
Let's rework your code into a simpler package that takes better advantage of Python's features:
import sys
NUM_FEATURES = 12
with open(sys.argv[1]) as source, open(sys.argv[2], 'w') as sink:
for line in source:
list_line_sparse = [0] * NUM_FEATURES
indicies = map(int, line.rstrip().split())
for index in indicies:
list_line_sparse[index] = 1
print(*list_line_sparse, file=sink, sep=',')
I revisited this problem with your "more efficiently" in mind. Although the above is more memory efficient, it is a hair slower time-wise. I reconsidered your original and came up with a solution that is less memory efficient but about 2x faster than your code:
import sys
NUM_FEATURES = 12
data = ''
with open(sys.argv[1]) as source:
for line in source:
list_line_sparse = ["0"] * NUM_FEATURES
indicies = map(int, line.rstrip().split())
for index in indicies:
list_line_sparse[index] = "1"
data += ",".join(list_line_sparse) + '\n'
with open(sys.argv[2], 'w') as sink:
sink.write(data)
Like your original solution, it stores all the data in memory and writes it out at the end which is both a disadvantage (memory-wise) and an advantage (time-wise.)
input.txt
1 3 9 10
1 3 9 11
2 7 3 5
USAGE
% python3 test.py input.txt output.txt
output.txt
0,1,0,1,0,0,0,0,0,1,1,0
0,1,0,1,0,0,0,0,0,1,0,1
0,0,1,1,0,1,0,1,0,0,0,0

Iterating on a file and comparing values using python

I have a section of code that opens files containing information with wavenumber and intensity like this:
500.21506 -0.00134
500.45613 0.00231
500.69720 -0.00187
500.93826 0.00129
501.17933 -0.00049
501.42040 0.00028
501.66147 0.00114
501.90253 -0.00036
502.14360 0.00247
My code attempts to parse the information between two given wavelengths: lowwav and highwav. I would like to print only the intensities of the wavenumbers that fall between lowwav and highwav. My entire code looks like:
import datetime
import glob
path = '/Users/140803/*'
files = glob.glob(path)
for line in open('sfit4.ctl', 'r'):
x = line.strip()
if x.startswith('band.1.nu_start'):
a,b = x.split('=')
b = float(b)
b = "{0:.3f}".format(b)
lowwav = b
if x.startswith('band.1.nu_stop'):
a,b = x.split('=')
b = float(b)
b = "{0:.3f}".format(b)
highwav = b
with open('\\_spec_final.t15', 'w') as f:
with open('info.txt', 'rt') as infofile:
for count, line in enumerate(infofile):
lat = float(line[88:94])
lon = float(line[119:127])
year = int(line[190:194])
month = int(line[195:197])
day = int(line[198:200])
hour = int(line[201:203])
minute = int(line[204:206])
second = int(line[207:209])
dur = float(line[302:315])
numpoints = float(line[655:660])
fov = line[481:497] # field of view?
sza = float(line[418:426])
snr = 0.0000
roe = 6396.2
res = 0.5000
lowwav = float(lowwav)
highwav = float(highwav)
spacebw = (highwav - lowwav)/ numpoints
d = datetime.datetime(year, month, day, hour, minute, second)
f.write('{:>12.5f}{:>12.5f}{:>12.5f}{:>12.5f}{:>8.1f}'.format(sza,roe,lat,lon,snr)) # line 1
f.write("\n")
f.write('{:>10d}{:>5d}{:>5d}{:>5d}{:>5d}{:>5d}'.format(year,month,day,hour,minute,second)) # line 2
f.write("\n")
f.write( ('{:%Y/%m/%d %H:%M:%S}'.format(d)) + "UT Solar Azimuth:" + ('{:>6.3f}'.format(sza)) + " Resolution:" + ('{:>6.4f}'.format(res)) + " Duration:" + ('{:>6.2f}'.format(dur))) # line 3
f.write("\n")
f.write('{:>21.13f}{:>26.13f}{:>24.17e}{:>12f}'.format(lowwav,highwav,spacebw,numpoints)) # line 4
f.write("\n")
with open(files[count], 'r') as g:
for line in g:
wave_no, tensity = [float(item) for item in line.split()]
if lowwav <= wave_no <= highwav :
f.write(str(tensity) + '\n')
g.close()
f.close()
infofile.close()
Right now, everything works fine except the last part where I compare wavelengths and print out the intensities corresponding to wavelengths between lowwav and highwav. No intensities are printing into the output file.
The problem is that when you iterate over the file g you are effectively moving its "file pointer". So the second loop finds the file at the beginning and doesn't produce any value.
Secondly, you are producing all these nums lists, but every iteration of the lop shadows the previous value, making it unreachable.
Either you want to collected all the values and then iterate on those:
with open(files[count], 'r') as g:
all_nums = []
for line in g:
all_nums.append([float(item) for item in line.split()])
for nums in all_nums:
if (lowwav - nums[0]) < 0 or (highwav - nums[0]) > 0 :
f.write(str(nums[1]))
f.write('\n')
else: break
Or just do everything inside the first loop (this should be more efficient):
with open(files[count], 'r') as g:
for line in g:
nums = [float(item) for item in line.split()]
if (lowwav - nums[0]) < 0 or (highwav - nums[0]) > 0 :
f.write(str(nums[1]))
f.write('\n')
else: break
Also note that the break statement will stop the processing of the values when the condition is false for the first time, you probably want to remove it.
This said, note that your code prints all values where nums[0] that either are bigger than lowwav, or smaller than highwav, which means that if lowwav < highwav every number value will be printed. You probably want to use and in place of or if you want to check whether they are between lowwav and highwav. Moreover in python you could just write lowwav < nums[0] < highwav for this.
I would personally use the following:
with open(files[count], 'r') as g:
for line in g:
wave_no, intensity = [float(item) for item in line.split()]
if lowwav < wave_no < highwav:
f.write(str(intensity)+'\n')
Split each line by white space, unpack the split list to two names wavelength and intensity.
[line.split() for line in r] makes
500.21506 -0.00134
500.45613 0.00231
to
[['500.21506', '-0.00134'], ['500.45613', '0.00231']]
This listcomp [(wavelength, intensity) for wavelength,intensity in lol if low <= float(wavelength) <= high] returns
[('500.21506', '-0.00134'), ('500.45613', '0.00231')]
If you join them back [' '.join((w, i)) for w,i in [('500.21506', '-0.00134'), ('500.45613', '0.00231')] you get ['500.21506 -0.00134', '500.45613 0.00231']
Use listcomp to filter out wavelength. And join wavelength and intensity back to string and write to file.
with open('data.txt', 'r') as r, open('\\_spec_final.t15', 'w') as w:
lol = (line.split() for line in r)
intensities = (' '.join((wavelength, intensity)) for wavelength,intensity in lol if low <= float(wavelength) <= high)
w.writelines(intensities)
If you want to output to terminal do print(list(intensities)) instead of w.writelines(intensities)
Contents of data.txt;
500.21506 -0.00134
500.45613 0.00231
500.69720 -0.00187
500.93826 0.00129
501.17933 -0.00049
501.42040 0.00028
501.66147 0.00114
501.90253 -0.00036
502.14360 0.00247
Output when low is 500 and high is 50`;
['500.21506 -0.00134', '500.45613 0.00231']

Python3: read from a file and sort the values

I have a txt file that contains data in the following fashion:
13
56
9
32
99
74
2
each value in a different file. I created three function:
the first one is to swap the values
def swap(lst,x,y):
temp = lst[x]
lst[x] = lst[y]
lst[y] = temp
and the second function is to sort the values:
def selection_sort(lst):
for x in range(0,len(lst)-1):
print(lst)
swap(lst,x,findMinFrom(lst[x:])+ x)
the third function is to find the minimum value from the list:
def findMinFrom(lst):
minIndex = -1
for m in range(0,len(lst)):
if minIndex == -1:
minIndex = m
elif lst[m] < lst[minIndex]:
minIndex = m
return minIndex
Now, how can I read from the file that contains the numbers and print them sorted?
Thanks in advance!
I used:
def main():
f = []
filename = input("Enter the file name: ")
for line in open(filename):
for eachElement in line:
f += eachElement
print(f)
selectionSort(f)
print(f)
main()
but still not working! any help?
Good programmers don't reinvent the wheel and use sorting routines that are standard in most modern languages. You can do:
with open('input.txt') as fp:
for line in sorted(fp):
print(line, end='')
to print the lines sorted alphabetically (as strings). And
with open('input.txt') as fp:
for val in sorted(map(int, fp)):
print(val)
to sort numerically.
To read all the lines in a file:
f = open('test.txt')
your_listname = list(f)
To sort and print
selection_sort(output_listname)
print(output_listname)
You may need to strip newline characters before sorting/printing
stripped_listname=[]
for i in your_listname:
i = i.strip('\n')
stripped_listname.append(i)
You probably also want to take the print statement out of your sort function so it doesn't print the list many times while sorting it.

Categories