How to compare two files (~50MB) fast?

How to compare two files (~50MB) fast? - python

I am using Python and want to compare two files and find the lines that are unique to each of them. I'm doing as shown below, but it is too slow.
f1 = open(file1)
text1Lines = f1.readlines()
f2 = open(file2)
text2Lines = f2.readlines()
diffInstance=difflib.Differ()
diffList = list(diffInstance.compare(text1Lines, text2Lines))
How can I speed it up considerably?

You might read and compare files simultaneously, instead of storing them in memory. The following snippet makes a lot of unrealistic assumptions (ie the files are of the same length and no line is present in the same file twice), but it illustrates the idea:
unique_1 = []
unique_2 = []
for line_1 in handle_1:
# Reading line from the 1st file and checking if we have already seen them in in the 2nd
if line_1 in unique_2:
unique_2.remove(line)
# If line was unique, remember it
else:
unique_1.append(line)
# The same, only files are the other way
line_2 = handle_2.readline()
if line_2 in unique_1:
unique_1.remove(line)
else:
unique_2.append(line)
print('\n'.join(unique_1))
print('\n'.join(unique_2))
Sure, it smells of reinventing bicycle, but you might get better performance by using simple algorithm instead of fancy diff-building and distance calculations of difflib. Alternatively, if you are absolutely sure your files will never be too big to fit in memory (not the safest assumption, to be honest), you can just use the set difference:
set1=set()
set2=set()
for line in handle_1:
set1.add(line)
for line in handle_2:
set2.add(line)
set1_uniques = set1.difference(set2)
set2_uniques = set2.difference(set1)

your code may have some bugs... you can only find the difference in same line. if 2 files have different number of lines or the data is unsorted your code will have problem...next is my code:
f1 = open('a.txt')
text1Lines = f1.readlines()
f2 = open('b.txt')
text2Lines = f2.readlines()
set1 = set(text1Lines)
set2 = set(text2Lines)
diffList = (set1|set2)-(set1&set2)

Related

Finding image similarities in a folder of thousands

I've cobbled together/wrote some code (Thanks stackoverflow users!) that checks for similarities in images using imagehash, but now I am having issues checking thousands of images (roughly 16,000). Is there anything that I could improve with the code (or a different route entirely) that can more accurately find matches and/or decrease time required? Thanks!
I first changed my list that is created to an itertools combination, so it only compares unique combinations of images.
new_loc = os.chdir(r'''myimagelocation''')
dirloc = os.listdir(r'''myimagelocation''')
duplicates = []
dup = []
for f1, f2 in itertools.combinations(dirloc,2):
#Honestly not sure which hash method to use, so I went with dhash.
dhash1 = imagehash.dhash(Image.open(f1))
dhash2 = imagehash.dhash(Image.open(f2))
hashdif = dhash1 - dhash2
if hashdif < 5: #May change the 5 to find more accurate matches
print("images are similar due to dhash", "image1", f1, "image2", f2)
duplicates.append(f1)
dup.append(f2)
#Setting up a CSV file with the similar images to review before deleting
with open("duplicates.csv", "w") as myfile:
wr = csv.writer(myfile)
wr.writerows(zip(duplicates, dup))
Currently, this code may take days to process the number of images I have in the folder. I'm hoping to reduce this down to hours if possible.

Try this, instead of hashing each image at comparison (127,992,000 hashes), you hash ahead of time and compare the hashes since those are not going to change (16,000 hashes).
new_loc = os.chdir(r'''myimagelocation''')
dirloc = os.listdir(r'''myimagelocation''')
duplicates = []
dup = []
hashes = []
for file in dirloc:
hashes.append((file, imagehash.dhash(Image.open(file))))
for pair1, pair2 in itertools.combinations(hashes,2):
f1, dhash1 = pair1
f2, dhash2 = pair2
#Honestly not sure which hash method to use, so I went with dhash.
hashdif = dhash1 - dhash2
if hashdif < 5: #May change the 5 to find more accurate matches
print("images are similar due to dhash", "image1", f1, "image2", f2)
duplicates.append(f1)
dup.append(f2)
#Setting up a CSV file with the similar images to review before deleting
with open("duplicates.csv", "w") as myfile: # also move this out of the loop so you arent rewriting the file every time
wr = csv.writer(myfile)
wr.writerows(zip(duplicates, dup))

Python Script takes much time / List comprehensions

I have a small script which compares, from CSV input files, how many items of the first list are in the second list. However, it takes a certain time to run when there is many references.
data_1 = import_csv("test1.csv")
print(len(data_1))
data_2 = import_csv("test2.csv")
print(len(data_2))
data_to_keep = len([i for i in data_1 if i in data_2])
I just run a test with 598756 items for the first list and 76612 for the second, and the script hasn't finished yet.
As I'm still relatively new to Python, I would like to know if there is a fastest way to achieve what I'm trying to do. Thank you for your help :)
EDIT : import CSV looks like this :
def import_csv(csvfilename):
data = []
with open(csvfilename, "r", encoding="utf-8", errors="ignore") as scraped:
reader = csv.reader(scraped, delimiter=',')
for row in reader:
if row: # avoid blank lines
data.append(row[0])
return data

Make data_2 a set.
data_2 = set(import_csv("test2.csv"))
In Python, sets are much faster for checking if an object is present (using the in operator).
You might also see some improvement from switching the order of your inputs. Make the larger file the set, that way you do fewer lookups when you iterate over the elements of the smaller file.

You can use set and it's intersection, if duplicates can be safely discarded:
data1 = [1,2,3,3,4]
data2 = [2,3,5,6,1,6]
print(len(set(data1).intersection(data2)))
# 3
This is set operation and is guaranteed to be faster than what you do.

Try it
import csv
with open('test1.csv', newline='') as csvfile:
list1 = csv.reader(csvfile, delimiter=',')
with open('test2.csv', newline='') as csvfile2:
list2 = csv.reader(csvfile2, delimiter=',')
data_to_keep = len([i for i in list1 if i in list2])

I'm making a few assumptions here but here's an idea...
test1.csv and test2.csv hold something unique, like serial numbers. Like...
9210268126,4628032171,6691918168,1499888554,2024134986,
8826205840,5643225730,3174290295,1881330725,7192644763,
7210351670,7956881819,4897219228,4638431591,6444695480,
1949859915,8919131597,2176933146,3875411064,3546520925
Try...
with open("test1.csv") as f1, open("test2.csv") as f2:
data_1 = [line.split(",") for line in f1]
data_2 = [line.split(",") for line in f2]
Since they're unique we can use the set functions to see which entries are in the other file:
data_to_keep = set(data_1).intersection(set(data_2))
I'm not sure how to do it faster - at that point it might be a hardware bottleneck.

That one should also work. It converts the list to a dictionary and avoids a sequential search that is performed using the in operator. In large datasest you often avoid the use of in operator.
data_1 = import_csv("test1.csv")
data_2 = dict([(i,i) for i in import_csv("test2.csv")])
data_to_keep = len([i for i in data_1 if data_2.get(i) is not None])

How to iterate over two files effectively (25000+ Lines)

So, I am trying to make a combined list inside of Python for matching data of about 25,000 lines.
The first list data came from file mac.uid and looks like this
Mac|ID
The second list data came serial.uid and looks like this:
Serial|Mac
Mac from list 1 must equal the Mac from list 2 before it's joined.
This is what I am currently doing, I believe there is too much repetition going on.
combined = [];
def combineData():
lines = open('mac.uid', 'r+')
for line in lines:
with open('serial.uid', 'r+') as serial:
for each in serial:
a, b = line.strip().split('|')
a = a.lower()
x, y = each.strip().split('|')
y = y.lower()
if a == y:
combined.append(a+""+b+""+x)
The final list is supposed to look like this:
Mac(List1), ID(List1), Serial(List2)
So that I can import it into an excel sheet.
Thanks for any and all help!

Instead of your nested loops (which cause quadratic complexity) you should use dictionaries which will give you roughly O(n log(n)) complexity. To do so, first read serial.uid once and store the mapping of MAC addresses to serial numbers in a dict.
serial = dict()
with open('serial.uid') as istr:
for line in istr:
(ser, mac) = split_fields(line)
serial[mac] = ser
Then you can close the file again and process mac.uid looking up the serial number for each MAC address in the dictionary you've just created.
combined = list()
with open('mac.uid') as istr:
for line in istr:
(mac, uid) = split_fields(line)
combined.append((mac, uid, serial[mac]))
Note that I've changed combined from a list of strings to a list of tuples. I've also factored the splitting logic out into a separate function. (You'll have to put its definition before its use, of course.)
def split_fields(line):
return line.strip().lower().split('|')
Finally, I recommend that you start using more descriptive names for your variables.
For files of 25k lines, you should have no issues storing everything in memory. If your data sets become too large for that, you might want to start looking into using a database. Note that the Python standard library ships with an SQLite module.

Upper memory limit?

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')

(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()

You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.

Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.

No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations

Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1

Loop over a file and write the next line if a condition is met

Having a hard time fixing this or finding any good hints about it.
I'm trying to loop over one file, modify each line slightly, and then loop over a different file. If the line in the second file starts with the line from the first then the following line in the second file should be written to a third file.
with open('ids.txt', 'rU') as f:
with open('seqres.txt', 'rU') as g:
for id in f:
id=id.lower()[0:4]+'_'+id[4]
with open(id + '.fasta', 'w') as h:
for line in g:
if line.startswith('>'+ id):
h.write(g.next())
All the correct files appear, but they are empty. Yes, I am sure the if has true cases. :-)
"seqres.txt" has lines with an ID number in a certain format, each followed by a line with data. The "ids.txt" has lines with the ID numbers of interest in a different format. I want each line of data with an interesting ID number in its own file.
Thanks a million to anyone with a little advice!

Here's a mostly flattened implementation. Depending on how many hits you're going to get for each ID, and how many entries there are in 'seqres' you could redesign it.
# Extract the IDs in the desired format and cache them
ids = [ x.lower()[0:4]+'_'+x[4] for x in open('ids.txt','rU')]
ids = set(ids)
# Create iterator for seqres.txt file and pull the first value
iseqres = iter(open('seqres.txt','rU'))
lineA = iseqres.next()
# iterate through the rest of seqres, staggering
for lineB in iseqres:
lineID = lineA[1:7]
if lineID in ids:
with open("%s.fasta" % lineID, 'a') as h:
h.write(lineB)
lineA = lineB

I think there is still progress to be made from the code you declare as final. You can make the result a little less nested and avoid a couple sort of silly things.
from contextlib import nested
from itertools import tee, izip
# Stole pairwise recipe from the itertools documentation
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
with nested(open('ids.txt', 'rU'), open('seqres.txt', 'rU')) as (f, g):
for id in f:
id = id.lower()[0:4] + '_' + id[4]
with open(id + '.fasta', 'w') as h:
g.seek(0) # start at the beginning of g each time
for line, next_line in pairwise(g):
if line.startswith('>' + id):
h.write(next_line)
This is an improvement over the final code you posted in that
It does not unnecessarily read the whole files into memory, but simple iterates over the file objects. (This may or may not be the best option for g, really. It definitely scales better.)
It does not contain the crash condition using gl[line+1] if we are already on the last line of gl
Depending on how g actually looks, there might be something more applicable than pairwise.
It is not as deeply nested.
It conforms to PEP8 for things like spaces around operators and indentation depth.
This algorithm is O(n * m), where n and m are the number of lines in f and g. If the length of f is unbounded, you can use a set of its ids to reduce the algorithm to O(n) (linear in the number of lines in g).

For speed, you really want to avoid looping over the same file multiple times. This means you've turned into an O(N*M) algorithm, when you could be a using an O(N+M) one.
To achieve this, read your list of ids into a fast lookup structure, like a set. Since there are only 4600 this in-memory form shouldn't be any problem.
The new solution is also reading the list into memory. Probably not a huge problem with just a few hundred thousand lines, but its wasting more memory than you need, since you can do the whole thing in a single pass, only reading the smaller ids.txt file into memory. You can just set a flag when the previous line was something interesting, which will signal the next line to write it out.
Here's a reworked version:
with open('ids.txt', 'rU') as f:
interesting_ids = set('>' + line.lower()[0:4] + "_" + line[4] for line in f) # Get all ids in a set.
found_id = None
with open('seqres.txt', 'rU') as g:
for line in g:
if found_id is not None:
with open(found_id+'.fasta','w') as h:
h.write(line)
id = line[:7]
if id in interesting_ids: found_id = id
else: found_id = None

The problem is that you are only looping through file g once - after you have read through it the first time the file index position is left at the end of the file, so any further reads will fail with EOF. You would need to reopen g every time round the loop.
However this will be massively inefficient - you are reading the same file repeatedly, once for every line in f. It will be orders of magnitude faster to read all of g into an array at the start and use that, so long as it will fit in memory.

After the first line in the ids.txt file has been processed, the file seqres.txt has been exhausted. There is something wrong with the nesting of your loops. Also, you're modifying the iterator inside the for line in g loop. Not a good idea.
If you really want to append the line that follows the line whose ID matches, then perhaps something like this might work better:
with open('ids.txt', 'rU') as f:
ids = f.readlines()
with open('seqres.txt', 'rU') as g:
seqres = g.readlines()
for id in ids:
id=id.lower()[0:4]+'_'+id[4]
with open(id + '.fasta', 'a') as h:
for line in seqres:
if line.startswith('>'+ id):
h.write(seqres.next())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compare two files (~50MB) fast? - python

Related

Finding image similarities in a folder of thousands

Python Script takes much time / List comprehensions

How to iterate over two files effectively (25000+ Lines)

Upper memory limit?

Loop over a file and write the next line if a condition is met

Categories

Resources