I have a dictionary I created by reading in a whole lot of image files. It looks like this:
files = { 'file1.png': [data...], 'file2.png': [data...], ... 'file1000': [data...]}
I am trying to process these images to see how similar each of them are to each other. The thing is, with 1000s of files worth of data this is taking forever. I'm sure I have 20 different places I could optimize but I am trying to work through it one piece at a time to see how I can better optimize it.
My original method tested file1 against all of the rest of the files. Then I tested file2 against all of the files. But I still tested it against file1. So, by the time I get to file1000 in the above example I shouldn't even need to test anything at that point since it has already been tested 999 times.
This is what I tried:
answers = {}
for x in files:
for y in files:
if y not in answers or x not in answers[y]:
if(compare(files[x],files[y]) < 0.01):
answers.setdefault(x, []).append(y)
This doesn't work, as I am getting the wrong output now. The compare function is just this:
rms = math.sqrt(functools.reduce(operator.add,map(lambda a,b: (a-b)**2, h1[0], h2[0]))/len(h1[0]))
return rms
I just didn't want to put that huge equation into the if statement.
Does anyone have a good method for comparing each of the data segments of the files dictionary without overlapping the comparisons?
Edit:
After trying ShadowRanger's answer I have realized that I may not have fully understood what I needed. My original answers dictionary looked like this:
{ 'file1.png': ['file1.png', 'file23.png', 'file333.png'],
'file2.png': ['file2.png'],
'file3.png': ['file3.png', 'file4.png', 'file5.png'],
'file4.png': ['file3.png', 'file4.png', 'file5.png'],
...}
And for now I am storing my results in a file like this:
file1.png file23.png file33.png
file2.png
file3.png file4.png file5.png
file6.png
...
I thought that by using combinations and only testing individual files once I would save a lot of time retesting files and not have to waste time getting rid of duplicate answers. But as far as I can tell, the combinations have actually reduced my ability to find matches and I'm not sure why.
You can avoid redundant comparisons with itertools.combinations to get order-insensitive unique pairs. Just import itertools and replace your doubly nested loop:
for x in files:
for y in files:
with a single loop that gets the combinations:
for x, y in itertools.combinations(files, 2):
Related
I have a JSON file with stations listed with subfields. x contains geological coordinates, and I'd like to find the closest to my cellMiddle coordinates. Currently I'm using this:
closestStationCoord = min(stations,
key=lambda x: abs(x[0]-cellMiddle[0]) + abs(x[1]-cellMiddle[1]))
So the coordinates are those with the minimum difference between x and cellMiddle. However, this takes a lot of time (in my experience, lambdas usually do take a long time to run). Is there any way I can fin this minimum faster?
If there are a lot of items, you should consider algorithmic optimizations to avoid checking all the stations that are irrelevant.
I believe this answer already has a good summary on your possible options: https://gamedev.stackexchange.com/questions/27264/how-do-i-optimize-searching-for-the-nearest-point
I have a dictionary called lemma_all_context_dict, and it has approximately 8000 keys. I need a list of all possible pairs of these keys.
I used:
pairs_of_words_list = list(itertools.combinations(lemma_all_context_dict.keys(), 2))
However, when using this line I get a MemoryError. I have 8GB of RAM but perhaps I get this error anyway because I've got a few very large dictionaries in this code.
So I tried a different way:
pairs_of_words_list = []
for p_one in range(len(lemma_all_context_dict.keys())):
for p_two in range(p_one+1,len(lemma_all_context_dict.keys())):
pairs_of_words_list.append([lemma_all_context_dict.keys()[p_one],lemma_all_context_dict.keys()[p_two]])
But this piece of codes takes around 20 minutes to run... does anyone know of a more efficient way to solve the problem? Thanks
**I don't think that this question is a duplicate because what I'm asking - and I don't think this has been asked - is how to implement this stuff without my computer crashing :-P
Don't build a list, since that's the reason you get a memory error (you even create two lists, since that's what .keys() does). You can iterate over the iterator (that's their purpose):
for a, b in itertools.combinations(lemma_all_context_dict, 2):
print a, b
It's kind of hard to explain but I'm using a directory that has a number of different files but essentially I want to loop over files with irregular intervals
so in pseudocode I guess it would be written like:
A = 1E4, 1E5, 5E5, 7E5, 1E6, 1.05E6, 1.1E6, 1.2E6, 1.5E6, 2E6
For A in range(start(A),end(A)):
inputdir ="../../../COMBI_Output/Noise Studies/[A] Macro Particles/10KT_[A]MP_IP1hoN0.0025/"
Run rest of code
Because at the moment I'm doing it manually by changing the value in [A] and its a nightmare and time consuming. I'm using Python on a macbook so I wonder if writing a bash script that is called within Python would be the right idea?
Or replacing A with a text file, such that its:
import numpy as np
mpnum=np.loadtxt("mp.txt")
for A in range(0,len(A)):
for B in range(0,len(A)):
inputdir ="../../../COMBI_Output/Noise Studies/",[A] "Macro Particles/10KT_",[A]"MP_IP1hoN0.0025/"
But I tried this first and still had no luck.
You are almost there. You don't need a range, just iterate over the list. Then do a replacement in the string using format.
A = ['1E4', '1E5', '5E5', '7E5', '1E6', '1.05E6', '1.1E6', '1.2E6', '1.5E6', '2E6']
for a in A:
inputdir = "../../../COMBI_Output/Noise Studies/{} Macro Particles/10KT_{}MP_IP1hoN0.0025/".format(a)
The idea of putting the file names in a list and simply iterating over them using
for a in A:
seems to be the best idea. However, one small suggestion, if I may, instead of having a list, if you're going to have a large number of files inside this list, why not make it a dictionary? In this way, you can iterate through your files easily as well as keep a count on them.
i have a question as to how i can perform this task in python:-
i have an array of entries like:
[IPAddress, connections, policystatus, activity flag, longitude, latitude] (all as strings)
ex.
['172.1.21.26','54','1','2','31.15424','12.54464']
['172.1.21.27','12','2','4','31.15424','12.54464']
['172.1.27.34','40','1','1','-40.15474','-54.21454']
['172.1.2.45','32','1','1','-40.15474','-54.21454']
...
till about 110000 entries with about 4000 different combinations of longitude-latitude
i want to count the average connections, average policy status,average of activity flag for each location
something like this:
[longitude,latitude,avgConn,avgPoli,avgActi]
['31.15424','12.54464','33','2','3']
['-40.15474','-54.21454','31','1','1']
...
so on
and i have about 195 files with ~110,000 entries each (sort of a big data problem)
my files are in .csv but im using it as .txt to easily work with it in python(not sure if this is the best idea)
im still new to python so im not really sure whats the best approach to use but i sincerely appreciate any help or guidance for this problem
thanks in advance!
No, if you have the files as .csv, threating them as text does not make sense, since python ships with the excellent csv module.
You could read the csv rows into a dict to group them, but I'd suggest writing the data in a proper database, and use SQL's AVG() and GROUP BY. Python ships with bindings for most databaases. If you have none installed, consider using the sqlite module.
I'll only give you the algorithm, you would learn more by writing the actual code yourself.
Use a Dictionary, with the key as a pair of the form (longitude, latitude) and value as a list of the for [ConnectionSum,policystatusSum,ActivityFlagSum]
loop over the entries once (do count the total number of entries, N)
a. for each entry, if the location exists - add the conn, policystat and Activity value to the existing sum.
b. if the entry does not exist, then assign [0,0,0] as the value
Do 1 and 2 for all files.
After all the entries have been scanned. Loop over the dictionary and divide each element of the list [ConnectionSum,policystatusSum,ActivityFlagSum] by N to get the average values of each.
As long as your locations are restricted to being in the same files (or even close to each other in a file), all you need to do is the stream-processing paradigm. For example if you know that duplicate locations only appear in a file, read each file, calculate the averages, then close the file. As long as you let the old data float out of scope, the garbage collector will get rid of it for you. Basically do this:
def processFile(pathToFile):
...
totalResults = ...
for path in filePaths:
partialResults = processFile(path)
totalResults = combine...partialResults...with...totalResults
An even more elegant solution would be to use the O(1) method of calculating averages "on-line". If for example you are averaging 5,6,7, you would do 5/1=5.0, (5.0*1+6)/2=5.5, (5.5*2+7)/3=6. At each step, you only keep track of the current average and the number of elements. This solution will yield the minimal amount of memory used (no more than the size of your final result!), and doesn't care about which order you visit elements in. It would go something like this. See http://docs.python.org/library/csv.html for what functions you'll need in the CSV module.
import csv
def allTheRecords():
for path in filePaths:
for row in csv.somehow_get_rows(path):
yield SomeStructure(row)
averages = {} # dict: keys are tuples (lat,long), values are an arbitrary
# datastructure, e.g. dict, representing {avgConn,avgPoli,avgActi,num}
for record in allTheRecords():
position = (record.lat, record.long)
currentAverage = averages.get(position, default={'avgConn':0, 'avgPoli':0, 'avgActi':0, num:0})
newAverage = {apply the math I mentioned above}
averages[position] = newAverage
(Do note that the notion of an "average at a location" is not well-defined. Well, it is well-defined, but not very useful: If you knew the exactly location of every IP event to infinite precision, the average of everything would be itself. The only reason you can compress your dataset is because your latitude and longitude have finite precision. If you run into this issue if you acquire more precise data, you can choose to round to the appropriate precision. It may be reasonable to round to within 10 meters or something; see latitude and longitude. This requires just a little bit of math/geometry.)
I have two files - prefix.txt and terms.txt both have about 100 lines. I'd like to write out a third file with the Cartesian product
http://en.wikipedia.org/wiki/Join_(SQL)#Cross_join
-about 10000 lines.
What is the best way to approach this in Python?
Secondly, is there a way to write the 10,000 lines to the third file in a random order?
You need itertools.product.
for prefix, term in itertools.product(open('prefix.txt'), open('terms.txt')):
print(prefix.strip() + term.strip())
Print them, or accumulate them, or write them directly. You need the .strip() because of the newline that comes with each of them.
Afterwards, you can shuffle them using random.shuffle(list(open('thirdfile.txt')), but I don't know how fast that will be on a file of the sizes you are using.
A Cartesian product enumerates all combinations. The easiest way to enumerate all combinations is to use nested loops.
You cannot write files in a random order very easily. To write to a "random" position, you must use file.seek(). How will you know what position to which you will seek? How do you know how long each part (prefix+term) will be?
You can, however, read entire files into memory (100 lines is nothing) and process the in-memory collections in "random" orders. This will assure that the output is randomized.
from random import shuffle
a = list(open('prefix.txt'))
b = list(open('terms.txt'))
c = [x.strip() + y.strip() for x in a for y in b]
shuffle(c)
open('result.txt', 'w').write('\n'.join(c))
Certainly, not the best way in terms of speed and memory, but 10000 is not big enough to sacrifice brevity anyway. You should normally close your file objects and you can loop through at least one of the files without storing its content in RAM. This: [:-1] removes the trailing newlline from each element of a and b.
Edit: using s.strip() instead of s[:-1] to get rid of the newlines---it's more portable.