I have a JSON file with stations listed with subfields. x contains geological coordinates, and I'd like to find the closest to my cellMiddle coordinates. Currently I'm using this:
closestStationCoord = min(stations,
key=lambda x: abs(x[0]-cellMiddle[0]) + abs(x[1]-cellMiddle[1]))
So the coordinates are those with the minimum difference between x and cellMiddle. However, this takes a lot of time (in my experience, lambdas usually do take a long time to run). Is there any way I can fin this minimum faster?
If there are a lot of items, you should consider algorithmic optimizations to avoid checking all the stations that are irrelevant.
I believe this answer already has a good summary on your possible options: https://gamedev.stackexchange.com/questions/27264/how-do-i-optimize-searching-for-the-nearest-point
Related
I have a user group list UserGroupA=[CustomerA_id1,CustomerA_id2 ....] containing 1000 users and user group list UserGroupB=[CustomerB_id1,CustomerB_id2 ...] containing 10000 users and I have a similarity function defined for any two users from UserGroupA and UserGroupB
Similarity(CustomerA_id(k),CustomerB_id(l)) where k and l are indices for users in Group A and B.
My objective is to find the most similar 1000 users from Group B to users in GroupA and the way I want to use CrossSimilarity to determine that. Is there a more efficient way to do it especially when the size of GroupB increases?
CrossSimilarity = None * [10000]
for i in range(10000):
for j in range(1000):
CrossSimilarity[i] = CrossSimilarity[i] + Similarity(CustomerA_id[k],CustomerB_id[i])
CrossSimilarity.sort()
It really depends on the Similarity function and how much time it takes. I expect it will heavily dominate your runtime, but without a runtime profile, it's hard to say. I have some general advice only:
Have a look at how you calculate Similarity and whether you can improve the process by doing everyone from group A, or B in one go rather than starting from scratch.
There are some micro-optimisations you can do: For example += will be tiny bit faster. Caching CustomerB_id in outer loop as well. You can likely squeeze some time out of your similarity function the same way. But I wouldn't expect this time to matter.
If your code is using pure python and is CPU-heavy, you could try compiling via CPython, or running in Pypy instead of standard Python.
Since what you are doing is basically a matrix multiplication between the two list (UserGroupA and UserGroupB) a more efficient and fastest way to perform it in memory, could be to use the scikit-sklearn module that provide the function:
sklearn.metrics.pairwise.pairwise_distances(X, Y, metric='euclidean')
where obviously X=UserGroupA and Y=UserGroupB and in metric field you can use the default similarity measure of sklearn or pass your own.
It will return a distance matrix D such that then D_{i, k} is the distance between the ith array from X and the kth array from Y.
Then to find the top 1000 similar user you can simply transform the matrix in a list and sort it.
Maybe is a little more articulated than your solution but should be faster:)
I have a dictionary I created by reading in a whole lot of image files. It looks like this:
files = { 'file1.png': [data...], 'file2.png': [data...], ... 'file1000': [data...]}
I am trying to process these images to see how similar each of them are to each other. The thing is, with 1000s of files worth of data this is taking forever. I'm sure I have 20 different places I could optimize but I am trying to work through it one piece at a time to see how I can better optimize it.
My original method tested file1 against all of the rest of the files. Then I tested file2 against all of the files. But I still tested it against file1. So, by the time I get to file1000 in the above example I shouldn't even need to test anything at that point since it has already been tested 999 times.
This is what I tried:
answers = {}
for x in files:
for y in files:
if y not in answers or x not in answers[y]:
if(compare(files[x],files[y]) < 0.01):
answers.setdefault(x, []).append(y)
This doesn't work, as I am getting the wrong output now. The compare function is just this:
rms = math.sqrt(functools.reduce(operator.add,map(lambda a,b: (a-b)**2, h1[0], h2[0]))/len(h1[0]))
return rms
I just didn't want to put that huge equation into the if statement.
Does anyone have a good method for comparing each of the data segments of the files dictionary without overlapping the comparisons?
Edit:
After trying ShadowRanger's answer I have realized that I may not have fully understood what I needed. My original answers dictionary looked like this:
{ 'file1.png': ['file1.png', 'file23.png', 'file333.png'],
'file2.png': ['file2.png'],
'file3.png': ['file3.png', 'file4.png', 'file5.png'],
'file4.png': ['file3.png', 'file4.png', 'file5.png'],
...}
And for now I am storing my results in a file like this:
file1.png file23.png file33.png
file2.png
file3.png file4.png file5.png
file6.png
...
I thought that by using combinations and only testing individual files once I would save a lot of time retesting files and not have to waste time getting rid of duplicate answers. But as far as I can tell, the combinations have actually reduced my ability to find matches and I'm not sure why.
You can avoid redundant comparisons with itertools.combinations to get order-insensitive unique pairs. Just import itertools and replace your doubly nested loop:
for x in files:
for y in files:
with a single loop that gets the combinations:
for x, y in itertools.combinations(files, 2):
I wrote a code which is working perfectly with the small size data, but when I run it over a dataset with 52000 features, it seems to be stuck in the below function:
def extract_neighboring_OSM_nodes(ref_nodes,cor_nodes):
time_start=time.time()
print "here we start finding neighbors at ", time_start
for ref_node in ref_nodes:
buffered_node = ref_node[2].buffer(10)
for cor_node in cor_nodes:
if cor_node[2].within(buffered_node):
ref_node[4].append(cor_node[0])
cor_node[4].append(ref_node[0])
# node[4][:] = [cor_nodes.index(x) for x in cor_nodes if x[2].within(buffered_node)]
time_end=time.time()
print "neighbor extraction took ", time_end
return ref_nodes
the ref_node and cor_node are a list of tuples as follows:
[(FID, point, geometry, links, neighbors)]
neighbors is an empty list which is going to be populated in the above function.
As I said the last message printed out is the first print command in this function. it seems that this function is so slow but for 52000 thousand features it should not take 24 hours, should it?
Any Idea where the problem would be or how to make the function faster?
You can try multiprocessing, here is an example - http://pythongisandstuff.wordpress.com/2013/07/31/using-arcpy-with-multiprocessing-%E2%80%93-part-3/.
If you want to get K Nearest Neighbors of every (or some, it doesn't matter) sample of a dataset or eps neighborhood of samples, there is no need to implement it yourself. There is libraries out there specially for this purpose.
Once they built the data structure (usually some kind of tree) you can query the data for neighborhood of a certain sample. Usually for high dimensional data these data structure are not as good as they are for low dimensions but there is solutions for high dimensional data as well.
One I can recommend here is KDTree which has a Scipy implementation.
I hope you find it useful as I did.
I am new to python and my problem is the following:
I have defined a function func(a,b) that return a value, given two input values.
Now I have my data stored in lists or numpy arrays A,Band would like to use func for every combination. (A and B have over one million entries)
ATM i use this snippet:
for p in A:
for k in B:
value = func(p,k)
This takes really really a lot of time.
So i was thinking that maybe something like this:
C=(map(func,zip(A,B)))
But this method only works pairwise... Any ideas?
Thanks for help
First issue
You need to calculate the output of f for many pairs of values. The "standard" way to speed up this kind of loops (calculations) is to make your function f accept (NumPy) arrays as input, and do the calculation on the whole array at once (ie, no looping as seen from Python). Check any NumPy tutorial to get an introduction.
Second issue
If A and B have over a million entries each, there are one trillion combinations. For 64 bits numbers, that means you'll need 7.3 TiB of space just to store the result of your calculation. Do you have enough hard drive to just store the result?
Third issue
If A and B where much smaller, in your particular case you'd be able to do this:
values = f(*meshgrid(A, B))
meshgrid returns the cartesian product of A and B, so it's simply a way to generate the points that have to be evaluated.
Summary
You need to use NumPy effectively to avoid Python loops. (Or if all else fails or they can't easily be vectorized, write those loops in a compiled language, for instance by using Cython)
Working with terabytes of data is hard. Do you really need that much data?
Any solution that calls a function f 1e12 times in a loop is bound to be slow, specially in CPython (which is the default Python implementation. If you're not really sure and you're using NumPy, you're using it too).
suppose, itertools.product does what you need:
from itertools import product
pro = product(A,B)
C = map(lambda x: func(*x), pro)
so far as it is generator it doesn't require additional memory
One million times one million is one trillion. Calling f one trillion times will take a while.
Unless you have a way of reducing the number of values to compute, you can't do better than the above.
If you use NumPy, you should definitely look the np.vectorize function which is designed for this kind of problems...
i have a question as to how i can perform this task in python:-
i have an array of entries like:
[IPAddress, connections, policystatus, activity flag, longitude, latitude] (all as strings)
ex.
['172.1.21.26','54','1','2','31.15424','12.54464']
['172.1.21.27','12','2','4','31.15424','12.54464']
['172.1.27.34','40','1','1','-40.15474','-54.21454']
['172.1.2.45','32','1','1','-40.15474','-54.21454']
...
till about 110000 entries with about 4000 different combinations of longitude-latitude
i want to count the average connections, average policy status,average of activity flag for each location
something like this:
[longitude,latitude,avgConn,avgPoli,avgActi]
['31.15424','12.54464','33','2','3']
['-40.15474','-54.21454','31','1','1']
...
so on
and i have about 195 files with ~110,000 entries each (sort of a big data problem)
my files are in .csv but im using it as .txt to easily work with it in python(not sure if this is the best idea)
im still new to python so im not really sure whats the best approach to use but i sincerely appreciate any help or guidance for this problem
thanks in advance!
No, if you have the files as .csv, threating them as text does not make sense, since python ships with the excellent csv module.
You could read the csv rows into a dict to group them, but I'd suggest writing the data in a proper database, and use SQL's AVG() and GROUP BY. Python ships with bindings for most databaases. If you have none installed, consider using the sqlite module.
I'll only give you the algorithm, you would learn more by writing the actual code yourself.
Use a Dictionary, with the key as a pair of the form (longitude, latitude) and value as a list of the for [ConnectionSum,policystatusSum,ActivityFlagSum]
loop over the entries once (do count the total number of entries, N)
a. for each entry, if the location exists - add the conn, policystat and Activity value to the existing sum.
b. if the entry does not exist, then assign [0,0,0] as the value
Do 1 and 2 for all files.
After all the entries have been scanned. Loop over the dictionary and divide each element of the list [ConnectionSum,policystatusSum,ActivityFlagSum] by N to get the average values of each.
As long as your locations are restricted to being in the same files (or even close to each other in a file), all you need to do is the stream-processing paradigm. For example if you know that duplicate locations only appear in a file, read each file, calculate the averages, then close the file. As long as you let the old data float out of scope, the garbage collector will get rid of it for you. Basically do this:
def processFile(pathToFile):
...
totalResults = ...
for path in filePaths:
partialResults = processFile(path)
totalResults = combine...partialResults...with...totalResults
An even more elegant solution would be to use the O(1) method of calculating averages "on-line". If for example you are averaging 5,6,7, you would do 5/1=5.0, (5.0*1+6)/2=5.5, (5.5*2+7)/3=6. At each step, you only keep track of the current average and the number of elements. This solution will yield the minimal amount of memory used (no more than the size of your final result!), and doesn't care about which order you visit elements in. It would go something like this. See http://docs.python.org/library/csv.html for what functions you'll need in the CSV module.
import csv
def allTheRecords():
for path in filePaths:
for row in csv.somehow_get_rows(path):
yield SomeStructure(row)
averages = {} # dict: keys are tuples (lat,long), values are an arbitrary
# datastructure, e.g. dict, representing {avgConn,avgPoli,avgActi,num}
for record in allTheRecords():
position = (record.lat, record.long)
currentAverage = averages.get(position, default={'avgConn':0, 'avgPoli':0, 'avgActi':0, num:0})
newAverage = {apply the math I mentioned above}
averages[position] = newAverage
(Do note that the notion of an "average at a location" is not well-defined. Well, it is well-defined, but not very useful: If you knew the exactly location of every IP event to infinite precision, the average of everything would be itself. The only reason you can compress your dataset is because your latitude and longitude have finite precision. If you run into this issue if you acquire more precise data, you can choose to round to the appropriate precision. It may be reasonable to round to within 10 meters or something; see latitude and longitude. This requires just a little bit of math/geometry.)