I wrote a code which is working perfectly with the small size data, but when I run it over a dataset with 52000 features, it seems to be stuck in the below function:
def extract_neighboring_OSM_nodes(ref_nodes,cor_nodes):
time_start=time.time()
print "here we start finding neighbors at ", time_start
for ref_node in ref_nodes:
buffered_node = ref_node[2].buffer(10)
for cor_node in cor_nodes:
if cor_node[2].within(buffered_node):
ref_node[4].append(cor_node[0])
cor_node[4].append(ref_node[0])
# node[4][:] = [cor_nodes.index(x) for x in cor_nodes if x[2].within(buffered_node)]
time_end=time.time()
print "neighbor extraction took ", time_end
return ref_nodes
the ref_node and cor_node are a list of tuples as follows:
[(FID, point, geometry, links, neighbors)]
neighbors is an empty list which is going to be populated in the above function.
As I said the last message printed out is the first print command in this function. it seems that this function is so slow but for 52000 thousand features it should not take 24 hours, should it?
Any Idea where the problem would be or how to make the function faster?
You can try multiprocessing, here is an example - http://pythongisandstuff.wordpress.com/2013/07/31/using-arcpy-with-multiprocessing-%E2%80%93-part-3/.
If you want to get K Nearest Neighbors of every (or some, it doesn't matter) sample of a dataset or eps neighborhood of samples, there is no need to implement it yourself. There is libraries out there specially for this purpose.
Once they built the data structure (usually some kind of tree) you can query the data for neighborhood of a certain sample. Usually for high dimensional data these data structure are not as good as they are for low dimensions but there is solutions for high dimensional data as well.
One I can recommend here is KDTree which has a Scipy implementation.
I hope you find it useful as I did.
Related
I want to create a randomized array that contains another array few times,
So if:
Big_Array = np.zeroes(5,5)
Inner_array = no.array([[1,1,1],
[2,1,2]])
And if we want 2 Inner_array it could look like:
Big_Array = [[1,2,0,0,0],
[1,1,0,0,0],
[1,2,0,0,0],
[0,0,2,1,2],
[0,0,1,1,1]]
I would like to write a code that will
A. Tell whether the bigger array can fit the required amount of inner arrays, and
B. place randomly the inner array (in random rotations) x amount of times in the big array without overlap
Thanks in advance!
If I understood correctly, you'd like to sample valid tilings of a square which contain a specified amount of integral-sided rectangles.
This is a special case of the exact cover problem, which is NP-complete, so in general I'd expect there to be no really efficient solutions, but you could solve it using Knuth's algorithm x. It would take a while to code yourself though.
There are also a few implementations of DLX online, such as this one from code review SE (not sure what the copyright on that is though).
I'm currently trying to solve the 'dance recital' kattis challenge in Python 3. See here
After taking input on how many performances there are in the dance recital, you must arrange performances in such a way that sequential performances by a dancer are minimized.
I've seen this challenge completed in C++, but my code kept running out of time and I wanted to optimize it.
Question: As of right now, I generate all possible permutations of performances and run comparisons off of that. A faster way to would be to not generate all permutations, as some of them are simply reversed and would result in the exact same output.
import itertools
print(list(itertools.permutations(range(2)))) --> [(0,1),(1,0)] #They're the same, backwards and forwards
print(magic_algorithm(range(2))) --> [(0,1)] #This is what I want
How might I generate such a list of permutations?
I've tried:
-Generating all permutation, running over them again to reversed() duplicates and saving them. This takes too long and the result cannot be hard coded into the solution as the file becomes too big.
-Only generating permutations up to the half-way mark, then stopping, assuming that after that, no unique permutations are generated (not true, as I found out)
-I've checked out questions here, but no one seems to have the same question as me, ditto on the web
Here's my current code:
from itertools import permutations
number_of_routines = int(input()) #first line is number of routines
dance_routine_list = [0]*10
permutation_list = list(permutations(range(number_of_routines))) #generate permutations
for q in range(number_of_routines):
s = input()
for c in s:
v = ord(c) - 65
dance_routine_list[q] |= (1 << v) #each routine ex.'ABC' is A-Z where each char represents a performer in the routine
def calculate():
least_changes_possible = 1e9 #this will become smaller, as optimizations are found
for j in permutation_list:
tmp = 0
for i in range(1,number_of_routines):
tmp += (bin(dance_routine_list[j[i]] & dance_routine_list[j[i - 1]]).count('1')) #each 1 represents a performer who must complete sequential routines
least_changes_possible = min(least_changes_possible, tmp)
return least_changes_possible
print(calculate())
Edit: Took a shower and decided adding a 2-element-comparison look-up table would speed it up, as many of the operations are repeated. Still doesn't fix iterating over the whole permutations, but it should help.
Edit: Found another thread that answered this pretty well. How to generate permutations of a list without "reverse duplicates" in Python using generators
Thank you all!
There are at most 10 possible dance routines, so at most 3.6M permutations, and even bad algorithms like generate 'em all and test will be done very quickly.
If you wanted a fast solution for up to 24 or so routines, then I would do it like this...
Given the the R dance routines, at any point in the recital, in order to decide which routine you can perform next, you need to know:
Which routines you've already performed, because there you can't do those ones next. There are 2R possible sets of already-performed routines; and
Which routine was performed last, because that helps determine the cost of the next one. There are at most R-1 possible values for that.
So there are at less than (R-2)*2R possible recital states...
Imagine a directed graph that connects each possible state to all the possible following states, by an edge for the routine that you would perform to get to that state. Label each edge with the cost of performing that routine.
For example, if you've performed routines 5 and 6, with 5 last, then you would be in state (5,6):5, and there would be an edge to (3,5,6):3 that you could get to after performing routine 3.
Starting at the initial "nothing performed yet" state ():-, use Dijkstra's algorithm to find the least cost path to a state with all routines performed.
Total complexity is O(R2*2R) or so, depending exactly how you implement it.
For R=10, R2*2R is ~100 000, which won't take very long at all. For R=24 it's about 9 billion, which is going to take under half a minute in pretty good C++.
I have a JSON file with stations listed with subfields. x contains geological coordinates, and I'd like to find the closest to my cellMiddle coordinates. Currently I'm using this:
closestStationCoord = min(stations,
key=lambda x: abs(x[0]-cellMiddle[0]) + abs(x[1]-cellMiddle[1]))
So the coordinates are those with the minimum difference between x and cellMiddle. However, this takes a lot of time (in my experience, lambdas usually do take a long time to run). Is there any way I can fin this minimum faster?
If there are a lot of items, you should consider algorithmic optimizations to avoid checking all the stations that are irrelevant.
I believe this answer already has a good summary on your possible options: https://gamedev.stackexchange.com/questions/27264/how-do-i-optimize-searching-for-the-nearest-point
I am looking for a way to speed up my code. I managed to speed up most parts of my code, reducing runtime to about 10 hours, but it's still not fast enough and since I'm running out of time I'm looking for a quick way to optimize my code.
An example:
text = pd.read_csv(os.path.join(dir,"text.csv"),chunksize = 5000)
new_text = [np.array(chunk)[:,2] for chunk in text]
new_text = list(itertools.chain.from_iterable(new_text))
In the code above I read in about 6 million rows of text documents in chunks and flatten them. This code takes about 3-4 hours to execute. This is the main bottleneck of my program. edit: I realized that I wasn't very clear on what the main issue was, The flattening is the part which takes the most amount of time.
Also this part of my program takes a long time:
train_dict = dict(izip(text,labels))
result = [train_dict[test[sample]] if test[sample] in train_dict else predictions[sample] for sample in xrange(len(predictions))]
The code above first zips the text documents with their corresponding labels (This a machine learning task, with the train_dict being the training set). Earlier in the program I generated predictions on a test set. There are duplicates between my train and test set so I need to find those duplicates. Therefore, I need to iterate over my test set row by row (2 million rows in total), when I find a duplicate I actually don't want to use the predicted label, but the label from the duplicate in the train_dict. I assign the result of this iteration to the variable result in the above code.
I heard there are various libraries in python that could speed up parts of your code, but I don't know which of those could do the job and right know I do not have the time to investigate this, that is why I need someone to point me in the right direction. Is there a way with which I could speed the code snippets above up?
edit2
I have investigated again. And it is definitely a memory issue. I tried to read the file in a row by row manner and after a while the speed declined dramatically, furthermore my ram usage is nearly 100%, and python's disk usage increased sharply. How can I decrease the memory footprint? Or should I find a way to make sure that I don't hold everything into memory?
edit3
As memory is the main issue of my problems I'll give an outline of a part of my program. I have dropped the predictions for the time being, which reduced the complexity of my program significantly, instead I insert a standard sample for every non duplicate in my test set.
import numpy as np
import pandas as pd
import itertools
import os
train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
train_2 = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
test = pd.read_csv(os.path.join(dir,"Test.csv"), chunksize = 80000)
sample = list(np.array(pd.read_csv(os.path.join(dir,"Samples.csv"))[:,2]))#this file is only 70mb
sample = sample[1]
test_set = [np.array(chunk)[:,2] for chunk in test]
test_set = list(itertools.chain.from_iterable(test_set))
train_set = [np.array(chunk)[:,2] for chunk in train]
train_set = list(itertools.chain.from_iterable(train_set))
labels = [np.array(chunk)[:,3] for chunk in train_2]
labels = list(itertools.chain.from_iterable(labels))
"""zipping train and labels"""
train_dict = dict(izip(train,labels))
"""finding duplicates"""
results = [train_dict[test[item]] if test[item] in train_dict else sample for item in xrange(len(test))]
Although this isn't my entire program, this is the part of my code that needs optimization. As you can see I am only using three important modules in this part, pandas, numpy and itertools. The memory issues arise when flattening train_set and test_set. The only thing I am doing is reading in the files, getting the necessary parts zipping the train documents with the corresponding labels in a dictionary. And then search for duplicates.
edit 4
As requested I'll give an explanation of my data sets. My Train.csv contains 4 columns. The first columns contain ID's for every sample, the second column contains titles and the third column contains text body samples(varying from 100-700 words). The fourth column contains category labels. Test.csv contains only the ID's and text bodies and titles. The columns are separated by commas.
Could you please post a dummy sample data set of a half dozen rows or so?
I can't quite see what your code is doing and I'm not a Pandas expert, but I think we can greatly speed up this code. It reads all the data into memory and then keeps re-copying the data to various places.
By writing "lazy" code we should be able to avoid all the re-copying. The ideal would be to read one line, transform it as we want, and store it into its final destination. Also this code uses indexing when it should be just iterating over values; we can pick up some speed there too.
Is the code you posted your actual code, or something you made just to post here? It appears to contain some mistakes so I am not sure what it actually does. In particular, train and labels would appear to contain identical data.
I'll check back and see if you have posted sample data. If so I can probably write "lazy" code for you that will have less re-copying of data and will be faster.
EDIT: Based on your new information, here's my dummy data:
id,title,body,category_labels
0,greeting,hello,noun
1,affirm,yes,verb
2,deny,no,verb
Here is the code that reads the above:
def get_train_data(training_file):
with open(training_file, "rt") as f:
next(f) # throw away "headers" in first line
for line in f:
lst = line.rstrip('\n').split(',')
# lst contains: id,title,body,category_labels
yield (lst[1],lst[2])
train_dict = dict(get_train_data("data.csv"))
And here is a faster way to build results:
results = [train_dict.get(x, sample) for x in test]
Instead of repeatedly indexing test to find the next item, we just iterate over the values in test. The dict.get() method handles the if x in train_dict test we need.
You can try Cython. It supports numpy and can give you a nice speedup.
Here is an introduction and explanation of what needs to be done
http://www.youtube.com/watch?v=Iw9-GckD-gQ
If the order of your rows is not important you can use sets to find elements that are in train set but not in test set (intersection trainset & testset) and add them first to your "result" and after that use set difference (testset-trainset) to add elements that are in your test set but not in the train set. This will allow to save on checking if sample is in trainset.
Again I have a question concerning large loops.
Suppose I have a function
limits
def limits(a,b):
*evaluate integral with upper and lower limits a and b*
return float result
A and B are simple np.arrays that store my values a and b. Now I want to calculate the integral 300'000^2/2 times because A and B are of the length of 300'000 each and the integral is symmetrical.
In Python I tried several ways like itertools.combinations_with_replacement to create the combinations of A and B and then put them into the integral but that takes huge amount of time and the memory is totally overloaded.
Is there any way, for example transferring the loop in another language, to speed this up?
I would like to run the loop
for i in range(len(A)):
for j in range(len(B)):
np.histogram(limits(A[i],B[j]))
I think histrogramming the return of limits is desirable in order not to store additional arrays that grow squarely.
From what I read python is not really the best choice for this iterative ansatzes.
So would it be reasonable to evaluate this loop in another language within Python, if yes, How to do it. I know there are ways to transfer code, but I have never done it so far.
Thanks for your help.
If you're worried about memory footprint, all you need to do is bin the results as you go in the for loop.
num_bins = 100
bin_upper_limits = np.linspace(-456, 456, num=num_bins-1)
# (last bin has no upper limit, it goes from 456 to infinity)
bin_count = np.zeros(num_bins)
for a in A:
for b in B:
if b<a:
# you said the integral is symmetric, so we can skip these, right?
continue
new_result = limits(a,b)
which_bin = np.digitize([new_result], bin_upper_limits)
bin_count[which_bin] += 1
So nothing large is saved in memory.
As for speed, I imagine that the overwhelming majority of time is spent evaluating limits(a,b). The looping and binning is plenty fast in this case, even in python. To convince yourself of this, try replacing the line new_result = limits(a,b) with new_result = 234. You'll find that the loop runs very fast. (A few minutes on my computer, much much less than the 4 hour figure you quote.) Python does not loop very fast compared to C, but it doesn't matter in this case.
Whatever you do to speed up the limits() call (including implementing it in another language) will speed up the program.
If you change the algorithm, there is vast room for improvement. Let's take an example of what it seems you're doing. Let's say A and B are 0,1,2,3. You're integrating a function over the ranges 0-->0, 0-->1, 1-->1, 1-->2, 0-->2, etc. etc. You're re-doing the same work over and over. If you have integrated 0-->1 and 1-->2, then you can add up those two results to get the integral 0-->2. You don't have to use a fancy integration algorithm, you just have to add two numbers you already know.
Therefore it seems to me that you can compute integrals in all the smallest ranges (0-->1, 1-->2, 2-->3), store the results in an array, and add subsets of the results to get the integral over whatever range you want. If you want this program to run in a few minutes instead of 4 hours, I suggest thinking through an alternative algorithm along those lines.
(Sorry if I'm misunderstanding the problem you're trying to solve.)