Here is my function to set value of 'cropped_inputs_final' array from 'cropped_inputs_index' array and 'inputs_data_list' array.
for i in range(cropped_inputs_final.shape[0]):
a1 = int(cropped_inputs_index[i])
cropped_inputs_final[i] = inputs_data_list[a1][0]
However, this for loop spend too much time.
Is there any way to replace for loop to numpy (or any else) to reduce the running time?
Thank you very much.
What you're trying to do is called vectorization, and is in general a very good thing, though it can be pretty counter intuitive sometimes.
I would need more context to give you an exact answer (what are the arrays shapes?), but array indexing seems to be what you need.
cropped_inputs_final = input_data_list[cropped_input_index][0]
Related
I'm trying to do an operation that is pretty similar to a numpy.add.at, but with two pairs of indices, and I'm wondering if there's a faster way to do this with numpy or something else rather than a for loop, which is running pretty slowly.
The following works, but I'm trying to do it faster:
for x,y in indices:
A[B[x,y]] += C[x,y]
where the values obtained for B[x,y] will have a lot of duplicates, so B[1,1] may be equal to B[1,2]
numpy.add.at(A, indices, C) is pretty close, but doesn't get me there, as B basically maps the indices into another space. I'm hoping there's a faster way to do this with numpy or something else, probably without an explicit loop.
I have a vector of floats (coming from an operation on an array) and a float value (which is actually an element of the array, but that's unimportant), and I need to find the smallest float out of them all.
I'd love to be able to find the minimum between them in one line in a 'Pythony' way.
MinVec = N[i,:] + N[:,j]
Answer = min(min(MinVec),N[i,j])
Clearly I'm performing two minimisation calls, and I'd love to be able to replace this with one call. Perhaps I could eliminate the vector MinVec as well.
As an aside, this is for a short program in Dynamic Programming.
TIA.
EDIT: My apologies, I didn't specify I was using numpy. The variable N is an array.
You can append the value, then minimize. I'm not sure what the relative time considerations of the two approaches are, though - I wouldn't necessarily assume this is faster:
Answer = min(np.append(MinVec, N[i, j]))
This is the same thing as the answer above but without using numpy.
Answer = min(MinVec.append(N[i, j]))
All,
I am going to compute some feature values using the following python codes. But, because the input sizes are too big, it is very time-consuming. Please help me to optimize the codes.
leaving_volume=len([x for x in pickup_ids if x not in dropoff_ids])
arriving_volume=len([x for x in dropoff_ids if x not in pickup_ids])
transition_volume=len([x for x in dropoff_ids if x in pickup_ids])
union_ids=list(set(pickup_ids + dropoff_ids))
busstop_ids=[x for x in union_ids if self.geoitems[x].fare>0]
busstop_density=np.sum([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
busstop_ids=[x for x in union_ids if self.geoitems[x].balance>0]
smartcard_balance=np.sum([self.geoitems[x].balance for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
Hi, All,
Here is my revised version. I run this code on my GPS traces data. It is faster.
intersect_ids=set(pickup_ids).intersection( set(dropoff_ids) )
union_ids=list(set(pickup_ids + dropoff_ids))
leaving_ids=set(pickup_ids)-intersect_ids
leaving_volume=len(leaving_ids)
arriving_ids=set(dropoff_ids)-intersect_ids
arriving_volume=len(arriving_ids)
transition_volume=len(intersect_ids)
busstop_density=np.mean([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in union_ids if self.geoitems[x].fare>0])
if not busstop_density > 0:
busstop_density = 0
smartcard_balance=np.mean([self.geoitems[x].balance for x in union_ids if self.geoitems[x].balance>0])
if not smartcard_balance > 0:
smartcard_balance = 0
Many thanks for the help.
Just a few things I noticed, as some Python efficiency trivia:
if x not in dropoff_ids
Checking for membership using the in operator is more efficient on a set than a list. But iterating with for through a list is probably more efficient than on a set. So if you want your first two lines to be as efficient as possible you should have both types of data structure around beforehand.
list(set(pickup_ids + dropoff_ids))
It's more efficient to create your sets before you combine data, rather than creating a long list and constructing a set from it. Luckily you probably already have the set versions around now (see the first comment)!
Above all you need to ask yourself the question:
Is the time I save by constructing extra data structures worth the time it takes to construct them?
Next one:
np.sum([...])
I've been trained by Python to think of constructing a list and then applying a function that theoretically only requires a generator as a code smell. I'm not sure if this applies in numpy, since from what I remember it's not completely straightforward to pull data from a generator and put it in a numpy structure.
It looks like this is just a small fragment of your code. If you're really concerned about efficiency I'd recommend making use of numpy arrays rather than lists, and trying to stick within numpy's built-in data structures and function as much as possible. They are likely more highly optimized for raw data crunching in C than the built-in Python functions.
If you're really, really concerned about efficiency then you should probably be doing this data analysis straight-up in C. Especially if you don't have much more code than what you've presented here it might be pretty easy to translate over.
I can only support what machine yerning wrote in his this post. If you are thinking of switching to numpy so if your variables pickup_ids and dropoff_ids were numpy arrays (which maybe they already are else do:
dropoff_ids = np.array( dropoff_ids, dtype='i' )
pickup_ids = np.array( pickup_ids, dtype='i' )
then you can make use of the functions np.in1d() which will give you a True/False array which you can just sum over to get the total number of True entries.
leaving_volume = (-np.in1d( pickup_ids, dropoff_ids )).sum()
transition_volume= np.in1d( dropoff_ids, pickup_ids).sum()
arriving_volume = (-np.in1d( dropoff_ids, pickup_ids)).sum()
somehow I have the feeling that transition_volume = len(pickup_ids) - arriving_volume but I'm not 100% sure right now.
Another function that could be useful to you is np.unique() if you want to get rid of duplicate entries which in a way will turn your array into a set.
print sum(1 for x in alist if x[1] == 8)
This code runs fine, but it is so slow. Is there a way better than this. Because, my list is very large and the computation takes a lot of time. Do you know a better and faster way to do it?
You'd have to create indexes or cached counts to speed up such code; trade memory for speed.
Wherever you handle your list (add to it, remove from it, edit entries) you also maintain your indices. For example, if you had a counts dict with ids as keys and their frequency as values, all you had to do is look up the count directly, and ensure that the counts stayed up-to-date as you manipulate alist.
The best way to manage this is by encapsulating your list in a custom type, so that you can control all manipulations of the data structure and maintain the extra information.
Not sure how much faster it would be but
len([x for x in alist if x[1] == 8])
is a little clearer.
I would use numpy. My numpy skills are a little bit rusty, but len(np_array == 8) would give you what you need for a single depth array. I think for you it would be len(np_array[:,1]) but I would have to check (this assumes your problem could use numpy arrays)
I am new to python and my problem is the following:
I have defined a function func(a,b) that return a value, given two input values.
Now I have my data stored in lists or numpy arrays A,Band would like to use func for every combination. (A and B have over one million entries)
ATM i use this snippet:
for p in A:
for k in B:
value = func(p,k)
This takes really really a lot of time.
So i was thinking that maybe something like this:
C=(map(func,zip(A,B)))
But this method only works pairwise... Any ideas?
Thanks for help
First issue
You need to calculate the output of f for many pairs of values. The "standard" way to speed up this kind of loops (calculations) is to make your function f accept (NumPy) arrays as input, and do the calculation on the whole array at once (ie, no looping as seen from Python). Check any NumPy tutorial to get an introduction.
Second issue
If A and B have over a million entries each, there are one trillion combinations. For 64 bits numbers, that means you'll need 7.3 TiB of space just to store the result of your calculation. Do you have enough hard drive to just store the result?
Third issue
If A and B where much smaller, in your particular case you'd be able to do this:
values = f(*meshgrid(A, B))
meshgrid returns the cartesian product of A and B, so it's simply a way to generate the points that have to be evaluated.
Summary
You need to use NumPy effectively to avoid Python loops. (Or if all else fails or they can't easily be vectorized, write those loops in a compiled language, for instance by using Cython)
Working with terabytes of data is hard. Do you really need that much data?
Any solution that calls a function f 1e12 times in a loop is bound to be slow, specially in CPython (which is the default Python implementation. If you're not really sure and you're using NumPy, you're using it too).
suppose, itertools.product does what you need:
from itertools import product
pro = product(A,B)
C = map(lambda x: func(*x), pro)
so far as it is generator it doesn't require additional memory
One million times one million is one trillion. Calling f one trillion times will take a while.
Unless you have a way of reducing the number of values to compute, you can't do better than the above.
If you use NumPy, you should definitely look the np.vectorize function which is designed for this kind of problems...