All,
I am going to compute some feature values using the following python codes. But, because the input sizes are too big, it is very time-consuming. Please help me to optimize the codes.
leaving_volume=len([x for x in pickup_ids if x not in dropoff_ids])
arriving_volume=len([x for x in dropoff_ids if x not in pickup_ids])
transition_volume=len([x for x in dropoff_ids if x in pickup_ids])
union_ids=list(set(pickup_ids + dropoff_ids))
busstop_ids=[x for x in union_ids if self.geoitems[x].fare>0]
busstop_density=np.sum([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
busstop_ids=[x for x in union_ids if self.geoitems[x].balance>0]
smartcard_balance=np.sum([self.geoitems[x].balance for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
Hi, All,
Here is my revised version. I run this code on my GPS traces data. It is faster.
intersect_ids=set(pickup_ids).intersection( set(dropoff_ids) )
union_ids=list(set(pickup_ids + dropoff_ids))
leaving_ids=set(pickup_ids)-intersect_ids
leaving_volume=len(leaving_ids)
arriving_ids=set(dropoff_ids)-intersect_ids
arriving_volume=len(arriving_ids)
transition_volume=len(intersect_ids)
busstop_density=np.mean([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in union_ids if self.geoitems[x].fare>0])
if not busstop_density > 0:
busstop_density = 0
smartcard_balance=np.mean([self.geoitems[x].balance for x in union_ids if self.geoitems[x].balance>0])
if not smartcard_balance > 0:
smartcard_balance = 0
Many thanks for the help.
Just a few things I noticed, as some Python efficiency trivia:
if x not in dropoff_ids
Checking for membership using the in operator is more efficient on a set than a list. But iterating with for through a list is probably more efficient than on a set. So if you want your first two lines to be as efficient as possible you should have both types of data structure around beforehand.
list(set(pickup_ids + dropoff_ids))
It's more efficient to create your sets before you combine data, rather than creating a long list and constructing a set from it. Luckily you probably already have the set versions around now (see the first comment)!
Above all you need to ask yourself the question:
Is the time I save by constructing extra data structures worth the time it takes to construct them?
Next one:
np.sum([...])
I've been trained by Python to think of constructing a list and then applying a function that theoretically only requires a generator as a code smell. I'm not sure if this applies in numpy, since from what I remember it's not completely straightforward to pull data from a generator and put it in a numpy structure.
It looks like this is just a small fragment of your code. If you're really concerned about efficiency I'd recommend making use of numpy arrays rather than lists, and trying to stick within numpy's built-in data structures and function as much as possible. They are likely more highly optimized for raw data crunching in C than the built-in Python functions.
If you're really, really concerned about efficiency then you should probably be doing this data analysis straight-up in C. Especially if you don't have much more code than what you've presented here it might be pretty easy to translate over.
I can only support what machine yerning wrote in his this post. If you are thinking of switching to numpy so if your variables pickup_ids and dropoff_ids were numpy arrays (which maybe they already are else do:
dropoff_ids = np.array( dropoff_ids, dtype='i' )
pickup_ids = np.array( pickup_ids, dtype='i' )
then you can make use of the functions np.in1d() which will give you a True/False array which you can just sum over to get the total number of True entries.
leaving_volume = (-np.in1d( pickup_ids, dropoff_ids )).sum()
transition_volume= np.in1d( dropoff_ids, pickup_ids).sum()
arriving_volume = (-np.in1d( dropoff_ids, pickup_ids)).sum()
somehow I have the feeling that transition_volume = len(pickup_ids) - arriving_volume but I'm not 100% sure right now.
Another function that could be useful to you is np.unique() if you want to get rid of duplicate entries which in a way will turn your array into a set.
Related
I have a Pandas dataframe which is 13.4GB large. Considering it is a multi-index dataframe of which the first level varies in length, I would like to convert it to a ragged tensor. Here is how I do it:
def convert(df):
X = df.groupby(level=0).apply(tf.convert_to_tensor)
X = X.tolist()
X = tf.ragged.stack(X)
return X
However, due to the enormous size of the data, this process has become absolutely intractable.
I have found out the following:
By far most of the time is being spent in the line "X = tf.ragged.stack(X)"
tf.ragged.constant or even tf.RaggedTensor.from_row_lengths are not faster.
Any advice would be much appreciated!
What I thought I had found was wrong; Using the factory functions such as tf.RaggedTensor.from_row_lengths is tremendously faster, especially on large datasets. I think that's because these functions require you to manually encode the data, whereas using tf.ragged.constant forces Tensorflow to calculate the encoding. Also, they are implemented in C instead of in Python. However, be sure to set the argument validate to False, because the difference in time is enormous.
I am wondering which would be the most efficient way of creating a column in a pandas dataframe which if an id_row exist in a given list will return 1 or else 0.
I'm currently using a lambda function to apply the result. My problem is that it is taking a long time as my dataframe is around 2M rows and the list it checks into between 200k to 100k items. If I'm not mistaken, this is quadratic time (I'm really not sure though), which in this case runs really slowly give the size of the objects.
The worst is that I have to repeat this bit of code for over a 100 other (different) dataframes.
Here is the function:
lst_to_add = [1,2,3.......,n]
df_table['TEST'] = df_table['id_row'].apply(lambda x : 1 if x i lst_to_add else 0)
I'm wondering how could I make the bit of code (way) more efficient.
I thought of a 'divide and conquer' solution using a recursive function perhaps, but I'm really open to any suggestions.
Last thing. I also have constraits with memory hence I'd would prefer a method which takes a little more time but less memory than the alternative (if I had a choice).
You could do
df_table['TEST'] = (df_table['id_row'].isin(lst_to_add)).astype(int)
This code checks if the id_row variables are in lst_to_add and returns True and False, which the astype(int) converts to 1's and 0's. Since this approach is vectorized (acts on the entire series), it should be faster than using apply.
As far as time complexity, your list should be a set, this will make your O(M*N) solution O(N), since set membership tests are constant time instead of linear time (like it is for lists). Then, use the built-in method .isin:
lst_to_add = set(lst_to_add)
df_table['TEST'] = df_table['TEST'].isin(lst_to_add)
You should stick to the boolean type if memory is an issue, and you only want 0 and 1.
print sum(1 for x in alist if x[1] == 8)
This code runs fine, but it is so slow. Is there a way better than this. Because, my list is very large and the computation takes a lot of time. Do you know a better and faster way to do it?
You'd have to create indexes or cached counts to speed up such code; trade memory for speed.
Wherever you handle your list (add to it, remove from it, edit entries) you also maintain your indices. For example, if you had a counts dict with ids as keys and their frequency as values, all you had to do is look up the count directly, and ensure that the counts stayed up-to-date as you manipulate alist.
The best way to manage this is by encapsulating your list in a custom type, so that you can control all manipulations of the data structure and maintain the extra information.
Not sure how much faster it would be but
len([x for x in alist if x[1] == 8])
is a little clearer.
I would use numpy. My numpy skills are a little bit rusty, but len(np_array == 8) would give you what you need for a single depth array. I think for you it would be len(np_array[:,1]) but I would have to check (this assumes your problem could use numpy arrays)
I am new to python and my problem is the following:
I have defined a function func(a,b) that return a value, given two input values.
Now I have my data stored in lists or numpy arrays A,Band would like to use func for every combination. (A and B have over one million entries)
ATM i use this snippet:
for p in A:
for k in B:
value = func(p,k)
This takes really really a lot of time.
So i was thinking that maybe something like this:
C=(map(func,zip(A,B)))
But this method only works pairwise... Any ideas?
Thanks for help
First issue
You need to calculate the output of f for many pairs of values. The "standard" way to speed up this kind of loops (calculations) is to make your function f accept (NumPy) arrays as input, and do the calculation on the whole array at once (ie, no looping as seen from Python). Check any NumPy tutorial to get an introduction.
Second issue
If A and B have over a million entries each, there are one trillion combinations. For 64 bits numbers, that means you'll need 7.3 TiB of space just to store the result of your calculation. Do you have enough hard drive to just store the result?
Third issue
If A and B where much smaller, in your particular case you'd be able to do this:
values = f(*meshgrid(A, B))
meshgrid returns the cartesian product of A and B, so it's simply a way to generate the points that have to be evaluated.
Summary
You need to use NumPy effectively to avoid Python loops. (Or if all else fails or they can't easily be vectorized, write those loops in a compiled language, for instance by using Cython)
Working with terabytes of data is hard. Do you really need that much data?
Any solution that calls a function f 1e12 times in a loop is bound to be slow, specially in CPython (which is the default Python implementation. If you're not really sure and you're using NumPy, you're using it too).
suppose, itertools.product does what you need:
from itertools import product
pro = product(A,B)
C = map(lambda x: func(*x), pro)
so far as it is generator it doesn't require additional memory
One million times one million is one trillion. Calling f one trillion times will take a while.
Unless you have a way of reducing the number of values to compute, you can't do better than the above.
If you use NumPy, you should definitely look the np.vectorize function which is designed for this kind of problems...
cost=0
for i in range(12):
cost=cost+math.pow(float(float(q[i])-float(w[i])),2)
cost=(math.sqrt(cost))
Any faster alternative to this? i am need to improve my entire code so trying to improve each statements performance.
thanking u
In addition to the general optimization remarks that are already made (and to which I subscribe), there is a more "optimized" way of doing what you want: you manipulate arrays of values and combine them mathematically. This is a job for the very useful and widely used NumPy package!
Here is how you would do it:
q_array = numpy.array(q, dtype=float)
w_array = numpy.array(w, dtype=float)
cost = math.sqrt(((q_array-w_array)**2).sum())
(If your arrays q and w already contain floats, you can remove the dtype=float.)
This is almost as fast as it can get, since NumPy's operations are optimized for arrays. It is also much more legible than a loop, because it is both simple and short.
Just a hint, but usually real performance improvements come when you evaluate the code at a function or even higher level.
During a good evaluation, you may find whole blocks that code be thrown away or rewritten to simplify the process.
Profilers are useful AFTER you've cleaned up crufty not-very-legible code. irrespective of whether it's to be run once or N zillion times, you should not write code like that.
Why are you doing float(q[i]) and float(w[i])? What type(s) is/are the elements of q and `w'?
If x and y are floats, then x - y will be a float too, so that's 3 apparently redundant occurrences of float() already.
Calling math.pow() instead of using the ** operator bears the overhead of lookups on 'math' and 'pow'.
Etc etc
See if the following code gives the same answers and reads better and is faster:
costsq = 0.0
for i in xrange(12):
costsq += (q[i] - w[i]) ** 2
cost = math.sqrt(costsq)
After you've tested that and understood why the changes were made, you can apply the lessons to other Python code. Then if you have a lot more array or matrix work to do, consider using numpy.
Assuming q and w contain numbers the conversions to float are not necessary, otherwise you should convert the lists to a usable representation earlier (and separately from your calculation)
Given that your function seems to only be doing the equivalent of this:
cost = sum( (qi-wi)**2 for qi,wi in zip(q[:12],w) ) ** 0.5
Perhaps this form would execute faster.