I'm looking for the most efficient way to select multiple columns from a data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,8), columns = list('abcdefgh'))
I want to select columns the following columns a,c,e,f,g only, which can be done by using indexing:
df.ix[:,[0,2,4,5,6]]
For a large data frame of many columns, this seems an inefficient method and I would much rather specify consecutive column indexes by range, if at all possible, but attempts such as the following, both throw up syntax errors:
df.ix[:,[0,2,4:6]]
or
df.ix[:,[0,2,[4:6]]]
As soon as you select non adjacent columns, you will pay the load.
If your data is homogeneous, falling back to numpy give you notable improvement.
In [147]: %timeit df[['a','c','e','f','g']]
%timeit df.values[:,[0,2,4,5,6]]
%timeit df.ix[:,[0,2,4,5,6]]
%timeit pd.DataFrame(df.values[:,[0,2,4,5,6]],columns=df.columns[[0,2,4,5,6]])
100 loops, best of 3: 2.67 ms per loop
10000 loops, best of 3: 58.7 µs per loop
1000 loops, best of 3: 1.81 ms per loop
1000 loops, best of 3: 568 µs per loop
I think you can use range:
print [0,2] + range(4,7)
[0, 2, 4, 5, 6]
print df.ix[:, [0,2] + range(4,7)]
a c e f g
0 0.278231 0.192650 0.653491 0.944689 0.663457
1 0.416367 0.477074 0.582187 0.730247 0.946496
2 0.396906 0.877941 0.774960 0.057290 0.556719
3 0.119685 0.211581 0.526096 0.213282 0.492261
Pandas is relatively well thought, the shortest way is the most efficient:
df[['a','c','e','f','g']]
You don't need ix, as it will do a search in your data, but for that you obviously need the names of the columns.
Related
If I have a Python Pandas DataFrame containing two columns of people and sequence respectively like:
people sequence
John 1
Rob 2
Bob 3
How can I return the person where sequence is maximal? In this example I want to return 'Bob'
pandas.Series.idxmax
Is the method that tells you the index value where the maximum occurs.
Then use that to get at the value of the other column.
df.at[df['sequence'].idxmax(), 'people']
'Bob'
I like the solution #user3483203 provided in the comments. The reason I provided a different one is to show that the same think can be done with fewer objects created.
In this case, df['sequence'] is accessing an internally stored object and subsequently calling the idxmax method on it. At that point we are accessing a specific cell in the dataframe df with the at accessor.
We can see that we are accessing the internally stored object because we can access it in two different ways and validate that it is the same object.
df['sequence'] is df.sequence
True
While
df['sequence'] is df.sequence.copy()
False
On the other hand, df.set_index('people') creates a new object and that is expensive.
Clearly this is over a ridiculously small data set but:
%timeit df.loc[df['sequence'].idxmax(), 'people']
%timeit df.at[df['sequence'].idxmax(), 'people']
%timeit df.set_index('people').sequence.idxmax()
10000 loops, best of 3: 65.1 µs per loop
10000 loops, best of 3: 62.6 µs per loop
1000 loops, best of 3: 556 µs per loop
Over a much larger data set:
df = pd.DataFrame(dict(
people=range(10000),
sequence=np.random.permutation(range(10000))
))
%timeit df.loc[df['sequence'].idxmax(), 'people']
%timeit df.at[df['sequence'].idxmax(), 'people']
%timeit df.set_index('people').sequence.idxmax()
10000 loops, best of 3: 107 µs per loop
10000 loops, best of 3: 101 µs per loop
1000 loops, best of 3: 816 µs per loop
The relative difference is consistent.
I'm calculating the average slope between a part of a series and a lagged part of the series:
one way is:
def get_trend(series, slen)
sum = 0.0
for i in range(slen):
sum += series[i+slen] - series[i]
return sum / slen**2
second way is:
numpy.average(numpy.subtract(series[slen:2*slen], series[:slen]))/float(slen)
The first code is faster than the second by about 50% according to timeit, and the results differ in the 18th digit and onward for a series of size 200 with slen = 66 and numbers in the series ranging between 0 and 1.
I've also tried to replace the average with sum and divide by slen**2, like I do in the for sum:
numpy.sum(numpy.subtract(series[slen:2*slen], series[:slen]))/float(slen**2)
This is equivalent in execution time to the for loop version, but the result is still not exactly the same, it's also (sometime) not the same as the average version, although more often than not it is the same as the average version.
Questions are:
Which of these should give the most accurate answer?
Why does the last version give a different answer from the for loop?
And why is the average function so inefficient?
Note: for timing I'm measuring the operation on a standard list, on a numpy array the average is faster than the for loop, but the sum is still more than twice as fast as the average.
A better suggestion
I think a better vectorized approach would be with slicing -
(series[slen:2*slen] - series[:slen]).sum()/float(slen**2)
Runtime test and verification -
In [139]: series = np.random.randint(11,999,(200))
...: slen= 66
...:
# Original app
In [140]: %timeit get_trend(series, slen)
100000 loops, best of 3: 17.1 µs per loop
# Proposed app
In [141]: %timeit (series[slen:2*slen] - series[:slen]).sum()/float(slen**2)
100000 loops, best of 3: 3.81 µs per loop
In [142]: out1 = get_trend(series, slen)
In [143]: out2 = (series[slen:2*slen] - series[:slen]).sum()/float(slen**2)
In [144]: out1, out2
Out[144]: (0.7587235996326905, 0.75872359963269054)
Investigating comparison on average based approach against loopy one
Let's add the second approach (vectorized one) from the question for testing -
In [146]: np.average(np.subtract(series[slen:2*slen], series[:slen]))/float(slen)
Out[146]: 0.75872359963269054
Timings are better than the loopy one and results look good. So, I am suspecting the way you are timing might be off.
If you are using NumPy ufuncs to leverage the vectorized operations with NumPy, you should work with arrays. So, if your data is a list, convert it to an array and then use the vectorized approach. Let's investigate it a bit more -
Case #1 : With a list of 200 elems and slen = 66
In [147]: series_list = np.random.randint(11,999,(200)).tolist()
In [148]: series = np.asarray(series_list)
In [149]: slen = 66
In [150]: %timeit get_trend(series_list, slen)
100000 loops, best of 3: 5.68 µs per loop
In [151]: %timeit np.asarray(series_list)
100000 loops, best of 3: 7.99 µs per loop
In [152]: %timeit np.average(np.subtract(series[slen:2*slen], series[:slen]))/float(slen)
100000 loops, best of 3: 6.98 µs per loop
Case #2 : Scale it 10x
In [157]: series_list = np.random.randint(11,999,(2000)).tolist()
In [159]: series = np.asarray(series_list)
In [160]: slen = 660
In [161]: %timeit get_trend(series_list, slen)
10000 loops, best of 3: 53.6 µs per loop
In [162]: %timeit np.asarray(series_list)
10000 loops, best of 3: 65.4 µs per loop
In [163]: %timeit np.average(np.subtract(series[slen:2*slen], series[:slen]))/float(slen)
100000 loops, best of 3: 8.71 µs per loop
So, it's the overhead of converting to an array that's hurting you!
Investigating comparison on sum based approach against average based one
On the third part of comparing sum-based code against average-based one, it's because np.avarege is indeed slower than "manually" doing it with summation. Timing it on this as well -
In [173]: a = np.random.randint(0,1000,(1000))
In [174]: %timeit np.sum(a)/float(len(a))
100000 loops, best of 3: 4.36 µs per loop
In [175]: %timeit np.average(a)
100000 loops, best of 3: 7.2 µs per loop
A better one than np.average with np.mean -
In [179]: %timeit np.mean(a)
100000 loops, best of 3: 6.46 µs per loop
Now, looking into the source code for np.average, it seems to be using np.mean. This explains why it 's slower than np.mean as we are avoiding the function call overhead there. On the tussle between np.sum and np.mean, I think np.mean does take care of the overflow in case we are adding a huge number of elements, which we might miss it with np.sum. So, for being on the safe side, I guess it's better to go with np.mean.
As for the first two questions...well I would not say you have different results!
A difference of the order of 1e-18 is veeeery small (also think in the context of your script).
This is why you should not compare floats for strict equality, but set a tolerance:
What is the best way to compare floats for almost-equality in Python?
https://www.python.org/dev/peps/pep-0485/
Below, I compared performance when dealing with sum operations between C-contiguous and Fortran-contiguous arrays (C vs FORTRAN memory order). I set axis=0 to ensure numbers are added up column wise. I was surprised that Fortran-contiguous array is actually slower than its C counterpart. Isn't that Fortran-contiguous array has contiguous memory allocation in columns and hence is better at column-wise operation?
import numpy as np
a = np.random.standard_normal((10000, 10000))
c = np.array(a, order='C')
f = np.array(a, order='F')
In Jupyter notebook, run
%timeit c.sum(axis=0)
10 loops, best of 3: 84.6 ms per loop
%timeit f.sum(axis=0)
10 loops, best of 3: 137 ms per loop
I think it's in the implementation of the np.sum(). For example:
import numpy as np
A = np.random.standard_normal((10000,10000))
C = np.array(A, order='C')
F = np.array(A, order='F')
Benchmarking with Ipython:
In [7]: %timeit C.sum(axis=0)
10 loops, best of 3: 101 ms per loop
In [8]: %timeit C.sum(axis=1)
10 loops, best of 3: 149 ms per loop
In [9]: %timeit F.sum(axis=0)
10 loops, best of 3: 149 ms per loop
In [10]: %timeit F.sum(axis=1)
10 loops, best of 3: 102 ms per loop
So it's behaving exactly the opposite as expected. But let's try out some other function:
In [17]: %timeit np.amax(C, axis=0)
1 loop, best of 3: 173 ms per loop
In [18]: %timeit np.amax(C, axis=1)
10 loops, best of 3: 70.4 ms per loop
In [13]: %timeit np.amax(F,axis=0)
10 loops, best of 3: 72 ms per loop
In [14]: %timeit np.amax(F,axis=1)
10 loops, best of 3: 168 ms per loop
Sure, it's apples to oranges. But np.amax() works along the axis as does sum and returns a vector with one element for each row/column. And behaves as one would expect.
In [25]: C.strides
Out[25]: (80000, 8)
In [26]: F.strides
Out[26]: (8, 80000)
Tells us that the arrays are in fact packed in row-order and column-order and looping in that direction should be a lot faster. Unless for example the sum sums each row by row as it travels along the columns for providing the column sum (axis=0). But without a mean of peeking inside the .pyd I'm just speculating.
EDIT:
From percusse 's link: http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduce.html
Reduces a‘s dimension by one, by applying ufunc along one axis.
Let a.shape = (N_0, ..., N_i, ..., N_{M-1}).
Then ufunc.reduce(a, axis=i)[k_0, ..,k_{i-1}, k_{i+1}, .., k_{M-1}] = the result of iterating j over range(N_i), cumulatively applying ufunc to each
a[k_0, ..,k_{i-1}, j, k_{i+1}, .., k_{M-1}]
So in pseudocode, when calling F.sum(axis=0):
for j=cols #axis=0
for i=rows #axis=1
sum(j,i)=F(j,i)+sum(j-1,i)
So it would actually iterate over the row when calculating the column sum, slowing down considerably when in column-major order. Behaviour such as this would explain the difference.
eric's link provides us with the implementation, for somebody curious enough to go through large amounts of code for the reason.
That's expected. If you check the result of
%timeit f.sum(axis=1)
it also gives similar result with the timing of c. Similarly,
%timeit c.sum(axis=1)
is slower.
Some explanation: suppose you have the following structure
|1| |6|
|2| |7|
|3| |8|
|4| |9|
|5| |10|
As Eric mentioned, these operations work with reduce. Let's say we are asking for column sum. So intuitive mechanism is not at play such that each column is accessed once, summed, and recorded. In fact it is the opposite such that each row is accessed and the function (here summing) is performed in essence similar to having two arrays a,b and executing
a += b
That's a very informal way of repeating what is mentioned super-cryptically in the documentation of reduce.
This requires rows to be accessed contiguously although we are performing a column sum [1,6] + [2,7] + [3,8]... Hence, the implementation direction matters depending on the operation but not the array.
I ran a comparison of several ways to access data in a DataFrame. See results below. The quickest access was from using the get_value method on a DataFrame. I was referred to this on this post.
What I was surprised by is that the access via get_value is quicker than accessing via the underlying numpy object df.values.
Question
My question is, is there a way to access elements of a numpy array as quickly as I can access a pandas dataframe via get_value?
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(16).reshape(4, 4))
Testing
%%timeit
df.iloc[2, 2]
10000 loops, best of 3: 108 µs per loop
%%timeit
df.values[2, 2]
The slowest run took 5.42 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.02 µs per loop
%%timeit
df.iat[2, 2]
The slowest run took 4.96 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 9.85 µs per loop
%%timeit
df.get_value(2, 2)
The slowest run took 19.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.57 µs per loop
iloc is pretty general, accepting slices and lists as well as simple integers. In the case above, where you have simple integer indexing, pandas first determines that it is a valid integer, then it converts the request to an iat index, so clearly it will be much slower. iat eventually resolves down to a call to get_value, so naturally a direct call to get_value is going to be fast. get_value itself is cached, so micro-benchmarks like these may not reflect performance in real code.
df.values does return an ndarray, but only after checking that it is a single contiguous block. This requires a few lookups and tests so it is a little slower than retrieving the value from the cache.
We can defeat the caching by creating a new data frame every time. This shows that values accessor is fastest, at least for data of a uniform type:
In [111]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4))
10000 loops, best of 3: 186 µs per loop
In [112]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.values[2,2]
1000 loops, best of 3: 200 µs per loop
In [113]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.get_value(2,2)
1000 loops, best of 3: 309 µs per loop
In [114]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iat[2,2]
1000 loops, best of 3: 308 µs per loop
In [115]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iloc[2,2]
1000 loops, best of 3: 420 µs per loop
In [116]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.ix[2,2]
1000 loops, best of 3: 316 µs per loop
The code claims that ix is the most general, and so should be in theory be slower than iloc; it may be that your particular test favours ix but other tests may favour iloc just because of the order of the tests needed to identify the index as a scalar index.
Reposted from https://groups.google.com/forum/#!topic/pydata/5mhuatNAl5g
It seems when creating a DataFrame from a structured array that the data is copied?
I get similar results if the data is instead a dictionary of numpy arrays.
Is there anyway to create a DataFrame from a structured array or similar without any copying or checking?
In [44]: sarray = randn(1e7,10).view([(name, float) for name in 'abcdefghij']).squeeze()
In [45]: for N in [10,100,1000,10000,100000,1000000,10000000]:
...: s = sarray[:N]
...: %timeit z = pd.DataFrame(s)
...:
1000 loops, best of 3: 830 µs per loop
1000 loops, best of 3: 834 µs per loop
1000 loops, best of 3: 872 µs per loop
1000 loops, best of 3: 1.33 ms per loop
100 loops, best of 3: 15.4 ms per loop
10 loops, best of 3: 161 ms per loop
1 loops, best of 3: 1.45 s per loop
Thanks,
Dave
Panda's DataFrame uses BlockManager to consolidate similarly typed data into a single memory chunk. Its this consolidation into a single chunk that causes the copy. If you initialize as follows:
pd.DataFrame(npmatrix, copy=False)
then the DataFrame will not copy the data, but reference it instead.
HOWEVER, sometimes you may come with multiple arrays and BlockManager will try to consolidate the data into a single chunk. In that situation, I think your only option is to monkey patch the BlockManager to not consolidate the data.
I agree with #DaveHirschfeld, this could be provided as a consolidate=False parameter to BlockManager. Pandas would be better for it.
This will by definition coerce dtypes to a single dtype (e.g. float64 in this case). No way around that. This is a view on the original array. Note that this only helps with construction. Most operations will tend to make and return copies.
In [44]: s = sarray[:1000000]
Original Method
In [45]: %timeit DataFrame(s)
10 loops, best of 3: 107 ms per loop
Coerce to an ndarray. Pass in copy=False (this doesn't affect a structured array, ONLY a plain single dtyped ndarray).
In [47]: %timeit DataFrame(s.view(np.float64).reshape(-1,len(s.dtype.names)),columns=s.dtype.names,copy=False)
100 loops, best of 3: 3.3 ms per loop
In [48]: result = DataFrame(s.view(np.float64).reshape(-1,len(s.dtype.names)),columns=s.dtype.names,copy=False)
In [49]: result2 = DataFrame(s)
In [50]: result.equals(result2)
Out[50]: True
Note that both DataFrame.from_dict and DataFrame.from_records will copy this. Pandas keeps like-dtyped ndarrays as a single ndarray. And its expensive to do a np.concatenate to aggregate, which is what is done under the hood. Using a view avoids this issue.
I suppose this could be the default for a structrured array if the passed dtypes are all the same. But then you have to ask why you are using a structured array in the first place. (obviously to get name-access..but is their another reason?)