Python Pandas, apply function - python

I am trying to use apply to avoid an iterrows() iterator in a function:
However that pandas method is poorly documented and I can't find example on how to use it, except for the lame .apply(sq.rt) in the documentation... No example on how to use arguments etc...
Anyway, here a toy example on what I try to do.
In my understanding apply will actually do the same as iterrows(), ie, iterate (over the rows if axis=0). On each iteration the input x of the function should be the row iterated over. However the error messages I keep receiving sort of disprove that assumption...
grid = np.random.rand(5,2)
df = pd.DataFrame(grid)
def multiply(x):
x[3]=x[0]*x[1]
df = df.apply(multiply, axis=0)
The example above returns an empty df. Can anyone shed some light on my misunderstanding?

import pandas as pd
import numpy as np
grid = np.random.rand(5,2)
df = pd.DataFrame(grid)
def multiply(x):
return x[0]*x[1]
df['multiply'] = df.apply(multiply, axis = 1)
print(df)
Results in:
0 1 multiply
0 0.550750 0.713054 0.392715
1 0.061949 0.661614 0.040987
2 0.472134 0.783479 0.369907
3 0.827371 0.277591 0.229670
4 0.961102 0.137510 0.132162
Explanation:
The function you are applying, needs to return a value. You are also applying this to each row, not column. The axis parameter you passed was incorrect in this regard.
Finally, notice that I am setting this equal to the 'multiply' column outside of my function. You can easily change this to be df[3] = ... like you have and get a dataframe like this:
0 1 3
0 0.550750 0.713054 0.392715
1 0.061949 0.661614 0.040987
2 0.472134 0.783479 0.369907
3 0.827371 0.277591 0.229670
4 0.961102 0.137510 0.132162

It should be noted that you can use lambda functions as well. See their documentation Apply
For your example, you can run:
df['multiply'] = df.apply(lambda row: row[0] * row[1], axis = 1)
which produces the same output as #Andy
This can be useful if your function is in the form of
def multiply(a,b):
return a*b
df['multiply'] = df.apply(lambda row: multiply(row[0] ,row[1]), axis = 1)
More examples in the section Enhancing Performance

When apply-ing a function, you need that function to return the result for that operation over the column/row. You are getting None because multiply doesn't return, evidently. That is, apply should be returning a result between particular values, not doing the assignment itself.
You're also iterating over the wrong axis, here. Your current code takes the first and second element of each column and multiplies them together.
A correct multiply function:
def multiply(x):
return x[0]*x[1]
df[3] = df.apply(multiply, 'columns')
With that being said, you can do much better than apply here, as it is not a vectorized operation. Just multiply the columns together directly.
df[3] = df[0]*df[1]
In general, you should avoid apply when possible as it is not much more than a loop itself under the hood.

One of the rules of Pandas Zen says: always try to find a vectorized solution first.
.apply(..., axis=1) is not vectorized!
Consider alternatives:
In [164]: df.prod(axis=1)
Out[164]:
0 0.770675
1 0.539782
2 0.318027
3 0.597172
4 0.211643
dtype: float64
In [165]: df[0] * df[1]
Out[165]:
0 0.770675
1 0.539782
2 0.318027
3 0.597172
4 0.211643
dtype: float64
Timing against 50.000 rows DF:
In [166]: df = pd.concat([df] * 10**4, ignore_index=True)
In [167]: df.shape
Out[167]: (50000, 2)
In [168]: %timeit df.apply(multiply, axis=1)
1 loop, best of 3: 6.12 s per loop
In [169]: %timeit df.prod(axis=1)
100 loops, best of 3: 6.23 ms per loop
In [170]: def multiply_vect(x1, x2):
...: return x1*x2
...:
In [171]: %timeit multiply_vect(df[0], df[1])
1000 loops, best of 3: 604 µs per loop
Conclusion: use .apply() as a very last resort (i.e. when nothing else helps)

Related

Efficiently taking time slices of variable length in a dataframe

I would like to efficiently slice a DataFrame with a DatetimeIndex (similar to a resample or groupby operation), but the desired time slices are different lengths.
This is relatively easy to do by looping (see code below), but with large timeseries the multiple slices quickly becomes slow. Any suggestions on vectorising this/improving speed?
import pandas as pd, datetime as dt, numpy as np
#Example DataFrame with a DatetimeIndex
idx = pd.DatetimeIndex(start=dt.datetime(2017,1,1), end=dt.datetime(2017,1,31), freq='h')
df = pd.Series(index = idx, data = np.random.rand(len(idx)))
#The slicer dataframe contains a series of start and end windows
slicer_df = pd.DataFrame(index = [1,2])
slicer_df['start_window'] = [dt.datetime(2017,1,2,2), dt.datetime(2017,1,6,12)]
slicer_df['end_window'] = [dt.datetime(2017,1,6,12), dt.datetime(2017,1,15,2)]
#The results should be stored to a dataframe, indexed by the index of the slicer dataframe
#This is the loop that I would like to vectorise
slice_results = pd.DataFrame()
slice_results['total'] = None
for index, row in slicer_df.iterrows():
slice_results.loc[index,'total'] = df[(df.index >= row.start_window) &
(df.index <= row.end_window)].sum()
NB. I've just realised that my particular data set has adjacent windows (ie. the start of one window corresponds to the end of the one before it), but the windows are of different lengths. It feels like there should be a way to perform a groupby or similar with only one pass over df...
You can do this as an apply, which will concat the results rather than iteratively update the DataFrame:
In [11]: slicer_df.apply((lambda row: \
df[(df.index >= row.start_window)
& (df.index <= row.end_window)].sum()), axis=1)
Out[11]:
1 36.381155
2 111.521803
dtype: float64
You can vectorize this with searchsorted (assuming the datetime index is sorted, otherwise first sort):
In [11]: inds = np.searchsorted(df.index.values, slicer_df.values)
In [12]: s = df.cumsum() # only sum once!
In [13]: pd.Series([s[end] - s[start-1] if start else s[end] for start, end in inds], slicer_df.index)
Out[13]:
1 36.381155
2 111.521803
dtype: float64
There's still a loop in there, but it's now a lot cheaper!
That leads us to a completely vectorized solution (it's a little more cryptic):
In [21]: inds2 = np.maximum(1, inds) # see note
In [22]: inds2[:, 0] -= 1
In [23]: inds2
Out[23]:
array([[ 23, 96],
[119, 336]])
In [24]: x = s[inds2]
In [25]: x
Out[25]:
array([[ 11.4596498 , 47.84080472],
[ 55.94941276, 167.47121538]])
In [26]: x[:, 1] - x[:, 0]
Out[26]: array([ 36.38115493, 111.52180263])
Note: the when the start date is before the first date we want to avoid the start index rolling back from 0 to -1 (which would mean the end of the array i.e. underflow).
I have come up with a vectorised method which relies on the varying length "windows" being always adjacent to one another, ie. that the start of a window is the same as the end of the window before it.
# Ensure that the join will be successful by rounding to a specific frequency
round_freq = '1h'
df.index = df.index.round(round_freq)
slicer_df.start_window= slicer_df.start_window.dt.round(round_freq)
# Give the index of the slicer a useful name
slicer_df.index.name = 'event_number'
#Perform a join to the start of the window, forward fill to the next window, then groupby to get the totals for each time window
df = df.to_frame('orig_data').join(slicer_df.reset_index().set_index('start_window')[['event_number']])
df.event_number = df.event_number.ffill()
df.groupby('event_number').sum()
Of course this only works when the windows are adjacent, ie. they can't overlap or have any gaps. If anyone has a more general method that works for the above, I'd love to see it!

Is the pandas .apply function vectorised?

Does the pandas df.apply(x, axis=1) method apply the function x to all the rows simultaneously, or iteratively? I had a look in the docs but didn't find anything.
It's iteratively:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: def f(row):
f.count += 1
return f.count
In [13]: f.count = 0
In [14]: df.apply(f, axis=1)
Out[14]:
0 1
1 2
dtype: int64
Note: Although in this example it doesn't seem to be the case the documentation warns:
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.
The actual for loop (for python functions rather than ufuncs) happens in lib.reduce (here).
I believe iteratively is the answer. Consider this:
import pandas as pd
import numpy as np
import time
# Make a 1000 row long dataframe
df = pd.DataFrame(np.random.random((1000, 4)))
# Apply this time delta function over the length of the dataframe
t0 = time.time()
times = df.apply(lambda _: time.time()-t0, axis=1)
# Print some of the results
print(times[::100])
Out[]:
0 0.000500
100 0.001029
200 0.001532
300 0.002036
400 0.002531
500 0.003033
600 0.003536
700 0.004035
800 0.004537
900 0.005513
dtype: float64

Python: Pandas Series - Why use loc?

Why do we use 'loc' for pandas dataframes? it seems the following code with or without using loc both compile anr run at a simulular speed
%timeit df_user1 = df.loc[df.user_id=='5561']
100 loops, best of 3: 11.9 ms per loop
or
%timeit df_user1_noloc = df[df.user_id=='5561']
100 loops, best of 3: 12 ms per loop
So why use loc?
Edit: This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that *
you can do column retrieval just by using the data frame's
getitem:
*
df['time'] # equivalent to df.loc[:, 'time']
it does not say why we use loc, although it does explain lots of features of loc, my specific question is 'why not just omit loc altogether'? for which i have accepted a very detailed answer below.
Also that other post the answer (which i do not think is an answer) is very hidden in the discussion and any person searching for what i was looking for would find it hard to locate the information and would be much better served by the answer provided to my question.
Explicit is better than implicit.
df[boolean_mask] selects rows where boolean_mask is True, but there is a corner case when you might not want it to: when df has boolean-valued column labels:
In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df
Out[229]:
False True
0 3 1
1 4 2
2 5 3
You might want to use df[[True]] to select the True column. Instead it raises a ValueError:
In [230]: df[[True]]
ValueError: Item wrong length 1 instead of 3.
Versus using loc:
In [231]: df.loc[[True]]
Out[231]:
False True
0 3 1
In contrast, the following does not raise ValueError even though the structure of df2 is almost the same as df1 above:
In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2
Out[258]:
A B
0 1 3
1 2 4
2 3 5
In [259]: df2[['B']]
Out[259]:
B
0 3
1 4
2 5
Thus, df[boolean_mask] does not always behave the same as df.loc[boolean_mask]. Even though this is arguably an unlikely use case, I would recommend always using df.loc[boolean_mask] instead of df[boolean_mask] because the meaning of df.loc's syntax is explicit. With df.loc[indexer] you know automatically that df.loc is selecting rows. In contrast, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing details about indexer and df.
df.loc[row_indexer, column_index] can select rows and columns. df[indexer] can only select rows or columns depending on the type of values in indexer and the type of column values df has (again, are they boolean?).
In [237]: df2.loc[[True,False,True], 'B']
Out[237]:
0 3
2 5
Name: B, dtype: int64
When a slice is passed to df.loc the end-points are included in the range. When a slice is passed to df[...], the slice is interpreted as a half-open interval:
In [239]: df2.loc[1:2]
Out[239]:
A B
1 2 4
2 3 5
In [271]: df2[1:2]
Out[271]:
A B
1 2 4
Performance Consideration on multiple columns "Chained Assignment" with and without using .loc
Let me supplement the already very good answers with the consideration of system performance.
The question itself includes a comparison on the system performance (execution time) of 2 pieces of codes with and without using .loc. The execution times are roughly the same for the code samples quoted. However, for some other code samples, there could be considerable difference on execution times with and without using .loc: e.g. several times difference or more!
A common case of pandas dataframe manipulation is we need to create a new column derived from values of an existing column. We may use the codes below to filter conditions (based on existing column) and set different values to the new column:
df[df['mark'] >= 50]['text_rating'] = 'Pass'
However, this kind of "Chained Assignment" does not work since it could create a "copy" instead of a "view" and assignment to the new column based on this "copy" will not update the original dataframe.
2 options available:
We can either use .loc, or
Code it another way without using .loc
2nd case e.g.:
df['text_rating'][df['mark'] >= 50] = 'Pass'
By placing the filtering at the last (after specifying the new column name), the assignment works well with the original dataframe updated.
The solution using .loc is as follows:
df.loc[df['mark'] >= 50, 'text_rating'] = 'Pass'
Now, let's see their execution time:
Without using .loc:
%%timeit
df['text_rating'][df['mark'] >= 50] = 'Pass'
2.01 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
With using .loc:
%%timeit
df.loc[df['mark'] >= 50, 'text_rating'] = 'Pass'
577 µs ± 5.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As we can see, with using .loc, the execution time is more than 3X times faster!
For a more detailed explanation of "Chained Assignment", you can refer to another related post How to deal with SettingWithCopyWarning in pandas? and in particular the answer of cs95. The post is excellent in explaining the functional differences of using .loc. I just supplement here the system performance (execution time) difference.
In addition to what has already been said (issues with having True, False as column name without using loc and ability to select rows and columns with loc and ability to do slicing for row and column selections), another big difference is that you can use loc to assign values to specific rows and columns. If you try to select a subset of the dataframe using boolean series and attempt to change a value of that subset selection you will likely get the SettingWithCopy warning.
Let's say you're trying to change the "upper management" column for all the rows whose salary is bigger than 60000.
This:
mask = df["salary"] > 60000
df[mask]["upper management"] = True
throws the warning that "A value is is trying to be set on a copy of a slice from a Dataframe" and won't work because df[mask] creates a copy and trying to update "upper management" of that copy has no effect on the original df.
But this succeeds:
mask = df["salary"] > 60000
df.loc[mask,"upper management"] = True
Note that in both cases you can do df[df["salary"] > 60000] or df.loc[df["salary"] > 60000], but I think storing boolean condition in a variable first is cleaner.

Apply numpy index to matrix

I have spent the last hour trying to figure this out
Suppose we have
import numpy as np
a = np.random.rand(5, 20) - 0.5
amin_index = np.argmin(np.abs(a), axis=1)
print(amin_index)
> [ 0 12 5 18 1] # or something similar
this does not work:
a[amin_index]
So, in essence, I need to find the minima along a certain axis for the array np.abs(a), but then extract the values from the array a at these positions. How can I apply an index to just one axis?
Probably very simple, but I can't get it figured out. Also, I can't use any loops since I have to do this for arrays with several million entries.
thanks 😊
One way is to pass in the array of row indexes (e.g. [0,1,2,3,4]) and the list of column indexes for the minimum in each corresponding row (your list amin_index).
This returns an array containing the value at [i, amin_index[i]] for each row i:
>>> a[np.arange(a.shape[0]), amin_index]
array([-0.0069325 , 0.04268358, -0.00128002, -0.01185333, -0.00389487])
This is basic indexing (rather than advanced indexing), so the returned array is actually a view of a rather than a new array in memory.
Is because argmin returns indexes of columns for each of the rows (with axis=1), therefore you need to access to each row at those particular columns:
a[range(a.shape[0]), amin_index]
Why not simply do np.amin(np.abs(a), axis=1), it's much simpler if you don't need the intermediate amin_index array via argmin?
Numpy's reference page is an excellent resource, see "Indexing".
Edits: Timing is always useful:
In [3]: a=np.random.rand(4000, 4000)-.5
In [4]: %timeit np.amin(np.abs(a), axis=1)
10 loops, best of 3: 128 ms per loop
In [5]: %timeit a[np.arange(a.shape[0]), np.argmin(np.abs(a), axis=1)]
10 loops, best of 3: 135 ms per loop

Filter rows of a numpy array?

I am looking to apply a function to each row of a numpy array. If this function evaluates to true I will keep the row, otherwise I will discard it. For example, my function might be:
def f(row):
if sum(row)>10: return True
else: return False
I was wondering if there was something similar to:
np.apply_over_axes()
which applies a function to each row of a numpy array and returns the result. I was hoping for something like:
np.filter_over_axes()
which would apply a function to each row of an numpy array and only return rows for which the function returned true. Is there anything like this? Or should I just use a for loop?
Ideally, you would be able to implement a vectorized version of your function and use that to do boolean indexing. For the vast majority of problems this is the right solution. Numpy provides quite a few functions that can act over various axes as well as all the basic operations and comparisons, so most useful conditions should be vectorizable.
import numpy as np
x = np.random.randn(20, 3)
x_new = x[np.sum(x, axis=1) > .5]
If you are absolutely sure that you can't do the above, I would suggest using a list comprehension (or np.apply_along_axis) to create an array of bools to index with.
def myfunc(row):
return sum(row) > .5
bool_arr = np.array([myfunc(row) for row in x])
x_new = x[bool_arr]
This will get the job done in a relatively clean way, but will be significantly slower than a vectorized version. An example:
x = np.random.randn(5000, 200)
%timeit x[np.sum(x, axis=1) > .5]
# 100 loops, best of 3: 5.71 ms per loop
%timeit x[np.array([myfunc(row) for row in x])]
# 1 loops, best of 3: 217 ms per loop

Categories