Is the pandas .apply function vectorised? - python

Does the pandas df.apply(x, axis=1) method apply the function x to all the rows simultaneously, or iteratively? I had a look in the docs but didn't find anything.

It's iteratively:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: def f(row):
f.count += 1
return f.count
In [13]: f.count = 0
In [14]: df.apply(f, axis=1)
Out[14]:
0 1
1 2
dtype: int64
Note: Although in this example it doesn't seem to be the case the documentation warns:
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.
The actual for loop (for python functions rather than ufuncs) happens in lib.reduce (here).

I believe iteratively is the answer. Consider this:
import pandas as pd
import numpy as np
import time
# Make a 1000 row long dataframe
df = pd.DataFrame(np.random.random((1000, 4)))
# Apply this time delta function over the length of the dataframe
t0 = time.time()
times = df.apply(lambda _: time.time()-t0, axis=1)
# Print some of the results
print(times[::100])
Out[]:
0 0.000500
100 0.001029
200 0.001532
300 0.002036
400 0.002531
500 0.003033
600 0.003536
700 0.004035
800 0.004537
900 0.005513
dtype: float64

Related

Find t confidence interval across rows in dataframe

This is an example dataframe, my actual dataframe has 100s more rows.
nums_1 nums_2 nums_3
1 1 8
2 1 7
3 5 9
Is there a method that will calculate the 95% confidence interval across each row? A method that would work for large dataframe?
df = pd.DataFrame({'nums_1': [1, 2, 3], 'nums_2': [1, 1, 5], 'nums_3' : [8,7,9]})
You can use:
from scipy import stats
df.apply(lambda x: stats.t.interval(0.95, len(x)-1, loc=np.mean(x), scale=stats.sem(x)), axis=1)
You will obtain essentially the same results by using the following:
import statsmodels.stats.api as sms
df.apply(lambda x: sms.DescrStatsW(x).tconfint_mean(), axis=1)
Both answers return the same result - tuples.
The answer is described here: Compute a confidence interval from sample data
What is important to understand is that it works correctly if each row (each sample) is drawn independently from a normal distribution with an unknown standard deviation.
When it comes to large dataframes, the easy solution is to use swifter. However, it only speeds up your calculations twice. Nevertheless, it is worth trying: https://towardsdatascience.com/do-you-use-apply-in-pandas-there-is-a-600x-faster-way-d2497facfa66
import statsmodels.stats.api as SMS
import swifter
df.swifter.apply(lambda x: sms.DescrStatsW(x).tconfint_mean(), axis=1)
Edit: if you want to round your results and maybe get two columns instead of one with tuples, you can use:
def get_conf_interv(x):
res1, res2 = sms.DescrStatsW(x).tconfint_mean()
return round(res1, 2), round(res2, 2)
df[['res1', 'res2']] = df.swifter.apply(get_conf_interv, axis=1, result_type='expand')

Iterations over items of pandas dataframe

I have read many times that iterations should be avoided in dataframes so I have been trying the "better ways", such as applying functions, but I get stuck with the following error:
The truth value of a Series is ambiguous
I need to run iterative calculations across various row items and get updated values. Here is an simplified example, but the real case has a lot of math in it hence why functions are preferred:
df = pd.DataFrame({'A':[10,20,30,40], 'B':[4,3,2,1]})
def match_col(A,B):
while A != B:
B = B + 1
df.apply(lambda x: match_col(df['A'],df['B']),axis=1)
Basically, I need for each row to use a number of items, run iterative calcs, and output new/updated items. Where am I getting the logic wrong?
Instead do:
df.apply(lambda x: match_col(x['A'],x['B']),axis=1)
Because you're applying the function over each row, the row's values are what need to be passed to match_col and not entire series e.g. df['A'].
You also need to return something from your function:
def match_col(A,B):
while A != B:
B = B + 1
return B
Then you'll get this result:
In [10]: df.apply(lambda x: match_col(x['A'],x['B']),axis=1)
Out[10]:
0 10
1 20
2 30
3 40
dtype: int64
I did some changes in apply function
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[10,20,30,40], 'B':[1,3,2,1]})
def match_col(col):
while col.A != col.B:
col.B = col.B + 1
return col.B
df.apply(match_col,axis=1)
Output
0 2
1 4
2 3
3 2
dtype: int64

Pandas .loc[] method only returns only DataType not Series despite calling single index

TL;DR: .loc[] returns DataFrame type all the time. Even when specifying single index.
I've tried everything. This is driving me insane.
I can't seem to reproduce it anywhere else.
I've checked every Type of data that's beeing passed. Everything is as it should be. But no matter what I pass into the .loc[] it will return a DataFrame not series.
import numpy as np
import pandas as pd
import datetime
index_list = 'A B C D E F G H'.split()
df = pd.DataFrame(data=None,index=index_list)
k = 0
while k <= 2:
now = datetime.datetime.now().strftime('%H:%M:%S')
df.loc[:,now] = 1
for i in index_list:
print(df.loc[i])
print(type(df.loc[i]))
k += 1
The code above will run with 0 errors and return Series Type of data.
This is distilled code but it's exactly the same as the real one. Same flow, exaclty the same Type of data is beeing passed.
'now' is set as a column name and all the new values for every index is now 1.
Next the script iterates trough the index_list and prints the type.
The problem is that in the real script .loc will only return DataFrame type not Series. And I have no idea why. I even tried to manually enter the .loc index name to check if I'm not passing the wrong type of data. Still returned DataFrame.
I'm 100% out of ideas of what I could be doing wrong.
Maybe some of you have ideas?
EDIT
removed The original code.
I found that if I call
print(df.loc[i].iloc[0])
It will return the Series data for the column.
print(type(df.loc[i].iloc[0]))
Will print:
20:48:48 1
Name: (A,), dtype: int64
<class 'pandas.core.series.Series'>
Why is the name (A,) a tuple?
TLDR: remove your extra brackets when building dfCoinMaster's index.
In the working code:
df = pd.DataFrame(data=None,index=index_list)
In the non-working code:
dfCoinMaster = pd.DataFrame(data=None,index=[current_coin_listings])
You're adding an extra level of list nesting, which you can see in your
Name: (A,), dtype: int64
line. You can reproduce the same behaviour by adding the extra brackets to your test:
In [28]: df = pd.DataFrame(data=None, index=[index_list])
In [29]: df.loc[:, 'test'] = 10
In [30]: df
Out[30]:
test
A 10
B 10
C 10
D 10
E 10
F 10
G 10
H 10
In [31]: df.loc['A']
Out[31]:
test
A 10
In [32]: type(_)
Out[32]: pandas.core.frame.DataFrame
But:
In [33]: df.loc[('A',)]
Out[33]:
test 10
Name: (A,), dtype: int64
In [34]: type(_)
Out[34]: pandas.core.series.Series

Efficiently taking time slices of variable length in a dataframe

I would like to efficiently slice a DataFrame with a DatetimeIndex (similar to a resample or groupby operation), but the desired time slices are different lengths.
This is relatively easy to do by looping (see code below), but with large timeseries the multiple slices quickly becomes slow. Any suggestions on vectorising this/improving speed?
import pandas as pd, datetime as dt, numpy as np
#Example DataFrame with a DatetimeIndex
idx = pd.DatetimeIndex(start=dt.datetime(2017,1,1), end=dt.datetime(2017,1,31), freq='h')
df = pd.Series(index = idx, data = np.random.rand(len(idx)))
#The slicer dataframe contains a series of start and end windows
slicer_df = pd.DataFrame(index = [1,2])
slicer_df['start_window'] = [dt.datetime(2017,1,2,2), dt.datetime(2017,1,6,12)]
slicer_df['end_window'] = [dt.datetime(2017,1,6,12), dt.datetime(2017,1,15,2)]
#The results should be stored to a dataframe, indexed by the index of the slicer dataframe
#This is the loop that I would like to vectorise
slice_results = pd.DataFrame()
slice_results['total'] = None
for index, row in slicer_df.iterrows():
slice_results.loc[index,'total'] = df[(df.index >= row.start_window) &
(df.index <= row.end_window)].sum()
NB. I've just realised that my particular data set has adjacent windows (ie. the start of one window corresponds to the end of the one before it), but the windows are of different lengths. It feels like there should be a way to perform a groupby or similar with only one pass over df...
You can do this as an apply, which will concat the results rather than iteratively update the DataFrame:
In [11]: slicer_df.apply((lambda row: \
df[(df.index >= row.start_window)
& (df.index <= row.end_window)].sum()), axis=1)
Out[11]:
1 36.381155
2 111.521803
dtype: float64
You can vectorize this with searchsorted (assuming the datetime index is sorted, otherwise first sort):
In [11]: inds = np.searchsorted(df.index.values, slicer_df.values)
In [12]: s = df.cumsum() # only sum once!
In [13]: pd.Series([s[end] - s[start-1] if start else s[end] for start, end in inds], slicer_df.index)
Out[13]:
1 36.381155
2 111.521803
dtype: float64
There's still a loop in there, but it's now a lot cheaper!
That leads us to a completely vectorized solution (it's a little more cryptic):
In [21]: inds2 = np.maximum(1, inds) # see note
In [22]: inds2[:, 0] -= 1
In [23]: inds2
Out[23]:
array([[ 23, 96],
[119, 336]])
In [24]: x = s[inds2]
In [25]: x
Out[25]:
array([[ 11.4596498 , 47.84080472],
[ 55.94941276, 167.47121538]])
In [26]: x[:, 1] - x[:, 0]
Out[26]: array([ 36.38115493, 111.52180263])
Note: the when the start date is before the first date we want to avoid the start index rolling back from 0 to -1 (which would mean the end of the array i.e. underflow).
I have come up with a vectorised method which relies on the varying length "windows" being always adjacent to one another, ie. that the start of a window is the same as the end of the window before it.
# Ensure that the join will be successful by rounding to a specific frequency
round_freq = '1h'
df.index = df.index.round(round_freq)
slicer_df.start_window= slicer_df.start_window.dt.round(round_freq)
# Give the index of the slicer a useful name
slicer_df.index.name = 'event_number'
#Perform a join to the start of the window, forward fill to the next window, then groupby to get the totals for each time window
df = df.to_frame('orig_data').join(slicer_df.reset_index().set_index('start_window')[['event_number']])
df.event_number = df.event_number.ffill()
df.groupby('event_number').sum()
Of course this only works when the windows are adjacent, ie. they can't overlap or have any gaps. If anyone has a more general method that works for the above, I'd love to see it!

Python Pandas, apply function

I am trying to use apply to avoid an iterrows() iterator in a function:
However that pandas method is poorly documented and I can't find example on how to use it, except for the lame .apply(sq.rt) in the documentation... No example on how to use arguments etc...
Anyway, here a toy example on what I try to do.
In my understanding apply will actually do the same as iterrows(), ie, iterate (over the rows if axis=0). On each iteration the input x of the function should be the row iterated over. However the error messages I keep receiving sort of disprove that assumption...
grid = np.random.rand(5,2)
df = pd.DataFrame(grid)
def multiply(x):
x[3]=x[0]*x[1]
df = df.apply(multiply, axis=0)
The example above returns an empty df. Can anyone shed some light on my misunderstanding?
import pandas as pd
import numpy as np
grid = np.random.rand(5,2)
df = pd.DataFrame(grid)
def multiply(x):
return x[0]*x[1]
df['multiply'] = df.apply(multiply, axis = 1)
print(df)
Results in:
0 1 multiply
0 0.550750 0.713054 0.392715
1 0.061949 0.661614 0.040987
2 0.472134 0.783479 0.369907
3 0.827371 0.277591 0.229670
4 0.961102 0.137510 0.132162
Explanation:
The function you are applying, needs to return a value. You are also applying this to each row, not column. The axis parameter you passed was incorrect in this regard.
Finally, notice that I am setting this equal to the 'multiply' column outside of my function. You can easily change this to be df[3] = ... like you have and get a dataframe like this:
0 1 3
0 0.550750 0.713054 0.392715
1 0.061949 0.661614 0.040987
2 0.472134 0.783479 0.369907
3 0.827371 0.277591 0.229670
4 0.961102 0.137510 0.132162
It should be noted that you can use lambda functions as well. See their documentation Apply
For your example, you can run:
df['multiply'] = df.apply(lambda row: row[0] * row[1], axis = 1)
which produces the same output as #Andy
This can be useful if your function is in the form of
def multiply(a,b):
return a*b
df['multiply'] = df.apply(lambda row: multiply(row[0] ,row[1]), axis = 1)
More examples in the section Enhancing Performance
When apply-ing a function, you need that function to return the result for that operation over the column/row. You are getting None because multiply doesn't return, evidently. That is, apply should be returning a result between particular values, not doing the assignment itself.
You're also iterating over the wrong axis, here. Your current code takes the first and second element of each column and multiplies them together.
A correct multiply function:
def multiply(x):
return x[0]*x[1]
df[3] = df.apply(multiply, 'columns')
With that being said, you can do much better than apply here, as it is not a vectorized operation. Just multiply the columns together directly.
df[3] = df[0]*df[1]
In general, you should avoid apply when possible as it is not much more than a loop itself under the hood.
One of the rules of Pandas Zen says: always try to find a vectorized solution first.
.apply(..., axis=1) is not vectorized!
Consider alternatives:
In [164]: df.prod(axis=1)
Out[164]:
0 0.770675
1 0.539782
2 0.318027
3 0.597172
4 0.211643
dtype: float64
In [165]: df[0] * df[1]
Out[165]:
0 0.770675
1 0.539782
2 0.318027
3 0.597172
4 0.211643
dtype: float64
Timing against 50.000 rows DF:
In [166]: df = pd.concat([df] * 10**4, ignore_index=True)
In [167]: df.shape
Out[167]: (50000, 2)
In [168]: %timeit df.apply(multiply, axis=1)
1 loop, best of 3: 6.12 s per loop
In [169]: %timeit df.prod(axis=1)
100 loops, best of 3: 6.23 ms per loop
In [170]: def multiply_vect(x1, x2):
...: return x1*x2
...:
In [171]: %timeit multiply_vect(df[0], df[1])
1000 loops, best of 3: 604 µs per loop
Conclusion: use .apply() as a very last resort (i.e. when nothing else helps)

Categories