I understand how to take cubic root of both positive and negative numbers. But when trying to use apply-lambda method to efficiently process all elements of a dataframe, I run into an ambiguity issue. Interestingly, this error does not arise with equalities, so I am wondering what could be wrong with the code:
sample[columns]=sample[columns].apply(lambda x: (-1)*np.power(-x,1./3) if x<0 else np.power(x,1./3))
It looks like you are passing a list or array of column names. I assume this because your variable name is plural with an s at the end. If this is the case, then sample[columns] is a dataframe. This is an issue because apply iterates through each column, passing that column the lambda you passed to apply. So you get
(-1) * np.power(-series_object, -1./3) if series_object < 0 else...
And it's the series_object < 0 that is messing things up because you are asking for the truthiness of a whole series being less than zero.
applymap
f = lambda x: -np.power(-x, 1./3) if x < 0 else np.power(x, 1./3)
sample[columns] = sample[columns].applymap(f)
That said, I'd use a lambda defined as follows
f = lambda x: np.sign(x) * np.power(abs(x), 1./3)
Then you could perform this on the entire dataframe
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(-10, 10, (5, 5)))
df
0 1 2 3 4
0 6 1 -8 0 5
1 3 1 3 9 -2
2 -10 2 -10 -8 -10
3 -3 9 3 8 2
4 -6 -7 9 3 -3
f = lambda x: np.sign(x) * np.power(abs(x), 1./3)
f(df)
0 1 2 3 4
0 1.817121 1.000000 -2.000000 0.000000 1.709976
1 1.442250 1.000000 1.442250 2.080084 -1.259921
2 -2.154435 1.259921 -2.154435 -2.000000 -2.154435
3 -1.442250 2.080084 1.442250 2.000000 1.259921
4 -1.817121 -1.912931 2.080084 1.442250 -1.442250
Same as
df.applymap(f)
0 1 2 3 4
0 1.817121 1.000000 -2.000000 0.000000 1.709976
1 1.442250 1.000000 1.442250 2.080084 -1.259921
2 -2.154435 1.259921 -2.154435 -2.000000 -2.154435
3 -1.442250 2.080084 1.442250 2.000000 1.259921
4 -1.817121 -1.912931 2.080084 1.442250 -1.442250
Check for equality
df.applymap(f).equals(f(df))
True
And its faster
%timeit df.applymap(f)
%timeit f(df)
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 473 µs per loop
It doesn't have to be complicated, simply use NumPys cube-root function: np.cbrt:
df[columns] = np.cbrt(df[columns])
It requires NumPy >= 1.10 though.
For older versions you could use np.absolute and np.sign instead of using conditionals:
df[columns] = df[columns].apply(lambda x: np.power(np.absolute(x), 1./3) * np.sign(x))
This calculates the cube root of the absolute and then changes the sign appropriatly.
Try :
sample[columns]=sample[columns]**(1/3)
Related
In the example from the pandas documentation about the new .pipe() method for GroupBy objects, an .apply() method accepting the same lambda would return the same results.
In [195]: import numpy as np
In [196]: n = 1000
In [197]: df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
.....: 'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n),
.....: 'Revenue': (np.random.random(n)*50+10).round(2),
.....: 'Quantity': np.random.randint(1, 10, size=n)})
In [199]: (df.groupby(['Store', 'Product'])
.....: .pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum())
.....: .unstack().round(2))
Out[199]:
Product Product_1 Product_2 Product_3
Store
Store_1 6.93 6.82 7.15
Store_2 6.69 6.64 6.77
I can see how the pipe functionality differs from apply for DataFrame objects, but not for GroupBy objects. Does anyone have an explanation or examples of what can be done with pipe but not with apply for a GroupBy?
What pipe does is to allow you to pass a callable with the expectation that the object that called pipe is the object that gets passed to the callable.
With apply we assume that the object that calls apply has subcomponents that will each get passed to the callable that was passed to apply. In the context of a groupby the subcomponents are slices of the dataframe that called groupby where each slice is a dataframe itself. This is analogous for a series groupby.
The main difference between what you can do with a pipe in a groupby context is that you have available to the callable the entire scope of the the groupby object. For apply, you only know about the local slice.
Setup
Consider df
df = pd.DataFrame(dict(
A=list('XXXXYYYYYY'),
B=range(10)
))
A B
0 X 0
1 X 1
2 X 2
3 X 3
4 Y 4
5 Y 5
6 Y 6
7 Y 7
8 Y 8
9 Y 9
Example 1
Make the entire 'B' column sum to 1 while each sub-group sums to the same amount. This requires that the calculation be aware of how many groups exist. This is something we can't do with apply because apply wouldn't know how many groups exist.
s = df.groupby('A').B.pipe(lambda g: df.B / g.transform('sum') / g.ngroups)
s
0 0.000000
1 0.083333
2 0.166667
3 0.250000
4 0.051282
5 0.064103
6 0.076923
7 0.089744
8 0.102564
9 0.115385
Name: B, dtype: float64
Note:
s.sum()
0.99999999999999989
And:
s.groupby(df.A).sum()
A
X 0.5
Y 0.5
Name: B, dtype: float64
Example 2
Subtract the mean of one group from the values of another. Again, this can't be done with apply because apply doesn't know about other groups.
df.groupby('A').B.pipe(
lambda g: (
g.get_group('X') - g.get_group('Y').mean()
).append(
g.get_group('Y') - g.get_group('X').mean()
)
)
0 -6.5
1 -5.5
2 -4.5
3 -3.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
Name: B, dtype: float64
print(df.groupby(['A'])['B'].apply(lambda l: l/l.sum()/df.A.nunique()))
I would like to get the average value of a row in a dataframe where I only use values greater than or equal to zero.
For example:
if my dataframe looked like:
df = pd.DataFrame([[3,4,5], [4,5,6],[4,-10,6]])
3 4 5
4 5 6
4 -10 6
currently if I get the average of the row I write :
df['mean'] = df.mean(axis = 1)
and get:
3 4 5 4
4 5 6 5
4 -10 6 0
I would like to get a dataframe that only used values greater than zero to computer the average. I would like a dataframe that looked like:
3 4 5 4
4 5 6 5
4 -10 6 5
In the above example -10 is excluded in the average. Is there a command that excludes the -10?
You can use df[df > 0] to query the data frame before calculating the average; df[df > 0] returns a data frame where cells smaller or equal to zero will be replaced with NaN and get ignored when calculating the mean:
df[df > 0].mean(1)
#0 4.0
#1 5.0
#2 5.0
#dtype: float64
Not nearly as succinct as #Psidom. But if you wanted to use numpy and get some added quickness.
v0 = df.values
v1 = np.where(v0 > 0, v0, np.nan)
v2 = np.nanmean(v1, axis=1)
df.assign(Mean=v2)
0 1 2 Mean
0 3 4 5 4.0
1 4 5 6 5.0
2 4 -10 6 5.0
Timing
small data
%timeit df.assign(Mean=df[df > 0].mean(1))
1000 loops, best of 3: 1.71 ms per loop
%%timeit
v0 = df.values
v1 = np.where(v0 > 0, v0, np.nan)
v2 = np.nanmean(v1, axis=1)
df.assign(Mean=v2)
1000 loops, best of 3: 407 µs per loop
I have a dataframe that looks like this:
In [60]: df1
Out[60]:
DIFF UID
0 NaN 1
1 13.0 1
2 4.0 1
3 NaN 2
4 3.0 2
5 23.0 2
6 NaN 3
7 4.0 3
8 29.0 3
9 42.0 3
10 NaN 4
11 3.0 4
and for each UID I want to calculate how many instances are found to have a value for DIFF over a given param.
I have tried something like this:
In [61]: threshold = 5
In [62]: df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count().reset_index().rename(columns={'DIFF':'ATTR_NAME'})
Out[63]:
UID ATTR_NAME
0 1 1
1 2 1
2 3 2
That works fine, in regards to finding the returning the right count of instances per user etc. However, I would like to be able to also include the users that have 0 instances, which are now excluded in the df1[df1.DIFF > threshold] part.
The desired output would be:
UID ATTR_NAME
0 1 1
1 2 1
2 3 2
3 4 0
Any ideas?
Thanks
Simple, use .reindex:
req = df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count()
req = req.reindex(df1.UID.unique()).reset_index().rename(columns={'DIFF':'ATTR_NAME'})
In one line:
df1[df1.DIFF > threshold].groupby('UID')['DIFF'].count().reindex(df1.UID.unique()).reset_index().rename(columns={'DIFF':'ATTR_NAME'})
Another way would be to use a function with apply() to do this:
In [101]: def count_instances(x, threshold):
counter = 0
for i in x:
if i > threshold: counter += 1
return counter
.....:
In [102]: df1.groupby('UID')['DIFF'].apply(lambda x: count_instances(x, 5)).reset_index()
Out[102]:
UID DIFF
0 1 1
1 2 1
2 3 2
3 4 0
It appears that this way is a little faster as well:
In [103]: %timeit df1.groupby('UID')['DIFF'].apply(lambda x: count_instances(x, 5)).reset_index()
100 loops, best of 3: 2.38 ms per loop
In [104]: %timeit df1[df1.DIFF > 5].groupby('UID')['DIFF'].count().reset_index()
100 loops, best of 3: 2.39 ms per loop
Would something like this work well for you?
Searching to count the numbers of values matching criteria without filtering out the keys that have no matches is equivalent to counting the numbers of True matches per group, which can be done with a sum of boolean values:
(df1.DIFF > 5).groupby(df1.UID).sum().reset_index()
UID DIFF
0 1 1.0
1 2 1.0
2 3 2.0
3 4 0.0
Is it possible to put percentile cuts on all columns of a dataframe with using a loop? This is how I am doing it now:
df = pd.DataFrame(np.random.randn(10,5))
df_q = pd.DataFrame()
for i in list(range(len(df.columns))):
df_q[i] = pd.qcut(df[i], 5, labels=list(range(5)))
I am hoping there is a slick pandas solution for this to avoid the use of a loop.
Thanks!
pd.qcut accepts an 1D array or Series as its argument. To apply pd.qcut to every column requires multiple calls to pd.qcut. So no matter how you dress it up, there will be a loop -- either explicit or implicit.
You could for example, use apply to call pd.qcut for each column:
In [46]: df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
Out[46]:
0 1 2 3 4
0 4 0 3 0 3
1 0 0 2 3 0
2 3 4 1 2 3
3 4 1 1 1 4
4 3 2 2 4 1
5 2 4 3 0 1
6 2 3 0 4 4
7 1 3 4 2 2
8 0 1 4 3 0
9 1 2 0 1 2
but under the hood, df.apply is using a for-loop, so it really isn't very different than your for-loop:
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
In [47]: %timeit df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
100 loops, best of 3: 2.9 ms per loop
In [48]: %%timeit
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
100 loops, best of 3: 2.95 ms per loop
Note that
for i in list(range(len(df.columns))):
will only work if the columns of df happen to be sequential integers starting at 0.
It is more robust to use
for col in df:
to iterate over the columns of the DataFrame.
I have a pandas Series, like so,
data = [1,2,3,2,4,5,6,3,5]
ds = pd.Series(data)
print (ds)
0 1
1 2
2 3
3 2
4 4
5 5
6 6
7 3
8 5
I am interested in getting the standard deviation for each index. For example, when I at index 5, I want to calculate the standard deviations for ds[0:4].
I have done this with the following code,
df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])
df.data = data
for i in df.index:
dataslice = df.ix[0:i]
df['avreturns'].loc[i] = dataslice.data.mean()
df['sd'].loc[i] = dataslice.data.std()
print (df)
data avreturns sd
0 1 1 NaN
1 2 1.5 0.7071068
2 3 2 1
3 2 2 0.8164966
4 4 2.4 1.140175
5 5 2.833333 1.47196
6 6 3.285714 1.799471
7 3 3.25 1.669046
8 5 3.444444 1.666667
This works, but I using a loop and it is slow. Is there a way to vectorize this?
I was able to vectorize the mean calculations by using the cumsum() function:
df.data.cumsum()/(df.index+1)
Is there a way to vectorize the standard deviation calculations?
You might be interested in pd.expanding_std, which calculates the cumulative standard deviation for you:
>>> pd.expanding_std(ds)
0 NaN
1 0.707107
2 1.000000
3 0.816497
4 1.140175
5 1.471960
6 1.799471
7 1.669046
8 1.666667
dtype: float64
For what it's worth, this type of cumulative operation might be very fiddly to vectorise: the Pandas implementation appears to loop using Cython for speed.
To expand #ajcr's answer, I ran a %timeit against the two ways to do this. I think there is 1000x improvement by using expanding_stds...
data = [x for x in range(1000)]
ds = pd.Series(data)
df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])
df.data = data
def foo(df):
for i in df.index:
dataslice = df.ix[0:i]
df['avreturns'].loc[i] = dataslice.data.mean()
df['sd'].loc[i] = dataslice.data.std()
return (df)
%timeit foo(df)
1 loops, best of 3: 1min 36s per loop
%timeit pd.expanding_std(df.data)
10000 loops, best of 3: 126 µs per loop