I am puzzled by the behavior of the pandas function agg(). When I pass a custom aggregator without first calling groupy(), then the aggregation fails. Why does this happen?
Specifically, when I run the following code ...
import pandas as pd
import numpy as np
def mymean(series):
return np.mean(series)
frame = pd.DataFrame()
frame["a"] = [1,2,3]
display(frame.agg(["mean"]))
display(frame.agg([mymean]))
frame["const"] = 0
display(frame.groupby("const").agg(["mean"]))
display(frame.groupby("const").agg([mymean]))
... I obtain this:
In the second of the four calls to agg(), aggregation fails. Why is that? What I want is to be able to compute custom summary statistics on a frame without first calling groupby() so that I obtain a frame where each row corresponds to one statistic (mean, kurtosis, other stuff...)
Thanks for your help :-)
Related
I have been struggling with a problem with custom aggregate function in Pandas that I have not been able to figure it out. let's consider the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'value': np.arange(1, 5), 'weights':np.arange(1, 5)})
Now if, I want to calculate the the average of the value column using the agg in Panadas, it would be:
df.agg({'value': 'mean'})
which results in a scaler value of 2.5 as shown in the following:
However, if I define the following custom mean function:
def my_mean(vec):
return np.mean(vec)
and use it in the following code:
df.agg({'value': my_mean})
I would get the following result:
So, the question here is, what should I do to get the same result as default mean aggregate function. One more thing to note that, if I use the mean function as a method in the custom function (shown below), it works just fine, however, I would like to know how to use np.mean function in my custom function. Any help would be much appreciated!
df my_mean2(vec):
return vec.mean()
When you pass a callable as the aggregate function, if that callable is not one of the predefined callables like np.mean, np.sum, etc It'll treat it as a transform and acts like df.apply().
The way around it is to let pandas know that your callable expects a vector of values. A crude way to do it is to have sth like:
def my_mean(vals):
print(type(vals))
try:
vals.shape
except:
raise TypeError()
return np.mean(vals)
>>> df.agg({'value': my_mean})
<class 'int'>
<class 'pandas.core.series.Series'>
value 2.5
dtype: float64
You see, at first pandas tries to call the function on each row (df.apply), but my_mean raises a type error and in the second attempt it'll pass the whole column as a Series object. Comment the try...except part out and you'll see my_mean will be called on each row with an int argument.
more on the first part:
my_mean1 = np.mean
my_mean2 = lambda *args, **kwargs: np.mean(*args, **kwargs)
df.agg({'value': my_mean1})
df.agg({'value': my_mean2})
Although my_mean2 and np.mean are essentially the same, since my_mean2 is np.mean evaluates to false, it'll go down the df.apply route while my_mean1 will work as expected.
Is it possible to "apply" a user function to a python datatable after groupby?
For example:
import datatable as dt
from datatable import f, by, sum
df = dt.Frame(SYM=['A','A','A','B','B'], xval=[1.1,1.2,2.3,2.4,2.5])
print(df[:, sum(f.xval), by(f.SYM)])
This works. But I would like to replace the "sum" function with a user function defined using:
def func(x):
# do some operations here; e.g. ranking
y = x
return(y)
Is this possible? Can you please provide an example (may be using numpy.rank inside func above)?
I get error messages with the following code. The peculiar thing is that if I start a debugger, then I can see that within the listcomprehension the dmatr-Object is (as I expected) a Dataframe(view).
However, the debugger also shows that within the subsequent invocation of the sumsq function the parameter D is a Series and not a DataFrame. I really don't get it whats happening here. I am just invoking a method (apply) on a DataFrame - yet the functions acts as if the method is invoked on a Series object. Can anyone help me in understanding what is going on here?
d: pandas DataFrame
import pandas as pd
import numpy as np
import collections
from sklearn.linear_model import LogisticRegression
def fisher(d, colIndex=0):
k = d.shape[1] #number of variables
sel = list(map(lambda x: x != colIndex, list(range(0, k))))
dfull = d.iloc[:, sel]
zw = dfull.groupby(d.iloc[:, colIndex])
def sumsq(D):
return D #when invoked from the subsequent line, the passed argument seems to be a Series!
return [dmatr.apply(sumsq) for (name, dmatr) in zw] #herein dmatr is a DataFrame
I am looking to execute a function over all arguments in a list (map could do that part) and then "join" them using another function that could be exited early (say if the objective was to find an instance or reach a threshold).
Here is an example where the function is ~np.isnan over a variable number of columns from a data frame and the "join" is the bitwise & operator on the resulting boolean masks. So it finds if there are any NaN values in a data frame, where the location corresponds to a variable list of columns. I then removes the rows where a NaN is found for the supplied column names.
import pandas as pd
import numpy as np
import random
data_values = range(10)
column_names = list(map(lambda x: "C" + str(x), data_values))
data = pd.DataFrame(columns=column_names, data=np.reshape(np.repeat(data_values,10,0),(10,10)))
data.iloc[random.sample(data_values,random.sample(data_values,1)[0]),random.sample(data_values,random.sample(data_values,1)[0])] = np.nan
cols_to_check = random.sample(column_names,random.sample(data_values,1)[0])
# ideally: data.loc[pd.notnull(data[cols_to_check[0]]) & pd.notnull(data[cols_to_check[1]]) & ...]
# or perhaps: data.loc[chainFunc(pd.notnull, np.logical_and, cols_to_check)]
masks = [list(np.where(~np.isnan(data[x]))[0]) for x in cols_to_check]
data.iloc[list(set(masks[0]).intersection(*masks))]
This becomes extremely slow on large data frames but is it possible to generalize this using the itertools and functools and drastically improve performance? Say something like (pseudocode):
def chainFunc(func_applied, func_chain, args):
x = func_applied(args[0])
for arg_counter in range(len(args)-1):
x = func_chain(x,func_applied(args[arg_counter+1]))
return(x)
How would it work on the data frame example above?
I was looking for a generic way to combine an arbitrary list of arguments and apply the result on a data frame. I guess in the above example the application is close to dropNA but not exactly. I was looking for a combination of reduce and chain, there is no real pandas specific interface of this, but it is possible to get something working:
import functools
data.iloc[ np.where(functools.reduce(lambda x, y: x & y,
map(lambda z: pd.notnull(data[z]),
cols_to_check)))[0] ]
I would like to apply a specific function (in this case a logit model) to a dataframe which can be grouped (by the variable "model"). I know the task can be performed through a loop, however I believe this to be inefficient at best. Example code below:
import pandas as pd
import numpy as np
import statsmodels.api as sm
df1=pd.DataFrame(np.random.randint(0,100,size=(100,10)),columns=list('abcdefghij'))
df2=pd.DataFrame(np.random.randint(0,100,size=(100,10)),columns=list('abcdefghij'))
df1['model']=1
df1['target']=np.random.randint(2,size=100)
df2['model']=2
df2['target']=np.random.randint(2,size=100)
data=pd.concat([df1,df2])
### Clunky, but works...
for i in range(1,2+1):
lm=sm.Logit(data[data['model']==i]['target'],
sm.add_constant(data[data['model']==i].drop(['target'],axis=1))).fit(disp=0)
print(lm.summary2())
### Can this work?
def elegant(self):
lm=sm.Logit(data['target'],
sm.add_constant(data.drop(['target'],axis=1))).fit(disp=0)
better=data.groupby(['model']).apply(elegant)
If the above groupby can work, is this a more efficient way to perform than looping?
This could work:
def elegant(df):
lm = sm.Logit(df['target'],
sm.add_constant(df.drop(['target'],axis=1))).fit(disp=0)
return lm
better = data.groupby('model').apply(elegant)
Using .apply you passe the dataframe groups to the function elegant so elegant has to take a dataframe as the first argument here. Also your function needs to return the result of your calculation lm.
For more complexe functions the following structure can be used:
def some_fun(df, kw_param=1):
# some calculations to df using kw_param
return df
better = data.groupby('model').apply(lambda group: some_func(group, kw_param=99))