Apply an aggregate function to a python datatable column after group by - python

Is it possible to "apply" a user function to a python datatable after groupby?
For example:
import datatable as dt
from datatable import f, by, sum
df = dt.Frame(SYM=['A','A','A','B','B'], xval=[1.1,1.2,2.3,2.4,2.5])
print(df[:, sum(f.xval), by(f.SYM)])
This works. But I would like to replace the "sum" function with a user function defined using:
def func(x):
# do some operations here; e.g. ranking
y = x
return(y)
Is this possible? Can you please provide an example (may be using numpy.rank inside func above)?

Related

How do I apply a function over a column?

I have created a function I would like to apply over a given dataframe column. Is there an apply function so that I can create a new column and apply my created function?
Example code:
dat = pd.DataFrame({'title': ['cat', 'dog', 'lion','turtle']})
Manual method that works:
print(calc_similarity(chosen_article,str(df['title'][1]),model_word2vec))
print(calc_similarity(chosen_article,str(df['title'][2]),model_word2vec))
Attempt to apply over dataframe column:
dat['similarity']= calc_similarity(chosen_article, str(df['title']), model_word2vec)
The issue I have been running into is that the function outputs the same result over the entirety of the newly created column.
I have tried apply() as follows:
dat['similarity'] = dat['title'].apply(lambda x: calc_similarity(chosen_article, str(x), model_word2vec))
and
dat['similarity'] = dat['title'].astype(str).apply(lambda x: calc_similarity(chosen_article, x, model_word2vec))
Which result in a ZeroDivisionError which i am not understanding since I am not passing empty strings
Function being used:
def calc_similarity(input1, input2, vectors):
s1words = set(vocab_check(vectors, input1.split()))
s2words = set(vocab_check(vectors, input2.split()))
output = vectors.n_similarity(s1words, s2words)
return output
It sounds like you are having difficulty applying a function while passing additional keyword arguments. Here's how you can execute that:
# By default, function will use values for first arg.
# You can specify kwargs in the apply method though
df['similarity'] = df['title'].apply(
calc_similarity,
input2=chosen_article,
vectors=model_word2vec
)

How to use pandas aggregate with custom function

I am puzzled by the behavior of the pandas function agg(). When I pass a custom aggregator without first calling groupy(), then the aggregation fails. Why does this happen?
Specifically, when I run the following code ...
import pandas as pd
import numpy as np
def mymean(series):
return np.mean(series)
frame = pd.DataFrame()
frame["a"] = [1,2,3]
display(frame.agg(["mean"]))
display(frame.agg([mymean]))
frame["const"] = 0
display(frame.groupby("const").agg(["mean"]))
display(frame.groupby("const").agg([mymean]))
... I obtain this:
In the second of the four calls to agg(), aggregation fails. Why is that? What I want is to be able to compute custom summary statistics on a frame without first calling groupby() so that I obtain a frame where each row corresponds to one statistic (mean, kurtosis, other stuff...)
Thanks for your help :-)

chain join multiple arguments from a list of variable size using a supplied bivariate function

I am looking to execute a function over all arguments in a list (map could do that part) and then "join" them using another function that could be exited early (say if the objective was to find an instance or reach a threshold).
Here is an example where the function is ~np.isnan over a variable number of columns from a data frame and the "join" is the bitwise & operator on the resulting boolean masks. So it finds if there are any NaN values in a data frame, where the location corresponds to a variable list of columns. I then removes the rows where a NaN is found for the supplied column names.
import pandas as pd
import numpy as np
import random
data_values = range(10)
column_names = list(map(lambda x: "C" + str(x), data_values))
data = pd.DataFrame(columns=column_names, data=np.reshape(np.repeat(data_values,10,0),(10,10)))
data.iloc[random.sample(data_values,random.sample(data_values,1)[0]),random.sample(data_values,random.sample(data_values,1)[0])] = np.nan
cols_to_check = random.sample(column_names,random.sample(data_values,1)[0])
# ideally: data.loc[pd.notnull(data[cols_to_check[0]]) & pd.notnull(data[cols_to_check[1]]) & ...]
# or perhaps: data.loc[chainFunc(pd.notnull, np.logical_and, cols_to_check)]
masks = [list(np.where(~np.isnan(data[x]))[0]) for x in cols_to_check]
data.iloc[list(set(masks[0]).intersection(*masks))]
This becomes extremely slow on large data frames but is it possible to generalize this using the itertools and functools and drastically improve performance? Say something like (pseudocode):
def chainFunc(func_applied, func_chain, args):
x = func_applied(args[0])
for arg_counter in range(len(args)-1):
x = func_chain(x,func_applied(args[arg_counter+1]))
return(x)
How would it work on the data frame example above?
I was looking for a generic way to combine an arbitrary list of arguments and apply the result on a data frame. I guess in the above example the application is close to dropNA but not exactly. I was looking for a combination of reduce and chain, there is no real pandas specific interface of this, but it is possible to get something working:
import functools
data.iloc[ np.where(functools.reduce(lambda x, y: x & y,
map(lambda z: pd.notnull(data[z]),
cols_to_check)))[0] ]

Python equivalent to Spark rangeBetween for window?

I am trying to find if there is a way in python to do the equivalent of a rangeBetween in a rolling aggregation. In Spark, you can use rangeBetween such that the window does not have to be symmetrical on the targeted row, ie for each row, I can look -5h to +3h: All rows that happen between 5 hours before and 3 hours after absed on a datetime column. I know that python has the pd.rolling option, but after reading all the documentation i can find on it it looks like it only takes 1 input as the window. You can change whether that window is centered on each row or not but I can't find a way to explicitly set it so it can look at a range of my choosing.
Does anyone know of another function or functionality that I am not aware of that would work to do this?
I'm not sure if it's the best answer but it's mine and it works so I guess it'll have to do until there is a better option. I made a python function out of it so you can sub in whatever aggregation function you want.
def rolling_stat(pdf, lower_bound, upper_bound, group , statistic = 'mean' )
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
group = pdf[group].drop_duplicates()
for grp in group:
dataframe_grp = dataframe[dataframe['group']==grp]
dataframe_grp.sort_index()
for index, row in dataframe_grp.iterrows():
lower= (index - timedelta(minutes = lower_bound))
upper= (index + timedelta(minutes = upper_bound))
agg = dataframe_grp.loc[lower:upper]['nbr'].agg([statistic])
dataframe_grp.at[index, 'agg'] = agg[0]
data_agg = data_agg.append(dataframe_grp)

Apply Function or Lambda to Pandas GROUPBY

I would like to apply a specific function (in this case a logit model) to a dataframe which can be grouped (by the variable "model"). I know the task can be performed through a loop, however I believe this to be inefficient at best. Example code below:
import pandas as pd
import numpy as np
import statsmodels.api as sm
df1=pd.DataFrame(np.random.randint(0,100,size=(100,10)),columns=list('abcdefghij'))
df2=pd.DataFrame(np.random.randint(0,100,size=(100,10)),columns=list('abcdefghij'))
df1['model']=1
df1['target']=np.random.randint(2,size=100)
df2['model']=2
df2['target']=np.random.randint(2,size=100)
data=pd.concat([df1,df2])
### Clunky, but works...
for i in range(1,2+1):
lm=sm.Logit(data[data['model']==i]['target'],
sm.add_constant(data[data['model']==i].drop(['target'],axis=1))).fit(disp=0)
print(lm.summary2())
### Can this work?
def elegant(self):
lm=sm.Logit(data['target'],
sm.add_constant(data.drop(['target'],axis=1))).fit(disp=0)
better=data.groupby(['model']).apply(elegant)
If the above groupby can work, is this a more efficient way to perform than looping?
This could work:
def elegant(df):
lm = sm.Logit(df['target'],
sm.add_constant(df.drop(['target'],axis=1))).fit(disp=0)
return lm
better = data.groupby('model').apply(elegant)
Using .apply you passe the dataframe groups to the function elegant so elegant has to take a dataframe as the first argument here. Also your function needs to return the result of your calculation lm.
For more complexe functions the following structure can be used:
def some_fun(df, kw_param=1):
# some calculations to df using kw_param
return df
better = data.groupby('model').apply(lambda group: some_func(group, kw_param=99))

Categories