I have been struggling with a problem with custom aggregate function in Pandas that I have not been able to figure it out. let's consider the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'value': np.arange(1, 5), 'weights':np.arange(1, 5)})
Now if, I want to calculate the the average of the value column using the agg in Panadas, it would be:
df.agg({'value': 'mean'})
which results in a scaler value of 2.5 as shown in the following:
However, if I define the following custom mean function:
def my_mean(vec):
return np.mean(vec)
and use it in the following code:
df.agg({'value': my_mean})
I would get the following result:
So, the question here is, what should I do to get the same result as default mean aggregate function. One more thing to note that, if I use the mean function as a method in the custom function (shown below), it works just fine, however, I would like to know how to use np.mean function in my custom function. Any help would be much appreciated!
df my_mean2(vec):
return vec.mean()
When you pass a callable as the aggregate function, if that callable is not one of the predefined callables like np.mean, np.sum, etc It'll treat it as a transform and acts like df.apply().
The way around it is to let pandas know that your callable expects a vector of values. A crude way to do it is to have sth like:
def my_mean(vals):
print(type(vals))
try:
vals.shape
except:
raise TypeError()
return np.mean(vals)
>>> df.agg({'value': my_mean})
<class 'int'>
<class 'pandas.core.series.Series'>
value 2.5
dtype: float64
You see, at first pandas tries to call the function on each row (df.apply), but my_mean raises a type error and in the second attempt it'll pass the whole column as a Series object. Comment the try...except part out and you'll see my_mean will be called on each row with an int argument.
more on the first part:
my_mean1 = np.mean
my_mean2 = lambda *args, **kwargs: np.mean(*args, **kwargs)
df.agg({'value': my_mean1})
df.agg({'value': my_mean2})
Although my_mean2 and np.mean are essentially the same, since my_mean2 is np.mean evaluates to false, it'll go down the df.apply route while my_mean1 will work as expected.
Related
I have the following dataframe where I wanted to combine products with the same value in Match column.
I did that by surfing and using the following piece of code
data2['Together'] = data2.groupby(by = ['Match'])['Product'].transform(lambda x : ','.join(x))
req = data2[['Order ID', 'Together']].drop_duplicates()
req
It gives the following result
Question 1
I tried to understand what was happening here by applying the same transform operation on each group and the transform function operates elementwise and gives something like this. So how does pandas change the result for the command shown above?
Although question might not be very clear, but still I think posting an answer would be better than deleting it.
So as I saw in above results when transform was applied on the whole Groupby object it returned the function applied on whole series and values duplicated whereas when I applied the function on individual series or groups it performed the transform function on each single element i.e. like the apply function of series.
After searching through the documentation and seeing the output of a custom function below this is what I get.
The groupby transform function directly passes the object to the function and checks its output whether it matches the length of passed object or it's a scaler in which it expands the output to that length.
But in series transform object, the function first tries to use apply function on the object and in case it fails then applies the function on whole object.
This is what I got after reading the source code, you can also see the output below, I created a function and called it on both transforms
def func(val):
print(type(val))
return ','.join(val.tolist())
# For series transforms
<class 'str'>
<class 'str'>
# For groupby transforms
<class 'pandas.core.series.Series'>
Now if I modify the function such that it can work only on whole series object and not on individual strings then observe how the series transform function behaves
# Modified function (cannot work only on strings)
def func(val):
print(type(val))
return val.str.split().str[0]
#For Series transforms
<class 'str'>
<class 'pandas.core.series.Series'>
If I pass the value of csv data following the way given below it produces the output.
data = pd.read_csv("abc.csv")
avg = data['A'].rolling(3).mean()
print(avg)
But if pass the value via following the way given below it produces error.
dff=[]
dff1=[]
dff1=abs(data['A'])
b, a = scipy.signal.butter(2, 0.05, 'highpass')
dff = scipy.signal.filtfilt(b, a, dff1)
avg = dff.rolling(3).mean()
print(avg)
Error is:
AttributeError: 'numpy.ndarray' object has no attribute 'rolling'
I can't figure it out, what is wrong with the code?
after applying dff = pd.Dataframe(dff)new problem arises. one unexpected zero is displayed at the top.
What is the reason behind this? How to get rid of this problem?
rolling is a function on Pandas Series and DataFrames. Scipy knows nothing about these, and generates Numpy ndarrays as output. It can accept dataframes and series as input, because the Pandas types can mimic ndarrays when needed.
The solution might be as simple as re-wrapping the ndarray as a dataframe using
dff = pd.Dataframe(dff)
I have a dataframe called df. I need to pass columns as arguments to a function.
Outside the function, this code works :
df.colname.fillna(method='ffill')
If I use the following code (ie the same line inside the function and pass df.colname as the argument (colname = df.colname) it does not work. The line is ignored:
def Funct (colname):
colname.fillna(method='ffill')
This works (colname = df.colname):
def Funct (colname):
colname [1:] = colname[1:].fillna(method='ffill')
What's happening?
Does the function change the dataframe object to an array? does this make the code inefficient and is there a better way of doing this?
(Note: This is part of a larger function which I am paraphrasing here for simplicity)
fillna() by default does not update the dataframe in place, instead expecting you to assign it back, such as
def Funct(colname):
return colname.fillna(method='ffill')
df = Funct(df)
It's worth mentioning that fillna() does have an argument inplace= which you could set to True if you need to update in place.
I am puzzled by the behavior of the pandas function agg(). When I pass a custom aggregator without first calling groupy(), then the aggregation fails. Why does this happen?
Specifically, when I run the following code ...
import pandas as pd
import numpy as np
def mymean(series):
return np.mean(series)
frame = pd.DataFrame()
frame["a"] = [1,2,3]
display(frame.agg(["mean"]))
display(frame.agg([mymean]))
frame["const"] = 0
display(frame.groupby("const").agg(["mean"]))
display(frame.groupby("const").agg([mymean]))
... I obtain this:
In the second of the four calls to agg(), aggregation fails. Why is that? What I want is to be able to compute custom summary statistics on a frame without first calling groupby() so that I obtain a frame where each row corresponds to one statistic (mean, kurtosis, other stuff...)
Thanks for your help :-)
Im new in Python. I have dataframe and I want do min-max(0-1) scaling in every column (every attr). I found method MinMaxScaller but I dont know how to use it with dataframe.
from sklearn import preprocessing
def sci_minmax(X):
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
return minmax_scale.fit_transform(X)
data_normalized = sci_minmax(data)
data_variance=data_normalized.var()
data_variance.head(10)
The error is 'numpy.float64' object has no attribute 'head'. I need the return type dataframe
There is no head method in scipy/numpy.
If you want a pandas.DataFrame, you'll have to call the constructor.
Any chance you mean to look at the first 10 records with head?
You can do this easily with numpy, too.
To select the first 10 records of an array, the python syntax is array[:10]. With numpy matrixes, you will want to specify rows and columns: array[:10,] or array[,:10]