I have two data frames, one (p) df contains columns to be transformed, second (a) contains transformation parameter in form of pd.series:
p=np.random.rand(5,3) #create data frame
cols=["A","B","C"]
df1=pd.DataFrame(p,columns=cols)
a=np.array([0.3,0.4,0.5]) # create series of transform parameters
a=pd.Series(a,index=cols)
I wander how to iterate over df columns to transform each one with appropriate transform parameter, something like below:
df1.apply(stats.boxcox,lmbda=a)
which of course not works. My temporary solution is just a brute force function:
def boxcox_transform(df,lambdas):
df1=pd.DataFrame(index=df.index)
for column in list(df):
df1[column]=stats.boxcox(df[column],lambdas[column])
return(df1)
boxcox_transform(df1,a)
I wander is there any more elegant solution like for example R CRAN mapply which can iterate over two lists
You can use a lambda:
result_df = df1.apply(lambda col: stats.boxcox(col, a.loc[col.name]))
Related
Is there a fast way of adding a column to a data frame df with values depending on all the rows of df with smaller index? A very simple example where the new column only depends on the value of one other column would be df["new_col"] = df["old_col"].cumsum() (if df is ordered), but I have something more complicated in mind. Ideally, I'd like to write something like
df["new_col"] = df.[some function here](f),
where [some function] sets the i-th value of df["new_col"] to f(df[df.index <= df.index[i]]). (Ideally [some function] can also be applied to groupby() objects.)
At the moment I loop through rows, add a temporary column containing a dict of relevant values and then apply a function, but this is very slow, memory-inefficient, etc.
I'm doing:
df.apply(lambda x: x.rename(x.name + "_something"))
I think this should return the column with _something appended to all columns, but it just returns the same df.
What am I doing wrong?
EDIT: I need to act on the series column by column, not on the dataframe obejct, as I'll be applying other transformations to x in the lambda, not shown here.
EDIT 2 Full Context:
I've got a time series dataframe, and I'm trying to generate features from the data.
I've written a bunch of primitive functions like:
def sumn(n, s):
return s.rolling(n).sum().rename(s.name + "_sum_" + str(n))
When I apply those to Series, it renames them well.
When I apply them to columns in a DataFrame, the numerical transformation goes through, but the rename doesn't work.
(I suppose it implies that a DataFrame isn't just a collection of Series, which means in all likelihood, I now have to explicitly rename things on the df)
I think you can do this use pd.concat:
pd.concat([df[e].rename(df[e].name+'_Something') for e in df],1)
Inside the list comprehension, you can add your other logics:
df[e].rename(df[e].name+'_Something').apply(...)
If you directly use df.apply, you can't change the column name. There is no way I can think of
Suppose you have an array of functions. Each function returns a pandas.Series object of the same indexing and size. Each function takes in the same input, the main dataframe df.
I'm looking for an output that has each of the series as a column of a resulting dataframe
Currently I have the following:
df_result = [f(df) for f in f_arr]
df_result = pd.DataFrame(df_result)
This takes a long time (there seems to be some overhead on the list operation) and the resulting dataframe is the transpose of what I need. I feel like there should be a clean map/apply way to do this.
Using
df_result = pd.concat(df_result, axis=1)
in place of the second line will avoid getting the transpose.
I have a pandas dataframe like that:
How can I able to calculate mean (min/max, median) for specific column if Cluster==1 or CLuster==2?
Thanks!
You can create new df with only the relevant rows, using:
newdf = df[df['cluster'].isin([1,2)]
newdf.mean(axis=1)
In order to calc mean of a specfic column you can:
newdf["page"].mean(axis=1)
If you meant take the mean only where Cluster is 1 or 2, then the other answers here address your issue. If you meant take a separate mean for each value of Cluster, you can use pandas' aggregation functions, including groupyby and agg:
df.groupby("Cluster").mean()
is the simplest and will take means of all columns, grouped by Cluster.
df.groupby("Cluster").agg({"duration" : np.mean})
is an example where you are taking the mean of just one specific column, grouped by cluster. You can also use np.min, np.max, np.median, etc.
The groupby method produces a GroupBy object, which is something like but not like a DataFrame. Think of it as the DataFrame grouped, waiting for aggregation to be applied to it. The GroupBy object has simple built-in aggregation functions that apply to all columns (the mean() in the first example), and also a more general aggregation function (the agg() in the second example) that you can use to apply specific functions in a variety of ways. One way of using it is passing a dict of column names keyed to functions, so specific functions can be applied to specific columns.
You can do it in one line, using boolean indexing. For example you can do something like:
import numpy as np
import pandas as pd
# This will just produce an example DataFrame
df = pd.DataFrame({'a':np.arange(30), 'Cluster':np.ones(30,dtype=np.int)})
df.loc[10:19, "Cluster"] *= 2
df.loc[20:, "Cluster"] *= 3
# This line is all you need
df.loc[(df['Cluster']==1)|(df['Cluster']==2), 'a'].mean()
The boolean indexing array is True for the correct clusters. a is just the name of the column to compute the mean over.
Simple intuitive answer
First pick the rows of interest, then average then pick the columns of interest.
clusters_of_interest = [1, 2]
columns_of_interest = ['page']
# rows of interest
newdf = df[ df.CLUSTER.isin(clusters_of_interest) ]
# average and pick columns of interest
newdf.mean(axis=0)[ columns_of_interest ]
More advanced
# Create groups object according to the value in the 'cluster' column
grp = df.groupby('CLUSTER')
# apply functions of interest to all cluster groupings
data_agg = grp.agg( ['mean' , 'max' , 'min' ] )
This is also a good link which describes aggregation techniques. It should be noted that the "simple answer" averages over clusters 1 AND 2 or whatever is specified in the clusters_of_interest while the .agg function averages over each group of values having the same CLUSTER value.
I browsed for a while but could not find a way to group a pandas data frame using a function.
For example, assume:
df2=df1.groupby(df1['ColA']).sum()
Can we define a function f such that:
df2=df1.groupby(f).sum()
Can this function f also take inputs from several columns of df1? For example, what if the key according to which the grouping is done is a function of df['ColA'] and df['ColC']? I cannot find any example on this although it seems it should be possible from the API doc at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html.
Thanks
You could apply f first, and pass the return value to groupby:
df2 = df1.groupby(f(df['ColA'], df['ColB'])).sum()
Note that you can pass a list of arrays to groupby.
So, if you have two functions and want to use both return values as keys, you could do this:
df2 = df1.groupby([f(df['ColA'], df['ColB']),
g(df['ColC'], df['ColD'])]).sum()