In a complex chained method using pandas, one of the steps is grouping data by a column and then calculate some metrics. This is a simplified example of the procedure i want to achieve. I have many more assignments in the workflow but is failing miserabily at first.
import pandas as pd
import numpy as np
data = pd.DataFrame({'Group':['A','A','A','B','B','B'],'first':[1,12,4,5,4,3],'last':[5,3,4,5,2,7,]})
data.groupby('Group').assign(average_ratio=lambda x: np.mean(x['first']/x['last']))
>>>> AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'
I know i could use apply this way:
data.groupby('Group').apply(lambda x: np.mean(x['first']/x['last']))
Group
A 1.733333
B 1.142857
dtype: float64
or much better, renaming the column in the same step:
data.groupby('Group').apply(lambda x: pd.Series({'average_ratio':np.mean(x['first']/x['last'])}))
average_ratio
Group
A 1.733333
B 1.142857
Is there any way of using .assign to obtain the same?
To answer last question, for your needs no you cannot. The method, DataFrame.assign simply adds new columns or replace existing columns but return the same index DataFrame with new/adjusted columns.
You are attempted a grouped aggregation that reduces the rows to group level and thereby changing the index and DataFrame granularity from unit level to aggregated grouped level. Therefore you need to run your groupby operations without assign.
To encapsulate multiple assigned aggregated columns that aligns to chained process, use a defined method and then apply it accordingly:
def aggfunc(row):
row['first_mean'] = np.mean(row['first'])
row['last_mean'] = np.mean(row['last'])
row['average_ratio'] = np.mean(row['first'].div(row['last']))
return row
agg_data = data.groupby('Group').apply(aggfunc)
Related
My dataframe is a pandas dataframe df with many rows & columns.
Now i want to create a new column (series) based on the values of an object column. e.g.:
df.iloc[0, 'oldcolumn'] Output is 0 should give me 0 in a new column and
df.iloc[1, 'oldcolumn'] Output is 'ab%$.' should give me 5 in the same new column (number of literals incl. space).
in addition, is there a way to avoid loops or own functions?
Thank U
To create a new column based on the length of the value in another column, you should do
df['newcol'] = df['oldcol'].apply(lambda x: len(str(x)))
Although this is a generic way of creating a new column based on data from existing columns, Henry's approach is also a good one.
In addition, is there a way to avoid loops or own functions?
I recommend you take a look at How To Make Your Pandas Loop 71803 Times Faster.
You can try this:
df['strlen'] = df['oldcolumn'].apply(len)
print(df)
I'm doing:
df.apply(lambda x: x.rename(x.name + "_something"))
I think this should return the column with _something appended to all columns, but it just returns the same df.
What am I doing wrong?
EDIT: I need to act on the series column by column, not on the dataframe obejct, as I'll be applying other transformations to x in the lambda, not shown here.
EDIT 2 Full Context:
I've got a time series dataframe, and I'm trying to generate features from the data.
I've written a bunch of primitive functions like:
def sumn(n, s):
return s.rolling(n).sum().rename(s.name + "_sum_" + str(n))
When I apply those to Series, it renames them well.
When I apply them to columns in a DataFrame, the numerical transformation goes through, but the rename doesn't work.
(I suppose it implies that a DataFrame isn't just a collection of Series, which means in all likelihood, I now have to explicitly rename things on the df)
I think you can do this use pd.concat:
pd.concat([df[e].rename(df[e].name+'_Something') for e in df],1)
Inside the list comprehension, you can add your other logics:
df[e].rename(df[e].name+'_Something').apply(...)
If you directly use df.apply, you can't change the column name. There is no way I can think of
I just wanted to know what is the difference in the function performed by these 2.
Data:
import pandas as pd
df = pd.DataFrame({"ID":["A","B","A","C","A","A","C","B"], "value":[1,2,4,3,6,7,3,4]})
as_index=False :
df_group1 = df.groupby("ID").sum().reset_index()
reset_index() :
df_group2 = df.groupby("ID", as_index=False).sum()
Both of them give the exact same output.
ID value
0 A 18
1 B 6
2 C 6
Can anyone tell me what is the difference and any example illustrating the same?
When you use as_index=False, you indicate to groupby() that you don't want to set the column ID as the index (duh!). When both implementation yield the same results, use as_index=False because it will save you some typing and an unnecessary pandas operation ;)
However, sometimes, you want to apply more complicated operations on your groups. In those occasions, you might find out that one is more suited than the other.
Example 1: You want to sum the values of three variables (i.e. columns) in a group on both axes.
Using as_index=True allows you to apply a sum over axis=1 without specifying the names of the columns, then summing the value over axis 0. When the operation is finished, you can use reset_index(drop=True/False) to get the dataframe under the right form.
Example 2: You need to set a value for the group based on the columns in the groupby().
Setting as_index=False allow you to check the condition on a common column and not on an index, which is often way easier.
At some point, you might come across KeyError when applying operations on groups. In that case, it is often because you are trying to use a column in your aggregate function that is currently an index of your GroupBy object.
I have a pandas dataframe like that:
How can I able to calculate mean (min/max, median) for specific column if Cluster==1 or CLuster==2?
Thanks!
You can create new df with only the relevant rows, using:
newdf = df[df['cluster'].isin([1,2)]
newdf.mean(axis=1)
In order to calc mean of a specfic column you can:
newdf["page"].mean(axis=1)
If you meant take the mean only where Cluster is 1 or 2, then the other answers here address your issue. If you meant take a separate mean for each value of Cluster, you can use pandas' aggregation functions, including groupyby and agg:
df.groupby("Cluster").mean()
is the simplest and will take means of all columns, grouped by Cluster.
df.groupby("Cluster").agg({"duration" : np.mean})
is an example where you are taking the mean of just one specific column, grouped by cluster. You can also use np.min, np.max, np.median, etc.
The groupby method produces a GroupBy object, which is something like but not like a DataFrame. Think of it as the DataFrame grouped, waiting for aggregation to be applied to it. The GroupBy object has simple built-in aggregation functions that apply to all columns (the mean() in the first example), and also a more general aggregation function (the agg() in the second example) that you can use to apply specific functions in a variety of ways. One way of using it is passing a dict of column names keyed to functions, so specific functions can be applied to specific columns.
You can do it in one line, using boolean indexing. For example you can do something like:
import numpy as np
import pandas as pd
# This will just produce an example DataFrame
df = pd.DataFrame({'a':np.arange(30), 'Cluster':np.ones(30,dtype=np.int)})
df.loc[10:19, "Cluster"] *= 2
df.loc[20:, "Cluster"] *= 3
# This line is all you need
df.loc[(df['Cluster']==1)|(df['Cluster']==2), 'a'].mean()
The boolean indexing array is True for the correct clusters. a is just the name of the column to compute the mean over.
Simple intuitive answer
First pick the rows of interest, then average then pick the columns of interest.
clusters_of_interest = [1, 2]
columns_of_interest = ['page']
# rows of interest
newdf = df[ df.CLUSTER.isin(clusters_of_interest) ]
# average and pick columns of interest
newdf.mean(axis=0)[ columns_of_interest ]
More advanced
# Create groups object according to the value in the 'cluster' column
grp = df.groupby('CLUSTER')
# apply functions of interest to all cluster groupings
data_agg = grp.agg( ['mean' , 'max' , 'min' ] )
This is also a good link which describes aggregation techniques. It should be noted that the "simple answer" averages over clusters 1 AND 2 or whatever is specified in the clusters_of_interest while the .agg function averages over each group of values having the same CLUSTER value.
I have a data set with columns Dist, Class, and Count.
I want to group that data set by dist and divide the count column of each group by the sum of the counts for that group (normalize it to one).
The following MWE demonstrates my approach thus far. But I wonder: is there a more compact/pandaific way of writing this?
import pandas as pd
import numpy as np
a = np.random.randint(0,4,(10,3))
s = pd.DataFrame(a,columns=['Dist','Class','Count'])
def manipcolumn(x):
csum = x['Count'].sum()
x['Count'] = x['Count'].apply(lambda x: x/csum)
return x
s.groupby('Dist').apply(manipcolumn)
One alternative way to get the normalised 'Count' column could be to use groupby and transform to get the sums for each group and then divide the returned Series by the 'Count' column. You can reassign this Series back to your DataFrame:
s['Count'] = s['Count'] / s.groupby('Dist')['Count'].transform(np.sum)
This avoids the need for a bespoke Python function and the use of apply. Testing it for the small example DataFrame in your question showed that it was around 8 times faster.