Pandas Mean for Certain Column - python

I have a pandas dataframe like that:
How can I able to calculate mean (min/max, median) for specific column if Cluster==1 or CLuster==2?
Thanks!

You can create new df with only the relevant rows, using:
newdf = df[df['cluster'].isin([1,2)]
newdf.mean(axis=1)
In order to calc mean of a specfic column you can:
newdf["page"].mean(axis=1)

If you meant take the mean only where Cluster is 1 or 2, then the other answers here address your issue. If you meant take a separate mean for each value of Cluster, you can use pandas' aggregation functions, including groupyby and agg:
df.groupby("Cluster").mean()
is the simplest and will take means of all columns, grouped by Cluster.
df.groupby("Cluster").agg({"duration" : np.mean})
is an example where you are taking the mean of just one specific column, grouped by cluster. You can also use np.min, np.max, np.median, etc.
The groupby method produces a GroupBy object, which is something like but not like a DataFrame. Think of it as the DataFrame grouped, waiting for aggregation to be applied to it. The GroupBy object has simple built-in aggregation functions that apply to all columns (the mean() in the first example), and also a more general aggregation function (the agg() in the second example) that you can use to apply specific functions in a variety of ways. One way of using it is passing a dict of column names keyed to functions, so specific functions can be applied to specific columns.

You can do it in one line, using boolean indexing. For example you can do something like:
import numpy as np
import pandas as pd
# This will just produce an example DataFrame
df = pd.DataFrame({'a':np.arange(30), 'Cluster':np.ones(30,dtype=np.int)})
df.loc[10:19, "Cluster"] *= 2
df.loc[20:, "Cluster"] *= 3
# This line is all you need
df.loc[(df['Cluster']==1)|(df['Cluster']==2), 'a'].mean()
The boolean indexing array is True for the correct clusters. a is just the name of the column to compute the mean over.

Simple intuitive answer
First pick the rows of interest, then average then pick the columns of interest.
clusters_of_interest = [1, 2]
columns_of_interest = ['page']
# rows of interest
newdf = df[ df.CLUSTER.isin(clusters_of_interest) ]
# average and pick columns of interest
newdf.mean(axis=0)[ columns_of_interest ]
More advanced
# Create groups object according to the value in the 'cluster' column
grp = df.groupby('CLUSTER')
# apply functions of interest to all cluster groupings
data_agg = grp.agg( ['mean' , 'max' , 'min' ] )
This is also a good link which describes aggregation techniques. It should be noted that the "simple answer" averages over clusters 1 AND 2 or whatever is specified in the clusters_of_interest while the .agg function averages over each group of values having the same CLUSTER value.

Related

assign not working in grouped pandas dataframe

In a complex chained method using pandas, one of the steps is grouping data by a column and then calculate some metrics. This is a simplified example of the procedure i want to achieve. I have many more assignments in the workflow but is failing miserabily at first.
import pandas as pd
import numpy as np
data = pd.DataFrame({'Group':['A','A','A','B','B','B'],'first':[1,12,4,5,4,3],'last':[5,3,4,5,2,7,]})
data.groupby('Group').assign(average_ratio=lambda x: np.mean(x['first']/x['last']))
>>>> AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'
I know i could use apply this way:
data.groupby('Group').apply(lambda x: np.mean(x['first']/x['last']))
Group
A 1.733333
B 1.142857
dtype: float64
or much better, renaming the column in the same step:
data.groupby('Group').apply(lambda x: pd.Series({'average_ratio':np.mean(x['first']/x['last'])}))
average_ratio
Group
A 1.733333
B 1.142857
Is there any way of using .assign to obtain the same?
To answer last question, for your needs no you cannot. The method, DataFrame.assign simply adds new columns or replace existing columns but return the same index DataFrame with new/adjusted columns.
You are attempted a grouped aggregation that reduces the rows to group level and thereby changing the index and DataFrame granularity from unit level to aggregated grouped level. Therefore you need to run your groupby operations without assign.
To encapsulate multiple assigned aggregated columns that aligns to chained process, use a defined method and then apply it accordingly:
def aggfunc(row):
row['first_mean'] = np.mean(row['first'])
row['last_mean'] = np.mean(row['last'])
row['average_ratio'] = np.mean(row['first'].div(row['last']))
return row
agg_data = data.groupby('Group').apply(aggfunc)

Map Value to Specific Row and Column - Python Pandas

I have a data set where i want to match the index row and change the value of a column within that row.
I have looked at map and loc and have been able to locate the data use df.loc but it filters that data down, all i want to do is change the value in a column on that row when that row is found.
What is the best approach - my original post can be found here:
Original post
It's simple to do in excel but struggling with Pandas.
Edit:
I have this so far which seems to work but it includes a lot of numbers after the total calculation along with dtype: int64
import pandas as pd
df = pd.read_csv(r'C:\Users\david\Documents\test.csv')
multiply = {2.1: df['Rate'] * df['Quantity']}
df['Total'] = df['Code'].map(multiply)
df.head()
how do i get around this?
The pandas method mask is likely a good option here. Mask takes two main arguments: a condition and something with which to replace values that meet that condition.
If you're trying to replace values with a formula that draws on values from multiple dataframe columns, you'll also want to pass in an additional axis argument.
The condition: this would be something like, for instance:
df['Code'] == 2.1
The replacement value: this can be a single value, a series/dataframe, or (most valuable for your purposes) a function/callable. For example:
df['Rate'] * df['Quantity']
The axis: Because you're passing a function/callable as the replacement argument, you need to tell mask() how to find those values. It might look something like this:
axis=0
So all together, the code would read like this:
df['Total'] = df['Code'].mask(
df['Code'] == 2.1,
df['Rate'] * df['Quantity'],
axis=0
)

Difference between "as_index = False", and "reset_index()" in pandas groupby

I just wanted to know what is the difference in the function performed by these 2.
Data:
import pandas as pd
df = pd.DataFrame({"ID":["A","B","A","C","A","A","C","B"], "value":[1,2,4,3,6,7,3,4]})
as_index=False :
df_group1 = df.groupby("ID").sum().reset_index()
reset_index() :
df_group2 = df.groupby("ID", as_index=False).sum()
Both of them give the exact same output.
ID value
0 A 18
1 B 6
2 C 6
Can anyone tell me what is the difference and any example illustrating the same?
When you use as_index=False, you indicate to groupby() that you don't want to set the column ID as the index (duh!). When both implementation yield the same results, use as_index=False because it will save you some typing and an unnecessary pandas operation ;)
However, sometimes, you want to apply more complicated operations on your groups. In those occasions, you might find out that one is more suited than the other.
Example 1: You want to sum the values of three variables (i.e. columns) in a group on both axes.
Using as_index=True allows you to apply a sum over axis=1 without specifying the names of the columns, then summing the value over axis 0. When the operation is finished, you can use reset_index(drop=True/False) to get the dataframe under the right form.
Example 2: You need to set a value for the group based on the columns in the groupby().
Setting as_index=False allow you to check the condition on a common column and not on an index, which is often way easier.
At some point, you might come across KeyError when applying operations on groups. In that case, it is often because you are trying to use a column in your aggregate function that is currently an index of your GroupBy object.

How do I Substract values in one column with the values of another in a DataFrame?

I have the following dataframe:
I want to see which country has the biggest difference between the column "Gold" and "Gold 1". The index currently is the countries.
As an example, with Afghanistan, it would be 0 - 0 = 0. I do this with every country and than the highest number in that list is my response. That's how I figured I want to do it.
Does anyone know how I can do that? Or is there a built-in function that can calculate that?
You can subtract these two columns using the built-in vector subtraction:
df1['Gold'] - df2['Gold 1']
The country with biggest difference is
df.Gold.sub(df['Gold 1']).idxmax()
The biggest absolute difference
df.Gold.sub(df['Gold 1']).abs().idxmax()
You can also sort this by the difference
df.loc[df.Gold.sub(df['Gold 1']).sort_values().index]
Or the absolute differences
df.loc[df.Gold.sub(df['Gold 1']).abs().sort_values().index]
you can try out the code below:
import pandas as pd
df = pd.DataFrame([['Afgh',0,0],['Agnt',18,0]], columns = ['Country','Gold','Gold1'])
df['GoldDiff'] = df['Gold'] - df['Gold1']
df.sort_values(by = 'GoldDiff', ascending = False)
df is just a test dataframe based on yours above. df['GoldDiff'] creates a new column to store differences.
Then you can simply sort the values by using the sort_values function from pandas. You can also add the option inplace = True if you want to modify your dataframe as the sorted one.

Manipulate A Group Column in Pandas

I have a data set with columns Dist, Class, and Count.
I want to group that data set by dist and divide the count column of each group by the sum of the counts for that group (normalize it to one).
The following MWE demonstrates my approach thus far. But I wonder: is there a more compact/pandaific way of writing this?
import pandas as pd
import numpy as np
a = np.random.randint(0,4,(10,3))
s = pd.DataFrame(a,columns=['Dist','Class','Count'])
def manipcolumn(x):
csum = x['Count'].sum()
x['Count'] = x['Count'].apply(lambda x: x/csum)
return x
s.groupby('Dist').apply(manipcolumn)
One alternative way to get the normalised 'Count' column could be to use groupby and transform to get the sums for each group and then divide the returned Series by the 'Count' column. You can reassign this Series back to your DataFrame:
s['Count'] = s['Count'] / s.groupby('Dist')['Count'].transform(np.sum)
This avoids the need for a bespoke Python function and the use of apply. Testing it for the small example DataFrame in your question showed that it was around 8 times faster.

Categories