I just wanted to know what is the difference in the function performed by these 2.
Data:
import pandas as pd
df = pd.DataFrame({"ID":["A","B","A","C","A","A","C","B"], "value":[1,2,4,3,6,7,3,4]})
as_index=False :
df_group1 = df.groupby("ID").sum().reset_index()
reset_index() :
df_group2 = df.groupby("ID", as_index=False).sum()
Both of them give the exact same output.
ID value
0 A 18
1 B 6
2 C 6
Can anyone tell me what is the difference and any example illustrating the same?
When you use as_index=False, you indicate to groupby() that you don't want to set the column ID as the index (duh!). When both implementation yield the same results, use as_index=False because it will save you some typing and an unnecessary pandas operation ;)
However, sometimes, you want to apply more complicated operations on your groups. In those occasions, you might find out that one is more suited than the other.
Example 1: You want to sum the values of three variables (i.e. columns) in a group on both axes.
Using as_index=True allows you to apply a sum over axis=1 without specifying the names of the columns, then summing the value over axis 0. When the operation is finished, you can use reset_index(drop=True/False) to get the dataframe under the right form.
Example 2: You need to set a value for the group based on the columns in the groupby().
Setting as_index=False allow you to check the condition on a common column and not on an index, which is often way easier.
At some point, you might come across KeyError when applying operations on groups. In that case, it is often because you are trying to use a column in your aggregate function that is currently an index of your GroupBy object.
Related
In a complex chained method using pandas, one of the steps is grouping data by a column and then calculate some metrics. This is a simplified example of the procedure i want to achieve. I have many more assignments in the workflow but is failing miserabily at first.
import pandas as pd
import numpy as np
data = pd.DataFrame({'Group':['A','A','A','B','B','B'],'first':[1,12,4,5,4,3],'last':[5,3,4,5,2,7,]})
data.groupby('Group').assign(average_ratio=lambda x: np.mean(x['first']/x['last']))
>>>> AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'
I know i could use apply this way:
data.groupby('Group').apply(lambda x: np.mean(x['first']/x['last']))
Group
A 1.733333
B 1.142857
dtype: float64
or much better, renaming the column in the same step:
data.groupby('Group').apply(lambda x: pd.Series({'average_ratio':np.mean(x['first']/x['last'])}))
average_ratio
Group
A 1.733333
B 1.142857
Is there any way of using .assign to obtain the same?
To answer last question, for your needs no you cannot. The method, DataFrame.assign simply adds new columns or replace existing columns but return the same index DataFrame with new/adjusted columns.
You are attempted a grouped aggregation that reduces the rows to group level and thereby changing the index and DataFrame granularity from unit level to aggregated grouped level. Therefore you need to run your groupby operations without assign.
To encapsulate multiple assigned aggregated columns that aligns to chained process, use a defined method and then apply it accordingly:
def aggfunc(row):
row['first_mean'] = np.mean(row['first'])
row['last_mean'] = np.mean(row['last'])
row['average_ratio'] = np.mean(row['first'].div(row['last']))
return row
agg_data = data.groupby('Group').apply(aggfunc)
I have a data set where i want to match the index row and change the value of a column within that row.
I have looked at map and loc and have been able to locate the data use df.loc but it filters that data down, all i want to do is change the value in a column on that row when that row is found.
What is the best approach - my original post can be found here:
Original post
It's simple to do in excel but struggling with Pandas.
Edit:
I have this so far which seems to work but it includes a lot of numbers after the total calculation along with dtype: int64
import pandas as pd
df = pd.read_csv(r'C:\Users\david\Documents\test.csv')
multiply = {2.1: df['Rate'] * df['Quantity']}
df['Total'] = df['Code'].map(multiply)
df.head()
how do i get around this?
The pandas method mask is likely a good option here. Mask takes two main arguments: a condition and something with which to replace values that meet that condition.
If you're trying to replace values with a formula that draws on values from multiple dataframe columns, you'll also want to pass in an additional axis argument.
The condition: this would be something like, for instance:
df['Code'] == 2.1
The replacement value: this can be a single value, a series/dataframe, or (most valuable for your purposes) a function/callable. For example:
df['Rate'] * df['Quantity']
The axis: Because you're passing a function/callable as the replacement argument, you need to tell mask() how to find those values. It might look something like this:
axis=0
So all together, the code would read like this:
df['Total'] = df['Code'].mask(
df['Code'] == 2.1,
df['Rate'] * df['Quantity'],
axis=0
)
I would like to take a categorical column, group by individual type and then sum each type
I am using Python code and its result is what I want
data2 = data.groupby(['service_type']).sum().unstack()
popular_ser2 = data2.sort_values(ascending = False).head(10).droplevel(0)
popular_ser2
I would like to confirm whether my code is logic due to the need of unstack and droplevel that is uncommon to see when using groupby and sort value.
I have a Pandas Dataframe that has some missing values. I would like to fill the missing values with something that doesn't influence the statistics that I will do on the data.
As an example, if in Excel you try to average a cell that contains 5 and an empty cell, the average will be 5. I'd like to have the same in Python.
I tried to fill with NaN but if I sum a certain column, for example, the result is NaN.
I also tried to fill with None but I get an error because I'm summing different datatypes.
Can somebody help? Thank you in advance.
there are many answers for your two questions.
Here is a solution for your first one:
If you wish to insert a certain value to your NaN entries in the Dataframe that won't alter your statistics, then I would suggest you to use the mean value of that data for it.
Example:
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
For the second question:
If you need to check descriptive statistics from your dataframe, and that descriptive stats should not be influenced by the NaN values, here are two solutions for it:
1)
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
df.mean()
df.std()
# or even:
df.describe()
2) Option 2:
I would suggest you to use the numpy nan functions such as (numpy.nansum, numpy.nanmean, numpy.nanstd)...
df.apply(numpy.nansum)
df.apply(numpy.nanstd) #...
The answer to your question is that missing values work differently in Pandas than in Excel. You can read about the technical reasons for that here. Basically, there is no magic number that we can fill a df with that will cause Pandas to just overlook it. Depending on our needs, we will sometimes choose to fill the missing values, sometimes to drop them (either permanently or for the duration of a calculation), or sometimes to use methods that can work with them (e.g. numpy.nansum, as Philipe Riskalla Leal mentioned).
You can use df.fillna(). Here is an example of how you can do the same.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan,2,1,np.nan],
[2,np.nan,3,4],
[4,np.nan,np.nan,3],
[np.nan,2,1,np.nan]],columns=list('ABCD'))
df.fillna(0.0)
Generally filling value with something like 0 would affect the statistics you do on your data.
So go for mean of the data which will make sure it won't affect your statistics.
So, use df.fillna(df.mean()) instead
If you want to change the datatype of any specific column with missing values filled with 'nan' for any statistical operation you can simply use below line of code, it will convert all the values of that column to numeric type and all the missing values automatically replace with 'nan' and it'll not affect your statistical operation.
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')
If you want to do the same for all the columns in dataframe you can use:
for i in df.columns:
df[i] = pd.to_numeric(df[i], errors='coerce')
I have a pandas dataframe like that:
How can I able to calculate mean (min/max, median) for specific column if Cluster==1 or CLuster==2?
Thanks!
You can create new df with only the relevant rows, using:
newdf = df[df['cluster'].isin([1,2)]
newdf.mean(axis=1)
In order to calc mean of a specfic column you can:
newdf["page"].mean(axis=1)
If you meant take the mean only where Cluster is 1 or 2, then the other answers here address your issue. If you meant take a separate mean for each value of Cluster, you can use pandas' aggregation functions, including groupyby and agg:
df.groupby("Cluster").mean()
is the simplest and will take means of all columns, grouped by Cluster.
df.groupby("Cluster").agg({"duration" : np.mean})
is an example where you are taking the mean of just one specific column, grouped by cluster. You can also use np.min, np.max, np.median, etc.
The groupby method produces a GroupBy object, which is something like but not like a DataFrame. Think of it as the DataFrame grouped, waiting for aggregation to be applied to it. The GroupBy object has simple built-in aggregation functions that apply to all columns (the mean() in the first example), and also a more general aggregation function (the agg() in the second example) that you can use to apply specific functions in a variety of ways. One way of using it is passing a dict of column names keyed to functions, so specific functions can be applied to specific columns.
You can do it in one line, using boolean indexing. For example you can do something like:
import numpy as np
import pandas as pd
# This will just produce an example DataFrame
df = pd.DataFrame({'a':np.arange(30), 'Cluster':np.ones(30,dtype=np.int)})
df.loc[10:19, "Cluster"] *= 2
df.loc[20:, "Cluster"] *= 3
# This line is all you need
df.loc[(df['Cluster']==1)|(df['Cluster']==2), 'a'].mean()
The boolean indexing array is True for the correct clusters. a is just the name of the column to compute the mean over.
Simple intuitive answer
First pick the rows of interest, then average then pick the columns of interest.
clusters_of_interest = [1, 2]
columns_of_interest = ['page']
# rows of interest
newdf = df[ df.CLUSTER.isin(clusters_of_interest) ]
# average and pick columns of interest
newdf.mean(axis=0)[ columns_of_interest ]
More advanced
# Create groups object according to the value in the 'cluster' column
grp = df.groupby('CLUSTER')
# apply functions of interest to all cluster groupings
data_agg = grp.agg( ['mean' , 'max' , 'min' ] )
This is also a good link which describes aggregation techniques. It should be noted that the "simple answer" averages over clusters 1 AND 2 or whatever is specified in the clusters_of_interest while the .agg function averages over each group of values having the same CLUSTER value.