Here's the code I currently have:
df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
which returns something like the following (I've made these values up):
CRIME_RATING
mean count
0 3.000000 1
1 3.118397 39
2 2.790698 32
3 5.125000 18
4 4.000000 1
5 4.222222 22
but I'd quite like to exclude indexes 0 and 4 from the resulting dataframe given that they both have a count of 1. Can this be done?
Use Series.ne for filter not equal 1 with tuple for select MultiIndex columns and filter in boolean indexing:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
df2 = df1[df1[('CRIME_RATING','count')].ne(1)]
If want avoid MultiIndex use named aggregation:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg(mean = ('CRIME_RATING','mean'),
count = ('CRIME_RATING','count'))
df2 = df1[df1['count'].ne(1)]
Related
I have the following dataframe:
df
Group Dist
0 A 5
1 B 2
2 A 3
3 B 1
4 B 0
5 A 5
I am trying to drop all rows that match Group if the Dist column equals zero. This works to delete row 4:
df = df[df.Dist != 0]
however I also want to delete rows 1 and 3 so I am left with:
df
Group Dist
0 A 5
2 A 3
5 A 5
Any ideas on how to drop the group based off this condition?
Thanks!
First get all Group values for Entry == 0 and then filter out them by check column Group with inverted mask by ~:
df1 = df[~df['Group'].isin(df.loc[df.Dist == 0, 'Group'])]
print (df1)
Group Dist
0 A 5
2 A 3
5 A 5
Or you can use GroupBy.transform with GroupBy.all for test if groups has no 0 values:
df1 = df[(df.Dist != 0).groupby(df['Group']).transform('all')]
EDIT: For remove all groups with missing values:
df2 = df[df['Dist'].notna().groupby(df['Group']).transform('all')]
For test missing values:
print (df[df['Dist'].isna()])
if return nothing there are no missing values NaN or no None like Nonetype.
So is possible check scalar, e.g. if this value is in row with index 10:
print (df.loc[10, 'Dist'])
print (type(df.loc[10, 'Dist']))
You can use groupby and the method filter:
df.groupby('Group').filter(lambda x: x['Dist'].ne(0).all())
Output:
Group Dist
0 A 5
2 A 3
5 A 5
If you want to filter out groups with missing values:
df.groupby('Group').filter(lambda x: x['Dist'].notna().all())
I am trying to build a table that has groups that are divided by subgroups with count and average for each subgroup. For example, I want to convert the following data frame:
To a table that looks like this where the interval is a bigger group and columns a thru i become subgroups within the group with the corresponding subgroups' count and average in each cell:
I have tried this with no success:
Try.
df.groupby(['interval']).apply(lambda x : x.stack()
.groupby(level=-1)
.agg({'count', 'mean'}))
Use groupby with apply to apply a function for each group then stack and groupby again with agg to find count and mean.
Use DataFrame.melt with GroupBy.agg and tuples for aggregate functions with new columns names:
df1 = (df.melt('interval', var_name='source')
.groupby(['interval','source'])['value']
.agg([('cnt','count'), ('average','mean')])
.reset_index())
print (df1.head())
interval source cnt average
0 0 a 1 5.0
1 0 b 1 0.0
2 0 c 1 0.0
3 0 d 1 0.0
4 0 f 1 0.0
The following code solves the problem I asked for:
df.group(['interval'],,as_index=False).agg({
'a':{"count":"mean"},
'b':{"count":"mean"},
'c':{"count":"mean"},
'd':{"count":"mean"},
'f':{"count":"mean"},
'g':{"count":"mean"},
'i':{"count":"mean"}
})
I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)
Here is the snippet:
test = pd.DataFrame({'userid': [1,1,1,2,2], 'order_id': [1,2,3,4,5], 'fee': [2,1,5,3,1]})
I'd like to group based on userid and count the 'order_id' column and sum the 'fee' column:
test.groupby('userid').order_id.count()
test.groupby('userid').fee.sum()
Is it possible to perform these two operations in one line of code so that I can get a resulting df looks like this:
userid counts sum
...
I've tried pivot_table:
test.pivot_table(index='userid', values=['order_id', 'fee'], aggfunc=[np.size, np.sum])
It gives something like this:
size sum
fee order_id fee order_id
userid
1 3 3 8 6
2 2 2 4 9
Is it possible to tell pandas to use np.size & np.sum on one column but not both?
Use DataFrameGroupBy.agg with rename columns:
d = {'order_id':'counts','fee':'sum'}
df = test.groupby('userid').agg({'order_id':'count', 'fee':'sum'})
.rename(columns=d)
.reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2
But better is aggregate by size, because count is used if need exclude NaNs:
df = test.groupby('userid')
.agg({'order_id':'size', 'fee':'sum'})
.rename(columns=d).reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2
This is very similar to this question, except I want my code to be able to apply to the length of a dataframe, instead of specific columns.
I have a DataFrame, and I'm trying to get a sum of each row to append to the dataframe as a column.
df = pd.DataFrame([[1,0,0],[20,7,1],[63,13,5]],columns=['drinking','drugs','both'],index = ['First','Second','Third'])
drinking drugs both
First 1 0 0
Second 20 7 1
Third 63 13 5
Desired output:
drinking drugs both total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
Current code:
df['total'] = df.apply(lambda row: (row['drinking'] + row['drugs'] + row['both']),axis=1)
This works great. But what if I have another dataframe, with seven columns, which are not called 'drinking', 'drugs', or 'both'? Is it possible to adjust this function so that it applies to the length of the dataframe? That way I can use the function for any dataframe at all, with a varying number of columns, not just a dataframe with columns called 'drinking', 'drugs', and 'both'?
Something like:
df['total'] = df.apply(for col in df: [code to calculate sum of each row]),axis=1)
You can use sum:
df['total'] = df.sum(axis=1)
If you need sum only some columns, use subset:
df['total'] = df[['drinking', 'drugs', 'both']].sum(axis=1)
what about something like this :
df.loc[:, 'Total'] = df.sum(axis=1)
with the output :
Out[4]:
drinking drugs both Total
First 1 0 0 1
Second 20 7 1 28
Third 63 13 5 81
It will sum all columns by row.