I have simple data:
type age
A 4
A 4
B 4
A 5
I want to get
type age count
A 4 2
A 5 1
B 4 1
How to perform such thing in panda: what shell I do after df.groupby(['type'])?
Let's use groupby with 'type' and 'age', then count and reset_index:
df.groupby(['type','age'])['age'].count().reset_index(name='count')
Output:
type age count
0 A 4 2
1 A 5 1
2 B 4 1
You could also do
df.groupby(['type','age']).size().reset_index(name='count')
Related
I would like to achieve the result below in Python using Pandas.
I tried groupby and sum on the id and Group columns using the below:
df.groupby(['id','Group'])['Total'].sum()
I got the first two columns, but I'm not sure how to get the third column (Overall_Total).
How can I do it?
Initial data (before grouping)
id
Group
Time
1
a
2
1
a
2
1
a
1
1
b
1
1
b
1
1
c
1
2
e
2
2
a
4
2
e
1
2
a
5
3
c
1
3
e
4
3
a
3
3
e
4
3
a
2
3
h
4
Assuming df is your initial dataframe, please try this:
df_group = df.groupby(['id','group']).sum(['time']).rename(columns={'time':'Total'})
df_group['All_total'] = df_group.groupby(['id'])['Total'].transform('sum')
I want to convert multiple columns in a dataframe (pandas) to the type "category" using the method .astype. Here is my code:
df['Field_1'].astype('category').cat.codes
works however
categories = df.select_types('objects')
categories['Field_1'].cat.codes
doesn't.
Would someone please tell my why?
In general, the question is how to apply a method (.astype) to a dataframe? I know how to apply a method to a column in a dataframe, however, applying it to a dataframe hasnt been successful, even with for loop since the for loop returns a series and the method .cat.codes is not appliable for the series.
I think you need processing each column separately in DataFrame.apply and lambda function, your code failed, because Series.cat.codes is not implemented for DataFrame:
df = pd.DataFrame({
'A':list('acbdac'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':list('dddbbb')
})
cols = df.select_dtypes('object').columns
df[cols] = df[cols].apply(lambda x: x.astype('category').cat.codes)
print (df)
A B C D
0 0 4 7 1
1 2 5 8 1
2 1 4 9 1
3 3 5 4 0
4 0 5 2 0
5 2 4 3 0
Similar idea, not sure if same output if convert all columns to categorical in first step by DataFrame.astype:
cols = df.select_dtypes('object').columns
df[cols] = df[cols].astype('category').apply(lambda x: x.cat.codes)
print (df)
A B C D
0 0 4 7 1
1 2 5 8 1
2 1 4 9 1
3 3 5 4 0
4 0 5 2 0
5 2 4 3 0
I have df like this
A B
1 1
1 2
1 3
2 2
2 1
3 2
3 3
3 4
I would like to extract rows whose col B is not ascending like
A B
2 2
2 1
I tried
df.groupby("A").filter()...
But I stacked to extract.
If you have any solution,please let me know.
One way is to use pandas.Series.is_monotonic:
df[df.groupby('A')['B'].transform(lambda x:not x.is_monotonic)]
Output:
A B
3 2 2
4 2 1
Use GroupBy.transform with Series.diff and compare by Series.lt for at least one negative value with Series.any and filter by boolean indexing:
df1 = df[df.groupby('A')['B'].transform(lambda x: x.diff().lt(0).any())]
print (df1)
A B
3 2 2
4 2 1
I have a dataframe df with the shape (4573,64) that I'm trying to pivot. The last column is an 'id' with two possible string values 'old' and 'new'. I would like to set the first 63 columns as index and then have the 'id' column across the top with values being the count of 'old' or 'new' for each index row.
I've created a list object out of columns labels that I want as index named cols.
I tried the following:
df.pivot(index=cols, columns='id')['id']
this gives an error: 'all arrays must be same length'
also tried the following to see if I can get sum but no luck either:
pd.pivot_table(df,index=cols,values=['id'],aggfunc=np.sum)
any ides greatly appreciated
I found a thread online talking about a possible bug in pandas 0.23.0 where the pandas.pivot_table() will not accept the multiindex as long as it contains NaN's (link to github in comments). My workaround was to do
df.fillna('empty', inplace=True)
then the solution below:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
as proposed by jezrael will work as intended hence the answer accepted.
I believe need convert columns names to list and then aggregate size with unstack:
df = pd.DataFrame({'B':[4,4,4,5,5,4],
'C':[1,1,9,4,2,3],
'D':[1,1,5,7,1,0],
'E':[0,0,6,9,2,4],
'id':list('aaabbb')})
print (df)
B C D E id
0 4 1 1 0 a
1 4 1 1 0 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
cols = df.columns.tolist()
df1 = df.groupby(cols)['id'].size().unstack(fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
Solution with pivot_table:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
In pandas and python:
I have a large datasets with health records where patients have records of diagnoses.
How to display the most frequent diagnoses, but only count 1 occurrence of the same diagnoses per patient?
Example ('pid' is patient id. 'code' is the code of a diagnosis):
IN:
pid code
1 A
1 B
1 A
1 A
2 A
2 A
2 B
2 A
3 B
3 C
3 D
4 A
4 A
4 A
4 B
OUT:
B 4
A 3
C 1
D 1
I would like to be able to use .isin .index if possible.
Example:
Remove all rows with less than 3 in frequency count in column 'code'
s = df['code'].value_counts().ge(3)
df = df[df['code'].isin(s[s].index)]
You can use groupby + nunique:
df.groupby(by='code').pid.nunique().sort_values(ascending=False)
Out[60]:
code
B 4
A 3
D 1
C 1
Name: pid, dtype: int64
To remove all rows with less than 3 in frequency count in column 'code'
df.groupby(by='code').filter(lambda x: x.pid.nunique()>=3)
Out[55]:
pid code
0 1 A
1 1 B
2 1 A
3 1 A
4 2 A
5 2 A
6 2 B
7 2 A
8 3 B
11 4 A
12 4 A
13 4 A
14 4 B
Since you mention value_counts
df.groupby('code').pid.value_counts().count(level=0)
Out[42]:
code
A 3
B 4
C 1
D 1
Name: pid, dtype: int64
You should be able to use the groupby and nunique() functions to obtain a distinct count of patients that had each diagnosis. This should give you the result you need:
df[['pid', 'code']].groupby(['code']).nunique()