groupby count same values in two columns in pandas? - python

I have the following Pandas dataframe:
name1 name2
A B
A A
A C
A A
B B
B A
I want to add a column named new which counts name1 OR name2 keeping the merged columns (distinct values in both name1 and name2). Hence, the expected output is the following dataframe:
name new
A 7
B 4
C 1
I've tried
df.groupby(["name1"]).count().groupby(["name2"]).count(), among many other things... but although that last one seems to give me the correct results, I cant get the joined datasets.

You can use value_counts with df.stack():
df[['name1','name2']].stack().value_counts()
#df.stack().value_counts() for all cols
A 7
B 4
C 1
Specifically:
(df[['name1','name2']].stack().value_counts().
to_frame('new').rename_axis('name').reset_index())
name new
0 A 7
1 B 4
2 C 1

Let us try melt
df.melt().value.value_counts()
Out[17]:
A 7
B 4
C 1
Name: value, dtype: int64

Alternatively,
df.name1.value_counts().add(df.name2.value_counts(), fill_value=0).astype(int)
gives you
A 7
B 4
C 1
dtype: int64

Using Series.append with Series.value_counts:
df['name1'].append(df['name2']).value_counts()
A 7
B 4
C 1
dtype: int64
value_counts converts the aggregated column to index. To get your desired output, use rename_axis with reset_index:
df['name1'].append(df['name2']).value_counts().rename_axis('name').reset_index(name='new')
name new
0 A 7
1 B 4
2 C 1

python Counter is another solution
from collections import Counter
s = pd.Series(Counter(df.to_numpy().flatten()))
In [1325]: s
Out[1325]:
A 7
B 4
C 1
dtype: int64

Related

Pivot table based on the first value of the group in Pandas

Have the following DataFrame:
I'm trying to pivot it in pandas and achieve the following format:
Actually I tried the classical approach with pd.pivot_table() but it does not work out:
pd.pivot_table(df,values='col2', index=[df.index], columns = 'col1')
Would be appreciate for some suggestions :) Thanks!
You can use pivot and then dropna for each column:
>>> df.pivot(columns='col1', values='col2').apply(lambda x: x.dropna().tolist()).astype(int)
col1 a b c
0 1 2 9
1 4 5 0
2 6 8 7
Another option is to create a Series of lists using groupby.agg; then construct a DataFrame:
out = df.groupby('col1')['col2'].agg(list).pipe(lambda x: pd.DataFrame(zip(*x), columns=x.index.tolist()))
Output:
A B C
0 1 2 9
1 4 5 0
2 6 8 7

All possibilities with groupby and value_counts(), issue with Multindex

I have a table that looks like this:
ACCOUNT_ID | OPTION
1 A
2 A
2 B
2 B
2 C
I want to count the groups for each so I ran df.groupby(['ACCOUNT_ID'])['OPTION'].value_counts() and the result looks like this:
ACCOUNT_ID | OPTION
1 A 1
2 A 1
2 B 2
2 C 1
This works well but I want every possible option to be shown (so A, B, C counts for each account_id) like:
ACCOUNT_ID | OPTION
1 A 1
1 B 0
1 C 0
2 A 1
2 B 2
2 C 1
I found this response, using .sort_index().reindex(uniques, fill_value = 0), which looks great, but doesn't work since I am using a MultiIndex.
Any tips would be amazing!!
One solution is to unstack the inner level of the MultiIndex into columns. This gives you a DataFrame whose columns have float64 dtype, with NaN values for missing combinations of ACCOUNT_ID and OPTION. Fill NaNs with 0, convert back to integer dtype with astype, and stack the columns back into the index to recreate the MultiIndex:
df.unstack().fillna(0).astype(int).stack()
ACCOUNT_ID OPTION
1 A 1
B 0
C 0
2 A 1
B 2
C 1
dtype: int64

Replace each value in Series with its relative ranking

I have a sorted Series, is there a simple way to change it from
A 0.064467
B 0.042283
C 0.037581
D 0.017410
dtype: float64
to
A 1
B 2
C 3
D 4
You can just do rank
df.rank(ascending=False)

In pandas, how to display the most frequent diagnoses in dataframe, but only count 1 occurrence of the same diagnoses per patient

In pandas and python:
I have a large datasets with health records where patients have records of diagnoses.
How to display the most frequent diagnoses, but only count 1 occurrence of the same diagnoses per patient?
Example ('pid' is patient id. 'code' is the code of a diagnosis):
IN:
pid code
1 A
1 B
1 A
1 A
2 A
2 A
2 B
2 A
3 B
3 C
3 D
4 A
4 A
4 A
4 B
OUT:
B 4
A 3
C 1
D 1
I would like to be able to use .isin .index if possible.
Example:
Remove all rows with less than 3 in frequency count in column 'code'
s = df['code'].value_counts().ge(3)
df = df[df['code'].isin(s[s].index)]
You can use groupby + nunique:
df.groupby(by='code').pid.nunique().sort_values(ascending=False)
Out[60]:
code
B 4
A 3
D 1
C 1
Name: pid, dtype: int64
To remove all rows with less than 3 in frequency count in column 'code'
df.groupby(by='code').filter(lambda x: x.pid.nunique()>=3)
Out[55]:
pid code
0 1 A
1 1 B
2 1 A
3 1 A
4 2 A
5 2 A
6 2 B
7 2 A
8 3 B
11 4 A
12 4 A
13 4 A
14 4 B
Since you mention value_counts
df.groupby('code').pid.value_counts().count(level=0)
Out[42]:
code
A 3
B 4
C 1
D 1
Name: pid, dtype: int64
You should be able to use the groupby and nunique() functions to obtain a distinct count of patients that had each diagnosis. This should give you the result you need:
df[['pid', 'code']].groupby(['code']).nunique()

Renaming tuple column name in dataframe

I am new to python and pandas. I have attached a picture of a pandas dataframe,
I need to know how I can fetch data from the last column and how to rename the last column.
You can use:
df = df.rename(columns = {df.columns[-1] : 'newname'})
Or:
df.columns = df.columns[:-1].tolist() + ['new_name']
It seems solution:
df.columns.values[-1] = 'newname'
is buggy. Because after rename pandas functions return weird errors.
For fetch data from last column is possible use select by position by iloc:
s = df.iloc[:,-1]
And after rename:
s1 = df['newname']
print (s1)
Sample:
df = pd.DataFrame({'R':[7,8,9],
'T':[1,3,5],
'E':[5,3,6],
('Z', 'a'):[7,4,3]})
print (df)
E T R (Z, a)
0 5 1 7 7
1 3 3 8 4
2 6 5 9 3
s = df.iloc[:,-1]
print (s)
0 7
1 4
2 3
Name: (Z, a), dtype: int64
df.columns = df.columns[:-1].tolist() + ['new_name']
print (df)
E T R new_name
0 5 1 7 7
1 3 3 8 4
2 6 5 9 3
df = df.rename(columns = {('Z', 'a') : 'newname'})
print (df)
E T R newname
0 5 1 7 7
1 3 3 8 4
2 6 5 9 3
s = df['newname']
print (s)
0 7
1 4
2 3
Name: newname, dtype: int64
df.columns.values[-1] = 'newname'
s = df['newname']
print (s)
>KeyError: 'newname'
fetch data from the last column
Retrieving the last column using df.iloc[:,-1] as suggested by other answers works fine only when it is indeed the last column.
However, using absolute column positions like -1 is not a stable solution, i.e. if you add some other column, your code will break.
A stable, generic approach
First of all, make sure all your column names are strings:
# rename columns
df.columns = [str(s) for s in df.columns]
# access column by name
df['(vehicle_id, reservation_count)']
rename the last column
It is preferable to have similar column names for all columns, without brackets in them - make your code more readable and your dataset easier to use:
# access column by name
df['vehicle_id_reservation_count`]
This is a straight forward conversion on all columns that are named by a tuple:
# rename columns
def rename(col):
if isinstance(col, tuple):
col = '_'.join(str(c) for c in col)
return col
df.columns = map(rename, df.columns)
You can drop the last column and reassign it with a different name.
This isn't technically renaming the column. However, I think its intuitive.
Using #jezrael's setup
df = pd.DataFrame({'R':[7,8,9],
'T':[1,3,5],
'E':[5,3,6],
('Z', 'a'):[7,4,3]})
print(df)
R T E (Z, a)
0 7 1 5 7
1 8 3 3 4
2 9 5 6 3
How can I fetch the last column?
You can use iloc
df.iloc[:, -1]
0 5
1 3
2 6
Name: c, dtype: int64
You can rename the column after you've extracted it
df.iloc[:, -1].rename('newcolumn')
0 5
1 3
2 6
Name: newcolumn, dtype: int64
In order to rename it within the dataframe, you can do a great number of ways. To continue with the theme that I've started, namely, fetching the column, then renaming it:
option 1
start by dropping the last column with iloc[:, :-1]
use join to add the renamed column referenced above
df.iloc[:, :-1].join(df.iloc[:, -1].rename('newcolumn'))
R T E newname
0 7 1 5 7
1 8 3 3 4
2 9 5 6 3
option 2
Or we can use assign to put it back and save the rename
df.iloc[:, :-1].assign(newname=df.iloc[:, -1])
R T E newname
0 7 1 5 7
1 8 3 3 4
2 9 5 6 3
For changeing the column name
columns=df.columns.values
columns[-1]="Column name"
For fetch data from dataframe
You can use loc,iloc and ix methods.
loc is for fetch value using label
iloc is for fetch value using indexing
ix can fetch data with both using index and label
Learn about loc and iloc
http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection
Learn more about indexing and selecting data
http://pandas.pydata.org/pandas-docs/stable/indexing.html

Categories