How to groupby count across multiple columns in pandas - python

I have the following sample dataframe in Python pandas:
+---+------+------+------+
| | col1 | col2 | col3 |
+---+------+------+------+
| 0 | a | d | b |
+---+------+------+------+
| 1 | a | c | b |
+---+------+------+------+
| 2 | c | b | c |
+---+------+------+------+
| 3 | b | b | c |
+---+------+------+------+
| 4 | a | a | d |
+---+------+------+------+
I would like to perform a count of all the 'a,' 'b,' 'c,' and 'd' values across columns 1-3 so that I would end up with a dataframe like this:
+---+--------+-------+
| | letter | count |
+---+--------+-------+
| 0 | a | 4 |
+---+--------+-------+
| 1 | b | 5 |
+---+--------+-------+
| 2 | c | 4 |
+---+--------+-------+
| 3 | d | 2 |
+---+--------+-------+
One way I can do this is stack the columns on top of each other and THEN do a groupby count, but I feel like there has to be a better way. Can someone help me with this?

You can stack() the dataframe to put all columns into rows and then do value_counts:
df.stack().value_counts()
b 5
c 4
a 4
d 2
dtype: int64

You can apply value_counts with sum:
print (df.apply(pd.value_counts))
col1 col2 col3
a 3.0 1 NaN
b 1.0 2 2.0
c 1.0 1 2.0
d NaN 1 1.0
df1 = df.apply(pd.value_counts).sum(1).reset_index()
df1.columns = ['letter','count']
df1['count'] = df1['count'].astype(int)
print (df1)
letter count
0 a 4
1 b 5
2 c 4
3 d 2

Related

Movement of a specific string from one column to another

Hi I have a DataFrame which have values like
| ID| Value| comments |
| 1 | a | |
| 2 | b | |
| 3 | a;b;c| |
| 4 | b;c | |
| 5 | d;a;c| |
I need to transfer to a and b from Value to Comments for all the rows they are in. so that only value other that a and b will remain in data.
the new df would look like this
| ID| Value| comments |
| 1 | | a |
| 2 | | b |
| 3 | c | a;b |
| 4 | c | b |
| 5 | d;c | a |
Can you give me a direction where should i look for the answer to this
(i) Use str.split to split on ';' and explode the "Value" column
(ii) Use boolean indexing to filter rows where 'a' or 'b' exist, take them out and groupby index and join them with ';' as separators
exploded_series = df['Value'].str.split(';').explode()
mask = exploded_series.isin(['a','b'])
df['comments'] = exploded_series[mask].groupby(level=0).apply(';'.join)
df['Value'] = exploded_series[~mask].groupby(level=0).apply(';'.join)
df = df.fillna('')
Output:
ID Value comments
0 1 a
1 2 b
2 3 c a;b
3 4 c b
4 5 d;c a
Explode your Value column then label it to the right column:
out = df.assign(Value=df['Value'].str.split(';')).explode('Value')
out['col'] = np.where(out['Value'].isin(['a', 'b']), 'comments', 'Value')
print(out)
# Intermediate output
ID Value comments col
0 1 a NaN comments
1 2 b NaN comments
2 3 a NaN comments
2 3 b NaN comments
2 3 c NaN Value
3 4 b NaN comments
3 4 c NaN Value
4 5 d NaN Value
4 5 a NaN comments
4 5 c NaN Value
Now pivot your dataframe:
out = out.pivot_table(index='ID', columns='col', values='Value', aggfunc=';'.join) \
.fillna('').reset_index().rename_axis(columns=None)
print(out)
# Final output
ID Value comments
0 1 a
1 2 b
2 3 c a;b
3 4 c b
4 5 d;c a

Combine and expand Dataframe based on IDs in columns

I have 3 dataframes A,B,C:
import pandas as pd
A = pd.DataFrame({"id": [1,2],
"connected_to_B_id1":["A","B"],
"connected_to_B_id2":["B","C"],
"connected_to_B_id3":["C", np.nan],
# entry can have multiple ids from B
})
B = pd.DataFrame({"id": ["A","B","C"],
"connected_to_C_id1":[1,1,2],
"connected_to_C_id2":[2,2,np.nan],
# entry can have multiple ids from C
})
C = pd.DataFrame({"id": [1,2],
"name":["a","b"],
})
#Output should be D:
D = pd.DataFrame({"id_A": [1,1,1,1,1,2,2,2],
"id_B": ["A","A","B","B","C","B","B","C"],
"id_C": [1,2,1,2,2,1,2,1],
"name": ["a","b","a","b","b","a","b","a"]
})
I want to use the IDs stored in the "connected_to_X" columns of each dataframe to create a dataframe, which contains all relationships recorded in the three individual dataframes.
What is the most elegant way to combine the dataframes to A, B and C to D?
Currently I am using dicts,lists and for loops and its messy and complicated.
D:
|idx |id_A|id_B|id_C|name|
|---:|--:|--:|--:|--:|
| 0 | 1 | A | 1 | a |
| 1 | 1 | A | 2 | b |
| 2 | 1 | B | 1 | a |
| 3 | 1 | B | 2 | b |
| 4 | 1 | C | 2 | b |
| 5 | 2 | B | 1 | a |
| 6 | 2 | B | 2 | b |
| 7 | 2 | C | 1 | a |
You just need to unpivot A and B then you can join the tables up.
(A.
melt(id_vars='id').
merge(B.melt(id_vars='id'), left_on = 'value', right_on='id', how='left').
merge(C, left_on = 'value_y', right_on='id').
drop(columns = ['variable_x', 'variable_y', 'value_x']).
sort_values(['id_x', 'id_y']).
reset_index(drop=True).
reset_index()
)
index id_x id_y value_y id name
0 0 1 A 1.0 1 a
1 1 1 A 2.0 2 b
2 2 1 B 1.0 1 a
3 3 1 B 2.0 2 b
4 4 1 C 2.0 2 b
5 5 2 B 1.0 1 a
6 6 2 B 2.0 2 b
7 7 2 C 2.0 2 b

Grouping many columns in one column in Pandas

I have a DataFrame that is similar to this one:
| | id | Group1 | Group2 | Group3 |
|---|----|--------|--------|--------|
| 0 | 22 | A | B | C |
| 1 | 23 | B | C | D |
| 2 | 24 | C | B | A |
| 3 | 25 | D | A | C |
And I want to get something like this:
| | Group | id_count |
|---|-------|----------|
| 0 | A | 3 |
| 1 | B | 3 |
| 2 | C | 3 |
| 3 | D | 2 |
Basically for each group I want to know how many people(id) have chosen it.
I know there is pd.groupby(), but it only gives an appropriate result for one column (if I give it a list, it does not combine group 1,2,3 in one column).
Use DataFrame.melt with GroupBy.size:
df1 = (df.melt('id', value_name='Group')
.groupby('Group')
.size()
.reset_index(name='id_count'))
print (df1)
Group id_count
0 A 3
1 B 3
2 C 4
3 D 2

Python: how do I filter data as long as a group contains any of a particular value

df = pd.DataFrame({'VisitID':[1,1,1,1,2,2,2,3,3,4,4], 'Item':['A','B','C','D','A','D','B','B','C','D','C']})
I have a dataset like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
I want to return VisitID rows as long as that VisitID had a occurrence of item A OR B. How do I go about? Expected Result:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
In R, I can do this via
library(dplyr)
df %>% group_by(VisitID) %>% filter(any(Item %in% c('A', 'B')))
How can I perform this in Python?
Something like df.groupby(['VisitID']).query(any(['A','B']))?
The syntax is similar, just use groupby.filter:
df.groupby('VisitID').filter(lambda g: g.Item.isin(['A','B']).any())
VisitID Item
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 D
6 2 B
7 3 B
8 3 C
To extract groups contains either we can just use groupby().transform('any') on isin():
s = (df.Item.isin(['A','B'])
.groupby(df['VisitID']).transform('any')
)
df[s]
Output:
VisitID Item
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 D
6 2 B
7 3 B
8 3 C

How do I add or subtract a row to an entire pandas dataframe?

I have a dataframe like this:
| a | b | c |
0 | 0 | 0 | 0 |
1 | 5 | 5 | 5 |
I have a dataframe row (or series) like this:
| a | b | c |
0 | 1 | 2 | 3 |
I want to subtract the row from the entire dataframe to obtain this:
| a | b | c |
0 | 1 | 2 | 3 |
1 | 6 | 7 | 8 |
Any help is appreciated, thanks.
Use DataFrame.add or DataFrame.sub with convert one row DataFrame to Series - e.g. by DataFrame.iloc for first row:
df = df1.add(df2.iloc[0])
#alternative select by row label
#df = df1.add(df2.loc[0])
print (df)
a b c
0 1 2 3
1 6 7 8
Detail:
print (df2.iloc[0])
a 1
b 2
c 3
Name: 0, dtype: int64
You can convert the second dataframe to numpy array:
df1 + df2.values
Output:
a b c
0 1 2 3
1 6 7 8

Categories