Pandas transform columns into counts grouped by ID - python

I have a dataframe like this:
| |ID |sex|est|
| 0 |aaa| M | S |
| 1 |aaa| M | C |
| 2 |aaa| F | D |
| 3 |bbb| F | D |
| 4 |bbb| M | C |
| 5 |ccc| F | C |
I need to change it to this:
| |ID | M | F | S | C | D |
| 0 |aaa| 2 | 1 | 1 | 1 | 1 |
| 1 |bbb| 1 | 1 | 0 | 1 | 1 |
| 2 |ccc| 0 | 1 | 0 | 1 | 0 |
I need to count from each unique ID the number of entries for each row but I can't do it manually, there are too many rows and columns.

Try this:
out = (df
.set_index('ID')
.stack()
.str.get_dummies()
.groupby(level=0)
.sum()
.reset_index()
)
print(out)
ID C D F M S
0 aaa 1 1 1 2 1
1 bbb 1 1 1 1 0
2 ccc 1 0 1 0 0

Use pd.get_dummies directly, to avoid the stack step, before computing on the groupby:
(pd
.get_dummies(
df,
columns=['sex', 'est'],
prefix_sep='',
prefix='')
.groupby('ID', as_index=False)
.sum()
)
ID F M C D S
0 aaa 1 2 1 1 1
1 bbb 1 1 1 1 0
2 ccc 1 0 1 0 0

Related

Pandas: group and custom transfrom dataframe long to wide

I have a dataframe in following form:
+---------+-------+-------+---------+---------+
| payment | type | err | country | source |
+---------+-------+-------+---------+---------+
| visa | type1 | OK | AR | source1 |
| paypal | type1 | OK | DE | source1 |
| mc | type2 | ERROR | AU | source2 |
| visa | type3 | OK | US | source2 |
| visa | type2 | OK | FR | source3 |
| visa | type1 | OK | FR | source2 |
+---------+-------+-------+---------+---------+
df = pd.DataFrame({'payment':['visa','paypal','mc','visa','visa','visa'],
'type':['type1','type1','type2','type3','type2','type1'],
'err':['OK','OK','ERROR','OK','OK','OK'],
'country':['AR','DE','AU','US','FR','FR'],
'source':['source1','source1','source2','source2','source3','source2'],
})
My goal is to transform it so that I have group by payment and country, but create new columns:
number_payments - just count for groupby,
num_errors - number of ERROR values for group,
num_type1.. num_type3 - number of corresponding values in column type (only 3 possible values),
num_source1.. num_source3 - number of corresponding values in column source (only 3 possible values).
Like this:
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 | num_type3 | num_source1 | num_source2 | num_source3 |
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
| visa | AR | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| visa | US | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| visa | FR | 2 | 0 | 1 | 2 | 0 | 0 | 1 | 1 |
| mc | AU | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
I tried to combine pandas groupby and pivot, but failed to make all and it's ugly. I'm pretty sure that there are some good and fast methods to do this..
Appreciate any help.
You can use get dummies and then select the 2 grouper columns and create the group, then join the size with sum:
c = df['err'].eq("ERROR")
g = (df[['payment','country']].assign(num_errors=c,
**pd.get_dummies(df[['type','source']],prefix=['num','num']))
.groupby(['payment','country']))
out = g.size().to_frame("number_payments").join(g.sum()).reset_index()
print(out)
payment country number_payments num_errors num_type1 num_type2 \
0 mc AU 1 1 0 1
1 paypal DE 1 0 1 0
2 visa AR 1 0 1 0
3 visa FR 2 0 1 1
4 visa US 1 0 0 0
num_type3 num_source1 num_source2 num_source3
0 0 0 1 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 1
4 1 0 1 0
First it is better to clean the data for your stated purposes:
df['err_bool'] = (df['err'] == 'ERROR').astype(int)
Then we use groupby for applicable columns:
df_grouped = df.groupby(['country','payment']).agg({
'number_payments' : 'count',
'err_bool':sum})
Then we can use the pivot for type and source:
df['dummy'] = 1
df_type = df.pivot(
index=['country','payment'],
columns='type',
value='dummy',
aggfunc = np.sum
)
df_source = df.pivot_table(
index=['country','payment'],
columns='source',
value='dummy',
aggfunc = np.sum
)
Then we join everything together:
df_grouped = df_grouped.join(df_type).join(df_source)

Group by as a list and create new colum for each value

I have a dataframe where every row is a user id and if he has an item:
| user | item_id |
|------|---------|
| 1 | a |
| 1 | b |
| 2 | b |
| 3 | c |
| 4 | a |
| 4 | c |
I want to create n columns where n is all the possible values of item_id, group one row per user and fill 1/0 according if the value is present for the user.
| user | item_a | item_b | item_c |
|------|---------|---------|----------|
| 1 | 1 | 1 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 1 | 1 |
| 4 | 1 | 0 | 1 |
Use pivot_table:
import pandas as pd
df = pd.DataFrame({'user': [1,1,2,3,4,4], 'item_id': list('abbcac')})
df = df.assign(val=1).pivot_table(values='val',
index='user',
columns='item_id',
fill_value=0)
pd.crosstab(df.user,df.item_id).add_prefix('item_').reset_index()
Yet another approach is to use get_dummies and group by sum where:
pd.get_dummies(df, columns=['item_id']).groupby('user').sum().reset_index()
desired result:
user item_id_a item_id_b item_id_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1
and to change the columns:
df.columns = df.columns.str.replace(r"_id", "")
df
user item_a item_b item_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1

Python pandas count occurrences in each column

I am new to pandas. Can someone help me in calculating frequencies of values for each columns.
Dataframe:
id|flag1|flag2|flag3|
---------------------
1 | 1 | 2 | 1 |
2 | 3 | 1 | 1 |
3 | 3 | 4 | 4 |
4 | 4 | 1 | 4 |
5 | 2 | 3 | 2 |
I want something like
id|flag1|flag2|flag3|
---------------------
1 | 1 | 2 | 2 |
2 | 1 | 1 | 1 |
3 | 2 | 1 | 0 |
4 | 1 | 1 | 2 |
Explanation - id 1 has 1 value in flag1, 2 values in flag2 and 2 values in flag3.
First filter only flag columns by filter or removing id column and then apply function value_counts, last replace NaNs to 0 and cast to ints:
df = df.filter(like='flag').apply(lambda x: x.value_counts()).fillna(0).astype(int)
print (df)
flag1 flag2 flag3
1 1 2 2
2 1 1 1
3 2 1 0
4 1 1 2
Or:
df = df.drop('id', 1).apply(lambda x: x.value_counts()).fillna(0).astype(int)
print (df)
flag1 flag2 flag3
1 1 2 2
2 1 1 1
3 2 1 0
4 1 1 2
Thank you, Bharath for suggestion:
df = df.filter(like='flag').apply(pd.Series.value_counts()).fillna(0).astype(int)

How to groupby count across multiple columns in pandas

I have the following sample dataframe in Python pandas:
+---+------+------+------+
| | col1 | col2 | col3 |
+---+------+------+------+
| 0 | a | d | b |
+---+------+------+------+
| 1 | a | c | b |
+---+------+------+------+
| 2 | c | b | c |
+---+------+------+------+
| 3 | b | b | c |
+---+------+------+------+
| 4 | a | a | d |
+---+------+------+------+
I would like to perform a count of all the 'a,' 'b,' 'c,' and 'd' values across columns 1-3 so that I would end up with a dataframe like this:
+---+--------+-------+
| | letter | count |
+---+--------+-------+
| 0 | a | 4 |
+---+--------+-------+
| 1 | b | 5 |
+---+--------+-------+
| 2 | c | 4 |
+---+--------+-------+
| 3 | d | 2 |
+---+--------+-------+
One way I can do this is stack the columns on top of each other and THEN do a groupby count, but I feel like there has to be a better way. Can someone help me with this?
You can stack() the dataframe to put all columns into rows and then do value_counts:
df.stack().value_counts()
b 5
c 4
a 4
d 2
dtype: int64
You can apply value_counts with sum:
print (df.apply(pd.value_counts))
col1 col2 col3
a 3.0 1 NaN
b 1.0 2 2.0
c 1.0 1 2.0
d NaN 1 1.0
df1 = df.apply(pd.value_counts).sum(1).reset_index()
df1.columns = ['letter','count']
df1['count'] = df1['count'].astype(int)
print (df1)
letter count
0 a 4
1 b 5
2 c 4
3 d 2

Pandas / Numpy: How to Turn Column Data Into Sparse Matrix

I'm working on an iPython project with Pandas and Numpy. I'm just learning too so this question is probably pretty basic. Lets say I have two columns of data
---------------
| col1 | col2 |
---------------
| a | b |
| c | d |
| b | e |
---------------
I want to transform this data of the form.
---------------------
| a | b | c | d | e |
---------------------
| 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 1 | 0 | 0 | 1 |
---------------------
Then I want to take a three column version
---------------------
| col1 | col2 | val |
---------------------
| a | b | .5 |
| c | d | .3 |
| b | e | .2 |
---------------------
and turn it into
---------------------------
| a | b | c | d | e | val |
---------------------------
| 1 | 1 | 0 | 0 | 0 | .5 |
| 0 | 0 | 1 | 1 | 0 | .3 |
| 0 | 1 | 0 | 0 | 1 | .2 |
---------------------------
I'm very new to Pandas and Numpy, how would I do this? What functions would I use?
I think you're looking for the pandas.get_dummies() function and pandas.DataFrame.combineAdd method.
In [7]: df = pd.DataFrame({'col1': list('acb'),
'col2': list('bde'),
'val': [.5, .3, .2]})
In [8]: df1 = pd.get_dummies(df.col1)
In [9]: df2 = pd.get_dummies(df.col2)
This produces the following two dataframes:
In [16]: df1
Out[16]:
a b c
0 1 0 0
1 0 0 1
2 0 1 0
[3 rows x 3 columns]
In [17]: df2
Out[17]:
b d e
0 1 0 0
1 0 1 0
2 0 0 1
[3 rows x 3 columns]
Which can be combined as follows:
In [10]: dummies = df1.combineAdd(df2)
In [18]: dummies
Out[18]:
a b c d e
0 1 1 0 0 0
1 0 0 1 1 0
2 0 1 0 0 1
[3 rows x 5 columns]
The last step is to copy the val column into the new dataframe.
In [19]: dummies['val'] = df.val
In [20]: dummies
Out[20]:
a b c d e val
0 1 1 0 0 0 0.5
1 0 0 1 1 0 0.3
2 0 1 0 0 1 0.2
[3 rows x 6 columns]

Categories