I have a dataframe in following form:
+---------+-------+-------+---------+---------+
| payment | type | err | country | source |
+---------+-------+-------+---------+---------+
| visa | type1 | OK | AR | source1 |
| paypal | type1 | OK | DE | source1 |
| mc | type2 | ERROR | AU | source2 |
| visa | type3 | OK | US | source2 |
| visa | type2 | OK | FR | source3 |
| visa | type1 | OK | FR | source2 |
+---------+-------+-------+---------+---------+
df = pd.DataFrame({'payment':['visa','paypal','mc','visa','visa','visa'],
'type':['type1','type1','type2','type3','type2','type1'],
'err':['OK','OK','ERROR','OK','OK','OK'],
'country':['AR','DE','AU','US','FR','FR'],
'source':['source1','source1','source2','source2','source3','source2'],
})
My goal is to transform it so that I have group by payment and country, but create new columns:
number_payments - just count for groupby,
num_errors - number of ERROR values for group,
num_type1.. num_type3 - number of corresponding values in column type (only 3 possible values),
num_source1.. num_source3 - number of corresponding values in column source (only 3 possible values).
Like this:
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 | num_type3 | num_source1 | num_source2 | num_source3 |
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
| visa | AR | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| visa | US | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| visa | FR | 2 | 0 | 1 | 2 | 0 | 0 | 1 | 1 |
| mc | AU | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
I tried to combine pandas groupby and pivot, but failed to make all and it's ugly. I'm pretty sure that there are some good and fast methods to do this..
Appreciate any help.
You can use get dummies and then select the 2 grouper columns and create the group, then join the size with sum:
c = df['err'].eq("ERROR")
g = (df[['payment','country']].assign(num_errors=c,
**pd.get_dummies(df[['type','source']],prefix=['num','num']))
.groupby(['payment','country']))
out = g.size().to_frame("number_payments").join(g.sum()).reset_index()
print(out)
payment country number_payments num_errors num_type1 num_type2 \
0 mc AU 1 1 0 1
1 paypal DE 1 0 1 0
2 visa AR 1 0 1 0
3 visa FR 2 0 1 1
4 visa US 1 0 0 0
num_type3 num_source1 num_source2 num_source3
0 0 0 1 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 1
4 1 0 1 0
First it is better to clean the data for your stated purposes:
df['err_bool'] = (df['err'] == 'ERROR').astype(int)
Then we use groupby for applicable columns:
df_grouped = df.groupby(['country','payment']).agg({
'number_payments' : 'count',
'err_bool':sum})
Then we can use the pivot for type and source:
df['dummy'] = 1
df_type = df.pivot(
index=['country','payment'],
columns='type',
value='dummy',
aggfunc = np.sum
)
df_source = df.pivot_table(
index=['country','payment'],
columns='source',
value='dummy',
aggfunc = np.sum
)
Then we join everything together:
df_grouped = df_grouped.join(df_type).join(df_source)
Related
I have the followign code:
import pandas as pd
status = ['Pass','Fail']
item_info = pd.DataFrame({
'student': ['John','Alice','Pete','Mike','John','Alice','Joseph'],
'test': ['Pass','Pass','Pass','Pass','Pass','Pass','Pass']
})
item_status = pd.crosstab(item_info['student'],item_info['test'])
print(item_status)
Which produces:
| Student | Pass |
|---------|------|
| Alice | 2 |
| John | 2 |
| Joseph | 1 |
| Mike | 1 |
| Pete | 1 |
However, I want to create something that looks like this:
| Student | Pass | Fail | Total |
|---------|------|------|-------|
| Alice | 2 | 0 | 2 |
| John | 2 | 0 | 2 |
| Joseph | 1 | 0 | 1 |
| Mike | 1 | 0 | 1 |
| Pete | 1 | 0 | 1 |
How do I change the code so that it includes a Fail column with 0 for all of the students and provides a total?
Generic solution which adds an extra label without knowing the existing labels in advance, with reindex
cols = item_info['test'].unique().tolist()+['Fail'] #adding the extra label
pd.crosstab(item_info['student'],item_info['test']).reindex(columns=cols,fill_value=0)
Or depending on what you want, I assumed you are looking to chain methods:
item_status = pd.crosstab(item_info['student'],item_info['test'])
item_status['Fail'] = 0
test Pass Fail
student
Alice 2 0
John 2 0
Joseph 1 0
Mike 1 0
Pete 1 0
I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
I have a dataframe where every row is a user id and if he has an item:
| user | item_id |
|------|---------|
| 1 | a |
| 1 | b |
| 2 | b |
| 3 | c |
| 4 | a |
| 4 | c |
I want to create n columns where n is all the possible values of item_id, group one row per user and fill 1/0 according if the value is present for the user.
| user | item_a | item_b | item_c |
|------|---------|---------|----------|
| 1 | 1 | 1 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 1 | 1 |
| 4 | 1 | 0 | 1 |
Use pivot_table:
import pandas as pd
df = pd.DataFrame({'user': [1,1,2,3,4,4], 'item_id': list('abbcac')})
df = df.assign(val=1).pivot_table(values='val',
index='user',
columns='item_id',
fill_value=0)
pd.crosstab(df.user,df.item_id).add_prefix('item_').reset_index()
Yet another approach is to use get_dummies and group by sum where:
pd.get_dummies(df, columns=['item_id']).groupby('user').sum().reset_index()
desired result:
user item_id_a item_id_b item_id_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1
and to change the columns:
df.columns = df.columns.str.replace(r"_id", "")
df
user item_a item_b item_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1
Current data frame: I have a pandas data frame where each employee has a text code(all codes start with T) and an associated frequency right next to the code. All text codes have 8 characters.
+----------+-------------------------------------------------------------+
| emp_id | text |
+----------+-------------------------------------------------------------+
| E0001 | [T0431516,-8,T0401531,-12,T0517519,12] |
| E0002 | [T0701540,-1,T0431516,-2] |
| E0003 | [T0517519,-1,T0421531,-7,T0516319,9,T0500371,-6,T0309711,-3]|
| E0004 | [T0516319,-3] |
| E0005 | [T0431516,2] |
+----------+-------------------------------------------------------------+
Expected data frame: I am trying to make the text codes present in the data frame as individual columns and if an employee has a frequency for that code then populate frequency else 0.
+----------+----------------------------------------------------------------------------------------+
| emp_id | T0431516 | T0401531 | T0517519 | T0701540 | T0421531 | T0516319 | T0500371 | T0309711 |
+----------+----------------------------------------------------------------------------------------+
| E0001 | -8 | -12 | 12 | 0 | 0 | 0 | 0 | 0 |
| E0002 | -2 | 0 | 0 | -1 | 0 | 0 | 0 | 0 |
| E0003 | 0 | 0 | -1 | 0 | -7 | 9 | -6 | -3 |
| E0004 | 0 | 0 | 0 | 0 | 0 | -3 | 0 | 0 |
| E0005 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------+----------------------------------------------------------------------------------------+
Sample data:
pd.DataFrame({'emp_id' : {0: 'E0001', 1: 'E0002', 2: 'E0003', 3: 'E0004', 4: 'E0005'},
'text' : {0: '[T0431516,-8,T0401531,-12,T0517519,12]', 1: '[T0701540,-1,T0431516,-2]', 2: '[T0517519,-1,T0421531,-7,T0516319,9,T0500371,-6,T0309711,-3]', 3: '[T0516319,-3]', 4: '[T0431516,2]'}
})
So, far my attempts were unsuccessful. Any pointers/help is much appreciated!
You can explode the dataframe and then create a pivot_table:
df = pd.DataFrame({'emp_id' : ['E0001', 'E0002', 'E0003', 'E0004', 'E0005'],
'text' : [['T0431516',-8,'T0401531',-12,'T0517519',12],
['T0701540',-1,'T0431516',-2],['T0517519',-1,'T0421531',-7,'T0516319',9,'T0500371',-6,'T0309711',-3],
['T0516319',-3], ['T0431516',2]]})
df = df.explode('text')
df['freq'] = df['text'].shift(-1)
df = df[df['text'].str[0] == 'T']
df['freq'] = df['freq'].astype(int)
df = pd.pivot_table(df, index='emp_id', columns='text', values='freq',aggfunc = 'sum').fillna(0).astype(int)
df
Out[1]:
text T0309711 T0401531 T0421531 T0431516 T0500371 T0516319 T0517519 \
emp_id
E0001 0 -12 0 -8 0 0 12
E0002 0 0 0 -2 0 0 0
E0003 -3 0 -7 0 -6 9 -1
E0004 0 0 0 0 0 -3 0
E0005 0 0 0 2 0 0 0
text T0701540
emp_id
E0001 0
E0002 -1
E0003 0
E0004 0
E0005 0
I have data which has a row granularity in terms of events, and I want to aggregate them by a customer ID. The data is in the form of a pandas df and looks like so:
| Event ID | Cust ID | P1 | P2 | P3 | P4 |
------------------------------------------
| 1 | 1 | 12 | 0 | 0 | 0 |
--------------------------
| 2 | 1 | 12 | 0 | 0 | 0 |
--------------------------
| 3 | 1 | 10 | 12 | 0 | 0 |
---------------------------
| 4 | 2 | 206 | 0 | 0 | 0 |
---------------------------
| 5 | 2 | 206 | 25 | 0 | 0 |
----------------------------
P1 to P4 have numbers which are just levels, they are event categories which I need to get counts of (there are 175+ codes), where each event category gets its own column.
The output I want, would ideally look like:
| Cust ID | Count(12) | Count(10) | Count(25) | Count(206) |
------------------------------------------------------------
| 1 | 3 | 1 | 0 | 0 |
---------------------
| 2 | 0 | 0 | 1 | 2 |
---------------------
The challenge I am facing is taking the counts across multiple columns. There are 2 '12's in P1 and 1 '12' in P2.
I tried using groupby and merge. But I've either used them incorrectly or they're the wrong functions to use because I get a lot of 'NaN's in the resulting table.
You can use the following method:
df = pd.DataFrame({'Event ID':[1,2,3,4,5],
'Cust ID':[1]*3+[2]*2,
'P1':[12,12,10,206,25],
'P2':[0,0,12,0,0],
'P3':[0]*5,
'P4':[0]*5})
df.melt(['Event ID','Cust ID'])\
.groupby('Cust ID')['value'].value_counts()\
.unstack().add_prefix('Count_')\
.reset_index()
Output:
value Cust ID Count_0 Count_10 Count_12 Count_25 Count_206
0 1 8.0 1.0 3.0 NaN NaN
1 2 6.0 NaN NaN 1.0 1.0