How to Sum and Group Data Frame Elements? - python

+-----+-------+--------+
| | Buyer | Sex |
+-----+-------+--------+
| 0 | 1 | Male |
| 1 | 1 | Female |
| 2 | 0 | Male |
| 3 | 1 | Female |
| ... | ... | ... |
+-----+-------+--------+
I'd like to sum and group the data frame above, into the data frame (table) below. Does pandas have any built-in-functions that can accomplish this or do I have to manually iterate, sum, and group?
+---+---------+------+
| | Female | Male |
+---+---------+------+
| 0 | 81 | 392 |
| 1 | 539 | 233 |
+---+---------+------+

Use pivot_table, with 'count' as your aggfunc.
Also, considering there might be some combinations never found, use fillna to fill the empty cells with 0:
In [28]:
df['V'] = 1
print df
Buyer Sex V
0 1 Male 1
1 1 Female 1
2 0 Male 1
3 1 Female 1
In [29]:
print df.pivot_table(index='Buyer', columns='Sex', values='V', aggfunc='count').fillna(0)
Sex Female Male
Buyer
0 0 1
1 2 1

frame.groupby('Sex').aggregate(some_function)
should work.
or even in your case :
frame.groupby('Sex').sum()

Related

Pandas: group and custom transfrom dataframe long to wide

I have a dataframe in following form:
+---------+-------+-------+---------+---------+
| payment | type | err | country | source |
+---------+-------+-------+---------+---------+
| visa | type1 | OK | AR | source1 |
| paypal | type1 | OK | DE | source1 |
| mc | type2 | ERROR | AU | source2 |
| visa | type3 | OK | US | source2 |
| visa | type2 | OK | FR | source3 |
| visa | type1 | OK | FR | source2 |
+---------+-------+-------+---------+---------+
df = pd.DataFrame({'payment':['visa','paypal','mc','visa','visa','visa'],
'type':['type1','type1','type2','type3','type2','type1'],
'err':['OK','OK','ERROR','OK','OK','OK'],
'country':['AR','DE','AU','US','FR','FR'],
'source':['source1','source1','source2','source2','source3','source2'],
})
My goal is to transform it so that I have group by payment and country, but create new columns:
number_payments - just count for groupby,
num_errors - number of ERROR values for group,
num_type1.. num_type3 - number of corresponding values in column type (only 3 possible values),
num_source1.. num_source3 - number of corresponding values in column source (only 3 possible values).
Like this:
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 | num_type3 | num_source1 | num_source2 | num_source3 |
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
| visa | AR | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| visa | US | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| visa | FR | 2 | 0 | 1 | 2 | 0 | 0 | 1 | 1 |
| mc | AU | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
I tried to combine pandas groupby and pivot, but failed to make all and it's ugly. I'm pretty sure that there are some good and fast methods to do this..
Appreciate any help.
You can use get dummies and then select the 2 grouper columns and create the group, then join the size with sum:
c = df['err'].eq("ERROR")
g = (df[['payment','country']].assign(num_errors=c,
**pd.get_dummies(df[['type','source']],prefix=['num','num']))
.groupby(['payment','country']))
out = g.size().to_frame("number_payments").join(g.sum()).reset_index()
print(out)
payment country number_payments num_errors num_type1 num_type2 \
0 mc AU 1 1 0 1
1 paypal DE 1 0 1 0
2 visa AR 1 0 1 0
3 visa FR 2 0 1 1
4 visa US 1 0 0 0
num_type3 num_source1 num_source2 num_source3
0 0 0 1 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 1
4 1 0 1 0
First it is better to clean the data for your stated purposes:
df['err_bool'] = (df['err'] == 'ERROR').astype(int)
Then we use groupby for applicable columns:
df_grouped = df.groupby(['country','payment']).agg({
'number_payments' : 'count',
'err_bool':sum})
Then we can use the pivot for type and source:
df['dummy'] = 1
df_type = df.pivot(
index=['country','payment'],
columns='type',
value='dummy',
aggfunc = np.sum
)
df_source = df.pivot_table(
index=['country','payment'],
columns='source',
value='dummy',
aggfunc = np.sum
)
Then we join everything together:
df_grouped = df_grouped.join(df_type).join(df_source)

Pass / Fail Dataframe example

I have the followign code:
import pandas as pd
status = ['Pass','Fail']
item_info = pd.DataFrame({
'student': ['John','Alice','Pete','Mike','John','Alice','Joseph'],
'test': ['Pass','Pass','Pass','Pass','Pass','Pass','Pass']
})
item_status = pd.crosstab(item_info['student'],item_info['test'])
print(item_status)
Which produces:
| Student | Pass |
|---------|------|
| Alice | 2 |
| John | 2 |
| Joseph | 1 |
| Mike | 1 |
| Pete | 1 |
However, I want to create something that looks like this:
| Student | Pass | Fail | Total |
|---------|------|------|-------|
| Alice | 2 | 0 | 2 |
| John | 2 | 0 | 2 |
| Joseph | 1 | 0 | 1 |
| Mike | 1 | 0 | 1 |
| Pete | 1 | 0 | 1 |
How do I change the code so that it includes a Fail column with 0 for all of the students and provides a total?
Generic solution which adds an extra label without knowing the existing labels in advance, with reindex
cols = item_info['test'].unique().tolist()+['Fail'] #adding the extra label
pd.crosstab(item_info['student'],item_info['test']).reindex(columns=cols,fill_value=0)
Or depending on what you want, I assumed you are looking to chain methods:
item_status = pd.crosstab(item_info['student'],item_info['test'])
item_status['Fail'] = 0
test Pass Fail
student
Alice 2 0
John 2 0
Joseph 1 0
Mike 1 0
Pete 1 0

Mapping duplicate rows to originals with dictionary - Python 3.6

I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10

Sorting according to duplication with multiple subset of duplication within a pandas data frame

I'm new to python, and I wanted to sort some duplication data according to some column of data within a data frame set, for example
Import pandas as pd
df = pd.read_excel('Data.xlsx', index = ['ID']
df2 = df[df.duplicated(subset = ['A','B'], keep = False)]
print (df2)
Let's say the output will be like this
'ID'|'Name' |'A'|'B'|
1 | Ash | 1 | 1 |
2 | James | 1 | 1 |
3 | Ash | 1 | 1 |
4 | James | 1 | 1 |
5 | Ash | 2 | 1 |
6 | James | 1 | 1 |
7 | Ash | 2 | 1 |
I would like to have the output of data as below:
'Name' |'A'|'B'|'Pattern'|'Frequency of Pattern'|
Ash | 1 | 1 | 1 | 2 |
Ash | 2 | 1 | 2 | 2 |
James | 1 | 1 | 3 | 3 |
So far I haven't found any similar post yet
Use GroupBy.size for count duplicates and then add new column to specific position by DataFrame.insert:
df4 = df3.groupby(['Name','A','B']).size().reset_index(name='Frequency of Pattern')
df4.insert(3, 'Pattern', df4.index + 1)
print (df4)
Name A B Pattern Frequency of Pattern
0 Ash 1 1 1 2
1 Ash 2 1 2 2
2 James 1 1 3 3

SQL / Python - how to return count for each attribute and sub-attribute from another table

I have SELECT that returns table which has:
-5 possible values for region (from 1 to 5) and
-3 possible values for age (1-3) with 2 possible values (1 or 2) for gender for each age group.
So table 1. looks something like this:
+----------+-----------+--------------+---------------+---------+
| att_name | att_value | sub_att_name | sub_att_value | percent |
+----------+-----------+--------------+---------------+---------+
| region | 1 | NULL | 0 | 34 |
| region | 2 | NULL | 0 | 22 |
| region | 3 | NULL | 0 | 15 |
| region | 4 | NULL | 0 | 37 |
| region | 5 | NULL | 0 | 12 |
| age | 1 | gender | 1 | 28 |
| age | 1 | gender | 2 | 8 |
| age | 2 | gender | 1 | 13 |
| age | 2 | gender | 2 | 45 |
| age | 3 | gender | 1 | 34 |
| age | 3 | gender | 2 | 34 |
+----------+-----------+--------------+---------------+---------+
Second table holds records with values from table 1. where table 1. unique values for att_name and sub_att_name are table 2. attributes:
+--------+-----+-----+
| region | age | gen |
+--------+-----+-----+
| 2 | 2 | 1 |
| 3 | 1 | 2 |
| 3 | 3 | 2 |
| 1 | 3 | 1 |
| 4 | 2 | 2 |
| 5 | 2 | 1 |
+--------+-----+-----+
I want to return count of each unique values for region and age/gender attributes from second table.
Final result should look like this:
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
| att_name | att_value | att_value_count | sub_att_name | sub_att_value | sub_att_value_count | percent |
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
| region | 1 | 1 | NULL | 0 | NULL | 34 |
| region | 2 | 1 | NULL | 0 | NULL | 22 |
| region | 3 | 2 | NULL | 0 | NULL | 15 |
| region | 4 | 1 | NULL | 0 | NULL | 37 |
| region | 5 | 1 | NULL | 0 | NULL | 12 |
| age | 1 | NULL | gender | 1 | 0 | 28 |
| age | 1 | NULL | gender | 2 | 1 | 8 |
| age | 2 | NULL | gender | 1 | 2 | 13 |
| age | 2 | NULL | gender | 2 | 1 | 45 |
| age | 3 | NULL | gender | 1 | 1 | 34 |
| age | 3 | NULL | gender | 2 | 1 | 34 |
+----------+-----------+-----------------+--------------+---------------+---------------------+---------+
Explanation
Region - doesn't have sub attribute so sub_att_name and sub_att_value_count are NULL.
att_value_count - counts appearance of each unique region (1 for all regions except for region 3 which shows 2 times).
Age/sex - counts combinations of appearance of age and sex (groups are 1/1, 1/2, 2/1, 2/2 and 3/1, 3/2).
Since we need to fill in values only for combinations att_value_count is NULL.
I'm tagging python and pandas in this question since I don't know if this is possible in SQL at all...i hope it is since we are using analytical tools to pull tables and views from database more naturally.
EDIT
SQL - answers looks complicated, I'll test and see if it works tomorrow.
Python - seems more appealing now - is there a way to parse att_name and sub_att_name, find 1 level and 2 level attributes and act accordingly? I think this is only possible with python and we do have different attributes and attributes levels.
I'l already thankful for given answers!
I think this is good enough to solve the issue:
data_1 = {'att_name':['region','region','region','region','region','age','age','age','age','age','age'],'att_value':[1,2,3,4,5,1,1,2,2,3,3],'sub_att_name':[np.nan,np.nan,np.nan,np.nan,np.nan,'gender','gender','gender','gender','gender','gender'],'sub_att_value':[0,0,0,0,0,1,2,1,2,1,2],'percent':[34,22,15,37,12,28,8,13,45,34,34]}
df_1 = pd.DataFrame(data_1)
data_2 = {'region':[2,3,3,1,4,5],'age':[2,1,3,3,2,2],'gen':[1,2,2,1,2,1]}
df_2 = pd.DataFrame(data_2)
df_2_grouped = df_2.groupby(['age','gen'],as_index=False).agg({'region':'count'}).rename(columns={'region':'counts'})
df_final = df_1.merge(df_2_grouped,how='left',left_on=['att_value','sub_att_value'],right_on=['age','gen']).drop(columns=['age','gen']).rename(columns={'counts':'sub_att_value_counts'}
Output of df_final:
att_name att_value sub_att_name sub_att_value percent sub_at_value_count
0 region 1 NaN 0 34 NaN
1 region 2 NaN 0 22 NaN
2 region 3 NaN 0 15 NaN
3 region 4 NaN 0 37 NaN
4 region 5 NaN 0 12 NaN
5 age 1 gender 1 28 NaN
6 age 1 gender 2 8 1.0
7 age 2 gender 1 13 2.0
8 age 2 gender 2 45 1.0
9 age 3 gender 1 34 1.0
10 age 3 gender 2 34 1.0
This is a pandas solution, basically, lookup or map.
df['att_value_count'] = np.nan
s = df['att_name'].eq('region')
df.loc[s, 'att_value_count'] = df.loc[s,'att_value'].map(df2['region'].value_counts())
# step 2
counts = df2.groupby('age')['gen'].value_counts().unstack('gen', fill_value=0)
df['sub_att_value_count'] = np.nan
tmp = df.loc[~s, ['att_value','sub_att_value']]
counts = df2.groupby('age')['gen'].value_counts().unstack('gen', fill_value=0)
df.loc[~s, 'sub_att_value_count'] = counts.lookup(tmp['att_value'], tmp['sub_att_value'])
You can also use merge so as it is more SQL friendly. For example, in step 2:
counts = df2.groupby('age')['gen'].value_counts().reset_index(name='sub_att_value_count')
(df.merge(counts,
left_on=['att_value','sub_att_value'],
right_on=['age','gen'],
how = 'outer'
)
.drop(['age','gen'], axis=1)
)
Output:
att_name att_value sub_att_name sub_att_value percent att_value_count sub_att_value_count
-- ---------- ----------- -------------- --------------- --------- ----------------- ---------------------
0 region 1 nan 0 34 1 nan
1 region 2 nan 0 22 1 nan
2 region 3 nan 0 15 2 nan
3 region 4 nan 0 37 1 nan
4 region 5 nan 0 12 1 nan
5 age 1 gender 1 28 nan 0
6 age 1 gender 2 8 nan 1
7 age 2 gender 1 13 nan 2
8 age 2 gender 2 45 nan 1
9 age 3 gender 1 34 nan 1
10 age 3 gender 2 34 nan 1
Update: Excuse my SQL skill if this doesn't run (it should though)
select
b.*
c.sub_att_value_count
from
(select
df1.*
a.att_value_count
from
(select
region, count(*) as att_value_count
from df2
group by region
) as a
full outer join df1
where df1.att_value = a.region
) as b
full outer join
(
select
age, gender, count(*) as sub_att_value_count
from df2
group by age, gender
) as c
where b.att_value = c.age and b.sub_att_value = c.gender

Categories