Remove duplicates with few columns and sum the other columns

Remove duplicates with few columns and sum the other columns - python

The following is my data:
name id junk date time value value2
abc 1 1 1/1/2017 18:07:54 5 10
abc 1 2 1/1/2017 19:07:54 10 15
abc 2 3 2/1/2017 20:07:54 15 20
abc 2 4 2/1/2017 21:07:54 20 25
def 3 5 3/1/2017 22:07:54 25 30
def 3 6 3/1/2017 23:07:54 30 35
def 4 7 4/1/2017 12:07:54 35 40
def 4 8 4/1/2017 13:07:54 40 45
I want to remove the duplicates based on three columns, name, id and date and take the first value. I tried the following command:
data.drop_duplicates(subset=['name', 'id', 'date'],keep = 'first')
I also want to group these three columns and take the sum of value and value2 column and I tried following column:
data[['name', 'id', 'date', 'value']].groupby(['name', 'id', 'date']).sum()
data[['name', 'id', 'date', 'value2']].groupby(['name', 'id', 'date']).sum()
Now I want to join all the three data frames and take the columns. I am thinking there should be a better way to do this? The following is the output I am looking for:
name id junk date time value value2
abc 1 1 1/1/2017 18:07:54 15 25
abc 2 3 2/1/2017 20:07:54 35 45
def 3 5 3/1/2017 22:07:54 55 65
def 4 7 4/1/2017 12:07:54 75 85
Where I want to consider to remove duplicates based on name, id and date column, take the first value of junk and time columns and also add the value and value2 columns.
Can anybody help me in doing this?

You need groupby with agg:
df = df.groupby(['name', 'id', 'date'])
.agg({'value':'sum', 'value2':'sum', 'time':'first', 'junk':'first'})
.reset_index()
print (df)
name id date value2 time junk value
0 abc 1 1/1/2017 25 18:07:54 1 15
1 abc 2 2/1/2017 45 20:07:54 3 35
2 def 3 3/1/2017 65 22:07:54 5 55
3 def 4 4/1/2017 85 12:07:54 7 755
Dynamic solution:
g_cols = ['name','id','date']
sum_cols = ['value','value2']
#remove columns in groupby and for sum
cols = df.columns[~df.columns.isin(sum_cols + g_cols)]
print (cols)
Index(['junk', 'time'], dtype='object')
#dict comprehension for sum columns
d_sum = {col:'sum' for col in sum_cols}
#dict comprehension for first columns
d = {col:'first' for col in cols}
#add dicts together
d.update(d_sum)
print (d)
{'value2': 'sum', 'time': 'first', 'junk': 'first', 'value': 'sum'}
df = df.groupby(g_cols).agg(d).reset_index()
print (df)
name id date value2 time junk value
0 abc 1 1/1/2017 25 18:07:54 1 15
1 abc 2 2/1/2017 45 20:07:54 3 35
2 def 3 3/1/2017 65 22:07:54 5 55
3 def 4 4/1/2017 85 12:07:54 7 75

Related

Groupby of different columns with different aggreagation with cumsum for next date

I have a dataframe sorted by date and time as :
ID Date Time A B C
abc 06/Feb 11 12 12 10
abc 06/Feb 12 14 13 5
xyz 07/Feb 1 16 14 50
xyz 07/Feb 2 18 15 0
xyz 07/Feb 3 20 16 10
I want to groupby it by ID and Date and get sum as Numerator,count as Denominator but for next date the sum will be cumsum of previous dates and so will be the count as cumcount, and 3 more columns of last value of A,B,C columns will be added.Such as:
ID Date A_Num A_denom B_Num B_Denom C_Num C_Denom A_Last B_Last C_Last
abc 06/Feb 26 2 25 2 15 2 14 13 5
xyz 07/Feb 54 3 45 3 60 3 20 16 10
I am not able to perform all these in one go..Can anyone help me in this.Thanks in advance.
Now I want to add df2 in df1 acc to id as:
ID Date A_Num A_denom B_Num B_Denom C_Num C_Denom A_Last B_Last C_Last
abc 06/Feb 52 4 50 4 30 4 14 13 5
xyz 07/Feb 108 6 90 6 120 6 20 16 10

You can aggregate sum, size and last per groups in GroupBy.agg, then selecting num and denum and use cumulative sum and last add by concat another DataFrame created by aggregate GroupBy.last:
cols = ['A','B','C']
print (df[cols].dtypes)
A int64
B int64
C int64
dtype: object
d = {'sum':'Num','size':'denom'}
df1 = df.groupby(['ID','Date'])[cols].agg(['sum','size']).rename(columns=d).cumsum()
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df2 = df.groupby(['ID','Date'])[cols].last().add_suffix('_Last')
df3 = pd.concat([df1, df2], axis=1).reset_index()
print (df3)
ID Date A_Num A_denom B_Num B_denom C_Num C_denom A_Last \
0 abc 06/Feb 26 2 25 2 15 2 14
1 xyz 07/Feb 80 5 70 5 75 5 20
B_Last C_Last
0 13 5
1 16 10
For write to file without index use:
df3.to_csv('file', index=False)
If there is no .reset_index in solution use:
df3.to_csv('file')

For similar value in column add new column frequence

I have a dataframe :
Id age number
1 35 7
5 76 9
1 15 0
2 10 4
5 43 8
What i need to get is :
Id age number freq
1 35 7 2
5 76 9 2
1 15 0 1
2 10 4 1
5 43 8 1
Add a new colum freq , for each value in a column , we takes all rows with same value in ID and count rows where the value of cat is less.

If need counter in descending order use GroupBy.cumcount:
df['freq'] = df.groupby('Id').cumcount(ascending=False).add(1)
But if need counts values by Id use GroupBy.transform with DataFrameGroupBy.size, output is different:
df['freq'] = df.groupby('Id')['Id'].transform('size')
Alternative with Series.map and Series.value_counts:
df['freq'] = df['Id'].map(df['Id'].value_counts())

Extract corresponding df value with reference from another df

There are 2 dataframes with 1 to 1 correspondence. I can retrieve an idxmax from all columns in df1.
Input:
df1 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1':[76,23,43,34,0,78,34],'value2':[1,45,8,0,76,45,56]})
df2 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1_pair':[0,0,0,0,180,180,90],'value2_pair':[0,0,0,0,90,180,90]})
df=df1.loc[df1.iloc[:,1:].idxmax(), 'ref']
Output: df1, df2 and df
ref value1 value2
0 2 76 1
1 4 23 45
2 6 43 8
3 8 34 0
4 10 0 76
5 12 78 45
6 14 34 56
ref value1_pair value2_pair
0 2 0 0
1 4 0 0
2 6 0 0
3 8 0 0
4 10 180 90
5 12 180 180
6 14 90 90
5 12
4 10
Name: ref, dtype: int64
Now I want to create a df which contains 3 columns
Desired Output df:
ref max value corresponding value
12 78 180
10 76 90
What are the best options to extract the corresponding values from df2?

Your main problem is matching the columns between df1 and df2. Let's rename them properly, melt both dataframes, merge and extract:
(df1.melt('ref')
.merge(df2.rename(columns={'value1_pair':'value1',
'value2_pair':'value2'})
.melt('ref'),
on=['ref', 'variable'])
.sort_values('value_x')
.groupby('variable').last()
)
Output:
ref value_x value_y
variable
value1 12 78 180
value2 10 76 90

Compare 2 dataframes and get values for row based on query

I have 2 dataframes: I want to take out values from df2 and append in column in df1 for df1.id = df2.id and df1.name = df2.name
df1:
id name price_1 price_2 price_3 price_4
1 a
2 b
df2:
id name price_1 price_2 price_3 price_4
1 a 10 11 12 11
2 b 11 44 22 55
3 c 76 56 45 34
output:
id name price_1 price_2 price_3 price_4
1 a 10 11 12 11
2 b 11 44 22 55

If all id, name combinations from df1 are present in df2 (as in the provided example), just take a subset of df2 with a match in df1:
df2.merge(df1[['id', 'name']])
Output:
id name price_1 price_2 price_3 price_4
0 1 a 10 11 12 11
1 2 b 11 44 22 55

You can use merge:
df1[['id','name']].merge(df2, on=['id','name'], how='left')

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris

If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN

You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want

You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31

Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicates with few columns and sum the other columns - python

Related

Groupby of different columns with different aggreagation with cumsum for next date

For similar value in column add new column frequence

Extract corresponding df value with reference from another df

Compare 2 dataframes and get values for row based on query

Compare two pandas dataframe with different size

Categories

Resources