Dropping Columns which have same values but different names - python

Currently merging two data frames where some of the columns of the two dataframes are the same but not all.
df = pd.merge(df_1, df_2, how='inner', on='name' )
This returns:
index name val1_x val2_x val1_y val2_y
0 name1 1 2 1 3
2 name2 12 14 12 34
3 name3 14 3 14 96
But I would like:
index name val1_x val2_x val2_y
0 name1 1 2 3
2 name2 12 14 34
3 name3 14 3 96
How could you get this result? Either with the merge command or after?
--------- Extension - outer merge -------------
With an inner merge
df = pd.merge(df_1, df_2, how='name', on='address').T.drop_duplicates().T
works as suggested in the solutions
However with an outer merge
df = pd.merge(df_1, df_2, how='outer', on='name' )
It does not work since there are nan values. It returns
index name val1_x val2_x val1_y val2_y
0 name1 1 2 nan 3
2 name2 12 14 12 34
3 name3 14 3 14 96
But I would like:
index name val1_x val2_x val2_y
0 name1 1 2 3
2 name2 12 14 34
3 name3 14 3 96
How can one achieve this?

Use drop_duplicates
df = pd.merge(df_1, df_2, how='inner', on='name' ).T.drop_duplicates().T
index name val1_x val2_x val2_y
0 0 name1 1 2 3
1 2 name2 12 14 34
2 3 name3 14 3 96

This is a complicated aggregation, so you can just write your own function to resolve the groups. This method will only work to resolve numeric (datetime and Bool also work) data. With strings, you'll need to fall back to a much slower pd.nunique call over the rows.
For each group, we check if the columns are completely duplicated (using np.unique, after filling) and then either return the original group or the deduplicated grouping.
Starting Data
index name val1_x val2_x val1_y val2_y
0 0 name1 1 2 NaN 3
1 2 name2 12 14 12.0 34
2 3 name3 14 3 14.0 96
Code
l = []
for idx, gp in df.groupby(df.columns.str.split('_').str[0], axis=1):
if any(gp.dtypes == 'O') | (gp.shape[1] == 1): # Can't/don't resolve these types
l.append(gp)
else:
arr = np.unique(gp.ffill(axis=1).bfill(axis=1).to_numpy(), axis=1)
if arr.shape[1] == 1:
l.append(pd.DataFrame(index=gp.index, columns=[idx], data=arr))
else:
l.append(gp)
df = pd.concat(l, axis=1)
index name val1 val2_x val2_y
0 0 name1 1.0 2 3
1 2 name2 12.0 14 34
2 3 name3 14.0 3 96

Related

How to compare 2 non-identical dataframes in python

I have two dataframes with the same column order but different column names and different rows. df2 rows vary from df1 rows.
df1= col_id num name
0 1 3 linda
1 2 4 James
df2= id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
This is the output I need. Outputs rows with same, OLD and NEW values so the user can clearly see what changed between two dataframes:
result col_id num name
0 1 was 3| now 2 was linda| now granpa
1 2 was 4| now 6 was James| now linda
2 was | now 3 was | now 7 was | now sam
Since your goal is just to compare differences, use DataFrame.compare instead of aggregating into strings.
However,
DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
So we just need to align the row/column indexes, either via merge or reindex.
Align via merge
Outer-merge the two dfs:
merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
# col_id num name_x id no name_y
# 0 1 3 linda 1 2 granpa
# 1 2 4 james 2 6 linda
# 2 NaN NaN NaN 3 7 sam
Divide the merged frame into left/right frames and align their columns with set_axis:
cols = df1.columns
left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
# col_id num name
# 0 1 3 linda
# 1 2 4 james
# 2 NaN NaN NaN
right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
# col_id num name
# 0 1 2 granpa
# 1 2 6 linda
# 2 3 7 sam
compare the aligned left/right frames (use keep_equal=True to show equal cells):
left.compare(right, keep_shape=True, keep_equal=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
left.compare(right, keep_shape=True)
# col_id num name
# self other self other self other
# 0 NaN NaN 3 2 linda granpa
# 1 NaN NaN 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
Align via reindex
If you are 100% sure that one df is a subset of the other, then reindex the subsetted rows.
In your example, df1 is a subset of df2, so reindex df1:
df1.assign(id=df1.col_id) # copy col_id (we need original col_id after reindexing)
.set_index('id') # set index to copied id
.reindex(df2.id) # reindex against df2's id
.reset_index(drop=True) # remove copied id
.set_axis(df2.columns, axis=1) # align column names
.compare(df2, keep_equal=True, keep_shape=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
Nullable integers
Normally int cannot mix with nan, so pandas converts to float. To keep the int values as int (like the examples above):
Ideally we'd convert the int columns to nullable integers with astype('Int64') (capital I).
However, there is currently a comparison bug with Int64, so just use astype(object) for now.
If I understand correctly, you want something like this:
new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')
Output:
>>> new_df
col_id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam

how to concat specific rows through a pandas dataframe

so, i have this situation:
there is a dataframe like this:
Number
Description
10001
name 2
1002
name2(pt1)
NaN
name2(pt2)
1003
name3
1004
name4(pt1)
NaN
name4(pt2)
1005
name5
So, i need to concat the name (part1 and part2) together into junt one field and then drop the NaN rows but i have no clue how to do this because the rows do not follown a specific interval pattern
Try with groupby aggregate on a series based on the notna Number values.
Groups are created from:
df['Number'].notna().cumsum()
0 1
1 2
2 2
3 3
4 4
5 4
6 5
Name: Number, dtype: int32
Then aggregate taking the 'first' Number (since first value is guaranteed to be notna) and doing some operation to combine Descriptions like join:
new_df = (
df.groupby(df['Number'].notna().cumsum(), as_index=False)
.aggregate({'Number': 'first', 'Description': ''.join})
)
new_df:
Number Description
0 10001.0 name 2
1 1002.0 name2(pt1)name2(pt2)
2 1003.0 name3
3 1004.0 name4(pt1)name4(pt2)
4 1005.0 name5
Or comma separated join:
new_df = (
df.groupby(df['Number'].notna().cumsum(), as_index=False)
.aggregate({'Number': 'first', 'Description': ','.join})
)
new_df:
Number Description
0 10001.0 name 2
1 1002.0 name2(pt1),name2(pt2)
2 1003.0 name3
3 1004.0 name4(pt1),name4(pt2)
4 1005.0 name5
Or as list:
new_df = (
df.groupby(df['Number'].notna().cumsum(), as_index=False)
.aggregate({'Number': 'first', 'Description': list})
)
new_df:
Number Description
0 10001.0 [name 2]
1 1002.0 [name2(pt1), name2(pt2)]
2 1003.0 [name3]
3 1004.0 [name4(pt1), name4(pt2)]
4 1005.0 [name5]

Extract corresponding df value with reference from another df

There are 2 dataframes with 1 to 1 correspondence. I can retrieve an idxmax from all columns in df1.
Input:
df1 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1':[76,23,43,34,0,78,34],'value2':[1,45,8,0,76,45,56]})
df2 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1_pair':[0,0,0,0,180,180,90],'value2_pair':[0,0,0,0,90,180,90]})
df=df1.loc[df1.iloc[:,1:].idxmax(), 'ref']
Output: df1, df2 and df
ref value1 value2
0 2 76 1
1 4 23 45
2 6 43 8
3 8 34 0
4 10 0 76
5 12 78 45
6 14 34 56
ref value1_pair value2_pair
0 2 0 0
1 4 0 0
2 6 0 0
3 8 0 0
4 10 180 90
5 12 180 180
6 14 90 90
5 12
4 10
Name: ref, dtype: int64
Now I want to create a df which contains 3 columns
Desired Output df:
ref max value corresponding value
12 78 180
10 76 90
What are the best options to extract the corresponding values from df2?
Your main problem is matching the columns between df1 and df2. Let's rename them properly, melt both dataframes, merge and extract:
(df1.melt('ref')
.merge(df2.rename(columns={'value1_pair':'value1',
'value2_pair':'value2'})
.melt('ref'),
on=['ref', 'variable'])
.sort_values('value_x')
.groupby('variable').last()
)
Output:
ref value_x value_y
variable
value1 12 78 180
value2 10 76 90

python pandas : lambda or other method to count NaN values / len(value)<1 along rows

Derive a new pandas column based on lengh of string in other columns
I want to count the number of columns which have a value in each row and create a new column with that number. Assume if I have 6 columns and two columns starts with a have some value then new column for that row will have the value 2.
df = pd.DataFrame({'ID':['1','2','3'],'ID2':['11','12','13'], 'J1': ['a','ab',''],'J2':['22','','33'],'a1': ['a11','','ab1'],'a2':['22','1','33']})
print df
The output should be like:
ID J1 J2 a1 a2 Count_J_cols_have_values count_a_cols_have_values
0 1 a 22 a11 22 2 2
1 2 ab 1 1 1
2 3 33 ab1 33 1 2
The output should be like:
ID J1 J2 a1 a2 Count_J_cols_have_values count_a_cols_have_values
0 1 a 22 a11 22 2 2
1 2 ab 1 1 1
2 3 33 ab1 33 1 2
Use DataFrame.filter with Series.ne and Series.sum as:
df['Count_J_cols_have_values'] = df.filter(regex='^J').ne('').sum(1)
df['count_a_cols_have_values'] = df.filter(regex='^a').ne('').sum(1)
print(df)
ID ID2 J1 J2 a1 a2 Count_J_cols_have_values count_a_cols_have_values
0 1 11 a 22 a11 22 2 2
1 2 12 ab 1 1 1
2 3 13 33 ab1 33 1 2
Or use filter, replace and count:
df['Count_J_cols_have_values'] = df.filter(regex='^J').replace('',np.nan).count(1)
df['count_a_cols_have_values'] = df.filter(regex='^a').replace('',np.nan).count(1)

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

Categories