Merge columns into one while dropping nan values and duplicates

Merge columns into one while dropping nan values and duplicates - python

I am trying to merge multiple columns into a single column while dropping duplicates and dropping null values but keeping the rows.
What I have:
df= pd.DataFrame(np.array([['nan', 'nan', 'nan'], ['nan', 2, 2], ['nan', 'x', 'nan']]), columns=['a', 'b', 'c'])
What I need:
df= pd.DataFrame(np.array([[''], [ 2], [ 1]]), columns=['a'])
I have tried this but I get 1,nan for the last row:
df['a]=df[['a','b','c]].agg(', '.join, axis=1)
I have also tried the following but I cannot get this to work:
.stack().unstack()
and
.join
but I cannot get these to drop duplicates for each row

This will find the maximum value of a row and replace 'nan' with '':
new_df = pd.DataFrame(df.astype(float).max(axis=1).replace(np.nan, ''), columns=[df.columns[0]])
output:
a
0
1 2.0
2 1.0

Related

Pandas : fill values from from another column

I did a pandas merge and now have two columns - col_x and col_y. I'd like to fill values in col_x with col_y, but only for rows where where col_y is not NaN or has a value. I'd like to keep the original values in col_x and only replace from col_y if NaN.
import pandas as pd
df = pd.DataFrame({
'i': [0, 1, 2, 3],
'c': [np.nan, {'a':'A'}, np.nan, {'b':'B'}],
'd': [{'c':'C'}, np.nan, {'d':'D'}, np.nan]
})
Expected output:
i c d
0 {'c':'C'} {'c':'C'}
1 {'a':'A'} np.nan
2 {'d':'D'} {'d':'D'}
3 {'b':'B'} np.nan

Are you just trying to fillna?
df.c.fillna(df.d, inplace=True)

You can use np.where()
So something like
df['c'] = np.where(df['c'].isna(), df['d'], df['c'])
should do the trick! The first parameter is the condition to check, the second is what to return if the condition is true, and the third is what to return if the condition is false.

Try:
df["c"] = [y if str(x) == "nan" else x for x,y in zip(df.c,df.d)]
Probably cleaner way but this is one line

Left merge on slice pandas

I am looking for an iterative way to append values from a maptable to my df of reference. In a simple form, this can be achieved using a left merge.
The challenge that I am facing are:
The new column might already be present in my df (but containing empty values)
There might be duplicate indices (eg id1 has two times 1 ans two times 2)
In each iteration, I only want to fill the empty values in df (column col_to_map)
To give you an example:
import pandas as pd
df_dict = dict({
'id1': [1, 2, 3, None, 1, 2],
'id2': ['a', 'b', 'c', 'd', None, None],
'val1': [None, None, None, None, None, None],
'val2': ['21a', None, None, None, None, None]
})
map_dict = dict({
'id1': [5, 4, 3, 5, pd.np.nan, pd.np.nan],
'id2': ['e', 'd', None, None, 'b', None],
'val1': ['15e', '14d', '13c', '15e', '12b', 'x1'],
'val2': ['25e', '24d', None, None, None, 'x2'],
'val3': ['35e', '34d', '33c', None, '32b', None],
})
df = pd.DataFrame.from_dict(df_dict)
maptable = pd.DataFrame.from_dict(map_dict)
for id_col in ['id1', 'id2']:
for col_to_map in ['val1','val2','val3']:
print(f'map {col_to_map} using id {id_col}')
# logic here to append only the non-empty values
df = map_iteration(df=df, maptable=maptable, id_col=id_col, col_to_map=col_to_map)
WHAT I TRIED AND WHERE I AM STUCK
I tried the following for map_iteration(), but I receive the error "ValueError: cannot reindex from a duplicate axis":
def map_iteration(df, maptable, id_col, col_to_map):
"""Map empty values in df col_to_map using the maptable and on the id_col"""
# Add column to df
if col_to_map not in df:
df.insert(len(df.columns), column=col_to_map, value=None)
# Take out invalid ids in maptable
maptable = maptable[~maptable[id_col].isnull() & ~maptable[id_col].isna()]
maptable = maptable[~maptable[id_col].duplicated(keep='first')]
# Target rows
elig_rows = df.loc[:, col_to_map].isnull() & ~df.loc[:, id_col].isnull() & ~df.loc[:, id_col].isna()
# To string ids
df.loc[:, id_col] = df.loc[:, id_col].astype(str).str.strip()
maptable.loc[:, id_col] = maptable.loc[:, id_col].astype(str).str.strip()
# Strip maptable
m = maptable.loc[:, [id_col, col_to_map]]
# Merge
df_left = df.loc[elig_rows, [id_col]].merge(m, how='left', on=id_col)
# Indexing
df_left = df_left.set_index(id_col)
df = df.set_index(id_col)
# Assign
df.loc[df.index[elig_rows], col_to_map] = df_left.loc[:, col_to_map]
# Drop index
df = df.reset_index()
return df

I come up with a solution:
for x in df.columns.values:
for ix, val in enumerate(df[x].values):
if not df[x].values[ix]:
df.at[ix, x] = maptable[x].values[ix]
returns:
id1 id2 val1 val2
0 1.0 a 15e 21a
1 2.0 b 14d 24d
2 3.0 c 13c None
3 5.0 d 15e None
Still not the most elegant solution tough, someone can come up with a list comprehension of looping simultaneously two columns of df etc.
Well if this is too cumbersome (and yes it is), one solution may be to finding all null values in your df and fill the nulls with the exact correspondences from the in the map table.
such as:
# Find null places in df
nulls = np.array(df.isnull())
# Find indexes and columns of this null cells
row_ind, col_ind = np.where(nulls ==True )
# Fill them
df_cols = df.columns.values
for ix, col in enumerate(col_ind):
df.at[row_ind[ix], df_cols[col]] = maptable[df_cols[col]].values[row_ind[ix]]
still returns:
id1 id2 val1 val2
0 1.0 a 15e 21a
1 2.0 b 14d 24d
2 3.0 c 13c None
3 5.0 d 15e None

Merge (or concat) two dataframes by index with duplicates index

I have two dataframe A and B with common indexes for A and B. These common indexes can appear several times (duplicate) for A and B.
I want to merge A and B in these 3 ways :
Case 0: If index i of A appear one time (i1) and index i for B
appear one times (i1), I want my merged by index dataframe to add
the rows A(i1), B(i1)
Case 1 : If index i of A appear one time (i1) and index i for B
appear two times in this order : (i1 and i2), I want my merged by
index dataframe to add the rows A(i1), B(i1) and A(i1), B(i2).
Case 2: If index i of A appear two time in this order : (i1, i2) and
index i for B appear two times in this order : (i1 and i2), I want
my merged by index dataframe to add the rows A(i1), B(i1) and A(i2),
B(i2).
These 3 cases are all of the possible case that could appear on my data.
When using pandas.merge, case 0 and case 1 works. But for case 2, the returned dataframe will add rows A(i1), B(i1) and A(i1), B(i2) and A(i2), B(i1) and A(i2), B(i2) instead of A(i1), B(i1) and A(i2), B(i2).
I could use pandas.merge method and then delete the undesired merged rows but is there a ways to combine these 3 cases at the same time ?
A = pd.DataFrame([[1, 2], [4, 2], [5,5], [5,5], [1,1]], index=['a', 'a', 'b', 'c', 'c'])
B = pd.DataFrame([[1, 5], [4, 8], [7,7], [5,5]], index=['b', 'c', 'a', 'a'])
pd.merge(A,B, left_index=True, right_index=True, how='inner')
For example, in the dataframe above, I want exactly it without the second and third index 'a'.

Basically, your 3 cases can be summarized into 2 cases:
Index i occur the same times (1 or 2 times) in A and B, merge according to the order.
Index i occur 2 times in A and 1 time in B, merge using B content for all rows.
Prep code:
def add_secondary_index(df):
df.index.name = 'Old'
df['Order'] = df.groupby(df.index).cumcount()
df.set_index('Order', append=True, inplace=True)
return df
import pandas as pd
A = pd.DataFrame([[1, 2], [4, 2], [5,5], [5,5], [1,1]], index=['a', 'a', 'b', 'c', 'c'])
B = pd.DataFrame([[1, 5], [4, 8], [7,7], [5,5]], index=['b', 'c', 'a', 'a'])
index_times = A.groupby(A.index).count() == B.groupby(B.index).count()
For case 1 is easy to solve, you just need to add the secondary index:
same_times_index = index_times[index_times[0].values].index
A_same = A.loc[same_times_index].copy()
B_same = B.loc[same_times_index].copy()
add_secondary_index(A_same)
add_secondary_index(B_same)
result_merge_same = pd.merge(A_same,B_same,left_index=True,right_index=True)
For case 2, you need to seprately consider:
not_same_times_index = index_times[~index_times.index.isin(same_times_index)].index
A_notsame = A.loc[not_same_times_index].copy()
B_notsame = B.loc[not_same_times_index].copy()
result_merge_notsame = pd.merge(A_notsame,B_notsame,left_index=True,right_index=True)
You could consider whether to add secondary index for result_merge_notsame, or drop it for result_merge_same.

Pandas dataframe conditional substitution and columnwise trimming

Current Pandas DataFrame
fn1 = pd.DataFrame([['A', 'NaN', 'NaN', 9, 6], ['B', 'NaN', 2, 'NaN', 7], ['C', 3, 2, 'NaN', 10], ['D', 'NaN', 7, 'NaN', 'NaN'], ['E', 'NaN', 'NaN', 3, 3], ['F', 'NaN', 'NaN', 7,'NaN']], columns = ['Symbol', 'Condition1','Condition2', 'Condition3', 'Condition4'])
fn1.set_index('Symbol', inplace=True)
Condition1 Condition2 Condition3 Condition4
Symbol
A NaN NaN 9 6
B NaN 2 NaN 7
C 3 2 NaN 10
D NaN 7 NaN NaN
E NaN NaN 3 3
F NaN NaN 7 NaN
I'm currently working with a Pandas DataFrame that looks like the link above. I'm trying to go column by column to substitute values that are not 'NaN' with the 'Symbol' associated with that row then collapse each column (or write to a new DataFrame) so that each column is a list of 'Symbol's that were present for each 'Condition' as shown in the desired output:
Desired Output
I've been able to get the 'Symbols' that were present for each condition into a list of lists (see below) but want to maintain the same column names and had trouble adding them to an ever-growing new DataFrame because the lengths are variable and I'm looping through columns.
ls2 = []
for col in fn1.columns:
fn2 = fn1[fn1[col] > 0]
ls2.append(list(fn2.index))
Where fn1 is the DataFrame that looks like the first image and I had made the 'Symbol' column the index.
Thank you in advance for any help.

Another answer would be slicing, just like below (explanations in comments):
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
"Symbol": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"],
"Condition1": [1, np.nan, 3, np.nan, np.nan, np.nan, 7, np.nan, np.nan, 8, 12],
"Condition2": [np.nan, 2, 2, 7, np.nan, np.nan, 5, 11, 14, np.nan, np.nan],
}
)
new_df = pd.concat(
[
df["Symbol"][df[column].notnull()].reset_index(drop=True) # get columns without null and ignore the index (as your output suggests)
for column in list(df)[1:] # Iterate over all columns except "Symbols"
],
axis=1, # Column-wise concatenation
)
# Rename columns
new_df.columns = list(df)[1:]
# You can leave NaNs or replace them with empty string, your choice
new_df.fillna("", inplace=True)
Output of this operation will be:
Condition1 Condition2
0 a b
1 c c
2 g d
3 j g
4 k h
5 i
If you need any further clarification, post a comment down below.

You can map the symbols to each of the columns, and then take the set of non-null values.
df = fn1.apply(lambda x: x.map(fn1['Symbol'].to_dict()))
condition_symbols = {col:sorted(list(set(fn1_symbols[col].dropna()))) for col in fn1.columns[1:]}
This will give you a dictionary:
{'Condition1': ['B', 'D'],
'Condition2': ['C', 'H'],
'Condition3': ['D', 'H', 'J'],
'Condition4': ['D', 'G', 'H', 'K']}
I know you asked for a Dataframe, but since the length for each list is different, it would not make sense to make it into a Dataframe. If you wanted a Dataframe, then you could just run this code:
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in condition_symbols.items() ]))
This gives you the following output:
Condition1 Condition2 Condition3 Condition4
0 B C D D
1 D H H G
2 NaN NaN J H
3 NaN NaN NaN K

Drop duplicate rows, but keep the union of their data

I have a data frame like this:
pd.DataFrame([
[1, None, 'a'],
[1, 3.3, None],
[2, 1.7, 'c']
], columns=['unique_id', 'x', 'target'])
I want to drop one of the rows where unique_id is 1, but take the union of their values. That is, I want to produce this:
pd.DataFrame([
[1, 3.3, 'a'],
[2, 1.7, 'c']
], columns=['unique_id', 'x', 'target'])
Can this be done efficiently in Pandas?
Assume this data frame has between 10k and 100k rows, with maybe 10% being duplicates I want to eliminate. There will only be 2 or 3 duplicates of each unique_id.
Edit: when both rows have disagreeing entries, just taking the first one is fine in my case. But I'm open to solutions where, e.g. both values are collected in a list.

This gives the result for your example. It takes the first non-Nan value for each column, in each group.
df.groupby("unique_id", as_index=False).first()

Use groupby and first:
df.groupby('unique_id').first()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge columns into one while dropping nan values and duplicates - python

This will find the maximum value of a row and replace 'nan' with '': new_df = pd.DataFrame(df.astype(float).max(axis=1).replace(np.nan, ''), columns=[df.columns[0]]) output: a 0 1 2.0 2 1.0

Related

Pandas : fill values from from another column

Left merge on slice pandas

Merge (or concat) two dataframes by index with duplicates index

Pandas dataframe conditional substitution and columnwise trimming

Drop duplicate rows, but keep the union of their data

Categories

Resources