Pandas merging mixed length datasets without duplicate columns - python

I'm trying to merge several mixed dataframes, with some missing values which sometimes exist in other dataframes to one combined dataset, some dataframes also might contain extra columns, then those should be added and all other rows have NaN as values.
This based on one or several columns, the row index has no meaning, the true dataset has many columns so manually removing anything is very much less than desirable.
So essentially, merging several dataframes based on one or several columns, prioritizing any non NaN value, or if two conflicting non NaN values would exist then prioritize the existing value in the base dataframe and not the one being merged in.
df1 = pd.DataFrame({
'id': [1, 2, 4],
'data_one': [np.nan, 3, np.nan],
'data_two': [4, np.nan, np.nan],
})
id data_one data_two
0 1 NaN 4.0
1 2 3.0 NaN
2 4 NaN NaN
df2 = pd.DataFrame({
'id': [1, 3],
'data_one': [8, np.nan],
'data_two': [np.nan, 4],
'data_three': [np.nan, 100]
})
id data_one data_two data_three
0 1 8.0 NaN NaN
1 3 NaN 4.0 100.0
# Desired result
res = pd.DataFrame({
'id': [1, 2, 3, 4],
'data_one': [8, 3, np.nan, np.nan],
'data_two': [4, np.nan, 4, np.nan],
'data_three': [np.nan, np.nan, 100, np.nan],
})
id data_one data_two data_three
0 1 8.0 4.0 NaN
1 2 3.0 NaN NaN
2 3 NaN 4.0 100.0
3 4 NaN NaN NaN
The functions I have been experimenting with so far is pd.merge(), pd.join(), pd.combine_first() but haven't had any success, maybe missing something easy.

You can do a groupby() coupled with fillna():
pd.concat([df1,df2]).groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
results in:
id data_one data_two data_three
0 1 8.0 4.0 NaN
1 2 3.0 NaN NaN
1 3 NaN 4.0 100.0
2 4 NaN NaN NaN
Note that it is gonna return separate rows for the positions where df1 and df2 both have a non-null values. Which is intentional since I don't know what you want to do with in such cases.

Related

Pandas excel: write rows with NaN values in a separate DataFrame

I have df with 2 columns: Name, Number.
I need to write a row if NaN in cell to a new DataFrame.
path = 'Files/Directory.xlsx'
df = pd.read_excel(path)
I've tried so many different things, spent 3 days and still can't get it.
df = pd.DataFrame(
{
"Name": ["Alex", "Bob", "Jim", np.nan, np.nan],
"Number": [1, 2, np.nan, 3, np.nan],
}
)
df
Name
Number
Alex
1.0
Bob
2.0
Jim
NaN
NaN
3.0
NaN
NaN
So it depends if you want to write rows with any NaN values to a new DataFrame or if you just want to write rows with all NaN values to the new DataFrame.
If any, the following should work:
df_nan = df.loc[df.isnull().any(axis=1)]
df_nan
Name
Number
Jim
NaN
NaN
3.0
NaN
NaN
If all, this should work:
df_nan = df.loc[df.isnull().all(axis=1)]
df_nan
Name
Number
NaN
NaN

Is there a pythonic way of shifting pandas dataframe cells to the left, while pushing out or overwriting any nan?

I have a pandas dataframe (starting_df) with nan values in the left-hand columns. I'd like to shift all values over to the left for a left-aligned dataframe. My Dataframe is 24x24, but for argument's sake, I'm just posting a 4x4 version.
After some cool initial answers here, I modified the dataframe to also include a non-leading nan, who's position I'd like to preserve.
I have a piece of code that accomplishes what I want, but it relies on nested for-loops and suppressing an IndexError, which does not feel very pythonic. I have no experience with error handling in general, but simply suppressing an error does not seem to be the right strategy.
Starting dataframe and desired final dataframe:
Here is the code that (poorly) accomplishes the goal.
import pandas as pd
import numpy as np
def get_left_aligned(starting_df):
"""take a starting df with right-aligned numbers and nan, and
turn it into a left aligned table."""
left_aligned_df = pd.DataFrame()
for temp_index_1 in range(0, starting_df.shape[0]):
temp_series = []
for temp_index_2 in range(0, starting_df.shape[0]):
try:
temp_series.append(starting_df.iloc[temp_index_2, temp_index_2 + temp_index_1])
temp_index_2 += 1
except IndexError:
pass
temp_series = pd.DataFrame(temp_series, columns=['col'+str(temp_index_1 + 1)])
left_aligned_df = pd.concat([left_aligned_df, temp_series], axis=1)
return left_aligned_df
df = pd.DataFrame(dict(col1=[1, np.nan, np.nan, np.nan],
col2=[5, 2, np.nan, np.nan],
col3=[7, np.nan, 3, np.nan],
col4=[9, 8, 6, 4]))
df_expected = pd.DataFrame(dict(col1=[1, 2, 3, 4],
col2=[5, np.nan, 6, np.nan],
col3=[7, 8, np.nan, np.nan],
col4=[9, np.nan, np.nan, np.nan]))
df_left = get_left_aligned(df)
I appreciate any help with this.
Thanks!
or transpose the df and use shift to shift by column, when the NA num is increasing 1 by 1.
dfn = df.T.copy()
for i, col in enumerate(dfn.columns):
dfn[col] = dfn[col].shift(-i)
dfn = dfn.T
print(dfn)
col1 col2 col3 col4
0 1.0 5.0 7.0 9.0
1 2.0 NaN 8.0 NaN
2 3.0 6.0 NaN NaN
3 4.0 NaN NaN NaN
One way to resolve your challenge is to move the data into numpy territory, sort the data, then return it as a pandas DataFrame:
Numpy converts pandas NA to object data type; pd.to_numeric resolves that to data types numpy can work with.
pd.DataFrame(
np.sort(df.transform(pd.to_numeric).to_numpy(), axis=1),
columns=df.columns,
dtype="Int64",
)
col1 col2 col3
0 1 4 6
1 2 5 <NA>
2 3 <NA> <NA>
You can sort values on the row based on their positions, keeping the nan values at the end, giving them a very high value (np.nan, for example), rather than their actual position.
df.T.apply(
lambda x: [z[1] for z in sorted(enumerate(x), key=(lambda k: np.inf if pd.isna(k[1]) else k[0]), reverse=False)],
axis=0
).T
Here an example:
import numpy as np
import pandas as pd
df = pd.DataFrame(
data = [
[np.nan, 2, 4, 7],
[np.nan, np.nan, 6, 9],
[np.nan, np.nan, np.nan, 10],
[np.nan, np.nan, np.nan, np.nan],
],
columns=['A', 'B', 'C', 'D']
)
df2 = df.T.apply(
lambda x: [z[1] for z in sorted(enumerate(x), key=(lambda k: np.inf if pd.isna(k[1]) else k[0]), reverse=False)],
axis=0
).T
And this id df2:
A B C D
0 2.0 4.0 7.0 NaN
1 6.0 9.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN NaN NaN NaN
EDIT
If you have rows with NaNs after the first not NaN value, you can use this approach based on first_valid_index:
df.apply(
lambda x: x.shift(-list(x.index).index(x.first_valid_index() or x.index[0])),
axis=1,
)
An example for this case:
import numpy as np
import pandas as pd
df = pd.DataFrame(
data = [
[np.nan, 2, 4, 7],
[np.nan, np.nan, 6, 9],
[np.nan, np.nan, np.nan, 10],
[np.nan, np.nan, np.nan, np.nan],
[np.nan, 5, np.nan, 3],
],
columns=['A', 'B', 'C', 'D']
)
df3 = df.apply(
lambda x: x.shift(-list(x.index).index(x.first_valid_index() or x.index[0])),
axis=1,
)
And df3 is:
A B C D
0 2.0 4.0 7.0 NaN
1 6.0 9.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN NaN NaN NaN
4 5.0 NaN 3.0 NaN

Iterate through two data frames and update a column of the first data frame with a column of the second data frame in pandas

I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.
Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0

Dropping columns with >N NaNs excluding specific columns

I'm wondering if the there is a consice way to do exclude all columns with more than N NaNs, excluding one column from this subset.
For example:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
Results in:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Running the following, I get:
df.dropna(thresh=2, axis=1)
B D
0 2.0 0
1 4.0 1
2 NaN 5
I would like to keep column 'C'. I.e., to perform this thresholding except on column 'C'.
Is that possible?
You can put the column back once you've done the thresholding. If you do this all on one line, you don't even need to store a reference to the column.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
df.dropna(thresh=2, axis=1).assign(C=df['C'])
You could also do
C = df['C']
df.dropna(thresh=2, axis=1)
df.assign(C=C)
As suggested by #Wen, you can also do an indexing operation that won't remove column C to begin with.
threshold = 2
df = df.loc[:, (df.isnull().sum(0) < threshold) | (df.columns == 'C')]
The index here for the column will select columns that have fewer than threshold NaN values, or whose name is C. If you wanted to include more than just one column in the exception, you can chain more conditions with the "or" operator |. For example:
df = df.loc[
:,
(df.isnull().sum(0) < threshold) |
(df.columns == 'C') |
(df.columns == 'D')]
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().sum(0)==len(df))]
Out[415]:
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5
As per Zero's suggestion
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().all(0))]
EDIT :
df.loc[:,(df.isnull().sum(0)<=1)|(df.columns=='C')]
Another take that blends some concepts from other answers.
df.loc[:, df.isnull().assign(C=False).sum().lt(2)]
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5

Pandas: How to drop multiple columns with nan as col name?

As per the title here's a reproducible example:
raw_data = {'x': ['this', 'that', 'this', 'that', 'this'],
np.nan: [np.nan, np.nan, np.nan, np.nan, np.nan],
'y': [np.nan, np.nan, np.nan, np.nan, np.nan],
np.nan: [np.nan, np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(raw_data, columns = ['x', np.nan, 'y', np.nan])
df
x NaN y NaN
0 this NaN NaN NaN
1 that NaN NaN NaN
2 this NaN NaN NaN
3 that NaN NaN NaN
4 this NaN NaN NaN
Aim is to drop only the columns with nan as the col name (so keep column y). dropna() doesn't work as it conditions on the nan values in the column, not nan as the col name.
df.drop(np.nan, axis=1, inplace=True) works if there's a single column in the data with nan as the col name, but not with multiple columns with nan as the col name, as in my data.
So how to drop multiple columns where the col name is nan?
In [218]: df = df.loc[:, df.columns.notna()]
In [219]: df
Out[219]:
x y
0 this NaN
1 that NaN
2 this NaN
3 that NaN
4 this NaN
You can try
df.columns = df.columns.fillna('to_drop')
df.drop('to_drop', axis = 1, inplace = True)
As of pandas 1.4.0
df.drop is the simplest solution, as it now handles multiple NaN headers properly:
df = df.drop(columns=np.nan)
# x y
# 0 this NaN
# 1 that NaN
# 2 this NaN
# 3 that NaN
# 4 this NaN
Or the equivalent axis syntax:
df = df.drop(np.nan, axis=1)
Note that it's possible to use inplace instead of assigning back to df, but inplace is not recommended and will eventually be deprecated.

Categories