Merging dataframes in pandas while ignoring NaN values - python

Suppose I have two dataframes:
df_a
A B C
0 1.0 NaN NaN
1 NaN 1.0 NaN
2 NaN NaN 1.0
df_b
A B C
0 NaN NaN 2.0
1 NaN 2.0 NaN
2 2.0 NaN NaN
How would I go about merging/concatenating them so the result dataframe looks like this:
df_c
A B C
0 1.0 NaN 2.0
1 NaN 2.0 NaN
2 2.0 NaN 1.0
The way I got closer conceptually was by using pd.merge(df_a, df_b, "right"), but all values on df_a ended up replaced.
Is there any way to ignore NaN values when merging?

In your case do combine_first
df_c = df_b.combine_first(df_a)
df_c
Out[151]:
A B C
0 1.0 NaN 2.0
1 NaN 2.0 NaN
2 2.0 NaN 1.0

df_c = df_a.combine_first(df_b)
df_c
A B C
0 1.0 NaN 2.0
1 NaN 1.0 NaN
2 2.0 NaN 1.0
df_d = df_b.combine_first(df_a)
df_d
A B C
0 1.0 NaN 2.0
1 NaN 2.0 NaN
2 2.0 NaN 1.0

Related

Dataframe from a dict of lists of dicts?

I have a dict of lists of dicts. What is the most efficient way to convert this into a DataFrame in pandas?
data = {
"0a2":[{"a":1,"b":1},{"a":1,"b":1,"c":1},{"a":1,"b":1}],
"279":[{"a":1,"b":1,"c":1},{"a":1,"b":1,"d":1}],
"ae2":[{"a":1,"b":1},{"a":1,"d":1},{"a":1,"b":1},{"a":1,"d":1}],
#...
}
import pandas as pd
pd.DataFrame(data, columns=["a","b","c","d"])
What I've tried:
One solution is to denormalize the data like this, by duplicating the "id" keys:
bad_data = [
{"a":1,"b":1,"id":"0a2"},{"a":1,"b":1,"c":1,"id":"0a2"},{"a":1,"b":1,"id":"0a2"},
{"a":1,"b":1,"c":1,"id":"279"},{"a":1,"b":1,"d":1,"id":"279"},
{"a":1,"b":1,"id":"ae2"},{"a":1,"d":1,"id":"ae2"},{"a":1,"b":1,"id":"ae2"},{"a":1,"d":1,"id":"ae2"}
]
pd.DataFrame(bad_data, columns=["a","b","c","d","id"])
But my data is very large, so I'd prefer some other hierarchical index solution.
IIUC, you can do (remcomended)
new_df = pd.concat((pd.DataFrame(d) for d in data.values()), keys=data.keys())
Output:
a b c d
0a2 0 1 1.0 NaN NaN
1 1 1.0 1.0 NaN
2 1 1.0 NaN NaN
279 0 1 1.0 1.0 NaN
1 1 1.0 NaN 1.0
ae2 0 1 1.0 NaN NaN
1 1 NaN NaN 1.0
2 1 1.0 NaN NaN
3 1 NaN NaN 1.0
Or
pd.concat(pd.DataFrame(v).assign(ID=k) for k,v in data.items())
Output:
a b c ID d
0 1 1.0 NaN 0a2 NaN
1 1 1.0 1.0 0a2 NaN
2 1 1.0 NaN 0a2 NaN
0 1 1.0 1.0 279 NaN
1 1 1.0 NaN 279 1.0
0 1 1.0 NaN ae2 NaN
1 1 NaN NaN ae2 1.0
2 1 1.0 NaN ae2 NaN
3 1 NaN NaN ae2 1.0

Fill missing values of pandas based on values of other columns

I have a following dataframe:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Now I want to fill null values of A with the values in B or D. i.e. if the value is Null in B than check D. So resultant dataframe looks like this.
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5 NaN NaN 5
3 3.0 3.0 NaN 4
I can do this using following code:
df['A'] = df['A'].fillna(df['B'])
df['A'] = df['A'].fillna(df['D'])
But I want to do this in one line, how can I do that?
You could simply chain both .fillna():
df['A'] = df.A.fillna(df.B).fillna(df.D)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4
Or using fillna with combine_first:
df['A'] = df.A.fillna(df.B.combine_first(df.D))
If dont need chain because many columns better is use back filling missing values with seelcting first column by positions:
df['A'] = df['A'].fillna(df[['B','D']].bfill(axis=1).iloc[:, 0])
print (df)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4

I want to subtract each column from the previous non-null column using the diff function

I have a long list of columns and I want to subtract the previous column from the current column and replace the current column with the difference.
So if I have:
A B C D
1 NaN 3 7
3 NaN 8 10
2 NaN 6 11
I want the output to be:
A B C D
1 NaN 2 4
3 NaN 5 2
2 NaN 4 5
I have been trying to use this code:
df2 = df1.diff(axis=1)
but this does not produce the desired output
Thanks in advance.
You can do this with df.where and then update to bring back the first non-null entry for each row of your DataFrame.
Sample Data: df
A B C D
0 1.0 NaN 3.0 7.0
1 1.0 4.0 5.0 9.0
2 NaN 4.0 NaN 4.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 7.0
5 3.0 NaN NaN 7.0
6 6.0 NaN NaN NaN
Code:
df_d = df.where(df.isnull(),
df.fillna(method='ffill', axis=1).diff(axis=1))
df_d.update(df.where(df.notnull().cumsum(1).cumsum(1) == 1))
Output: df_d
A B C D
0 1.0 NaN 2.0 4.0
1 1.0 3.0 1.0 4.0
2 NaN 4.0 NaN 0.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 4.0
5 3.0 NaN NaN 4.0
6 6.0 NaN NaN NaN
Actually, it is producing the desired result but you are trying to calculate diff on nan values which will be nan so diff is working as expected.
For your case just fetch the first column from original dataframe and you should be fine
df2=df1.diff(axis=1)
df2.A=df1.A
print(df2)
Output
A B C D
1 NaN 2.0 4.0

Pandas: Drop Rows, Columns If More Than Half Are NaN

I have a Pandas DataFrame called df with 1,460 rows and 81 columns. I want to remove all columns where at least half the entries are NaN and to do something similar for rows.
From the Pandas docs, I attempted this:
train_df.shape //(1460, 81)
train_df.dropna(thresh=len(train_df)/2, axis=1, inplace=True)
train_df.shape //(1460, 77)
Is this the correct way of doing it? It seems to remove 4 columns but I'm surprised. I would have thought len(train_df) gets me the number of rows so I've passed the wrong value to thresh...?
How would I do the same thing for rows (removing rows where at least half the columns are NaN)?
Thanks!
I guess you did the right thing but forgot to add the .index.
The line should look like this:
train_df.dropna(thresh=len(train_df.index)/2, axis=1, inplace=True)
Hope that helps.
Using count and loc. count(axis=) ignores NaNs for counting.
In [4135]: df.loc[df.count(1) > df.shape[1]/2, df.count(0) > df.shape[0]/2]
Out[4135]:
0
0 0.382991
1 0.428040
7 0.441113
Details
In [4136]: df
Out[4136]:
0 1 2 3
0 0.382991 0.658090 0.881214 0.572673
1 0.428040 0.258378 0.865269 0.173278
2 0.579953 NaN NaN NaN
3 0.117927 NaN NaN NaN
4 0.597632 NaN NaN NaN
5 0.547839 NaN NaN NaN
6 0.998631 NaN NaN NaN
7 0.441113 0.527205 0.779821 0.251350
In [4137]: df.count(1) > df.shape[1]/2
Out[4137]:
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 True
dtype: bool
In [4138]: df.count(0) < df.shape[0]/2
Out[4138]:
0 False
1 True
2 True
3 True
dtype: bool
Setup
np.random.seed([3,14159])
df = pd.DataFrame(np.random.choice([1, np.nan], size=(10, 10)))
df
0 1 2 3 4 5 6 7 8 9
0 1.0 1.0 NaN NaN NaN 1.0 1.0 NaN 1.0 NaN
1 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0
2 NaN 1.0 1.0 NaN NaN NaN NaN 1.0 1.0 1.0
3 1.0 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN
4 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN 1.0 NaN
5 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0
6 NaN NaN 1.0 NaN NaN 1.0 1.0 NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN 1.0 NaN 1.0 NaN NaN
8 1.0 1.0 1.0 NaN 1.0 NaN 1.0 NaN NaN 1.0
9 NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Solution 1
This assumes you make the calculation for rows and columns before you drop either rows or columns.
n = df.notnull()
df.loc[n.mean(1) > .5, n.mean() > .5]
5 6 9
1 1.0 1.0 1.0
4 1.0 NaN NaN
8 NaN 1.0 1.0
9 1.0 1.0 1.0
Solution 2
Similar concept but using numpy tools.
v = np.isnan(df.values)
r = np.count_nonzero(v, 1) < v.shape[1] // 2
c = np.count_nonzero(v, 0) < v.shape[0] // 2
df.loc[r, c]
5 6 9
1 1.0 1.0 1.0
4 1.0 NaN NaN
8 NaN 1.0 1.0
9 1.0 1.0 1.0
Try this code, it would do !
df.dropna(thresh = df.shape[1]/3, axis = 0, inplace = True)

Pandas: operations with nans

Suppose I produce the following using pandas:
df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(25).reshape((5,5))*2,index = ['A','B','C','D','E'])
df[2] = np.nan
df1[3] = np.nan
df[4] = np.nan
df1[4] = np.nan
df2 = df+df1
print(df2)
0 1 2 3 4
A 3.0 3.0 NaN NaN NaN
B 3.0 3.0 NaN NaN NaN
C 3.0 3.0 NaN NaN NaN
D 3.0 3.0 NaN NaN NaN
E 3.0 3.0 NaN NaN NaN
What do I have to do to get this instead?
0 1 2 3 4
A 3 3 2 1 NaN
B 3 3 2 1 NaN
C 3 3 2 1 NaN
D 3 3 2 1 NaN
E 3 3 2 1 NaN
Use the fill_value argument of the DataFrame.add method:
fill_value : None or float value, default None Fill missing (NaN)
values with this value. If both DataFrame locations are missing, the
result will be missing.
df.add(df1, fill_value=0)
Out:
0 1 2 3 4
A 3.0 3.0 2.0 1.0 NaN
B 3.0 3.0 2.0 1.0 NaN
C 3.0 3.0 2.0 1.0 NaN
D 3.0 3.0 2.0 1.0 NaN
E 3.0 3.0 2.0 1.0 NaN

Categories