I have two data frame :
import pandas as pd
from numpy import nan
df1 = pd.DataFrame({'key':[1,2,3,4],
'only_at_df1':['a','b','c','d'],
'col2':['e','f','g','h'],})
df2 = pd.DataFrame({'key':[1,9],
'only_at_df2':[nan,'x'],
'col2':['e','z'],})
How to acquire this:
df3 = pd.DataFrame({'key':[1,2,3,4,9],
'only_at_df1':['a','b','c','d',nan],
'only_at_df2':[nan,nan,nan,nan,'x'],
'col2':['e','f','g','h','z'],})
any help appreciated.
The best is probably to use combine_first after temporarily setting "key" as index:
df1.set_index('key').combine_first(df2.set_index('key')).reset_index()
output:
key col2 only_at_df1 only_at_df2
0 1 e a NaN
1 2 f b NaN
2 3 g c NaN
3 4 h d NaN
4 9 z NaN x
This seems like a straightforward use of merge with how="outer":
df1.merge(df2, how="outer")
Output:
key only_at_df1 col2 only_at_df2
0 1 a e NaN
1 2 b f NaN
2 3 c g NaN
3 4 d h NaN
4 9 NaN z x
Related
I'm concatenating two pandas data frames, that have the same exact columns, but different number of rows. I'd like to stack the first dataframe over the second.
When I do the following, I get many NaN values in some of the columns. I've tried the fix in using this post, using .reset_index
But I'm getting NaN values still. My dataframes have the following columns:
The first one, rem_dup_pre and the second one, rem_dup_po have shape (54178, 11) (83502, 11) respectively.
I've tried this:
concat_mil = pd.concat([rem_dup_pre.reset_index(drop=True), rem_dup_po.reset_index(drop=True)], axis=0)
and I get NaN values. For example in 'Station Type', where previously there were no NaN values in either rem_dup_pre or rep_dup_po:
How can I simply concat them without NaN values?
Here's how I did it and I don't get any additional NaNs.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':['a','b','c','d',np.nan,np.nan],
'c':['x',np.nan,np.nan,np.nan,'y','z']})
df2 = pd.DataFrame(np.random.randint(0,10,(3,3)), columns = list('abc'))
print (df1)
print (df2)
df = pd.concat([df1,df2]).reset_index(drop=True)
print (df)
The output of this is:
DF1:
a b c
0 1 a x
1 2 b NaN
2 3 c NaN
3 4 d NaN
4 5 NaN y
5 6 NaN z
DF2:
a b c
0 4 8 4
1 8 4 4
2 2 8 1
DF: after concat
a b c
0 1 a x
1 2 b NaN
2 3 c NaN
3 4 d NaN
4 5 NaN y
5 6 NaN z
6 4 8 4
7 8 4 4
8 2 8 1
Need to perform the following operation on a pandas dataframe df inside a for loop with 50 iterations or more:
Column'X' of df has to be merged with column 'X' of df1,
Column'Y' of df has to be merged with column 'Y' of df2,
Column'Z' of df has to be merged with column 'Z' of df3,
Column'W' of df has to be merged with column 'W' of df4
The columns which are common in all 5 dataframes - df, df1, df2, df3 and df4 are A, B, C and D
EDIT
The shape of all dataframes is different from one another where df is the master dataframe having maximum number of rows and rest all other 4 dataframes have number of rows less than df but varying from each other. So while merging columns need to make sure that rows from both dataframes are matched first.
Input df
A B C D X Y Z W
1 2 3 4 nan nan nan nan
2 3 4 5 nan nan nan nan
5 9 7 8 nan nan nan nan
4 8 6 3 nan nan nan nan
df1
A B C D X Y Z W
2 3 4 5 100 nan nan nan
4 8 6 3 200 nan nan nan
df2
A B C D X Y Z W
1 2 3 4 nan 50 nan nan
df3
A B C D X Y Z W
1 2 3 4 nan nan 1000 nan
4 8 6 3 nan nan 2000 nan
df4
A B C D X Y Z W
2 3 4 5 nan nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 nan nan nan 45
Output df
A B C D X Y Z W
1 2 3 4 nan 50 1000 nan
2 3 4 5 100 nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 200 nan 2000 45
Which is the most efficient and fastest way to achieve it? Tried using 4 separate combine_first statements but that doesn't seem to be the most efficient way.
Can this be done by using just 1 line of code instead?
Any help will be appreciated. Many thanks in advance.
How can I replace specific row-wise duplicate cells in selected columns without dropping rows (preferably without looping through the rows)?
Basically, I want to keep the first value and replace the remaining duplicates in a row with NAN.
For example:
df_example = pd.DataFrame({'A':['a' , 'b', 'c'], 'B':['a', 'f', 'c'],'C':[1,2,3]})
df_example.head()
Original:
A B C
0 a a 1
1 b f 2
2 c c 3
Expected output:
A B C
0 a nan 1
1 b f 2
2 c nan 3
A bit more complicated example is as follows:
Original:
A B C D
0 a 1 a 1
1 b 2 f 5
2 c 3 c 3
Expected output:
A B C D
0 a 1 nan nan
1 b 2 f 5
2 c 3 nan nan
Use DataFrame.mask with Series.duplicated per rows in DataFrame.apply:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C
0 a NaN 1
1 b f 2
2 c NaN 3
With new data:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C D
0 a 1 NaN NaN
1 b 2 f 5.0
2 c 3 NaN NaN
I have a pandas dataframe df as shown.
col1 col2
0 NaN a
1 2 b
2 NaN c
3 NaN d
4 5 e
5 6 f
I want to find the first NaN value in col1 and assign a new value to it. I've tried both of the following methods but none of them works.
df.loc[df['col'].isna(), 'col1'][0] = 1
df.loc[df['col'].isna(), 'col1'].iloc[0] = 1
Both of them don't show any error or warning. But when I check the value of the original dataframe, it doesn't change.
What is the correct way to do this?
You can use .fillna() with limit=1 parameter:
df['col1'].fillna(1, limit=1, inplace=True)
print(df)
Prints:
col1 col2
0 1.0 a
1 2.0 b
2 NaN c
3 NaN d
4 5.0 e
5 6.0 f
I have a dataframe like below:
Text Label
a NaN
b NaN
c NaN
1 NaN
2 NaN
b NaN
c NaN
a NaN
b NaN
c NaN
Whenever the pattern "a,b,c" occurs downwards I want to label that part as a string such as 'Check'. Final dataframe should look like this:
Text Label
a Check
b Check
c Check
1 NaN
2 NaN
b NaN
c NaN
a Check
b Check
c Check
What is the best way to do this. Thank you =)
Here's a NumPy based approach leveraging broadcasting:
import numpy as np
w = df.Text.cumsum().str[-3:].eq('abc') # inefficient for large dfs
m = (w[w].index.values[:,None] + np.arange(-2,1)).ravel()
df.loc[m, 'Label'] = 'Check'
Text Label
0 a Check
1 b Check
2 c Check
3 1 NaN
4 2 NaN
5 b NaN
6 c NaN
7 a Check
8 b Check
9 c Check
Use this solution with numpy.where for general solution:
arr = df['Text']
pat = list('abc')
N = len(pat)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['label'] = np.where(np.in1d(np.arange(len(arr)), d), 'Check', np.nan)
print (df)
Text Label label
0 a NaN Check
1 b NaN Check
2 c NaN Check
3 1 NaN nan
4 2 NaN nan
5 b NaN nan
6 c NaN nan
7 a NaN Check
8 b NaN Check
9 c NaN Check
Good old shift and bfill work as well (for small number of steps):
s = df.Text.eq('c') & df.Text.shift().eq('b') & df.Text.shift(2).eq('a')
df.loc[s, 'Label'] = 'Check'
df.Label.bfill(limit=2, inplace=True)
Output:
Text Label
0 a Check
1 b Check
2 c Check
3 1 NaN
4 2 NaN
5 b NaN
6 c NaN
7 a Check
8 b Check
9 c Check