Concatenating selected columns from two data frames in python pandas

Concatenating selected columns from two data frames in python pandas - python

I am trying to concatenate some of the columns in my data frame in python pandas. Say, I have the following data frames:
df1['Head','Body','feat1','feat2']
df2['Head','Body','feat3','feat4']
I want to merge the dataframes into:
merged_df['Head','Body','feat1','feat2','feat3',feat4']
Intuitively, I did this:
merged_df = pd.concat([df1, df2['feat3','feat4'],axis=1)
It did not work. I did my research and did this:
merged_df =
df1[['Head','Body','feat1','feat2']].merge(df2[['Head','feat3','feat4']],
on='Head', how='left')
It worked but caused some discrepancies on my data. Turns out some of my 'Head' data are not unique. So now I am just looking for the most straight forward way to concatenate the selected columns from DF2 into my DF1. Note that both data frames follow the same order, so the row 1 in DF1 is directly related to row 1 in DF2, so is the row 8120th and so on..
Thanks

taking an example, lets suppose we have two DataFrame's as df1 and df2, so, if the values are of the columns are same or unique across then you simple do merge which will align the columns as you desired.
$ df1
Head Body feat1 feat2
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
$ df2
Head Body feat3 feat4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
Step 1 solution:
>>> pd.merge(df1, df2, on=['Head', 'Body'])
Head Body feat1 feat2 feat3 feat4
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
Secondly, if you have the columns values are different as follows then you can use pd.concat or pd.merge:
$ df1
Head Body feat1 feat2
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
$ df2
Head Body feat3 feat4
0 4 1 1 1
1 5 2 2 2
2 6 3 3 3
Step 2 solution:
If you want to use union of keys from both frames, then you can do it both with concat and merge as follows:
>>> pd.concat([df1,df2], join="outer", sort=False)
Head Body feat1 feat2 feat3 feat4
0 1 1 1.0 1.0 NaN NaN
1 2 2 2.0 2.0 NaN NaN
2 3 3 3.0 3.0 NaN NaN
0 4 1 NaN NaN 1.0 1.0
1 5 2 NaN NaN 2.0 2.0
2 6 3 NaN NaN 3.0 3.0
>>> pd.merge(df1, df2, on=['Head', 'Body'], how='outer')
Head Body feat1 feat2 feat3 feat4
0 1 1 1.0 1.0 NaN NaN
1 2 2 2.0 2.0 NaN NaN
2 3 3 3.0 3.0 NaN NaN
3 4 1 NaN NaN 1.0 1.0
4 5 2 NaN NaN 2.0 2.0
5 6 3 NaN NaN 3.0 3.0
Or you can opt to have :
a) if you want to use keys from left frame
pd.merge(df1, df2, on=['Head', 'Body'], how='left')
b) if you want to use keys from right frame
pd.merge(df1, df2, on=['Head', 'Body'], how='right')
Default it takes 'inner'.
inner: use intersection of keys from both frames, similar to a SQL
inner join; preserve the order of the left keys
You Can see DataFrame.merge for detail options..
After looking at your workaround, you want to use the keys from left frame
>>> pd.merge(df1, df2, on=['Head', 'Body'], how='left')
Head Body feat1 feat2 feat3 feat4
0 1 1 1 1 NaN NaN
1 2 2 2 2 NaN NaN
2 3 3 3 3 NaN NaN

I think you need value assign , and it will ignore index
df1['feat3']=df2['feat3'].values
df1['feat4']=df2['feat4'].values

Related

fill nan values with values from another row with common values in two or more columns [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()

If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN

You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Replace NaN of rows using data from another rows [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()

If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN

You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Pandas: How to replace values of Nan in column based on another column?

Given that, i have a dataset as below:
dict = {
"A": [math.nan,math.nan,1,math.nan,2,math.nan,3,5],
"B": np.random.randint(1,5,size=8)
}
dt = pd.DataFrame(dict)
My favorite output is, if the in column A we have an Nan then multiply the value of the column B in the same row and replace it with Nan. So, given that, the below is my dataset:
A B
NaN 1
NaN 1
1.0 3
NaN 2
2.0 3
NaN 1
3.0 1
5.0 3
My favorite output is:
A B
2 1
2 1
1 3
4 2
2 3
2 1
3 1
5 3
My current solution is as below which does not work:
dt[pd.isna(dt["A"])]["A"] = dt[pd.isna(dt["A"])]["B"].apply( lambda x:2*x )
print(dt)

In your case with fillna
df.A.fillna(df.B*2, inplace=True)
df
A B
0 2.0 1
1 2.0 1
2 1.0 3
3 4.0 2
4 2.0 3
5 2.0 1
6 3.0 1
7 5.0 3

How to drop a row in one dataframe if missing value in another dataframe?

I have two DataFrames (example below). I would like to delete any row in df1 with a value equal to df2[patnum] if df2[city] is 'nan'.
For example: I would want to drop rows 2 and 3 in df1 since they contain '4' and patnum '4' in df2 has a missing value in df2['city'].
How would I do this?
df1
Citer Citee
0 1 2
1 2 4
2 3 5
3 4 7
df2
Patnum City
0 1 new york
1 2 amsterdam
2 3 copenhagen
3 4 nan
4 5 sydney
expected result:
df1
Citer Citee
0 1 2
1 3 5

IIUC stack isin and dropna
the idea is to return a True/False boolean based on matches then drop those rows after we unstack the dataframe.
val = df2[df2['City'].isna()]['Patnum'].values
df3 = df1.stack()[~df1.stack().isin(val)].unstack().dropna(how="any")
Citer Citee
0 1.0 2.0
2 3.0 5.0
Details
df1.stack()[~df1.stack().isin(val)]
0 Citer 1
Citee 2
1 Citer 2
2 Citer 3
Citee 5
3 Citee 7
dtype: int64
print(df1.stack()[~df1.stack().isin(val)].unstack())
Citer Citee
0 1.0 2.0
1 2.0 NaN
2 3.0 5.0
3 NaN 7.0

append two data frames with unequal columns

I am trying to append two dataframes in pandas which have two different no of columns.
Example:
df1
A B
1 1
2 2
3 3
df2
A
4
5
Expected concatenated dataframe
df
A B
1 1
2 2
3 3
4 Null(or)0
5 Null(or)0
I am using
df1.append(df2) when the columns are same. But no idea how to deal with unequal no of columns.

How about pd.concat?
>>> pd.concat([df1,df2])
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
Also, df1.append(df2) still works:
>>> df1.append(df2)
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
From the docs of df.append:
Columns not in this frame are added as new columns.

Use the concat to join two columns and pass the additional argument ignore_index=True to reset the index other wise you might end with indexes as 0 1 2 0 1. For additional information refer docs here:
df1 = pd.DataFrame({'A':[1,2,3], 'B':[1,2,3]})
df2 = pd.DataFrame({'A':[4,5]})
df = pd.concat([df1,df2],ignore_index=True)
df
Output:
without ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
with ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 4 NaN
4 5 NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concatenating selected columns from two data frames in python pandas - python

I think you need value assign , and it will ignore index df1['feat3']=df2['feat3'].values df1['feat4']=df2['feat4'].values

Related

fill nan values with values from another row with common values in two or more columns [duplicate]

Replace NaN of rows using data from another rows [duplicate]

Pandas: How to replace values of Nan in column based on another column?

How to drop a row in one dataframe if missing value in another dataframe?

append two data frames with unequal columns

Categories

Resources