I am trying to concatenate some of the columns in my data frame in python pandas. Say, I have the following data frames:
df1['Head','Body','feat1','feat2']
df2['Head','Body','feat3','feat4']
I want to merge the dataframes into:
merged_df['Head','Body','feat1','feat2','feat3',feat4']
Intuitively, I did this:
merged_df = pd.concat([df1, df2['feat3','feat4'],axis=1)
It did not work. I did my research and did this:
merged_df =
df1[['Head','Body','feat1','feat2']].merge(df2[['Head','feat3','feat4']],
on='Head', how='left')
It worked but caused some discrepancies on my data. Turns out some of my 'Head' data are not unique. So now I am just looking for the most straight forward way to concatenate the selected columns from DF2 into my DF1. Note that both data frames follow the same order, so the row 1 in DF1 is directly related to row 1 in DF2, so is the row 8120th and so on..
Thanks
taking an example, lets suppose we have two DataFrame's as df1 and df2, so, if the values are of the columns are same or unique across then you simple do merge which will align the columns as you desired.
$ df1
Head Body feat1 feat2
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
$ df2
Head Body feat3 feat4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
Step 1 solution:
>>> pd.merge(df1, df2, on=['Head', 'Body'])
Head Body feat1 feat2 feat3 feat4
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
Secondly, if you have the columns values are different as follows then you can use pd.concat or pd.merge:
$ df1
Head Body feat1 feat2
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
$ df2
Head Body feat3 feat4
0 4 1 1 1
1 5 2 2 2
2 6 3 3 3
Step 2 solution:
If you want to use union of keys from both frames, then you can do it both with concat and merge as follows:
>>> pd.concat([df1,df2], join="outer", sort=False)
Head Body feat1 feat2 feat3 feat4
0 1 1 1.0 1.0 NaN NaN
1 2 2 2.0 2.0 NaN NaN
2 3 3 3.0 3.0 NaN NaN
0 4 1 NaN NaN 1.0 1.0
1 5 2 NaN NaN 2.0 2.0
2 6 3 NaN NaN 3.0 3.0
>>> pd.merge(df1, df2, on=['Head', 'Body'], how='outer')
Head Body feat1 feat2 feat3 feat4
0 1 1 1.0 1.0 NaN NaN
1 2 2 2.0 2.0 NaN NaN
2 3 3 3.0 3.0 NaN NaN
3 4 1 NaN NaN 1.0 1.0
4 5 2 NaN NaN 2.0 2.0
5 6 3 NaN NaN 3.0 3.0
Or you can opt to have :
a) if you want to use keys from left frame
pd.merge(df1, df2, on=['Head', 'Body'], how='left')
b) if you want to use keys from right frame
pd.merge(df1, df2, on=['Head', 'Body'], how='right')
Default it takes 'inner'.
inner: use intersection of keys from both frames, similar to a SQL
inner join; preserve the order of the left keys
You Can see DataFrame.merge for detail options..
After looking at your workaround, you want to use the keys from left frame
>>> pd.merge(df1, df2, on=['Head', 'Body'], how='left')
Head Body feat1 feat2 feat3 feat4
0 1 1 1 1 NaN NaN
1 2 2 2 2 NaN NaN
2 3 3 3 3 NaN NaN
I think you need value assign , and it will ignore index
df1['feat3']=df2['feat3'].values
df1['feat4']=df2['feat4'].values
Related
I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()
I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()
Given that, i have a dataset as below:
dict = {
"A": [math.nan,math.nan,1,math.nan,2,math.nan,3,5],
"B": np.random.randint(1,5,size=8)
}
dt = pd.DataFrame(dict)
My favorite output is, if the in column A we have an Nan then multiply the value of the column B in the same row and replace it with Nan. So, given that, the below is my dataset:
A B
NaN 1
NaN 1
1.0 3
NaN 2
2.0 3
NaN 1
3.0 1
5.0 3
My favorite output is:
A B
2 1
2 1
1 3
4 2
2 3
2 1
3 1
5 3
My current solution is as below which does not work:
dt[pd.isna(dt["A"])]["A"] = dt[pd.isna(dt["A"])]["B"].apply( lambda x:2*x )
print(dt)
In your case with fillna
df.A.fillna(df.B*2, inplace=True)
df
A B
0 2.0 1
1 2.0 1
2 1.0 3
3 4.0 2
4 2.0 3
5 2.0 1
6 3.0 1
7 5.0 3
I have two DataFrames (example below). I would like to delete any row in df1 with a value equal to df2[patnum] if df2[city] is 'nan'.
For example: I would want to drop rows 2 and 3 in df1 since they contain '4' and patnum '4' in df2 has a missing value in df2['city'].
How would I do this?
df1
Citer Citee
0 1 2
1 2 4
2 3 5
3 4 7
df2
Patnum City
0 1 new york
1 2 amsterdam
2 3 copenhagen
3 4 nan
4 5 sydney
expected result:
df1
Citer Citee
0 1 2
1 3 5
IIUC stack isin and dropna
the idea is to return a True/False boolean based on matches then drop those rows after we unstack the dataframe.
val = df2[df2['City'].isna()]['Patnum'].values
df3 = df1.stack()[~df1.stack().isin(val)].unstack().dropna(how="any")
Citer Citee
0 1.0 2.0
2 3.0 5.0
Details
df1.stack()[~df1.stack().isin(val)]
0 Citer 1
Citee 2
1 Citer 2
2 Citer 3
Citee 5
3 Citee 7
dtype: int64
print(df1.stack()[~df1.stack().isin(val)].unstack())
Citer Citee
0 1.0 2.0
1 2.0 NaN
2 3.0 5.0
3 NaN 7.0
I am trying to append two dataframes in pandas which have two different no of columns.
Example:
df1
A B
1 1
2 2
3 3
df2
A
4
5
Expected concatenated dataframe
df
A B
1 1
2 2
3 3
4 Null(or)0
5 Null(or)0
I am using
df1.append(df2) when the columns are same. But no idea how to deal with unequal no of columns.
How about pd.concat?
>>> pd.concat([df1,df2])
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
Also, df1.append(df2) still works:
>>> df1.append(df2)
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
From the docs of df.append:
Columns not in this frame are added as new columns.
Use the concat to join two columns and pass the additional argument ignore_index=True to reset the index other wise you might end with indexes as 0 1 2 0 1. For additional information refer docs here:
df1 = pd.DataFrame({'A':[1,2,3], 'B':[1,2,3]})
df2 = pd.DataFrame({'A':[4,5]})
df = pd.concat([df1,df2],ignore_index=True)
df
Output:
without ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
with ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 4 NaN
4 5 NaN