append two data frames with unequal columns - python

I am trying to append two dataframes in pandas which have two different no of columns.
Example:
df1
A B
1 1
2 2
3 3
df2
A
4
5
Expected concatenated dataframe
df
A B
1 1
2 2
3 3
4 Null(or)0
5 Null(or)0
I am using
df1.append(df2) when the columns are same. But no idea how to deal with unequal no of columns.

How about pd.concat?
>>> pd.concat([df1,df2])
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
Also, df1.append(df2) still works:
>>> df1.append(df2)
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
From the docs of df.append:
Columns not in this frame are added as new columns.

Use the concat to join two columns and pass the additional argument ignore_index=True to reset the index other wise you might end with indexes as 0 1 2 0 1. For additional information refer docs here:
df1 = pd.DataFrame({'A':[1,2,3], 'B':[1,2,3]})
df2 = pd.DataFrame({'A':[4,5]})
df = pd.concat([df1,df2],ignore_index=True)
df
Output:
without ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
with ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 4 NaN
4 5 NaN

Related

How to drop a row in one dataframe if missing value in another dataframe?

I have two DataFrames (example below). I would like to delete any row in df1 with a value equal to df2[patnum] if df2[city] is 'nan'.
For example: I would want to drop rows 2 and 3 in df1 since they contain '4' and patnum '4' in df2 has a missing value in df2['city'].
How would I do this?
df1
Citer Citee
0 1 2
1 2 4
2 3 5
3 4 7
df2
Patnum City
0 1 new york
1 2 amsterdam
2 3 copenhagen
3 4 nan
4 5 sydney
expected result:
df1
Citer Citee
0 1 2
1 3 5
IIUC stack isin and dropna
the idea is to return a True/False boolean based on matches then drop those rows after we unstack the dataframe.
val = df2[df2['City'].isna()]['Patnum'].values
df3 = df1.stack()[~df1.stack().isin(val)].unstack().dropna(how="any")
Citer Citee
0 1.0 2.0
2 3.0 5.0
Details
df1.stack()[~df1.stack().isin(val)]
0 Citer 1
Citee 2
1 Citer 2
2 Citer 3
Citee 5
3 Citee 7
dtype: int64
print(df1.stack()[~df1.stack().isin(val)].unstack())
Citer Citee
0 1.0 2.0
1 2.0 NaN
2 3.0 5.0
3 NaN 7.0

Pandas - combine columns and put one after another?

I have the following dataframe:
a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6
The desirable output is:
a,b
1,3
2,4
3,5
2,4
3,5
4,6
There is a lot of "a" and "b" named headers in the dataframe, the maximum is a50 and b50. So I am looking for the way to combine them all into just "a" and "b".
I think it's possible to do with concat, but I have no idea how to combine it all, putting all the values under each other. I'll be grateful for any ideas.
You can use pd.wide_to_long:
pd.wide_to_long(df.reset_index(), ['a','b'], 'index', 'No').reset_index()[['a','b']]
Output:
a b
0 1 3
1 2 4
2 3 5
3 2 4
4 3 5
5 4 6
First we read the dataframe:
import pandas as pd
from io import StringIO
s = """a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6"""
df = pd.read_csv(StringIO(s), sep=',')
Then we stack the columns, and separate the number of the columns from the letter 'a' or 'b':
stacked = df.stack().rename("val").reset_index(1).reset_index()
cols_numbers = pd.DataFrame(stacked
.level_1
.str.split('(\d)')
.apply(lambda l: l[:2])
.tolist(),
columns=["col", "num"])
x = cols_numbers.join(stacked[['val', 'index']])
print(x)
col num val index
0 a 1 1 0
1 a 2 2 0
2 b 1 3 0
3 b 2 4 0
4 a 1 2 1
5 a 2 3 1
6 b 1 4 1
7 b 2 5 1
8 a 1 3 2
9 a 2 4 2
10 b 1 5 2
11 b 2 6 2
Finally, we group by index and num to get two columns a and b, and we fill the first row of the b column with the second value, to get what was expected:
result = (x
.set_index("col", append=True)
.groupby(["index", "num"])
.val
.apply(lambda g:
g
.unstack()
.fillna(method="bfill")
.head(1))
.reset_index(-1, drop=True))
print(result)
col a b
index num
0 1 1.0 3.0
2 2.0 4.0
1 1 2.0 4.0
2 3.0 5.0
2 1 3.0 5.0
2 4.0 6.0
To get rid of the multiindex at the end: result.reset_index(drop=True)

Keep all cells above given value in pandas DataFrame

I would like to discard all cells that contain a value below a given value. So not only the rows or only the columns that, but for for all cells.
Tried code below, where all values in each cell should be at least 3. Doesn't work.
df[(df >= 3).any(axis=1)]
Example
import pandas as pd
my_dict = {'A':[1,5,6,2],'B':[9,9,1,2],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 2 2 5
I want to keep only the cells that are at least 3.
If you want "all values in each cell should be at least 3"
df [df < 3] = 3
df
A B C
0 3 9 3
1 5 9 3
2 6 3 3
3 3 3 5
If you want "to keep only the cells that are at least 3"
df = df [df >= 3]
df
A B C
0 NaN 9.0 NaN
1 5.0 9.0 NaN
2 6.0 NaN 3.0
3 3.0 3.0 5.0
You can check if the value is >= 3 then drop all rows with NaN value.
df[df >= 3 ].dropna()
DEMO:
import pandas as pd
my_dict = {'A':[1,5,6,3],'B':[9,9,1,3],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 3 3 5
df = df[df >= 3 ].dropna().reset_index(drop=True)
df
A B C
0 3.0 3.0 5.0

Concatenating selected columns from two data frames in python pandas

I am trying to concatenate some of the columns in my data frame in python pandas. Say, I have the following data frames:
df1['Head','Body','feat1','feat2']
df2['Head','Body','feat3','feat4']
I want to merge the dataframes into:
merged_df['Head','Body','feat1','feat2','feat3',feat4']
Intuitively, I did this:
merged_df = pd.concat([df1, df2['feat3','feat4'],axis=1)
It did not work. I did my research and did this:
merged_df =
df1[['Head','Body','feat1','feat2']].merge(df2[['Head','feat3','feat4']],
on='Head', how='left')
It worked but caused some discrepancies on my data. Turns out some of my 'Head' data are not unique. So now I am just looking for the most straight forward way to concatenate the selected columns from DF2 into my DF1. Note that both data frames follow the same order, so the row 1 in DF1 is directly related to row 1 in DF2, so is the row 8120th and so on..
Thanks
taking an example, lets suppose we have two DataFrame's as df1 and df2, so, if the values are of the columns are same or unique across then you simple do merge which will align the columns as you desired.
$ df1
Head Body feat1 feat2
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
$ df2
Head Body feat3 feat4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
Step 1 solution:
>>> pd.merge(df1, df2, on=['Head', 'Body'])
Head Body feat1 feat2 feat3 feat4
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
Secondly, if you have the columns values are different as follows then you can use pd.concat or pd.merge:
$ df1
Head Body feat1 feat2
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
$ df2
Head Body feat3 feat4
0 4 1 1 1
1 5 2 2 2
2 6 3 3 3
Step 2 solution:
If you want to use union of keys from both frames, then you can do it both with concat and merge as follows:
>>> pd.concat([df1,df2], join="outer", sort=False)
Head Body feat1 feat2 feat3 feat4
0 1 1 1.0 1.0 NaN NaN
1 2 2 2.0 2.0 NaN NaN
2 3 3 3.0 3.0 NaN NaN
0 4 1 NaN NaN 1.0 1.0
1 5 2 NaN NaN 2.0 2.0
2 6 3 NaN NaN 3.0 3.0
>>> pd.merge(df1, df2, on=['Head', 'Body'], how='outer')
Head Body feat1 feat2 feat3 feat4
0 1 1 1.0 1.0 NaN NaN
1 2 2 2.0 2.0 NaN NaN
2 3 3 3.0 3.0 NaN NaN
3 4 1 NaN NaN 1.0 1.0
4 5 2 NaN NaN 2.0 2.0
5 6 3 NaN NaN 3.0 3.0
Or you can opt to have :
a) if you want to use keys from left frame
pd.merge(df1, df2, on=['Head', 'Body'], how='left')
b) if you want to use keys from right frame
pd.merge(df1, df2, on=['Head', 'Body'], how='right')
Default it takes 'inner'.
inner: use intersection of keys from both frames, similar to a SQL
inner join; preserve the order of the left keys
You Can see DataFrame.merge for detail options..
After looking at your workaround, you want to use the keys from left frame
>>> pd.merge(df1, df2, on=['Head', 'Body'], how='left')
Head Body feat1 feat2 feat3 feat4
0 1 1 1 1 NaN NaN
1 2 2 2 2 NaN NaN
2 3 3 3 3 NaN NaN
I think you need value assign , and it will ignore index
df1['feat3']=df2['feat3'].values
df1['feat4']=df2['feat4'].values

How to drop duplicates from a subset of rows in a pandas dataframe?

I have a dataframe like this:
A B C
12 true 1
12 true 1
3 nan 2
3 nan 3
I would like to drop all rows where the value of column A is duplicate but only if the value of column B is 'true'.
The resulting dataframe I have in mind is:
A B C
12 true 1
3 nan 2
3 nan 3
I tried using: df.loc[df['B']=='true'].drop_duplicates('A', inplace=True, keep='first') but it doesn't seem to work.
Thanks for your help!
You can sue pd.concat split the df by B
df=pd.concat([df.loc[df.B!=True],df.loc[df.B==True].drop_duplicates(['A'],keep='first')]).sort_index()
df
Out[1593]:
A B C
0 12 True 1
2 3 NaN 2
3 3 NaN 3
df[df.B.ne(True) | ~df.A.duplicated()]
A B C
0 12 True 1
2 3 NaN 2
3 3 NaN 3

Categories