Combining DataFrames without Nans - python

I have two df. One maps values to IDs. The other one has multiple entries of these IDs. I want to have a df with the first dataframe with the values assigned to the respective IDs.
df1 =
Val1 Val2 Val3
x 1000 2 0
y 2000 3 9
z 3000 1 8
df2=
foo ID bar
0 something y a
1 nothing y b
2 everything x c
3 who z d
result=
foo ID bar Val1 Val2 Val3
0 something y a 2000 3 9
1 nothing y b 2000 3 9
2 everything x c 1000 2 0
3 who z d 3000 1 8
I've tried merge and join (obviously incorrectly) but I am getting a bunch of NaNs when I do that. It appears that I am getting NaNs on every alternate ID.
I have also tried indexing both DFs by ID but that didn't seem to help either. I am obviously missing something that I am guessing is a core functionality but I can't get my head around it.

merge and join could both get you the result DataFrame you want. Since one of your DataFrames is indexed (by ID) and the other has just a integer index, merge is the logical choice.
Merge:
# use ID as the column to join on in df2 and the index of df1
result = df2.merge(df1, left_on="ID", right_index=True, how="inner")
Join:
df2.set_index("ID", inplace=True) # index df2 in place so you can use join, which merges by index by default
result = df2.join(df1, how="inner") # join df1 by index

Related

Creating a new map from existing maps in python

This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
Example:
Map1:
1,100
2,453
3,200
Map2:
100,25,30,
200,300,,
250,190,20,1
My map3 should be:
1,25
2,0
3,300
As 453 is not present in map2, our map3 contains value 0 for key 2.
First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df1.columns=['a','b']
df2.columns=['c','d','e','f']
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
.fillna(0)
.convert_dtypes())
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0

Match multiple columns on Python to a single value

I hope you are doing well.
I am trying to perform a match based on multiple columns where my values of Column B of df1 is scattered in three to four columns in df2. The goal here is the the return the values of Column A of df2 if values of Column B matches any values in the columns C,D,E.
What I did until now was actually to do multiple left merges (and changing the name of Column B to match the name of columns C,D,E of df2).
I am trying to simplify the process but I am unsure how I am supposed to do this?
My dataset looks like that:
Df1:
ID
0 77
1 4859
2 LSP
DF2:
X id1 id2 id3
0 AAAAA_XX 889 77 BSP
1 BBBBB_XX 4859 CC 998P
2 CCCC_YY YUI TYU LSP
My goal is to have in df1:
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
Thank you very much !
you can get all the values in the columns to one first with pd.concat
then we merge the tables like this:
df3 = pd.concat([df2.id1, df2.id2]).reset_index()
df1 = df2.merge(df3, how="left", left_on = df1.ID, right_on = df3[0])
df1 = df1.iloc[:, :2]
df1 = df1.rename(columns={"key_0": "ID"})
not the most beautiful code in the world, but it works.
output:
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
Use DataFrame.merge with DataFrame.melt:
df = df1.merge(df2.melt(id_vars='X', value_name='ID').drop('variable', axis=1),
how='left',
on='ID')
print (df)
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
If possible duplicated ID is possible use:
df = (df1.merge(df2.melt(id_vars='X', value_name='ID')
.drop('variable', axis=1)
.drop_duplicates('ID'),
how='left',
on='ID'))

Dataframe becomes larger than it should be after join operation in pandas

I have an excel dataframe which I am trying to populate with fields from other excel file like so:
df = pd.read_excel("file1.xlsx")
df_new = df.join(conv.set_index('id'), on='id', how='inner')
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: "" if x==0 else x) # if id==0, its same as nan
df_new = df_new.dropna() # drop nan
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: str(int(x))) # convert id to string
df_new = df_new.drop_duplicates() # drop duplicates, if any
it is clear that df_new should be a subset of df, however, when I run following code:
len(df[df['id'].isin(df_new['id'].values)]) # length of this should be same as len(df_new)
len(df_new)
I get different results (there are 6 more rows in df_new than in df). How can that be? I have checked all dataframes for duplicates and none of them contain any. Interestingly, following code does give expected results:
len(df_new[df_new['id'].isin(df['id'].values)])
len(df_new)
These both print same numbers
Edit:
I have also tried following: others = df[~df['id'].isin(df_new['id'].values)], and checking if others has same length as len(df) - len(df_new), but again, in dataframe others there are 6 more rows than expected
The problem comes from your conv dataframe. Assume that your df that comes from file1 is
id PersonalN
0 1
And conv is
id other_col
0 'abc'
0 'def'
After the join you will get:
id PersonalN other_col
0 1 'abc'
0 1 'def'
size of df_new is larger than of df and drop_dulicates() or dropna() will not help you to reduce the shape of your resulting dataframe.
It's hard to know without the data, but even if there are no duplicates in either of the dataframe, the size of the result of an inner join can be larger than the original dataframe size. Consider the following example:
df1 = pd.DataFrame(range(10), columns=["id_"])
df2 = pd.DataFrame({"id_": list(range(10)) + [1] * 3, "something": range(13)})
df2.drop_duplicates(inplace = True)
print(len(df1), len(df2))
==> 10 13
df_new = df1.join(df2.set_index("id_"), on = "id_")
len(df_new)
==> 13
print(df_new)
id_ something
0 0 0
1 1 1
1 1 10
1 1 11
1 1 12
2 2 2
...
The reason is of course that the ids of the other dataframe are not unique, and a single id in the original dataframe (df1 in my example) is joined to several rows on the other dataframe (df2 in my example, conv in yours).

Will passing ignore_index=True to pd.concat preserve index succession within dataframes that I'm concatenating?

I have two dataframes:
df1 =
value
0 a
1 b
2 c
df2 =
value
0 d
1 e
I need to concatenate them across index, but I have to preserve the index of the first dataframe and continue it in the second dataframe, like this:
result =
value
0 a
1 b
2 c
3 d
4 e
My guess is that pd.concat([df1, df2], ignore_index=True) will do the job. However, I'm worried that for large dataframes the order of the rows may be changed and I'll end up with something like this (first two rows changed indices):
result =
value
0 b
1 a
2 c
3 d
4 e
So my question is, does the pd.concat with ignore_index=True save the index succession within dataframes that are being concatenated, or there is randomness in the index assignment?
In my experience, pd.concat concats the rows in the order the DataFrames are passed to it during concatenation.
If you want to be safe, specify sort=False which will also avoid sorting on columns:
pd.concat([df1, df2], axis=0, sort=False, ignore_index=True)
value
0 a
1 b
2 c
3 d
4 e

join two pandas dataframe using a specific column

I am new with pandas and I am trying to join two dataframes based on the equality of one specific column. For example suppose that I have the followings:
df1
A B C
1 2 3
2 2 2
df2
A B C
5 6 7
2 8 9
Both dataframes have the same columns and the value of only one column (say A) might be equal. What I want as output is this:
df3
A B C B C
2 8 9 2 2
The values for column 'A' are unique in both dataframes.
Thanks
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner')
If you wish to maintain column A as a non-index, then:
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner').reset_index()
Alternatively, you could just do:
df3 = df1.merge(df2, on='A', how='inner', suffixes=('_1', '_2'))
And then you can keep track of each value's origin

Categories