I have two dataframes, df1:
column1 column2
0 A B
1 A A
2 C A
3 None None
4 None None
and df2
id l
40 100005090 A
188 100020985 B
Now I want to join df1 and df2, but I don't know how to match the indices. If I simply do df1.join(df2), the indices are aligned to df2. That is, it finds the 40th entry of df2 and that is now the first entry of the dataframe that starts at 40 (df1). How do I tell pandas to align indices to df1, meaning that the first entry of df2 is actually index 40? That is, I would like to get:
id l column1 column2
40 100005090 A A B
188 100020985 B A A
...
You can take a slice of your df that is the same length as df1, then you can overwrite the index values and then join:
In [174]:
sub = df.iloc[:len(df1)]
sub.index = df1.index
df1.join(sub)
Out[174]:
id l column1 column2
40 100005090 A A B
188 100020985 B A A
If the dfs are the same length then the first line is not needed, you just overwrite the index with the index values from the other df.
Related
This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
Example:
Map1:
1,100
2,453
3,200
Map2:
100,25,30,
200,300,,
250,190,20,1
My map3 should be:
1,25
2,0
3,300
As 453 is not present in map2, our map3 contains value 0 for key 2.
First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df1.columns=['a','b']
df2.columns=['c','d','e','f']
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
.fillna(0)
.convert_dtypes())
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0
I hope you are doing well.
I am trying to perform a match based on multiple columns where my values of Column B of df1 is scattered in three to four columns in df2. The goal here is the the return the values of Column A of df2 if values of Column B matches any values in the columns C,D,E.
What I did until now was actually to do multiple left merges (and changing the name of Column B to match the name of columns C,D,E of df2).
I am trying to simplify the process but I am unsure how I am supposed to do this?
My dataset looks like that:
Df1:
ID
0 77
1 4859
2 LSP
DF2:
X id1 id2 id3
0 AAAAA_XX 889 77 BSP
1 BBBBB_XX 4859 CC 998P
2 CCCC_YY YUI TYU LSP
My goal is to have in df1:
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
Thank you very much !
you can get all the values in the columns to one first with pd.concat
then we merge the tables like this:
df3 = pd.concat([df2.id1, df2.id2]).reset_index()
df1 = df2.merge(df3, how="left", left_on = df1.ID, right_on = df3[0])
df1 = df1.iloc[:, :2]
df1 = df1.rename(columns={"key_0": "ID"})
not the most beautiful code in the world, but it works.
output:
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
Use DataFrame.merge with DataFrame.melt:
df = df1.merge(df2.melt(id_vars='X', value_name='ID').drop('variable', axis=1),
how='left',
on='ID')
print (df)
ID X
0 77 AAAAA_XX
1 4859 BBBBB_XX
2 LSP CCCC_YY
If possible duplicated ID is possible use:
df = (df1.merge(df2.melt(id_vars='X', value_name='ID')
.drop('variable', axis=1)
.drop_duplicates('ID'),
how='left',
on='ID'))
I have an excel dataframe which I am trying to populate with fields from other excel file like so:
df = pd.read_excel("file1.xlsx")
df_new = df.join(conv.set_index('id'), on='id', how='inner')
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: "" if x==0 else x) # if id==0, its same as nan
df_new = df_new.dropna() # drop nan
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: str(int(x))) # convert id to string
df_new = df_new.drop_duplicates() # drop duplicates, if any
it is clear that df_new should be a subset of df, however, when I run following code:
len(df[df['id'].isin(df_new['id'].values)]) # length of this should be same as len(df_new)
len(df_new)
I get different results (there are 6 more rows in df_new than in df). How can that be? I have checked all dataframes for duplicates and none of them contain any. Interestingly, following code does give expected results:
len(df_new[df_new['id'].isin(df['id'].values)])
len(df_new)
These both print same numbers
Edit:
I have also tried following: others = df[~df['id'].isin(df_new['id'].values)], and checking if others has same length as len(df) - len(df_new), but again, in dataframe others there are 6 more rows than expected
The problem comes from your conv dataframe. Assume that your df that comes from file1 is
id PersonalN
0 1
And conv is
id other_col
0 'abc'
0 'def'
After the join you will get:
id PersonalN other_col
0 1 'abc'
0 1 'def'
size of df_new is larger than of df and drop_dulicates() or dropna() will not help you to reduce the shape of your resulting dataframe.
It's hard to know without the data, but even if there are no duplicates in either of the dataframe, the size of the result of an inner join can be larger than the original dataframe size. Consider the following example:
df1 = pd.DataFrame(range(10), columns=["id_"])
df2 = pd.DataFrame({"id_": list(range(10)) + [1] * 3, "something": range(13)})
df2.drop_duplicates(inplace = True)
print(len(df1), len(df2))
==> 10 13
df_new = df1.join(df2.set_index("id_"), on = "id_")
len(df_new)
==> 13
print(df_new)
id_ something
0 0 0
1 1 1
1 1 10
1 1 11
1 1 12
2 2 2
...
The reason is of course that the ids of the other dataframe are not unique, and a single id in the original dataframe (df1 in my example) is joined to several rows on the other dataframe (df2 in my example, conv in yours).
I have two dataframes df1 and df2.
df1
index emp_id name code
0 07 emp07 'A'
1 11 emp11 'B'
2 30 emp30 'C'
df2
index emp_id salary
0 06 1000
1 17 2000
2 11 3000
I want to store a map from df1['emp_id'] to df2.index.
Example: input array - ['emp11','B'] (from df1)
Expected output: [11, 2] # this is df1['emp_id'], df2.index
Code I am trying:
columns_to_idx = {emp_id: i for i, emp_id in
enumerate(list(DF1.set_index('emp_id').loc[DF2.index][['name', 'code']]))}
I think you need DataFrame.merge with inner join and DataFrame.reset_index for column from index for avoid lost it:
df = df1.merge(df2.reset_index(), on='emp_id')
print (df)
emp_id name code index salary
0 11 emp11 B 2 3000
Then is possible create MultiIndex and select by tuple:
df2 = (df1.merge(df2.reset_index(), on='emp_id')
.set_index(['name','code'])[['emp_id','index']])
print (df2)
emp_id index
name code
emp11 B 11 2
print (df2.loc[('emp11','B')].tolist())
[11, 2]
I have two df. One maps values to IDs. The other one has multiple entries of these IDs. I want to have a df with the first dataframe with the values assigned to the respective IDs.
df1 =
Val1 Val2 Val3
x 1000 2 0
y 2000 3 9
z 3000 1 8
df2=
foo ID bar
0 something y a
1 nothing y b
2 everything x c
3 who z d
result=
foo ID bar Val1 Val2 Val3
0 something y a 2000 3 9
1 nothing y b 2000 3 9
2 everything x c 1000 2 0
3 who z d 3000 1 8
I've tried merge and join (obviously incorrectly) but I am getting a bunch of NaNs when I do that. It appears that I am getting NaNs on every alternate ID.
I have also tried indexing both DFs by ID but that didn't seem to help either. I am obviously missing something that I am guessing is a core functionality but I can't get my head around it.
merge and join could both get you the result DataFrame you want. Since one of your DataFrames is indexed (by ID) and the other has just a integer index, merge is the logical choice.
Merge:
# use ID as the column to join on in df2 and the index of df1
result = df2.merge(df1, left_on="ID", right_index=True, how="inner")
Join:
df2.set_index("ID", inplace=True) # index df2 in place so you can use join, which merges by index by default
result = df2.join(df1, how="inner") # join df1 by index