I have two pandas DataFrames:
key id count
100 9821 7
200 9813 10
nodekey nodeid
100 9821
200 9813
If the nodekey+nodeid in df2 match key+id in df1, count in df1 has to be set to 0. So, the result of the example above should be;
key id count
100 9821 0
200 9813 0
I tried the following (matching on key and nodekey only, as a test) but receive an error:
df1['count']=np.where((df1.key == df2.nodekey),0)
ValueError: either both or neither of x and y should be given

This should work
df1.loc[df1[['key', 'id']].transform(tuple,1).isin(df2[['nodekey', 'nodeid']].transform(tuple,1)), "count"] = 0
which is basically using
df.loc[mask, 'count']=0
where mask is True for rows where tuple ('key', 'id') matches any tuple ('nodekey', 'nodeid')

Merge the dataframes using the left merge (the rows that are present in df1 but not in df2 will be filled with nans):
combined = df1.merge(df2, left_on=['key', 'id'],
right_on=['nodekey', 'nodeid'], how='left')
Update the counts for the rows that are non-nan:
combined.loc[combined.nodekey.notnull(), 'count'] = 0
Cleanup the unwanted columns:
combined.drop(['nodekey', 'nodeid'], axis=1, inplace=True)
# key id count
#0 100 9821 0
#1 200 9813 0
#2 300 9855 7


Creating a new map from existing maps in python

This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
My map3 should be:
As 453 is not present in map2, our map3 contains value 0 for key 2.
First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0

Match multiple columns on Python to a single value

I hope you are doing well.
I am trying to perform a match based on multiple columns where my values of Column B of df1 is scattered in three to four columns in df2. The goal here is the the return the values of Column A of df2 if values of Column B matches any values in the columns C,D,E.
What I did until now was actually to do multiple left merges (and changing the name of Column B to match the name of columns C,D,E of df2).
I am trying to simplify the process but I am unsure how I am supposed to do this?
My dataset looks like that:
0 77
1 4859
X id1 id2 id3
0 AAAAA_XX 889 77 BSP
1 BBBBB_XX 4859 CC 998P
My goal is to have in df1:
1 4859 BBBBB_XX
Thank you very much !
you can get all the values in the columns to one first with pd.concat
then we merge the tables like this:
df3 = pd.concat([df2.id1, df2.id2]).reset_index()
df1 = df2.merge(df3, how="left", left_on = df1.ID, right_on = df3[0])
df1 = df1.iloc[:, :2]
df1 = df1.rename(columns={"key_0": "ID"})
not the most beautiful code in the world, but it works.
1 4859 BBBBB_XX
Use DataFrame.merge with DataFrame.melt:
df = df1.merge(df2.melt(id_vars='X', value_name='ID').drop('variable', axis=1),
print (df)
1 4859 BBBBB_XX
If possible duplicated ID is possible use:
df = (df1.merge(df2.melt(id_vars='X', value_name='ID')
.drop('variable', axis=1)

Dataframe becomes larger than it should be after join operation in pandas

I have an excel dataframe which I am trying to populate with fields from other excel file like so:
df = pd.read_excel("file1.xlsx")
df_new = df.join(conv.set_index('id'), on='id', how='inner')
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: "" if x==0 else x) # if id==0, its same as nan
df_new = df_new.dropna() # drop nan
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: str(int(x))) # convert id to string
df_new = df_new.drop_duplicates() # drop duplicates, if any
it is clear that df_new should be a subset of df, however, when I run following code:
len(df[df['id'].isin(df_new['id'].values)]) # length of this should be same as len(df_new)
I get different results (there are 6 more rows in df_new than in df). How can that be? I have checked all dataframes for duplicates and none of them contain any. Interestingly, following code does give expected results:
These both print same numbers
I have also tried following: others = df[~df['id'].isin(df_new['id'].values)], and checking if others has same length as len(df) - len(df_new), but again, in dataframe others there are 6 more rows than expected
The problem comes from your conv dataframe. Assume that your df that comes from file1 is
id PersonalN
0 1
And conv is
id other_col
0 'abc'
0 'def'
After the join you will get:
id PersonalN other_col
0 1 'abc'
0 1 'def'
size of df_new is larger than of df and drop_dulicates() or dropna() will not help you to reduce the shape of your resulting dataframe.
It's hard to know without the data, but even if there are no duplicates in either of the dataframe, the size of the result of an inner join can be larger than the original dataframe size. Consider the following example:
df1 = pd.DataFrame(range(10), columns=["id_"])
df2 = pd.DataFrame({"id_": list(range(10)) + [1] * 3, "something": range(13)})
df2.drop_duplicates(inplace = True)
print(len(df1), len(df2))
==> 10 13
df_new = df1.join(df2.set_index("id_"), on = "id_")
==> 13
id_ something
0 0 0
1 1 1
1 1 10
1 1 11
1 1 12
2 2 2
The reason is of course that the ids of the other dataframe are not unique, and a single id in the original dataframe (df1 in my example) is joined to several rows on the other dataframe (df2 in my example, conv in yours).

Join two pandas data frames with the indices of the first?

I have two dataframes, df1:
column1 column2
0 A B
1 A A
2 C A
3 None None
4 None None
and df2
id l
40 100005090 A
188 100020985 B
Now I want to join df1 and df2, but I don't know how to match the indices. If I simply do df1.join(df2), the indices are aligned to df2. That is, it finds the 40th entry of df2 and that is now the first entry of the dataframe that starts at 40 (df1). How do I tell pandas to align indices to df1, meaning that the first entry of df2 is actually index 40? That is, I would like to get:
id l column1 column2
40 100005090 A A B
188 100020985 B A A
You can take a slice of your df that is the same length as df1, then you can overwrite the index values and then join:
In [174]:
sub = df.iloc[:len(df1)]
sub.index = df1.index
id l column1 column2
40 100005090 A A B
188 100020985 B A A
If the dfs are the same length then the first line is not needed, you just overwrite the index with the index values from the other df.

Combining DataFrames without Nans

I have two df. One maps values to IDs. The other one has multiple entries of these IDs. I want to have a df with the first dataframe with the values assigned to the respective IDs.
df1 =
Val1 Val2 Val3
x 1000 2 0
y 2000 3 9
z 3000 1 8
foo ID bar
0 something y a
1 nothing y b
2 everything x c
3 who z d
foo ID bar Val1 Val2 Val3
0 something y a 2000 3 9
1 nothing y b 2000 3 9
2 everything x c 1000 2 0
3 who z d 3000 1 8
I've tried merge and join (obviously incorrectly) but I am getting a bunch of NaNs when I do that. It appears that I am getting NaNs on every alternate ID.
I have also tried indexing both DFs by ID but that didn't seem to help either. I am obviously missing something that I am guessing is a core functionality but I can't get my head around it.
merge and join could both get you the result DataFrame you want. Since one of your DataFrames is indexed (by ID) and the other has just a integer index, merge is the logical choice.
# use ID as the column to join on in df2 and the index of df1
result = df2.merge(df1, left_on="ID", right_index=True, how="inner")
df2.set_index("ID", inplace=True) # index df2 in place so you can use join, which merges by index by default
result = df2.join(df1, how="inner") # join df1 by index
