adding a new column in a dataframe based on another dataframe column - python

Let's assume we have the following 2 dataframes:
df1(36000, 20) and df2(80,6)
They have 3 columns in common(let's say Name, Last Name, Date)
df1 includes the data of df2 (minus the data in the 3 different columns) and of course some extra information.
df2 has a column that I am interested (let' s name it Rent)
What I want is to create an extra column in df1 that for the values that of df2 to have the value "Overdue" and for the values that are not there have "Due" while keeping the rest of columns in df1.
I tried the following
merged = df1.merge(df2, how='left', on=list(df1.columns),
indicator=True)
df1['Rent'] = np.where(merged['_merge'] == 'both', 'Overdue', 'Due')
However I get an error due to the fact that not all columns of df1 exist in df2. Any ideas?
Also I tried the following
df1['Rent'].apply(lambda x: 'Overdue' if df1['Name'].isin(df2['Name']) else 'Due')
but I m getting the following error
AttributeError: 'function' object has no attribute 'df2'

Try this:
df1['Rent'] = lambda x: 'Overdue' if df1['Name'].isin(df2['Name']) else 'Due'
The main point is not to use .apply()

Related

creating a new column in a dataframe based on 4 other dataframes

Imagine we have 4 dataframes
df1(35000, 20)
df2(12000, 21)
df3(323, 18)
df4(220, 6)
Here is where it is get tricky:
df4 was created by a merge of df3 and df2 based on 1 column.
It took 3 columns from df3 and 3 columns from df2. (that is why it has 6 cols in total)
what I want is the following: I wish to create an extra column in df1 and insert specific values for the rows that have the same value in a specific column in df1 and df3. For this reason I have done the following
df1['new col'] = df1['Name'].isin(df3['Name'])
Now my new column is filled with values True/False whether the value in the column name is the same for both df1 and df2. So far so good, but what I want to fill this new column with the values of a specific column from df2. I tried the following
df1['new col'] = df1['Name'].map({True:df2['Address'],False:'no address inserted'})
However, it inserts all the values of addresses from df2 in that cell instead only the 1 value that is needed. Any ideas?
I also tried the following
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
df1['Code'] = np.where(merged['_merge'] == 'both', merged['Address'], 'n.a.')
but I get the following error
Length of values (1210) does not match length of index (35653)
merge using the how='left' and then fill the missing values with fillna.
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
merged[address_column].fillna('n.a.', inplace=True) #address column is the name or list of names of columns that you want the replace the nan's with

Compare data of two columns of one dataframe with two columns of another dataframe and find mismatch data

I have dataframe df1 as following-
Second dataframe df2 is as following-
and I want the resulted dataframe as following
Dataframe df1 & df2 contains a large number of columns and data but here I am showing sample data. My goal is to compare Customer and ID column of df1 with Customer and Part Number of df2. Comparison is to find mismatch of data of df1['Customer'] and df1['ID'] with df2['Customer'] and df2['Part Number']. Finally storing mismatch data in another dataframe df3. For example: Customer(rishab) with ID(89ab) is present in df1 but not in df2.Thus Customer, Order#, and Part are stored in df3.
I am using isin() method to find mismatch of df1 with df2 for one column only but not able to do it for comparison of two columns.
df3 = df1[~df1['ID'].isin(df2['Part Number'].values)]
#here I am only able to find mismatch based upon only 1 column ID but I want to include Customer also
I can use loop also but the data is very large(Time complexity will increase) and I am sure there can be one-liner code to achieve this task. I have also tried to use merge but not able to produce the exact output.
So, how to produce this exact output? I am also not able to use isin() for two columns and I think isin() cannot to use for two columns
The easiest way to achieve this is:
df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how='left', indicator=True)
df3.reset_index(inplace = True)
df3 = df3[df3['_merge'] == 'left_only']
Here, you first do a left join on the columns, and put indicator = True, which will give another column like _merge, which has indicator mentioning which side the data exists, and then we pick left_only from those.
You can try outer join to get non matching rows. Something like df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how = "outer")

compare columns different dataframes

I got two DataFrames I would like to merge, but I would prefer to check if the one column that exists in both dfs has the exact same values in each row.
for genereal merging I tried several solutions in the comment you see the shape
df = pd.concat([df_b, df_c], axis=1, join='inner') # (245131, 40)
df = pd.concat([df_b, df_c], axis=1).reindex(df_b.index) # (245131, 40)
df = pd.merge(df_b, df_c, on=['client_id'], how='inner') # (420707, 39)
df = pd.concat([df_b, df_c], axis=1) # (245131, 40)
The original df_c is (245131, 14) and df_b is (245131, 26)
By that I assume that the column client_id has the exact values, since in three approaches I have a shape of 245131 rows.
I would like to compare the client_ids in a new_df, tried it with .loc, but it did not work out. Tried also df.rename(columns={ df.columns[20]: "client_id_1" }, inplace=True) but it renamed both columns
I tried
df_test = df_c.client_id
df_test.append(df_b.client_id, ignore_index=True)
but I only receive one index and one client_id column but the shape says 245131 rows.
If I can be sure that the values are exact the same, should I drop the client_id in one df and do the concat/merge after that? So that I got the correct shape of (245131, 39)
is there a mangle_dupe_cols command for merge or compare like for read_csv?
Chris if you wish to check if 2 columns of 2 separate dataframes are exactly the same, you can try the following:
tuple(df1['col'].values) == tuple(df2['col'].values)
This should return a bool value
If you want to merge 2 dataframes ensure all the rows for your column of interest has unique values as duplicates will cause addition of rows
Else use concat if you want to join the dataframes along the axis

Comparing two dataframe columns and outputting a third

I apologize in advance if this has been covered, I could not find anything quite like this. This is my first programming job (I was previously software QA) and I've been beating my head against a wall on this.
I have 2 dataframes, one is very large [df2] (14.6 million lines) and I am iterating through it in chunks. I attempted to compare a column of the same name in each dataframe, if they're equal I would like to output a secondary column of the larger frame.
i.e.
if df1['tag'] == df2['tag']:
df1['new column'] = df2['plate']
I attempted a merge but this didn't output what I expected.
df3 = pd.merge(df1, df2, on='tag', how='left')
I hope I did an okay job explaining this.
[Edit:] I also believe I should mention that df2 and df1 both have many additional columns I do not want to interact with/change. Is it possible to only compare the single columns of two dataframes, and output the third additional column?
You may try inner merge. First, you may inner merge df1 with df2 and then you will get plates only for common rows and you can rename new df1's column as per your need
df1 = df1.merge(df2, on="tag", how = 'inner')
df1['new column'] = df1['plate']
del df1['plate']
I hope this works.
As smci mentioned, this is a perfect time to use join/merge. If you're looking to preserve df1, a left join is what you want. So you were on the right path:
df1 = pd.merge(df1['tag'],
df2['tag', 'plate'],
on='tag', how='left')
df1.rename({'plate': 'new column'}, axis='columns')
That will only compare the tag columns in each dataframe, so the other columns won't matter. It'll bring over the plate column from df2, and then renames it to whatever you want your new column to be named.
This is totally a case for join/merge. You want to put df2 on the left because it's smaller.
df2.join(df1, on='tag', ...)
You only misunderstood the type of join/merge) you want to make:
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
'how'='left' join would create an (unwanted) entry for all rows of the LHS df2. That's not quite what you want (if df2 contained other tag values not seen in df1, you'd also get entries for them).
'how'='inner' would form the intersection of df2 and df1 on the 'on'='tag' field. i.e. you only get entries for where df1 contains a valid tag value according to df2.
So:
df3 = df2.join(df1, on='tag', how='inner')
# then reference df3['plate']
or if you only want the 'plate' column in df3 (or some other selection of columns), you can directly do:
df2.join(df1, on='tag', how='inner') ['plate']

pandas combine_first with particular index columns?

I'm trying to join two dataframes in pandas to have the following behavior: I want to join on a specified column, but have it so redundant columns are not added to the dataframe. This is analogous to combine_first except combine_first does not seem to take an index column optional argument. Example:
# combine df1 and df2 based on "id" column
df1 = pandas.merge(df2, how="outer", on=["id"])
The problem with the above is that columns common to df1/df2 aside from "id" will be added twice (with _x,_y prefixes) to df1. How can I do something like:
# Do outer join from df2 to df1, matching items by "id" but not adding
# columns that are redundant (df1 takes precedence if the values disagree)
df1.combine_first(df2, on=["id"])
How can this be done?
If you are trying to merge columns from df2 into df1 while excluding any redundant columns, the following should work.
df1.set_index("id", inplace=True)
df2.set_index("id", inplace=True)
df3 = df1.merge(df2.ix[:,df2.columns-df1.columns], left_index=True, right_index=True, how="outer")
However this obviously will not update any values from df1 with values from df2 as it is only bringing in non-redundant columns. But since you said df1 will take precedence on any values that disagree, perhaps this will do the trick?

Categories