pandas combine_first with particular index columns? - python

I'm trying to join two dataframes in pandas to have the following behavior: I want to join on a specified column, but have it so redundant columns are not added to the dataframe. This is analogous to combine_first except combine_first does not seem to take an index column optional argument. Example:
# combine df1 and df2 based on "id" column
df1 = pandas.merge(df2, how="outer", on=["id"])
The problem with the above is that columns common to df1/df2 aside from "id" will be added twice (with _x,_y prefixes) to df1. How can I do something like:
# Do outer join from df2 to df1, matching items by "id" but not adding
# columns that are redundant (df1 takes precedence if the values disagree)
df1.combine_first(df2, on=["id"])
How can this be done?

If you are trying to merge columns from df2 into df1 while excluding any redundant columns, the following should work.
df1.set_index("id", inplace=True)
df2.set_index("id", inplace=True)
df3 = df1.merge(df2.ix[:,df2.columns-df1.columns], left_index=True, right_index=True, how="outer")
However this obviously will not update any values from df1 with values from df2 as it is only bringing in non-redundant columns. But since you said df1 will take precedence on any values that disagree, perhaps this will do the trick?

Related

Pandas; how to concat two dataframes but only the columns that are the same?

If i have two DataFrames, EX:
df1 = pd.DataFrame({'Code':['1001','1002','1003','1004'],
'Place':['Chile','Peru','Colombia','Argentina']})
and:
df2 = pd.DataFrame({'Code':['1001','1002','1003'],
'Place':['Chile','Peru','Colombia']})
How can i concat this two to make one DataFrame of two rows but only with tha columns that are the same? Thanks
If I understand your question correctly, you want two rows, namely "Code" and "Place". ie. you need to transpose the DataFrame:
df = df1.merge(df2, how="inner").T
print(df)

Pandas `hash_pandas_object` not producing duplicate hash values for duplicate entires

I have two dataframes, df1 and df2, and I know that df2 is a subset of df1. What I am trying to do is find the set difference between df1 and df2, such that df1 has only entries that are different from those in df2. To accomplish this, I first used pandas.util.hash_pandas_object on each of the dataframes, and then found the set difference between the two hashed columns.
df1['hash'] = pd.util.hash_pandas_object(df1, index=False)
df2['hash'] = pd.util.hash_pandas_object(df2, index=False)
df1 = df1.loc[~df1['hash'].isin(df2['hash'])]
This results in df1 remaining the same size; that is, none of the hash values matched. However, when I use a lambda function, df1 is reduced by the expected amount.
df1['hash'] = df1.apply(lambda x: hash(tuple(x)), axis=1)
df2['hash'] = df2.apply(lambda x: hash(tuple(x)), axis=1)
df1 = df1.loc[~df1['hash'].isin(df2['hash'])]
The problem with the second approach is that it takes an extremely long time to execute (df1 has about 3 million rows). Am I just misunderstanding how to use pandas.util.hash_pandas_object?
The difference is that in the first case you are hashing the complete dataframe, while in the second case you are hashing each individual row.
If your object is to remove the duplicate rows, you can achieve this faster using left/right merge with indicator option and then drop the rows that are not unique to the original dataframe.
df_merged = df1.merge(df2, how='left', on=list_columns, indicator=True)
df_merged = df_merged[df_merged.indicator=="left_only"] # this will keep only unmatched rows

Compare data of two columns of one dataframe with two columns of another dataframe and find mismatch data

I have dataframe df1 as following-
Second dataframe df2 is as following-
and I want the resulted dataframe as following
Dataframe df1 & df2 contains a large number of columns and data but here I am showing sample data. My goal is to compare Customer and ID column of df1 with Customer and Part Number of df2. Comparison is to find mismatch of data of df1['Customer'] and df1['ID'] with df2['Customer'] and df2['Part Number']. Finally storing mismatch data in another dataframe df3. For example: Customer(rishab) with ID(89ab) is present in df1 but not in df2.Thus Customer, Order#, and Part are stored in df3.
I am using isin() method to find mismatch of df1 with df2 for one column only but not able to do it for comparison of two columns.
df3 = df1[~df1['ID'].isin(df2['Part Number'].values)]
#here I am only able to find mismatch based upon only 1 column ID but I want to include Customer also
I can use loop also but the data is very large(Time complexity will increase) and I am sure there can be one-liner code to achieve this task. I have also tried to use merge but not able to produce the exact output.
So, how to produce this exact output? I am also not able to use isin() for two columns and I think isin() cannot to use for two columns
The easiest way to achieve this is:
df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how='left', indicator=True)
df3.reset_index(inplace = True)
df3 = df3[df3['_merge'] == 'left_only']
Here, you first do a left join on the columns, and put indicator = True, which will give another column like _merge, which has indicator mentioning which side the data exists, and then we pick left_only from those.
You can try outer join to get non matching rows. Something like df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how = "outer")

Comparing two dataframe columns and outputting a third

I apologize in advance if this has been covered, I could not find anything quite like this. This is my first programming job (I was previously software QA) and I've been beating my head against a wall on this.
I have 2 dataframes, one is very large [df2] (14.6 million lines) and I am iterating through it in chunks. I attempted to compare a column of the same name in each dataframe, if they're equal I would like to output a secondary column of the larger frame.
i.e.
if df1['tag'] == df2['tag']:
df1['new column'] = df2['plate']
I attempted a merge but this didn't output what I expected.
df3 = pd.merge(df1, df2, on='tag', how='left')
I hope I did an okay job explaining this.
[Edit:] I also believe I should mention that df2 and df1 both have many additional columns I do not want to interact with/change. Is it possible to only compare the single columns of two dataframes, and output the third additional column?
You may try inner merge. First, you may inner merge df1 with df2 and then you will get plates only for common rows and you can rename new df1's column as per your need
df1 = df1.merge(df2, on="tag", how = 'inner')
df1['new column'] = df1['plate']
del df1['plate']
I hope this works.
As smci mentioned, this is a perfect time to use join/merge. If you're looking to preserve df1, a left join is what you want. So you were on the right path:
df1 = pd.merge(df1['tag'],
df2['tag', 'plate'],
on='tag', how='left')
df1.rename({'plate': 'new column'}, axis='columns')
That will only compare the tag columns in each dataframe, so the other columns won't matter. It'll bring over the plate column from df2, and then renames it to whatever you want your new column to be named.
This is totally a case for join/merge. You want to put df2 on the left because it's smaller.
df2.join(df1, on='tag', ...)
You only misunderstood the type of join/merge) you want to make:
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
'how'='left' join would create an (unwanted) entry for all rows of the LHS df2. That's not quite what you want (if df2 contained other tag values not seen in df1, you'd also get entries for them).
'how'='inner' would form the intersection of df2 and df1 on the 'on'='tag' field. i.e. you only get entries for where df1 contains a valid tag value according to df2.
So:
df3 = df2.join(df1, on='tag', how='inner')
# then reference df3['plate']
or if you only want the 'plate' column in df3 (or some other selection of columns), you can directly do:
df2.join(df1, on='tag', how='inner') ['plate']

Pandas how to concat two dataframes without losing the column headers

I have the following toy code:
import pandas as pd
df = pd.DataFrame()
df["foo"] = [1,2,3,4]
df2 = pd.DataFrame()
df2["bar"]=[4,5,6,7]
df = pd.concat([df,df2], ignore_index=True,axis=1)
print(list(df))
Output: [0,1]
Expected Output: [foo,bar] (order is not important)
Is there any way to concatenate two dataframes without losing the original column headers, if I can guarantee that the headers will be unique?
Iterating through the columns and then adding them to one of the DataFrames comes to mind, but is there a pandas function, or concat parameter that I am unaware of?
Thanks!
As stated in merge, join, and concat documentation, ignore index will remove all name references and use a range (0...n-1) instead. So it should give you the result you want once you remove ignore_index argument or set it to false (default).
df = pd.concat([df, df2], axis=1)
This will join your df and df2 based on indexes (same indexed rows will be concatenated, if other dataframe has no member of that index it will be concatenated as nan).
If you have different indexing on your dataframes, and want to concatenate it this way. You can either create a temporary index and join on that, or set the new dataframe's columns after using concat(..., ignore_index=True).
I don't think the accepted answer answers the question, which is about column headers, not indexes.
I am facing the same problem, and my workaround is to add the column names after the concatenation:
df.columns = ["foo", "bar"]

Categories