compare columns different dataframes - python

I got two DataFrames I would like to merge, but I would prefer to check if the one column that exists in both dfs has the exact same values in each row.
for genereal merging I tried several solutions in the comment you see the shape
df = pd.concat([df_b, df_c], axis=1, join='inner') # (245131, 40)
df = pd.concat([df_b, df_c], axis=1).reindex(df_b.index) # (245131, 40)
df = pd.merge(df_b, df_c, on=['client_id'], how='inner') # (420707, 39)
df = pd.concat([df_b, df_c], axis=1) # (245131, 40)
The original df_c is (245131, 14) and df_b is (245131, 26)
By that I assume that the column client_id has the exact values, since in three approaches I have a shape of 245131 rows.
I would like to compare the client_ids in a new_df, tried it with .loc, but it did not work out. Tried also df.rename(columns={ df.columns[20]: "client_id_1" }, inplace=True) but it renamed both columns
I tried
df_test = df_c.client_id
df_test.append(df_b.client_id, ignore_index=True)
but I only receive one index and one client_id column but the shape says 245131 rows.
If I can be sure that the values are exact the same, should I drop the client_id in one df and do the concat/merge after that? So that I got the correct shape of (245131, 39)
is there a mangle_dupe_cols command for merge or compare like for read_csv?

Chris if you wish to check if 2 columns of 2 separate dataframes are exactly the same, you can try the following:
tuple(df1['col'].values) == tuple(df2['col'].values)
This should return a bool value
If you want to merge 2 dataframes ensure all the rows for your column of interest has unique values as duplicates will cause addition of rows
Else use concat if you want to join the dataframes along the axis

Related

creating a new column in a dataframe based on 4 other dataframes

Imagine we have 4 dataframes
df1(35000, 20)
df2(12000, 21)
df3(323, 18)
df4(220, 6)
Here is where it is get tricky:
df4 was created by a merge of df3 and df2 based on 1 column.
It took 3 columns from df3 and 3 columns from df2. (that is why it has 6 cols in total)
what I want is the following: I wish to create an extra column in df1 and insert specific values for the rows that have the same value in a specific column in df1 and df3. For this reason I have done the following
df1['new col'] = df1['Name'].isin(df3['Name'])
Now my new column is filled with values True/False whether the value in the column name is the same for both df1 and df2. So far so good, but what I want to fill this new column with the values of a specific column from df2. I tried the following
df1['new col'] = df1['Name'].map({True:df2['Address'],False:'no address inserted'})
However, it inserts all the values of addresses from df2 in that cell instead only the 1 value that is needed. Any ideas?
I also tried the following
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
df1['Code'] = np.where(merged['_merge'] == 'both', merged['Address'], 'n.a.')
but I get the following error
Length of values (1210) does not match length of index (35653)
merge using the how='left' and then fill the missing values with fillna.
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
merged[address_column].fillna('n.a.', inplace=True) #address column is the name or list of names of columns that you want the replace the nan's with

Compare data of two columns of one dataframe with two columns of another dataframe and find mismatch data

I have dataframe df1 as following-
Second dataframe df2 is as following-
and I want the resulted dataframe as following
Dataframe df1 & df2 contains a large number of columns and data but here I am showing sample data. My goal is to compare Customer and ID column of df1 with Customer and Part Number of df2. Comparison is to find mismatch of data of df1['Customer'] and df1['ID'] with df2['Customer'] and df2['Part Number']. Finally storing mismatch data in another dataframe df3. For example: Customer(rishab) with ID(89ab) is present in df1 but not in df2.Thus Customer, Order#, and Part are stored in df3.
I am using isin() method to find mismatch of df1 with df2 for one column only but not able to do it for comparison of two columns.
df3 = df1[~df1['ID'].isin(df2['Part Number'].values)]
#here I am only able to find mismatch based upon only 1 column ID but I want to include Customer also
I can use loop also but the data is very large(Time complexity will increase) and I am sure there can be one-liner code to achieve this task. I have also tried to use merge but not able to produce the exact output.
So, how to produce this exact output? I am also not able to use isin() for two columns and I think isin() cannot to use for two columns
The easiest way to achieve this is:
df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how='left', indicator=True)
df3.reset_index(inplace = True)
df3 = df3[df3['_merge'] == 'left_only']
Here, you first do a left join on the columns, and put indicator = True, which will give another column like _merge, which has indicator mentioning which side the data exists, and then we pick left_only from those.
You can try outer join to get non matching rows. Something like df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how = "outer")

pandas concat adding as columns with nans?

I have two dataframes, each with the same number of columns :
print(df1.shape)
(54, 35238)
print(df2.shape)
(64, 35238)
And both don't have any index set
print(df1.index.name)
None
print(df2.index.name)
None
However, whenever I try to vertically concat them (so to have a third dataframe with shape (118, 35238)), it produces a new df with NaNs:
df3 = pandas.concat([df1, df2], ignore_index=True)
print(df3)
The resultant df has the correct number of rows, but it has decided to concat them as new columns. Using the "axis" flag set to 1 results in the same number of (inappropriate) columns (e.g. shape of (63, 70476)).
Any ideas on how to fix this?
They have the same number of columns, but are the column names different? The documentation on concat suggests to me that you need identical column names to have them stack the way you want.
If this is the problem, you could probably fix it by changing one dataframe's column names to match the other before concatenating:
df2.columns = df1.columns
This might be because your df2 is a series, you can try:
pd.concat([df1, pd.DataFrame([df2])], axis=0, ignore_index=True)

how to reindex the 'multi - groupbyed' dataframe?

I have a dataframe contains 4 columns, the first 3 columns are numerical variables which indicate the feature of the variable at the last column, and the last column are strings.
I want to merge the last string column by the previous 3 columns through the groupby function. Then it works(I mean the string which shares the same feature logged by the first three columns had been merged successfully)
Previously the length of the dataframe was 1200, and the length of the merged dataframe is 1100. I found the later df is multindexed. Which only contain 2 columns.(hierarchical index ) Thus I tried the reindex method by a generated ascending numerical list. Sadly I failed.
df1.columns
*[Out]Index(['time', 'column','author', 'text'], dtype='object')
series = df1.groupby(['time', 'column','author'])
['body_text'].sum()#merge the last column by the first 3 columns
dfx = series.to_frame()# get the new df
dfx.columns
*[Out]Index(['author', 'text'], dtype='object')
len(dfx)
*[Out]1100
indexs = list(range(1100))
dfx.reindex(index = indexs)
*[Out]Exception: cannot handle a non-unique multi-index!
Reindex here is not necessary, better is use DataFrame.reset_index or add parameter as_index=False to DataFrame.groupby
dfx = df1.groupby(['time', 'column','author'])['body_text'].sum().reset_index()
Or:
dfx = df1.groupby(['time', 'column','author'], as_index=False)['body_text'].sum()

Concat data frame not working

I have two data frames df1 and df2. both have same numbers of rows but different columns.
I want to concat all columns of df1 and 2nd and 3rd column of df2.
df1 has 119 columns and df2 has 3 of which i want 2nd & 3rd
Code I am using is:
data_train_test = pd.concat([df1,df2.iloc[:,
[2,3]]],axis=1,ignore_index=False)
Error I am getting is
ValueError: Shape of passed values is (121, 39880), indices imply (121, 28898)
My Analysis:
39880 - 28898 = 10982
df1 is TFID data frame made from concat of two other data frames with rows 17916+10982 = 28898.
how I made df2 is
frames = [data, prediction_data]
df2 = pd.concat(frames)
I am not able to find the exact reason for this problem. Can someone please help?
I think I solved it by resetting the index while creating df2.
frames = [data, prediction_data]
df2 = pd.concat(frames).reset_index()
I am not sure I understood your question correctly but I thinks what you want to do is :
data_train_test = pd.concat([df1,df2[[1,2]]])
.iloc[] is used to select a row (the ith row in the index of your dataframe). So you don't really need it their.
import pandas as pd
df1 = pd.DataFrame(data={'a':[0]})
df2 = pd.DataFrame(data={'b1':[1], 'b2':[2], 'b3':[3]})
data_train_test = pd.concat([df1,df2[df2.columns[1:3]]], axis=1)
# or
data_train_test = pd.concat([df1,df2.loc[:,df2.columns[1:3]]], axis=1)

Categories