creating a new column in a dataframe based on 4 other dataframes - python

Imagine we have 4 dataframes
df1(35000, 20)
df2(12000, 21)
df3(323, 18)
df4(220, 6)
Here is where it is get tricky:
df4 was created by a merge of df3 and df2 based on 1 column.
It took 3 columns from df3 and 3 columns from df2. (that is why it has 6 cols in total)
what I want is the following: I wish to create an extra column in df1 and insert specific values for the rows that have the same value in a specific column in df1 and df3. For this reason I have done the following
df1['new col'] = df1['Name'].isin(df3['Name'])
Now my new column is filled with values True/False whether the value in the column name is the same for both df1 and df2. So far so good, but what I want to fill this new column with the values of a specific column from df2. I tried the following
df1['new col'] = df1['Name'].map({True:df2['Address'],False:'no address inserted'})
However, it inserts all the values of addresses from df2 in that cell instead only the 1 value that is needed. Any ideas?
I also tried the following
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
df1['Code'] = np.where(merged['_merge'] == 'both', merged['Address'], 'n.a.')
but I get the following error
Length of values (1210) does not match length of index (35653)

merge using the how='left' and then fill the missing values with fillna.
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
merged[address_column].fillna('n.a.', inplace=True) #address column is the name or list of names of columns that you want the replace the nan's with

Related

adding a new column in a dataframe based on another dataframe column

Let's assume we have the following 2 dataframes:
df1(36000, 20) and df2(80,6)
They have 3 columns in common(let's say Name, Last Name, Date)
df1 includes the data of df2 (minus the data in the 3 different columns) and of course some extra information.
df2 has a column that I am interested (let' s name it Rent)
What I want is to create an extra column in df1 that for the values that of df2 to have the value "Overdue" and for the values that are not there have "Due" while keeping the rest of columns in df1.
I tried the following
merged = df1.merge(df2, how='left', on=list(df1.columns),
indicator=True)
df1['Rent'] = np.where(merged['_merge'] == 'both', 'Overdue', 'Due')
However I get an error due to the fact that not all columns of df1 exist in df2. Any ideas?
Also I tried the following
df1['Rent'].apply(lambda x: 'Overdue' if df1['Name'].isin(df2['Name']) else 'Due')
but I m getting the following error
AttributeError: 'function' object has no attribute 'df2'
Try this:
df1['Rent'] = lambda x: 'Overdue' if df1['Name'].isin(df2['Name']) else 'Due'
The main point is not to use .apply()

Compare 2 Data frames for partial similarities

How do I compare 2 data frames and remove the rows that have similar values?
df = pd.read_csv('trace_id.csv')
df1 = pd.read_csv('people.csv')
combinedf = pd.concat([df, df1], axis=1)
df contains the column 'trace_id', and df1 contains the columns 'index' and 'name'. Notice that trace_id and index are very similar in values, 'TRACE_PERSON_0000000003' and 'PERSON_0000000003' respectively. How do I remove the rows between that have similar trace_id and index values.
Example: trace_id = 'TRACE_PERSON_0000000003' and index = 'PERSON_0000000003', both its trace_id, index and name will be removed. 'PERSON_0000000000' are not found in the trace_id column, so 'PERSON_0000000000' and 'Amy Berger' will be retained in the data frame.
Hard to be sure without example data, but you can:
delete TRACE_ from the trace_id column of df
merge on the trimmed trace_id and index, passing indicator=True to get a merge indicator column named _merge
exclude rows where _merge == 'both'
df = pd.read_csv('trace_id.csv')
df1 = pd.read_csv('people.csv')
# Delete "TRACE_" from `trace_id` column
df['trace_id_trimmed'] = df['trace_id'].str.replace('TRACE_', '')
# Outer merge with indicator column
merged = df.merge(df1,
how='outer',
left_on='trace_id_trimmed',
right_on='index',
indicator=True)
# Exclude rows where merge key was found in both DataFrames
merged = merged[merged['_merge'] != 'both'].copy()

Compare data of two columns of one dataframe with two columns of another dataframe and find mismatch data

I have dataframe df1 as following-
Second dataframe df2 is as following-
and I want the resulted dataframe as following
Dataframe df1 & df2 contains a large number of columns and data but here I am showing sample data. My goal is to compare Customer and ID column of df1 with Customer and Part Number of df2. Comparison is to find mismatch of data of df1['Customer'] and df1['ID'] with df2['Customer'] and df2['Part Number']. Finally storing mismatch data in another dataframe df3. For example: Customer(rishab) with ID(89ab) is present in df1 but not in df2.Thus Customer, Order#, and Part are stored in df3.
I am using isin() method to find mismatch of df1 with df2 for one column only but not able to do it for comparison of two columns.
df3 = df1[~df1['ID'].isin(df2['Part Number'].values)]
#here I am only able to find mismatch based upon only 1 column ID but I want to include Customer also
I can use loop also but the data is very large(Time complexity will increase) and I am sure there can be one-liner code to achieve this task. I have also tried to use merge but not able to produce the exact output.
So, how to produce this exact output? I am also not able to use isin() for two columns and I think isin() cannot to use for two columns
The easiest way to achieve this is:
df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how='left', indicator=True)
df3.reset_index(inplace = True)
df3 = df3[df3['_merge'] == 'left_only']
Here, you first do a left join on the columns, and put indicator = True, which will give another column like _merge, which has indicator mentioning which side the data exists, and then we pick left_only from those.
You can try outer join to get non matching rows. Something like df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how = "outer")

Pandas: Merge 2 dataframes based on a column values; for mulitple rows containing same column value, append those to different columns

I have two dataframes, dataframe1 and dataframe2. They both share the same data in a particular column for both, lets call this column 'share1' and 'share2' for dataframe1 and dataframe2 respectively.
The issue is, there are instances where in dataframe1 , there is only one row in 'share1' with a particular value (lets call it 'c34z'), but in dataframe2 there are multiple rows with the value 'c34z' in the 'share2' column.
What I would like to do is, in the new merged dataframe, when there are new values, I would just like to place them in a new column.
So the number of columns in the new dataframe will be the maximum number of duplicates for a particular value in 'share2' . And for rows where there was only a unique value in 'share2', the rest of the added columns will be blank, for that row.
You can using cumcount create the additional key then, pivot df2
newdf2=df2.assign(key=df2.groupby('share2').cumcount(),v=df2.share2).pivot_table(index='share2',columns='key',values='v',aggfunc='first')
After this ,I am using .loc or reindex concat df2 to df1
df2=df2.reindex(df1.share1)
df2.index=df1.index
yourdf=pd.concat([df1,df2],axis=1)
Loading Data:
import pandas as pd
df1 = {'key': ['c34z', 'c34z_2'], 'value': ['x', 'y']}
df2 = {'key': ['c34z', 'c34z_2', 'c34z_2'], 'value': ['c34z_value', 'c34z_2_value', 'c34z_2_value']}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
Convert df2 by grouping and pivoting
df2_pivot = df2.groupby('key')['value'].apply(lambda df: df.reset_index(drop=True)).unstack().reset_index()
merge df1 and df2_pivot
df_merged = pd.merge(df1, df2_pivot, on='key')

How to sum columns from three different dataframes with a common key

I am reading in an excel spreadsheet about schools with three sheets as follows.
import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
print xl.sheet_names
df1 = xl.parse(xl.sheet_names[0], skiprows=14)
df2 = xl.parse(xl.sheet_names[1], skiprows=14)
df3 = xl.parse(xl.sheet_names[2], skiprows=14)
df1.columns = [chr(65+i) for i in xrange(len(df1.columns))]
df2.columns = df1.columns
df3.columns = df1.columns
The unique id for each school is in column 'D' in each of the three dataframes. I would like to make a new dataframe which has two columns. The first is the sum of column 'G' from df1, df2, df3 and the second is the sum of column 'K' from df1, df2, df3. In other words, I think I need the following steps.
Filter rows for which unique column 'D' ids actually exist in all three dataframes. If the school doesn't appear in all three sheets then I discard it.
For each remaining row (school), add up the values in column 'G' in the three dataframes.
Do the same for column 'K'.
I am new to pandas but how should I do this? Somehow the unique ids have to be used in steps 2 and 3 to make sure the values that are added correspond to the same school.
Attempted solution
df1 = df1.set_index('D')
df2 = df2.set_index('D')
df3 = df3.set_index('D')
df1['SumK']= df1['K'] + df2['K'] + df3['K']
df1['SumG']= df1['G'] + df2['G'] + df3['G']
After concatenating the dataframes, you can use groupby and count to get a list of values for "D" that exist in all three dataframes since there is only one in each dataframe. You can then use this to filter concatenated dataframe to sum whichever columns you need, e.g.:
df = pd.concat([df1, df2, df3])
criteria = df.D.isin((df.groupby('D').count() == 3).index)
df[criteria].groupby('D')[['G', 'K']].sum()

Categories