How do I compare 2 data frames and remove the rows that have similar values?
df = pd.read_csv('trace_id.csv')
df1 = pd.read_csv('people.csv')
combinedf = pd.concat([df, df1], axis=1)
df contains the column 'trace_id', and df1 contains the columns 'index' and 'name'. Notice that trace_id and index are very similar in values, 'TRACE_PERSON_0000000003' and 'PERSON_0000000003' respectively. How do I remove the rows between that have similar trace_id and index values.
Example: trace_id = 'TRACE_PERSON_0000000003' and index = 'PERSON_0000000003', both its trace_id, index and name will be removed. 'PERSON_0000000000' are not found in the trace_id column, so 'PERSON_0000000000' and 'Amy Berger' will be retained in the data frame.
Hard to be sure without example data, but you can:
delete TRACE_ from the trace_id column of df
merge on the trimmed trace_id and index, passing indicator=True to get a merge indicator column named _merge
exclude rows where _merge == 'both'
df = pd.read_csv('trace_id.csv')
df1 = pd.read_csv('people.csv')
# Delete "TRACE_" from `trace_id` column
df['trace_id_trimmed'] = df['trace_id'].str.replace('TRACE_', '')
# Outer merge with indicator column
merged = df.merge(df1,
how='outer',
left_on='trace_id_trimmed',
right_on='index',
indicator=True)
# Exclude rows where merge key was found in both DataFrames
merged = merged[merged['_merge'] != 'both'].copy()
Related
Imagine we have 4 dataframes
df1(35000, 20)
df2(12000, 21)
df3(323, 18)
df4(220, 6)
Here is where it is get tricky:
df4 was created by a merge of df3 and df2 based on 1 column.
It took 3 columns from df3 and 3 columns from df2. (that is why it has 6 cols in total)
what I want is the following: I wish to create an extra column in df1 and insert specific values for the rows that have the same value in a specific column in df1 and df3. For this reason I have done the following
df1['new col'] = df1['Name'].isin(df3['Name'])
Now my new column is filled with values True/False whether the value in the column name is the same for both df1 and df2. So far so good, but what I want to fill this new column with the values of a specific column from df2. I tried the following
df1['new col'] = df1['Name'].map({True:df2['Address'],False:'no address inserted'})
However, it inserts all the values of addresses from df2 in that cell instead only the 1 value that is needed. Any ideas?
I also tried the following
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
df1['Code'] = np.where(merged['_merge'] == 'both', merged['Address'], 'n.a.')
but I get the following error
Length of values (1210) does not match length of index (35653)
merge using the how='left' and then fill the missing values with fillna.
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
merged[address_column].fillna('n.a.', inplace=True) #address column is the name or list of names of columns that you want the replace the nan's with
I have a dictionary of dataframes. Each of these dataframes has a column 'defrost_temperature'. What I want to do is make one new dataframe that collects all those columns, maintaining them as seperate columns.
This is what I am doing right now:
merged_defrosts = pd.DataFrame()
for key in df_dict.keys():
merged_defrosts[key] = df_dict[key]["defrost_temperature"]
But unfortunately, only the first column is filled correctly. The other columns are filled with NaN as shown in the screenshot
enter image description here
The different defrosts are not necessarily the same length. (the fourth dataframe is 108 rows, the others are 109 rows)
You can try pd.merge on index of the larger.
df_result = pd.DataFrame()
for i, df in enumerate(df_dict.values()):
s1, s2 = f'_{i}', f'_{i+1}'
m1, m2 = df_result.shape[0], df.shape[0]
if m1 == 0:
df_result = df
elif m1 >= m2:
df_result = df_result.merge(df, how=left, left_index=True, right_index=True, suffixes=(s1, s2))
else:
df_result = df.merge(df_result, how=left, left_index=True, right_index=True, suffixes=(s2, s1))
This would create undesired column names though that you can manually rename them afterwards.
You could try to concat the dataframes horizontaly after making the common column the index:
merged_defrosts = pd.concat([df.set_index("defrost_temperature") for df in df_dict.values()]
).reset_index()
I have a Dataframe df1 with the columns. I need to compare the headers of columns in df1 with a list of headers from df2
df1 =['a','b','c','d','f']
df2 =['a','b','c','d','e','f']
I need to compare the df1 with df2 and if any missing columns, I need to add them to df1 with blank values.
I tried concat and also append and both didn't work. with concat, I'm not able to add the column e and with append, it is appending all the columns from df1 and df2. How would I get only missing column added to df1 in the same order?
df1_cols = df1.columns
df2_cols = df2._combine_match_columns
if (df1_cols == df2_cols).all():
df1.to_csv(path + file_name, sep='|')
else:
print("something is missing, continuing")
#pd.concat([my_df,flat_data_frame], ignore_index=False, sort=False)
all_list = my_df.append(flat_data_frame, ignore_index=False, sort=False)
I wanted to see the results as
a|b|c|d|e|f - > headers
1|2|3|4||5 -> values
pandas.DataFrame.align
df1.align(df2, axis=1)[0]
By default this does an 'outer' join
By specifying axis=1 we focus on columns
This returns a tuple of both an aligned df1 and df2 with the calling dataframe being the first element. So I grab the first element with [0]
pandas.DataFrame.reindex
df1.reindex(columns=df1.columns | df2.columns)
You can treat pandas.Index objects like sets most of the time. So df1.columns | df2.columns is the union of those two index objects. I then reindex using the result.
Lets first create the two dataframes as:
import pandas as pd, numpy as np
df1 = pd.DataFrame(np.random.random((5,5)), columns = ['a','b','c','d','f'])
df2 = pd.DataFrame(np.random.random((5,7)), columns = ['a','b','c','d','e','f','g'])
Now add those columns of df2 to df1 (with nan values), which are not in df1:
for i in list(df2):
if i not in list(df1):
df1[i] = np.nan
Now display the columns of df1 alphabetically:
df1 = df1[sorted(list(df1))]
I have two dataframes, dataframe1 and dataframe2. They both share the same data in a particular column for both, lets call this column 'share1' and 'share2' for dataframe1 and dataframe2 respectively.
The issue is, there are instances where in dataframe1 , there is only one row in 'share1' with a particular value (lets call it 'c34z'), but in dataframe2 there are multiple rows with the value 'c34z' in the 'share2' column.
What I would like to do is, in the new merged dataframe, when there are new values, I would just like to place them in a new column.
So the number of columns in the new dataframe will be the maximum number of duplicates for a particular value in 'share2' . And for rows where there was only a unique value in 'share2', the rest of the added columns will be blank, for that row.
You can using cumcount create the additional key then, pivot df2
newdf2=df2.assign(key=df2.groupby('share2').cumcount(),v=df2.share2).pivot_table(index='share2',columns='key',values='v',aggfunc='first')
After this ,I am using .loc or reindex concat df2 to df1
df2=df2.reindex(df1.share1)
df2.index=df1.index
yourdf=pd.concat([df1,df2],axis=1)
Loading Data:
import pandas as pd
df1 = {'key': ['c34z', 'c34z_2'], 'value': ['x', 'y']}
df2 = {'key': ['c34z', 'c34z_2', 'c34z_2'], 'value': ['c34z_value', 'c34z_2_value', 'c34z_2_value']}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
Convert df2 by grouping and pivoting
df2_pivot = df2.groupby('key')['value'].apply(lambda df: df.reset_index(drop=True)).unstack().reset_index()
merge df1 and df2_pivot
df_merged = pd.merge(df1, df2_pivot, on='key')
I am reading in an excel spreadsheet about schools with three sheets as follows.
import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
print xl.sheet_names
df1 = xl.parse(xl.sheet_names[0], skiprows=14)
df2 = xl.parse(xl.sheet_names[1], skiprows=14)
df3 = xl.parse(xl.sheet_names[2], skiprows=14)
df1.columns = [chr(65+i) for i in xrange(len(df1.columns))]
df2.columns = df1.columns
df3.columns = df1.columns
The unique id for each school is in column 'D' in each of the three dataframes. I would like to make a new dataframe which has two columns. The first is the sum of column 'G' from df1, df2, df3 and the second is the sum of column 'K' from df1, df2, df3. In other words, I think I need the following steps.
Filter rows for which unique column 'D' ids actually exist in all three dataframes. If the school doesn't appear in all three sheets then I discard it.
For each remaining row (school), add up the values in column 'G' in the three dataframes.
Do the same for column 'K'.
I am new to pandas but how should I do this? Somehow the unique ids have to be used in steps 2 and 3 to make sure the values that are added correspond to the same school.
Attempted solution
df1 = df1.set_index('D')
df2 = df2.set_index('D')
df3 = df3.set_index('D')
df1['SumK']= df1['K'] + df2['K'] + df3['K']
df1['SumG']= df1['G'] + df2['G'] + df3['G']
After concatenating the dataframes, you can use groupby and count to get a list of values for "D" that exist in all three dataframes since there is only one in each dataframe. You can then use this to filter concatenated dataframe to sum whichever columns you need, e.g.:
df = pd.concat([df1, df2, df3])
criteria = df.D.isin((df.groupby('D').count() == 3).index)
df[criteria].groupby('D')[['G', 'K']].sum()