I have two dataframes that are structure wise equal.
Both has to following format:
file_name | country_name | country_code | .....
I want to compare the two, and get a percentage of equality for each column.
The second data frame is the test dataframe, that holds the true values. Some of the values are NaN, which should be ignored. So far I have a managed to compare the two, and get the total number of equal samples for each column, my problem is dividing each of them by the total number of relevant samples(That doesn't have NaN in the second dataframe), in a "nice way".
For example:
df1
file_name | country_name
1 a
2 b
3 d
4 c
df2
file_name | country_name
1 a
2 b
3 nan
4 d
I expect an output of 66% for this column, because 2 of the 3 relevant samples has the same value, and the 4th is nan so it is ignored from the calculation.
What I've done so far:
test_set = pd.read_excel(file_path)
test_set = test_set.astype(str)
a_set = pd.read_excel(file2_path)
a_set = a_set.astype(str)
merged_df = a_set.merge(test_set, on='file_name')
for field in fields:
if field == 'file_name':
continue
merged_df[field] = merged_df.apply(lambda x: 0 if x[field + '_y'] == 'nan' else 1 if x[field + '_x'].lower() == x[field + '_y'].lower() else 0, axis=1)
scores = merged_df.drop('file_name', axis=1).sum(axis=0)
This gives me these(correct) results:
country_name 14
country_code 0
state_name 4
state_code 59
city 74
...
But now I want to divide each of them by the total number of samples that doesn't contain NaN in the corresponding field from the test_set dataframe. I can think of naive ways to do this, like creating another column that holds the number of not nan values for each of these column, but looking for a pretty solution.
As you have unique filenames I would use all vectorial operations, take advantage of index alignment:
# set the filename as index
df1b = df1.set_index('file_name')
# set the filename as index
df2b = df2.set_index('file_name')
# compare and divide by the number of non-NA
out = df1b.eq(df2b).sum().div(df2b.notna().sum())
Output:
country_name 0.666667
dtype: float64
If you don't have to merge you could use:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([
["1", "a"],
["2", np.NAN],
["3", "c"]
])
df2 = pd.DataFrame([
["1", "X"],
["100", "b"],
["3", "c"]
])
# expected:
# col 0: equal = 2, ratio: 2/3
# col 1: equal = 1, ratio: 1/2
df1 = df1.sort_index()
df2 = df2.sort_index()
def get_col_ratio(col):
colA = df1[col]
colB = df2[col]
colA_ = colA[~(colA.isna() | colB.isna())]
colB_ = colB[~(colA.isna() | colB.isna())]
return (colA_.str.lower() == colB_.str.lower()).sum() / len(colA_)
ratios = pd.DataFrame([[get_col_ratio(i) for i in df1.columns]], columns=df1.columns)
print(ratios)
Or, using pd.merge
fields = df1.columns
merged = pd.merge(df1,df2, left_index=True, right_index=True)
def get_ratio(col):
cols = merged[[f"{col}_x",f"{col}_y"]]
cols = cols.dropna()
equal_rows = cols[cols.columns[0]].str.lower() == cols[cols.columns[1]].str.lower()
return equal_rows.sum() / len(cols)
ratios = pd.DataFrame([[get_ratio(i) for i in fields]], columns=fields)
ratios
Related
Hi I have the following two DataFrame's (index level == 2):
df1 = pd.DataFrame()
df1["Index1"] = ["A", "AA"]
df1["Index2"] = ["B", "BB"]
df1 = df1.set_index(["Index1", "Index2"])
df1["Value1"] = 1
df1["Value2"] = 2
df1
df2 = pd.DataFrame()
df2["Index1"] = ["X", "XX"]
df2["Index2"] = ["Y", "YY"]
df2["Value1"] = 3
df2["Value2"] = 4
df2 = df2.set_index(["Index1", "Index2"])
df2
I would like to create the following DataFrame with 3-level index where the first level indicates from which DataFrame the values are taken. Note all DataFrames have exactly the same columns:
How can I do this in the most automatic way? Ideally I would like to have the following solution:
# start with empty dataframe
res = pd.DataFrame(index = pd.MultiIndex(levels = [[], [], []],
codes = [[],[],[]],
names = ["Df number", "Index1", "Index2"]),
columns = ["Value1", "Value2"])
res = AddDataFrameAtIndex(index = "DF1", level = 0, dfToInsert = df1)
res = AddDataFrameAtIndex(index = "DF2", level = 0, dfToInsert = df2)
A possible solution, based on pandas.concat:
pd.concat([df1, df2], keys=['DF1', 'DF2'], names=['DF number'])
Output:
Value1 Value2
DF number Index1 Index2
DF1 A B 1 2
AA BB 1 2
DF2 X Y 3 4
XX YY 3 4
I have 2 data frames, one is around 4.5 million rows while another is 1200 rows. I want to find the values of the smaller data frame in the column of the bigger data frame and eventually drop those records based on true/false.
df1 = { ‘id’:[‘1234’,’4566’,’6789’], ‘Name’:[‘Sara’, ‘Iris’,’Jeff’], ‘Age’:[10,12,47]}
df2 = { ‘id’:[‘1234’,’4566’,’1080’]}
The function I wrote:
def find_match(row):
if (row.column in df1.column.values) == (row.column in df2.column.values):
return “True”
else:
return “False”
df1[” flag”] = df1.apply(find_match, axis=1)
once I run the .apply(), it runs for a long time since the data frame is huge.
You can try concatenating the two df's using pandas.concat, then dropping the duplicate rows.
import pandas as pd
df1 = pd.DataFrame({"colA":["a1", "a1", "a2", "a3"], "colB":[0,1,1,1], "colC":["A","A","B","B"]})
df2 = pd.DataFrame({"colA":["a1", "a1", "a2", "a3"], "colB":[1,1,1,1], "colC":["A","B","B","B"]})
df = pd.concat([df1, df2])
print("df: \n", df)
df_dropped = df.drop_duplicates()
print("df_dropped: \n", df_dropped)
This code returns the values from df1 that matches df2.
df1 = pd.DataFrame({"id":["1234","4566","6789"], "Name":["Sara", "Iris","Jeff"], "Age":[10,12,47]})
id Name Age
0 1234 Sara 10
1 4566 Iris 12
2 6789 Jeff 47
df2 = pd.DataFrame({ "id":["1234","4566","1080"]})
id
0 1234
1 4566
2 1080
new_df = df2.merge(df1, on = "id", how = "outer")
This will return the ones that do match, also the ones that do not match with nan values for the name and age colmuns. than you can drop the ones that match and keep only the Nan ones
df_not_match = new_df[new_df["Name"].isna()] # will return the row id : 1080
I have a pandas dataframe like as given below
dfx = pd.DataFrame({'min_temp' :[38,36,np.nan,38,37,39],'max_temp': [41,39,39,41,43,44],
'min_hr': [89,87,85,84,82,86],'max_hr': [91,98,np.nan,94,92,96], 'min_sbp':[21,23,25,27,28,29],
'ethnicity':['A','B','C','D','E','F'],'Gender':['M','F','F','F','F','F']})
What I would like to do is
1) Identify all columns that contain min and max.
2) Find their corresponding pair. ex: min_temp and max_temp are a pair. Similarly min_hr and max_hr are a pair
3) Convert these two columns into one column and name it as rel_temp. See below for formula
rel_temp = (max_temp - min_temp)/min_temp
This is what I was trying. Do note that my real data has several thousand records and hundreds of columns like this
def myfunc(n):
return lambda a,b : ((b-a)/a)
dfx.apply(myfunc(col for col in dfx.columns)) # didn't know how to apply string contains here
I expect my output to be like this. Please note that only min and max columns have to be transformed. Rest of the columns in dataframe should be left as is.
Idea is create df1 and df2 with same columns names with DataFrame.filter and rename, so then subtract and divide all columns with DataFrame.sub and DataFrame.div:
df1 = dfx.filter(like='max').rename(columns=lambda x: x.replace('max','rel'))
df2 = dfx.filter(like='min').rename(columns=lambda x: x.replace('min','rel'))
df = df1.sub(df2).div(df2).join(dfx.loc[:, ~dfx.columns.str.contains('min|max')])
print (df)
rel_temp rel_hr ethnicity Gender
0 0.078947 0.022472 A M
1 0.083333 0.126437 B F
2 NaN NaN C F
3 0.078947 0.119048 D F
4 0.162162 0.121951 E F
5 0.128205 0.116279 F F
Try using:
cols = dfx.columns
con = cols[cols.str.contains('_')]
for i in con.str.split('_').str[-1].unique():
df = dfx[[x for x in con if i in x]]
dfx['rel_%s' % i] = (df['max_%s' % i] - df['min_%s' % i]) / df['min_%s' % i]
dfx = dfx.drop(con, axis=1)
print(dfx)
As a simplified example, suppose I had a DataFrame as follows:
Group Type Value1 Value2
Red A 13 24
Red B 3 12
Blue C 5 0
Red C 8 9
Green A 2 -1
Red None 56 78
Blue A 40 104
Green B 1 -5
What I want to calculate is the difference in Value1 between rows of Type A and B and the difference in Value2 between rows of Type A and B for each Group entry.
Since Red and Green are the only Groups having entries of Type A and B, we would only calculate new rows for these Groups. So the resulting DataFrame would be:
Group Type Value1 Value2
Red A-B 10 12
Green A-B 1 4
My initial idea was simply to filter for rows where Type is either 'A' or 'B' with df = df[df['Type'].isin(['A', 'B'])], then filter again for Groups that are in rows with both 'A' and 'B' as Type (not sure how to do this), then sort and apply diff().
import pandas as pd
from io import StringIO
# read data using string io
data = StringIO("""Group,Type,Value1,Value2
Red,A,13,24
Red,B,3,12
Blue,C,5,0
Red,C,8,9
Green,A,2,-1
Red,None,56,78
Blue,A,40,104
Green,B,1,-5""")
df = pd.read_csv(data)
# create tidyr spread like operation
def spread(df, propcol, valcol):
indcol = list(df.columns.drop(valcol))
df = df.set_index(indcol).unstack(propcol).reset_index()
df.columns = [i[1] if i[0] == valcol else i[0] for i in df.columns]
return df
df = spread(df, 'Group','Type')
# create filter conditions to remove 'C'. can also do the opposite
notBlueC = df['Blue'] != 'C'
notGreenC = df['Green'] != 'C'
notRedC = df['Red'] != 'C'
clean_df = df[notBlueC & notGreenC & notRedC]
So the following code will make groups for each type, then subtract each dataframe from each other dataframe, resulting in a final dataframe that has the subtracted values. Input your dataframe as inp_df, and the dataframe you want will be final_df:
grouped = inp_df.groupby('Type')
# Getting the list of groups:
list_o_groups = list(grouped.groups.keys())
# Going through each group and subtracting the one from the other:
sub_df_dict = {}
for first_idx, first_df in enumerate(list_o_groups):
for second_idx, second_df in enumerate(list_o_groups):
if second_idx <= first_idx:
continue
sub_df_dict['%s-%s' % (first_df, second_df)] = pd.DataFrame()
sub_df_dict['%s-%s' % (first_df, second_df)]['Value1'] = grouped.get_group(first_df)['Value1'] - grouped.get_group(second_df)['Value1']
sub_df_dict['%s-%s' % (first_df, second_df)]['Value2'] = grouped.get_group(first_df)['Value2'] - grouped.get_group(second_df)['Value2']
sub_df_dict['%s-%s' % (first_df, second_df)]['Type'] = ['%s-%s' % (first_df, second_df)] * sub_df_dict['%s-%s' % (first_df, second_df)].shape[0]
# Combining them into one df:
for idx, each_key in enumerate(sub_df_dict.keys()):
if idx == 0:
final_df = sub_df_dict[each_key]
continue
else:
final_df = final_df.append(sub_df_dict[each_key])
# Cleaning up the dataframe
final_df.dropna(inplace=True)
The result of this code on your example dataframe.
*EDIT - added the dropna to clean up the df.
I want to know if this is possible with pandas:
From df2, I want to create new1 and new2.
new1 as the latest date that can find from df1 that match column A
and B.
new2 as the latest date that can find from df1 that match column A
but not B.
I managed to get new1 but not new2.
Code:
import pandas as pd
d1 = [['1/1/19', 'xy','p1','54'], ['1/1/19', 'ft','p2','20'], ['3/15/19', 'xy','p3','60'],['2/5/19', 'xy','p4','40']]
df1 = pd.DataFrame(d1, columns = ['Name', 'A','B','C'])
d2 =[['12/1/19', 'xy','p1','110'], ['12/10/19', 'das','p10','60'], ['12/20/19', 'fas','p50','40']]
df2 = pd.DataFrame(d2, columns = ['Name', 'A','B','C'])
d3 = [['12/1/19', 'xy','p1','110','1/1/19','3/15/19'], ['12/10/19', 'das','p10','60','0','0'], ['12/20/19', 'fas','p50','40','0','0']]
dfresult = pd.DataFrame(d3, columns = ['Name', 'A','B','C','new1','new2'])
Updated!
IIUC, you want to add two columns to df2 : new1 and new2.
First I modified two things:
df1 = pd.DataFrame(d1, columns = ['Name1', 'A','B','C'])
df2 = pd.DataFrame(d2, columns = ['Name2', 'A','B','C'])
df1.Name1 = pd.to_datetime(df1.Name1)
Renamed Name into Name1 and Name2 for ease of use. Then I turned Name1 into a real date, so we can get the maximum date by group.
Then, We merge df2 with df1 on A column. This will give us rows that match on that column
aux = df2.merge(df1, on='A')
Then when the B columns is the same on both dataframes, we get Name1 out of it:
df2['new1'] = df2.index.map(aux[aux.B_x==aux.B_y].Name1).fillna(0)
If they're different we get the maximum date for every A group:
df2['new2'] = df2.A.map(aux[aux.B_x!=aux.B_y].groupby('A').Name1.max()).fillna(0)
Ouput:
Name2 A B C new1 new2
0 12/1/19 xy p1 110 2019-01-01 00:00:00 2019-03-15 00:00:00
1 12/10/19 das p10 60 0 0
2 12/20/19 fas p50 40 0 0
You can do this by:
standard merge based on A
removing all entries which match B values
sorting for dates
dropping duplicates on A, keeping last date (n.b. assumes dates are in date format, not as strings!)
merging back on id
Thus:
source = df1.copy() # renamed
v = df2.merge(source, on='A', how='left') # get all values where df2.A == source.A
v = v[v['B_x'] != v['B_y']] # drop entries where B values are the same
nv = v.sort_values(by=['Name_y']).drop_duplicates(subset=['Name_x'], keep='last')
df2.merge(nv[['Name_y', 'Name_x']].rename(columns={'Name_y': 'new2', 'Name_x': 'Name'}),
on='Name', how='left') # keeps non-matching, consider inner
This yields:
Out[94]:
Name A B C new2
0 12/1/19 xy p1 110 3/15/19
1 12/10/19 das p10 60 NaN
2 12/20/19 fas p50 40 NaN
My initial thought was to do something like the below. Sadly, it is not elegant. Generally, this sort of way to determining some value are frowned upon mostly because it fails to scale and with large data, gets especially slow.
def find_date(row, source=df1): # renamed df1 to source
t = source[source['B'] != row['B']]
t = t[t['A'] == row['A']]
return t.sort_values(by='date', ascending=False).iloc[0]
df2['new2'] = df2.apply(find_date, axis=1)