Difference between values based on condition - python

I have two dataframes and I want to find the difference between dataframe1 and dataframe2 based on the condition. What I mean is the following:
df.ref_well:
zone depth
a 34
b 23
c 11
d 35
e -9999
df_well
zone depth
a 17
c 15
d 25
f 11
what I want is to generate the df3 with the zone name and the difference between depth of the same zones in df1 and df3:
df3 = well- ref well (the same zones)
zone depth
a 17
b -9999
c -4
d 10
e -9999
I have tried to iterate through dfs separately and identify the same zones, and if they are equal to find the difference:
ref_well_zone_count=len(df_ref_well.iloc[:,0])
well_zone_count=len(df_well.iloc[:,0])
delta_depth=[]
for ref_zone in range(ref_well_zone_count):
for well_zone in range(well_zone_count):
if df_ref_well.iloc[ref_zone,0]==df_well.iloc[well_zone,0]:
delta_tvdss.append(df_well.iloc[well_zone, 1] - df_ref_well.iloc[ref_zone, 1])
The problem is I can't fill the results into the new column I am not able to insert them, so when I try adding the delta_depth as a column it says that:
ValueError: Length of values does not match length of index
But if I print out the results it calculates perfectly

You didn't specify what you want to do if there is no match. So I will assume no match means depth = 0
Link 2 df together using merge, then fill those that doesn't have a match will have 0 by default:
df3 = pd.merge(ref_well,df_well, on=['zone'], how='outer').fillna(0)
Calculate the difference and put it back
df3['diff'] = df3.depth_x - df3.depth_y

Related

How to compare 2 data frames in python and highlight the differences?

I am trying to compare 2 files one is in xls and other is in csv format.
File1.xlsx (not actual data)
Title Flag Price total ...more columns
0 A Y 12 300
1 B N 15 700
2 C N 18 1000
..
..
more rows
File2.csv (not actual data)
Title Flag Price total ...more columns
0 E Y 7 234
1 B N 16 600
2 A Y 12 300
3 C N 17 1000
..
..
more rows
I used Pandas and moved those files to data frame. There is no unique columns(to make id) in the files and there are 700K records to compare. I need to compare File 1 with File 2 and show the differences. I have tried few things but I am not getting the outliers as expected.
If I use merge function as below, I am getting output with the values only for File 1.
diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')
output I am getting
Title Attention_Needed Price total
1 B N 15 700
2 C N 18 1000
This output is not showing the correct diff as record with Title 'E' is missing
I also tried using panda merge
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
& output for above was
Title Flag Price total Exist
0 A Y 12 300 both
1 B N 15 700 left_only
2 C N 18 1000 left_only
3 E Y 7 234 right_only
4 B N 16 600 right_only
5 C N 17 1000 right_only
Problem with above output is it is showing records from both the data frames and it will be very difficult if there are 1000 of records in each data frame.
Output I am looking for (for differences) by adding extra column("Comments") and give message as matching, exact difference, new etc. or on the similar lines
Title Flag Price total Comments
0 A Y 12 300 matching
1 B N 15 700 Price, total different
2 C N 18 1000 Price different
3 E Y 7 234 New record
If above output can not be possible, then please suggest if there is any other way to solve this.
PS: This is my first question here, so please let me know if you need more details here.
Rows in DF1 Which Are Not Available in DF2
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
Rows in DF2 Which Are Not Available in DF1
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']
If you're differentiating by row not column
pd.concat([df1,df2]).drop_duplicates(keep=False)
If each df has the same columns and each column should be compared individually
for col in data.columns:
set(df1.col).symmetric_difference(df2.col)
# WARNING: this way of getting column diffs likely won't keep row order
# new row order will be [unique_elements_from_df1_REVERSED] concat [unique_elements_from_df2_REVERSED]
lets assume df1 (left) is our "source of truth" for what's considered an original record.
after running
diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')
take the output and split it into 2 df's.
df1 = diff_df[diff_df["Exist"] in ["both", "left_only"]]
df2 = diff_df[diff_df["Exist"] == "right_only"]
Right now, if you drop the "exist" row from df1, you'll have records where the comment would be "matching".
Let's assume you add the 'comments' column to df1
you could say that everything in df2 is a new record, but that would disregard the "price/total different".
If you really want the difference comment, now is a tricky bit where the 'how' really depends on what order columns matter most (title > flag > ...) and how much they matter (weighting system)
After you have a wighting system determined, you need a 'scoring' method that will compare two rows in order to see how similar they are based on the column ranking you determine.
# distributes weight so first is heaviest, last is lightest, total weight = 100
# if i was good i'd do this with numpy not manually
def getWeights(l):
weights = [0 for col in l]
total = 100
while total > 0:
for i, e in enumerate(l):
for j in range(i+1):
weights[j] += 1
total -= 1
return weights
def scoreRows(row1, row2):
s = 0
for i, colName in enumerate(colRank):
if row1[colName] == row2[colName]:
s += weights[i]
colRank = ['title', 'flag']
weights = getWeights(colRank)
Let's say only these 2 matter and the rest are considered 'modifications' to an original row
That is to say, if a row in df2 doesn't have a matching title OR flag for ANY row in df1, that row is a new record
What makes a row a new record is completely up to you.
Another way of thinking about it is that you need to determine what makes some row in df2 'differ' from some row in df1 and not a different row in df1
if you have 2 rows in df1
row1: [1, 2, 3, 4]
row2: [1, 6, 3, 7]
and you want to compare this row against that df
[1, 6, 5, 4]
this row has the same first element as both, the same second element as row2, and the same 4th element of row1.
so which row does it differ from?
if this is a question you aren't sure how to answer, consider cutting losses and just keep df1 as "good" records and df2 as "new" records
if you're sticking with the 'differs' comment, our next step is to filter out truly new records from records that have slight differences by building a score table
# to recap
# df1 has "both" and "left_only" records ("matching" comment)
# df2 has "right_only" records (new records and differing records)
rowScores = []
# list of lists
# each inner list index correlates to the index for df2
# inner lists are
# made up of tuples
# each tuple first element is the actual row from df1 that is matched
# second element is the score for matching (out of 100)
for i, row1 in df2.itterrows():
thisRowsScores = []
#df2 first because they are what we are scoring
for j, row2 in df1.iterrows():
s = scoreRows(row1, row2)
if s>0: # only save rows and scores that matter
thisRowsScores.append((row2, s))
# at this point, you can either leave the scoring as a table and have comments refer how different differences relate back to some row
# or you can just keep the best score like i'll be doing
#sort by score
sortedRowScores = thisRowsScores.sort(key=lambda x: x[1], reverse=True)
rowScores.append(sortedRowScores[0])
# appends empty list if no good matches found in df1
# alternatively, remove 'reversed' from above and index at -1
The reason we save the row itself is so that it can be indexed by df1 in order to add a "differ" comments
At this point, lets just say that df1 already has the comments "matching" added to it
Now that each row in df2 has a score and reference to the row it matched best in df1, we can edit the comment to that row in df1 to list the columns with different values.
But at this point, I feel as though that df now needs a reference back to df2 so that the record and values those difference refer to are actually gettable.

nunique compare two Pandas dataframe with duplicates and pivot them

My input:
df1 = pd.DataFrame({'frame':[ 1,1,1,2,3,0,1,2,2,2,3,4,4,5,5,5,8,9,9,10,],
'label':['GO','PL','ICV','CL','AO','AO','AO','ICV','PL','TI','PL','TI','PL','CL','CL','AO','TI','PL','ICV','ICV'],
'user': ['user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1']})
df2 = pd.DataFrame({'frame':[ 1, 1, 2, 3, 4,0,1,2,2,2,4,4,5,6,6,7,8,9,10,11],
'label':['ICV','GO', 'CL','TI','PI','AO','GO','ICV','TI','PL','ICV','TI','PL','CL','CL','CL','AO','AO','PL','ICV'],
'user': ['user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2']})
df_c = pd.concat([df1,df2])
I trying compare two df, frame by frame, and check if label in df1 existing in same frame in df2. And make some calucation with result (pivot for example)
That my code:
m_df = df1.merge(df2,on=['frame'],how='outer' )
m_df['cross']=m_df.apply(lambda row: 'Matched'
if row['label_x']==row['label_y']
else 'Mismatched', axis='columns')
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0,margins=True)
pv_mc = pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.count,fill_value=0,margins=True)
but this creates a some problem:
first, I can calqulate "simple" total (column All) of matched and missmatched as descipted in picture, or its "duplicated" as AO in pv_m or wrong number as in CL in pv_m_unq
and second, I think merge method as I use int not clever way, because I get if frame+label repetead in df(its happens often), in merged df I get number row in df1 X number of rows in df2 for this specific frame+label
I think maybe there is a smarter way to compare df and pivot them?
You got the unexpected result on margin total because the margin is making use the same function passed to aggfunc (i.e. pd.Series.nunique in this case) for its calculation and the values of Matched and Mismatched in these 2 rows are both the same as 1 (hence only one unique value of 1). (You are currently getting the unique count of frame id's)
Probably, you can achieve more or less what you want by taking the count on them (including margin, Matched and Mismatched) instead of the unique count of frame id's, by using pd.Series.count instead in the last line of codes:
pv_m = pd.pivot_table(m_df,columns='cross',index='label_x',values='frame', aggfunc=pd.Series.count, margins=True, fill_value=0)
Result
cross Matched Mismatched All
label_x
AO 0 1 1
CL 1 0 1
GO 1 1 2
ICV 1 1 2
PL 0 2 2
All 3 5 8
Edit
If all you need is to have the All column being the sum of Matched and Mismatched, you can do it as follows:
Change your code of generating pv_m_unq without building margin:
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0)
Then, we create the column All as the sum of Matched and Mismatched for each row, as follows:
pv_m_unq['All'] = pv_m_unq['Matched'] + pv_m_unq['Mismatched']
Finally, create the row All as the sum of Matched and Mismatched for each column and append it as the last row, as follows:
row_All = pd.Series({'Matched': pv_m_unq['Matched'].sum(),
'Mismatched': pv_m_unq['Mismatched'].sum(),
'All': pv_m_unq['All'].sum()},
name='All')
pv_m_unq = pv_m_unq.append(row_All)
Result:
print(pv_m_unq)
Matched Mismatched All
label_x
AO 1 3 4
CL 1 2 3
GO 1 1 2
ICV 2 4 6
PL 1 5 6
TI 2 3 5
All 8 18 26
You can use isin() function like this:
df3 =df1[df1.label.isin(df2.label)]

pandas take pairwise difference between a lot of columns

I have a pandas dataframe having a lot of actual (column name ending with _act) and projected columns (column name ending with _proj). Other than actual and projected there's also a date column. Now I want to add an error column (in that order, i.e., beside its projected column) for all of them. Sample dataframe:
date a_act a_proj b_act b_proj .... z_act z_proj
2020 10 5 9 11 .... 3 -1
.
.
What I want:
date a_act a_proj a_error b_act b_proj b_error .... z_act z_proj z_error
2020 10 5 5 9 11 -2 .... 3 -1 4
.
.
What's the best way to achieve this, as I have a lot of actual and projected columns?
You could do:
df = df.set_index('date')
# create new columns
columns = df.columns[df.columns.str.endswith('act')].str.replace('act', 'error')
# compute differences
diffs = pd.DataFrame(data=df.values[:, ::2] - df.values[:, 1::2], index=df.index, columns=columns)
# concat
res = pd.concat((df, diffs), axis=1)
# reorder columns
res = res.reindex(sorted(res.columns), axis=1)
print(res)
Output
a_act a_error a_proj b_act b_error b_proj z_act z_error z_proj
date
2020 10 5 5 9 -2 11 3 4 -1

Select highest member of close coordinates saved in pandas dataframe

I have a dataframe that has following columns: X and Y are Cartesian coordinates and Value is the value of element at these coordinates. What I want to achieve is to select only one coordinates out of n that are close to other, lets say coordinates are close if distance is lower than some value m, so the initial DF looks like this (example):
data = {'X':[0,0,0,1,1,5,6,7,8],'Y':[0,1,4,2,6,5,6,4,8],'Value':[6,7,4,5,6,5,6,4,8]}
df = pd.DataFrame(data)
X Y Value
0 0 0 6
1 0 1 7
2 0 4 4
3 1 2 5
4 1 6 6
5 5 5 5
6 6 6 6
7 7 4 4
8 8 8 8
distance is count with following function:
def countDistance(lat1, lon1, lat2, lon2):
#use basic knowledge about triangles - values are in meters
distance = sqrt(pow(lat1-lat2,2)+pow(lon1-lon2,2))
return distance
lets say if we want to m<=3, the output dataframe would look like this:
X Y Value
1 0 1 7
4 1 6 6
8 8 8 8
What is to be done:
rows 0,1,3 are close, highest value is in row 1, continue
rows 2 and 4 (from original df) are close, keep row 4
rows 5,6,7 are close, keep row 6
left over row 6 is close to row 8, keep row 8, has higher value
So I need to go through dataframe row by row, check the rest, select best match and then continue. I can't think about any simple method how to achieve this, this cant be use case of drop_duplicates, since they are not duplicates, but looping over the whole DF will be very inefficient. One method I could think about was to loop just once, for each of rows finds close ones (probably apply countdistance()), select the best fitting row and replace rest with its values, in the end use drop_duplicates. The other idea was to create a recursive function that would create a new DF, then while original df will have rows select first, find close ones, best match append to new DF, remove first row and all close from original DF and continue until empty, then return same function with new DF as to remove possible uncaught close points.
These ideas are all kind of inefficient, is there a nice and efficient pythonic way to achieve this?
For now, I have created simple code with recursion, the code works but is most likely not optimal.
def recModif(self,df):
#columns=['','X','Y','Value']
new_df = df.copy()
new_df = new_df[new_df['Value']<0] #create copy to work with
changed = False
while not df.empty: #for all the data
df = df.reset_index(drop=True) #need to reset so 0 is always accessible
x = df.loc[0,'X'] #first row x and y
y = df.loc[0,'Y']
df['dist'] = self.countDistance(x,y,df['X'],df['Y']) #add column with distances
select = df[df['dist']<10] #number of meters that two elements cant be next to other
if(len(select.index)>1): #if there is more than one elem close
changed = True
#print(select,select['Value'].idxmax())
select = select.loc[[select['Value'].idxmax()]] #get the highest one
new_df = new_df.append(pd.DataFrame(select.iloc[:,:3]),ignore_index=True) #add it to new df
df = df[df['dist'] >= 10] #drop the elements now
if changed:
return self.recModif(new_df) #use recursion if possible overlaps
else:
return new_df #return new df if all was OK

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

Categories