Do an inner join in python pandas using criteria

Do an inner join in python pandas using criteria - python

I’m trying to replicate in python/pandas what would be fairly straightforward in SQL, but am stuck.
I want to take a data frame with three columns:
dataframe1
Org Des Score
0 A B 10
1 A B 11
2 A B 15
3 A C 4
4 A C 4.5
5 A C 6
6 A D 100
7 A D 110
8 A D 130
And filter out all score values that are greater than the minimum * 1.2 for each Org-Des combination.
So the output table would be:
output_dataframe
Org Des Score
0 A B 10
1 A B 11
3 A C 4
4 A C 4.5
6 A D 100
7 A D 110
For the first Org-Des combo, A-B, the min Score is 10 and (1.2 * min) = 12. So rows 0 and 1 would be preserved because Scores 10 and 11 are < 12. Row 3 would be eliminated because it is > 12.
For A-C, the min Score is 4 and (1.2 * min) = 5. So rows 3 and 4 are preserved because they are < 5. And so on...
My approach
I thought I'd use the following approach:
Use a groupby function to create a dataframe with the mins by Org-Des pair:
dataframe2 = pd.DataFrame(dataframe1.groupby(['Org','Des'])['Score'].min())
Then do an inner join (or a merge?) between dataframe1 and dataframe2 with the criteria that the Score < 1.2 * min for each Org-Des pair type.
But I haven't been able to get this to work for two reasons, 1) dataframe2 ends up being a funky shape, which I would need to figure out how to join or merge with dataframe1, or transform then join/merge and 2) I don't know how to set criteria as part of a join/merge.
Is this the right approach or is there a more pythonic way to achieve the same goal?
Edit to reflect #Psidom answer:
I tried the code you suggested and it gave me an error, here's the full code and output:
In: import pandas as pd
import numpy as np
In: df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
'Des': ['B','B','B','C','C','C','D','D','D'],
'Score': ['10','11','15','4','4.5','6','100','110','130'], })
Out: Org Des Score
0 A B 10
1 A B 11
2 A B 15
3 A C 4
4 A C 4.5
5 A C 6
6 A D 100
7 A D 110
8 A D 130
In: df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())
df2
Out: Score
Org Des
A B 10
C 4
D 100
In: df1 = pd.merge(df1, df2.groupby(['Org', 'Des']).min()*1.2, left_on = ['Org', 'Des'], right_index=True)
df.loc[df1.Score_x < df1.Score_y, :]
Out: KeyError: 'Org' #It's a big error but this seems to be the relevant part. Let me know if it would be useful to past the whole error.
I suspect I may have the df1, df2 and df's mixed up? I changed from the original answer post to match the code.

You can set up the join criteria as this. For the original data frame, set the join columns as ['Org', 'Des'], and for the aggregated data frame the grouped columns become index so you will need to set right_index to be true, then it should work as expected:
import pandas as pd
df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
'Des': ['B','B','B','C','C','C','D','D','D'],
'Score': [10,11,15,4,4.5,6,100,110,130]})
df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())
df3 = pd.merge(df1, df2, left_on = ['Org', 'Des'], right_index=True)
df1.loc[df3.Score_x < df3.Score_y * 1.2, ]
# Org Des Score
#0 A B 10.0
#1 A B 11.0
#3 A C 4.0
#4 A C 4.5
#6 A D 100.0
#7 A D 110.0

I did it like this:
df[df.groupby(['Org', 'Des']).Score.apply(lambda x: x < x.min() * 1.2)]

Related

Merge/Append pandas dataframes but update overlapping rows

I have two (or more) dataframes that I want to append under each other (or outer merge, in a way). How do I make sure that I can append the two dataframes, but at the same time, if an index is the same, I want to update the value of the variable with the second (dfB) dataframe.
As an illustration:
dfA =
Index Var1
A 5
B 6
C 7
dfB =
Index Var1
A 6
D 8
E 10
Desired output should look like
output =
Index Var1
A 6
B 6
C 7
D 8
E 10
Any help would be greatly appreciated!
Thanks

For this particular case, considering the update, you can use pd.concat() with the argument ignore_index=True and drop_duplicates(['index'])
output = pd.concat([dfA,dfB],ignore_index=True)drop_duplicates(['index'],keep='last')
Example:
A = {'Index':['A','B','C'],'Var1':[5,6,7]}
B = {'Index':['A','D','E'],'Var1':[6,7,8]}
dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)
output = pd.concat([dfA,dfB],ignore_index=True).drop_duplicates(['Index'],keep='last')
print(output)
Index Var1
1 B 6
2 C 7
3 A 6
4 D 7
5 E 8
After this you can use set_index() or sort_values() if you want to sort your dataframe in alphabetical order given the column Index

You can also merge and fillna:
final = (df1.merge(df2,on='Index',how='outer',suffixes=('_x',''))
.assign(Var1 = lambda x: x['Var1'].fillna(x['Var1_x']))[df1.columns])
Index Var1
0 A 6.0
1 B 6.0
2 C 7.0
3 D 8.0
4 E 10.0

Sort Dataframe by Descending Rows AND Columns at the Same Time

Currently have a dataframe that is countries by series, with values ranging from 0-25
I want to sort the df so that the highest values appear in the top left (first), while the lowest appear in the bottom right (last).
FROM
A B C D ...
USA 4 0 10 16
CHN 2 3 13 22
UK 2 1 8 14
...
TO
D C A B ...
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
...
In this, the column with the highest values is now first, and the same is true with the index.
I have considered reindexing, but this loses the 'Countries' Index.
D C A B ...
0 22 13 2 3
1 16 10 4 0
2 14 8 2 1
...
I have thought about creating a new column and row that has the Mean or Sum of values for that respective column/row, but is this the most efficient way?
How would I then sort the DF after I have the new rows/columns??
Is there a way to reindex using...
df_mv.reindex(df_mv.mean(or sum)().sort_values(ascending = False).index, axis=1)
... that would allow me to keep the country index, and simply sort it accordingly?
Thanks for any and all advice or assistance.
EDIT
Intended result organizes columns AND rows from largest to smallest.
Regarding the first row of the A and B columns in the intended output, these are supposed to be 2, 3 respectively. This is because the intended result interprets the A column as greater than the B column in both sum and mean (even though either sum or mean can be considered for the 'value' of a row/column).
By saying the higher numbers would be in the top left, while the lower ones would be in the bottom right, I simply meant this as a general trend for the resulting df. It is the columns and rows as whole however, that are the intended focus. I apologize for the confusion.

You could use:
rows_index=df.max(axis=1).sort_values(ascending=False).index
col_index=df.max().sort_values(ascending=False).index
new_df=df.loc[rows_index,col_index]
print(new_df)
D C A B
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1

Use .T to transpose rows to columns and vice versa:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.T
df = df.sort_values(df.columns[0], ascending=False).T
Result:
>>> df
D C B A
CHN 22 13 3 2
USA 16 10 0 4
UK 14 8 1 2

Here's another way, this time without transposing but using axis=1 as an argument:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.sort_values(df.index[0], axis=1, ascending=False)

Using numpy:
arr = df.to_numpy()
arr = arr[np.max(arr, axis=1).argsort()[::-1], :]
arr = np.sort(arr, axis=1)[:, ::-1]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df1)
Output:
A B C D
USA 22 13 3 2
CHN 16 10 4 0
UK 14 8 2 1

For loops with multiple result

I have a following dummy data frame:
df = pd.DataFrame([[1,50,60],[5,70,80],[2,120,30],[3,125,450],[5,80,90],[4,100,200],[2,1000,2000],[1,10,20]],columns = ['A','B','C'])
A B C
0 1 50 60
1 5 70 80
2 2 120 30
3 3 125 450
4 5 80 90
5 4 100 200
6 2 1000 2000
7 1 10 20
I am for loop in python at this moment and I would like to know if there is any possibility that for loop in python to generate multiple results. I would like to break the above data frame using for loop where for each variable in column A I would like to have new df and sort them based on column B and have column C multiplied by 2:
df1 =
A B C
1 10 40
1 20 120
df2 =
A B C
2 120 60
2 1000 4000
df3 =
A B C
3 125 900
df4 =
A B C
4 100 200
df5 =
A B C
5 70 80
5 80 90
I am not sure if this can be done in Python. Normally I use matlab and for this I tried the following in my python script:
def f(df):
for i in np.unique(df['A'].values):
df = df.sort_values(['A','B'])
df = df['C'].assign(C = lambda x: x.C*2)
print df
Of course this is wrong since it will not generate multiple result as df1,df2...df5 (this variables are important to be ended by 1,2,...5 such that it can be traced or followed column A of the dataframe). Could anyone help me please? I understand that this can be easily done without for loop (vectorization), but I have many unique values in column A and I would like to run a for loop on them and I would also like to learn more about loop in Python. Many thanks.

Use DataFrame.groupby is faster than Series.unique.
Optionally you can save the dataframes in a dictionary.
The advantage of using a dictionary with respect to the list is that it can match the password with the value in A
df2=df.copy()
df2['C']=df2['C']*2
df2=df2.sort_values('B')
dfs={i:group for i,group in df2.groupby('A')}
access the dictionary based on the value in A:
for key in dfs:
print(f'dfs[{key}]')
print(dfs[key])
print('_'*20)
dfs[1]
A B C
7 1 10 80
0 1 50 240
____________________
dfs[2]
A B C
2 2 120 120
6 2 1000 8000
____________________
dfs[3]
A B C
3 3 125 1800
____________________
dfs[4]
A B C
5 4 100 800
____________________
dfs[5]
A B C
1 5 70 320
4 5 80 360

Sort and multiply before chunking into pieces:
df['C'] = 2* df['C']
[group for name, group in df.sort_values(by=['A','B']).groupby('A')]
Or if you want a dict:
{name: group for name, group in df.sort_values(by=['A','B']).groupby('A')}

I have similar answer like Ansev:
df = pd.DataFrame([[1,50,60],[5,70,80],[2,120,30],[3,125,450],[5,80,90],[4,100,200],[2,1000,2000],[1,10,20]],columns = ['A','B','C'])
A = np.unique(data['A'].values)
df_result = []
for a in A:
df1 = df.loc[df['A'] == a]
df1 = df1.sort_values('B')
df1 = df1.assign(C = lambda x: x.C*2)
df_result+=[df1]
I am still unable to automate this for having the result as df_result1, df_result2...df_result5. What I can do is only to call the result from each loop as df_result[0], df_result[1],...df_result[4].

What you want to do is group by the column A and then store the resulting dataframe into a dict indexed by the value of A. A code to do that would be
df_dict = {}
for ix, gp in df.groupby('A'):
new_df = gp.sort_values('B')
new_df['C'] = 2*new_df['C']
df_dict[ix] = new_df
Then the variable df_list contains all the resulting dataframes sorted by column B and column C multiplied by 2. For example
print(df_dict[1])
A B C
1 10 40
1 50 120

cumsum pandas up to specific value - python pandas

Cumsum until value exceeds certain number:
Say that we have two Data frames A,B that look like this:
A = pd.DataFrame({"type":['a','b','c'], "value":[100, 50, 30]})
B = pd.DataFrame({"type": ['a','a','a','a','b','b','b','c','c','c','c','c'], "value": [10,50,45,10,45,10,5,6,6,8,12,10]})
The two data frames would look like this.
>>> A
type value
0 a 100
1 b 50
2 c 30
>>> B
type value
0 a 10
1 a 50
2 a 45
3 a 10
4 b 45
5 b 10
6 b 5
7 c 6
8 c 6
9 c 8
10 c 12
11 c 10
For each group in "type" in data frame A, i would like to add the column value in B up to the number specified in the column value in A. I would also like to count the number of rows in B that were added. I've been trying to use a cumsum() but I don't know exactly to to stop the sum when the value is reached,
The output should be:
type value
0 a 3
1 b 2
2 c 4
Thank you,

Merging the two data frame before hand should help:
import pandas as pd
df = pd.merge(B, A, on = 'type')
df['cumsum'] = df.groupby('type')['value_x'].cumsum()
B[(df.groupby('type')['cumsum'].shift().fillna(0) < df['value_y'])].groupby('type').count()
# type value
# a 3
# b 2
# c 4

Assuming B['type'] to be sorted as with the sample case, here's a NumPy based solution -
IDs = np.searchsorted(A['type'],B['type'])
count_cumsum = np.bincount(IDs,B['value']).cumsum()
upper_bound = A['value'] + np.append(0,count_cumsum[:-1])
Bv_cumsum = np.cumsum(B['value'])
grp_start = np.unique(IDs,return_index=True)[1]
A['output'] = np.searchsorted(Bv_cumsum,upper_bound) - grp_start + 1

Outer join in python Pandas

I have two data sets as following
A B
IDs IDs
1 1
2 2
3 5
4 7
How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A
Something like Following
B
Ids
5
7
I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following
pd.merge(A, B, on='ids', how='right')
Thanks

You can use NumPy's setdiff1d, like so -
np.setdiff1d(B['IDs'],A['IDs'])
Also, np.in1d could be used for the same effect, like so -
B[~np.in1d(B['IDs'],A['IDs'])]
Please note that np.setdiff1d would give us a sorted NumPy array as output.
Sample run -
>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
IDs
1 7
2 5

You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:
A = pd.DataFrame({'IDs':[1,2,3,4],
'B':[4,5,6,7],
'C':[1,8,9,4]})
print (A)
B C IDs
0 4 1 1
1 5 8 2
2 6 9 3
3 7 4 4
B = pd.DataFrame({'IDs':[1,2,5,7],
'A':[1,8,3,7],
'D':[1,8,9,4]})
print (B)
A D IDs
0 1 1 1
1 8 8 2
2 3 9 5
3 7 4 7
df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']
df = df.drop('_merge', axis=1)
print (df)
B C IDs A D
4 NaN NaN 5.0 3.0 9.0
5 NaN NaN 7.0 7.0 4.0

You could convert the data series to sets and take the difference:
import pandas as pd
df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)}) # Take difference and convert back to DataFrame
The variable "C" then yields
C
0 5
1 7

You can simply use pandas' .isin() method:
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]
If these are separate DataFrames:
a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]
Output:
IDs
2 5
3 7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.