Outer join in python Pandas - python

I have two data sets as following
A B
IDs IDs
1 1
2 2
3 5
4 7
How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A
Something like Following
B
Ids
5
7
I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following
pd.merge(A, B, on='ids', how='right')
Thanks

You can use NumPy's setdiff1d, like so -
np.setdiff1d(B['IDs'],A['IDs'])
Also, np.in1d could be used for the same effect, like so -
B[~np.in1d(B['IDs'],A['IDs'])]
Please note that np.setdiff1d would give us a sorted NumPy array as output.
Sample run -
>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
IDs
1 7
2 5

You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:
A = pd.DataFrame({'IDs':[1,2,3,4],
'B':[4,5,6,7],
'C':[1,8,9,4]})
print (A)
B C IDs
0 4 1 1
1 5 8 2
2 6 9 3
3 7 4 4
B = pd.DataFrame({'IDs':[1,2,5,7],
'A':[1,8,3,7],
'D':[1,8,9,4]})
print (B)
A D IDs
0 1 1 1
1 8 8 2
2 3 9 5
3 7 4 7
df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']
df = df.drop('_merge', axis=1)
print (df)
B C IDs A D
4 NaN NaN 5.0 3.0 9.0
5 NaN NaN 7.0 7.0 4.0

You could convert the data series to sets and take the difference:
import pandas as pd
df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)}) # Take difference and convert back to DataFrame
The variable "C" then yields
C
0 5
1 7

You can simply use pandas' .isin() method:
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]
If these are separate DataFrames:
a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]
Output:
IDs
2 5
3 7

Related

Add label/multi-index on top of columns

Context: I'd like to add a new multi-index/row on top of the columns. For example if I have this dataframe:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
Desired Output: How could I make it so that I can add "Table X" on top of the columns A,B, and C?
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Possible solutions(?): I was thinking about transposing the dataframe, adding the multi-index, and transpose it back again, but not sure how to do that without having to write the dataframe columns manually (I've checked other SO posts about this as well)
Thank you!
In the meantime I've also discovered this solution:
tt = pd.concat([tt],keys=['Table X'], axis=1)
Which also yields the desired output
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
If you want a data frame like you wrote, you need a Multiindex data frame, try this:
import pandas as pd
# you need a nested dict first
dict_nested = {'Table X': {'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]}}
# then you have to reform it
reformed_dict = {}
for outer_key, inner_dict in dict_nested.items():
for inner_key, values in inner_dict.items():
reformed_dict[(outer_key, inner_key)] = values
# last but not least convert it to a multiindex dataframe
multiindex_df = pd.DataFrame(reformed_dict)
print(multiIndex_df)
# >> Table X
# >> A B C
# >> 0 1 4 7
# >> 1 2 5 8
# >> 2 3 6 9
You can use pd.MultiIndex.from_tuples() to set / change the columns of the dataframe with a multi index:
tt.columns = pd.MultiIndex.from_tuples((
('Table X', 'A'), ('Table X', 'B'), ('Table X', 'C')))
Result (tt):
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Add-on, as those are multi index levels you can later change them:
tt.columns.set_levels(['table_x'],level=0,inplace=True)
tt.columns.set_levels(['a','b','c'],level=1,inplace=True)
table_x
a b c
0 1 4 7
1 2 5 8
2 3 6 9

groupby a column to get a minimum value and corresponding value in another column [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Merge/Append pandas dataframes but update overlapping rows

I have two (or more) dataframes that I want to append under each other (or outer merge, in a way). How do I make sure that I can append the two dataframes, but at the same time, if an index is the same, I want to update the value of the variable with the second (dfB) dataframe.
As an illustration:
dfA =
Index Var1
A 5
B 6
C 7
dfB =
Index Var1
A 6
D 8
E 10
Desired output should look like
output =
Index Var1
A 6
B 6
C 7
D 8
E 10
Any help would be greatly appreciated!
Thanks
For this particular case, considering the update, you can use pd.concat() with the argument ignore_index=True and drop_duplicates(['index'])
output = pd.concat([dfA,dfB],ignore_index=True)drop_duplicates(['index'],keep='last')
Example:
A = {'Index':['A','B','C'],'Var1':[5,6,7]}
B = {'Index':['A','D','E'],'Var1':[6,7,8]}
dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)
output = pd.concat([dfA,dfB],ignore_index=True).drop_duplicates(['Index'],keep='last')
print(output)
Index Var1
1 B 6
2 C 7
3 A 6
4 D 7
5 E 8
After this you can use set_index() or sort_values() if you want to sort your dataframe in alphabetical order given the column Index
You can also merge and fillna:
final = (df1.merge(df2,on='Index',how='outer',suffixes=('_x',''))
.assign(Var1 = lambda x: x['Var1'].fillna(x['Var1_x']))[df1.columns])
Index Var1
0 A 6.0
1 B 6.0
2 C 7.0
3 D 8.0
4 E 10.0

Python Pandas: MultiIndex groupby second level of columns

I'm trying to group rows by multiple columns.
What I want to achieve can be illustrated by this small example:
import pandas as pd
col_index = pd.MultiIndex.from_arrays([['A','A','B','B'],['a','b','c','d']])
df = pd.DataFrame([ [1,2,3,3],
[4,2,2,2],
[6,4,2,2],
[1,2,4,4],
[3,8,4,4],
[1,2,3,3]], columns = col_index)
DataFrame created by this looks like this:
A B
a b c d
0 1 2 3 3
1 4 2 2 2
2 6 4 2 2
3 1 2 4 4
4 3 8 4 4
5 1 2 3 3
I would like to group by 'c' and 'd', actually whole 'B'
This gives me "KeyError: 'c' "
#something like this
df.groupby(['c','d'], axis = 1, level = 1)
#or like this
df.groupby('B', axis = 1, level = 0)
I tried searching for answer but I can't seem to find any.
Can somebody tell me what I'm doing wrong?
This is one way of doing it by resetting the columns first:
df.set_axis(df.columns.droplevel(0), axis=1,inplace=False).groupby(['c','d']).sum()
Out[531]:
a b
c d
2 2 10 6
3 3 2 4
4 4 4 10
You can also specify the 2-level multi-indices explicitly.
df.groupby([("B","c"), ("B", "d")])

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Categories