My DataFrame looks like this:
A B
100 1
100 2
200 2
200 3
I need to find all possible combinations of A and B values and create new dataframe with this combinations and a third column indicating each combination presence in the original df:
A B C
100 1 True
100 2 True
100 3 False
200 1 False
200 2 True
200 3 True
How I'm doing it now:
import pandas as pd
df = pd.DataFrame({'A' : [100,100,200,200], 'B' : [1,2,2,3]})
df['D'] = 42
df2 = df[['A','D']].merge(df[['B','D']], on = 'D')
[['A','B']].drop_duplicates()
i1 = df.set_index(['A','B']).index
i2 = df2.set_index(['A','B']).index
df2['C'] = i2.isin(i1)
print(df2)
It works, but looks ugly. Is there a cleaner way?
You can use:
create new column filled Trues
set_index from columns for all combinations
create MultiIndex.from_product from levels of df1 index
reindex original df and if not exist values add Falses
reset_index for columns from MultiIndex
df['C'] = True
df1 = df.set_index(['A','B'])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
df = df1.reindex(mux, fill_value=False).reset_index()
print (df)
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True
With the help of itertools and tuple
import itertools
newdf = pd.DataFrame(list(itertools.product(df['A'].unique(),df['B'].unique())),columns = df.columns)
dft = list(df.itertuples(index=False))
newdf['C'] = newdf.apply(lambda x: tuple(x) in dft,axis=1)
Output :
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True
Using cartesian_product and pd.merge
In [415]: combs = pd.core.reshape.util.cartesian_product(
df.set_index(['A', 'B']).index.levels)
In [416]: combs
Out[416]:
[array([100, 100, 100, 200, 200, 200], dtype=int64),
array([1, 2, 3, 1, 2, 3], dtype=int64)]
In [417]: (pd.DataFrame({'A': combs[0], 'B': combs[1]})
.merge(df, how='left', indicator='C')
.replace({'C': {'both': True, 'left_only': False}}) )
Out[417]:
A B C
0 100 1 True
1 100 2 True
2 100 3 False
3 200 1 False
4 200 2 True
5 200 3 True
For combs, you could also,
In [432]: pd.core.reshape.util.cartesian_product([df.A.unique(), df.B.unique()])
Out[432]:
[array([100, 100, 100, 200, 200, 200], dtype=int64),
array([1, 2, 3, 1, 2, 3], dtype=int64)]
Related
I have data similar to this
data = {'A': [10,20,30,10,-10], 'B': [100,200,300,100,-100], 'C':[1000,2000,3000,1000, -1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
Here index value 0,3 and 4 are exactly equal but one is negative, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3 and 4.(Any one value)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
Any mistakes please pardon.
Like this maybe:
In [69]: import numpy as np
# Create column 'D' with exact duplicate rows using 'abs'
In [68]: df['D'] = np.where(df.abs().duplicated(keep=False), 'Duplicate', '')
# If the sum of duplicated rows = 0, this means they are 'exact opposite'
In [78]: if df[df.D.eq('Duplicate')].sum(1).sum() == 0:
...: df.loc[ix, 'D'] = 'Exact Opposite'
...:
In [79]: df
Out[79]:
A B C D
0 10 100 1000 Exact Opposite
1 20 200 2000
2 30 300 3000
3 -10 -100 -1000 Exact Opposite
To follow your logic let us just adding abs with groupby , so the output will return the pair index as list
df.reset_index().groupby(df['Sum Val'].abs())['index'].agg(list)
Out[367]:
Sum Val
1110 [0, 3]
2220 [1]
3330 [2]
Name: index, dtype: object
import pandas as pd
data = {'A': [10, 20, 30, -10], 'B': [100, 200,300, -100], 'C': [1000, 2000, 3000,-1000]}
df = pd.DataFrame(data)
print(df)
df['total'] = df.sum(axis=1)
df['total'] = df['total'].apply(lambda x: "Exact opposite" if sum(df['total'] == -1*x) else "")
print(df)
Problem
Consider the following dataframe:
data_so = {
'ID': [100, 100, 100, 200, 200, 300, 300, 300],
'letter': ['A','B','A','C','D','E','D','A'],
}
df_so = pandas.DataFrame (data_so, columns = ['ID', 'letter'])
I want to obtain a new column where all duplicates in different groups are True. All other duplicates in the same group should be False.
What I've tried
I've tried using
df_so['dup'] = df_so.duplicated(subset=['letter'], keep=False)
but the result is not what I want:
The first occurrence of A in group 1 (row 0) is True because there is a duplicate in another group (row 7). However all other occurrences of A in the same group (row 2) should be False.
If row 7 is deleted, then row 0 should be False because A is not present anymore in any other group.
What you need is essentially the AND of two different duplicated() calls.
~df_so.duplicated() deals within groups
df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True) Deals between groups ignoring current group duplicates
Code:
import pandas as pd
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300, 300], 'letter': ['A','B','A','C','D','E','D','A'], }
df_so = pd.DataFrame (data_so, columns = ['ID', 'letter'])
df_so['dup'] = ~df_so.duplicated() & df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
print(df_so)
Output:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
Other case:
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300], 'letter': ['A','B','A','C','D','E','D'] }
Output:
ID letter dup
0 100 A False
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
As you clarify in the comment, you need an additional mask beside current duplicated
m1 = df_so.duplicated(subset=['letter'], keep=False)
m2 = ~df_so.groupby('ID').letter.apply(lambda x: x.duplicated())
df_so['dup'] = m1 & m2
Out[157]:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
8 300 A False
Note: I added row=8 as in the comment.
My idea for this problem:
import datatable as dt
df = dt.Frame(df_so)
df[:1, :, dt.by("ID", "letter")]
I would group by both the ID and letter column. Then simply select the first row.
I have two dataframes, each row in dataframe A has a list of indices corresponding to entries in dataframe B and a set of other values. I want to join the two dataframes in a way so that each of the entries in B has the other values in A where the index of the entry in B is in the list of indices in the entry in A.
So far, I have found a way of extracting the rows in B for the list of indices in each row in A but only row-by-row from this answer but then I am not sure where to go from here? Also not sure if there's a better way of doing it with Pandas dynamically as the size of the list of indices can change.
import pandas as pd
import numpy as np
# Inputs
A = pd.DataFrame.from_dict({
"indices": [[0,1],[2,3],[4,5]],
"a1": ["a","b","c"],
"a2": [100,200,300]
})
print(A)
>> indices a1 a2
>> 0 [0, 1] a 100
>> 1 [2, 3] b 200
>> 2 [4, 5] c 300
B = pd.DataFrame.from_dict({
"b": [10,20,30,40,50,60]
})
print(B)
>> b
>> 0 10
>> 1 20
>> 2 30
>> 3 40
>> 4 50
>> 5 60
# This is the desired output
out = pd.DataFrame.from_dict({
"b": [10,20,30,40,50,60],
"a1": ["a","a", "b", "b", "c", "c"],
"a2": [100,100,200,200,300,300]
})
print(out)
>> b a1 a2
>> 0 10 a 100
>> 1 20 a 100
>> 2 30 b 200
>> 3 40 b 200
>> 4 50 c 300
>> 5 60 c 300
If you have pandas >=0.25, you can use explode:
C = A.explode('indices')
This gives:
indices a1 a2
0 0 a 100
0 1 a 100
1 2 b 200
1 3 b 200
2 4 c 300
2 5 c 300
Then do:
output = pd.merge(B, C, left_index = True, right_on = 'indices')
output.index = output.indices.values
output.drop('indices', axis = 1, inplace = True)
Final Output:
b a1 a2
0 10 a 100
1 20 a 100
2 30 b 200
3 40 b 200
4 50 c 300
5 60 c 300
using pd.merge
df2 = pd.DataFrame(A.set_index(['a1','a2']).indices)
df = pd.DataFrame(df2.indices.values.tolist(), index=a.index).stack().reset_index().drop('level_2', axis=1).set_index(0)
pd.merge(B,df,left_index=True, right_index=True)
Output
b a1 a2
0 10 a 100
1 20 a 100
2 30 b 200
3 40 b 200
4 50 c 300
5 60 c 300
Here you go:
helper = A.indices.apply(pd.Series).stack().reset_index(level=1, drop=True)
A = A.reindex(helper.index).drop(columns=['indices'])
A['indices'] = helper
B = B.merge(A, left_index=True, right_on='indices').drop(columns=['indices']).reset_index(drop=True)
Result:
b a1 a2
0 10 a 100
1 20 a 100
2 30 b 200
3 40 b 200
4 50 c 300
5 60 c 300
You can also use melt instead of stack, but it's more complicated as you must drop columns you don't need:
import pandas as pd
import numpy as np
# Inputs
A = pd.DataFrame.from_dict({
"indices": [[0,1],[2,3],[4,5]],
"a1": ["a","b","c"],
"a2": [100,200,300]
})
B = pd.DataFrame.from_dict({
"b": [10,20,30,40,50,60]
})
AA = pd.concat([A.indices.apply(pd.Series), A], axis=1)
AA.drop(['indices'], axis=1, inplace=True)
print(AA)
0 1 a1 a2
0 0 1 a 100
1 2 3 b 200
2 4 5 c 300
AA = AA.melt(id_vars=['a1', 'a2'], value_name='val').drop(['variable'], axis=1)
print(AA)
a1 a2 val
0 a 100 0
1 b 200 2
2 c 300 4
3 a 100 1
4 b 200 3
5 c 300 5
pd.merge(AA.set_index(['val']), B, left_index=True, right_index=True)
Out[8]:
a1 a2 b
0 a 100 10
2 b 200 30
4 c 300 50
1 a 100 20
3 b 200 40
5 c 300 60
This solution will handle indices of varying lengths.
A = pd.DataFrame.from_dict({
"indices": [[0,1],[2,3],[4,5]],
"a1": ["a","b","c"],
"a2": [100,200,300]
})
A = A.indices.apply(pd.Series) \
.merge(A, left_index = True, right_index = True) \
.drop(["indices"], axis = 1)\
.melt(id_vars = ['a1', 'a2'], value_name = "index")\
.drop("variable", axis = 1)\
.dropna()
A = A.set_index('index')
B = pd.DataFrame.from_dict({
"b": [10,20,30,40,50,60]
})
B
B.merge(A,left_index=True,right_index=True)
Final Output:
b a1 a2
0 10 a 100
1 20 a 100
2 30 b 200
3 40 b 200
4 50 c 300
5 60 c 300
This is my dataframe:
df = pd.DataFrame({'sym': list('aaaaaabb'), 'key': [1, 1, 1, 1, 2, 2, 3, 3], 'x': [100, 100, 90, 100, 500, 500, 700, 700]})
I group them by key and sym:
groups = df.groupby(['key', 'sym'])
Now I want to check whether all x in each group are equal or not. If they are not equal, I want to delete it from the df. In this case I want to omit the first group.
This is my desired df:
key sym x
4 2 a 500
5 2 a 500
6 3 b 700
7 3 b 700
Use GroupBy.transform with SeriesGroupBy.nunique and compare by 1, filter by boolean indexing:
df1 = df[df.groupby(['key', 'sym'])['x'].transform('nunique').eq(1)]
print (df1)
sym key x
4 a 2 500
5 a 2 500
6 b 3 700
7 b 3 700
I am trying to solve this question with Python.
ID = np.concatenate((np.repeat("A",5),
np.repeat("B",4),
np.repeat("C",2)))
Hour = np.array([0,2,5,6,9,0,2,5,6,0,2])
testVector = [0,2,5]
df = pd.DataFrame({'ID' : ID, 'Hour': Hour})
We group the rows by ID, then we want to remove all rows from df where not all values in testVector are found in the column Hour of that group. We could achieve that as follows:
def all_in(x,y):
return all([z in list(x) for z in y])
to_keep = df.groupby(by='ID')['Hour'].aggregate(lambda x: all_in(x,testVector))
to_keep = list(to_keep[to_keep].index)
df = df[df['ID'].isin(to_keep)]
I want to make this code as short and efficient as possible. Any suggestions for improvements or alternative solution approaches?
In [99]: test_set = set(testVector)
In [100]: df.loc[df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)]
Out[100]:
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Explanation:
in the lambda x: set(x) & test_set == test_set) function we create a set of Hour values for each group:
In [104]: df.groupby('ID').Hour.apply(lambda x: set(x))
Out[104]:
ID
A {0, 2, 5, 6, 9}
B {0, 2, 5, 6}
C {0, 2}
Name: Hour, dtype: object
Then we do set intersection with the test_set:
In [105]: df.groupby('ID').Hour.apply(lambda x: set(x) & test_set)
Out[105]:
ID
A {0, 2, 5}
B {0, 2, 5}
C {0, 2}
Name: Hour, dtype: object
and compare it with the test_set again:
In [106]: df.groupby('ID').Hour.apply(lambda x: set(x) & test_set == test_set)
Out[106]:
ID
A True
B True
C False
Name: Hour, dtype: bool
PS I used .apply() instead of .transform just for showing how it works.
But we need to use transform in order to use boolean indexing later on:
In [107]: df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)
Out[107]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
10 False
Name: Hour, dtype: bool
Similar to MaxU's solution but I used a Series instead of a set:
testVector = pd.Series(testVector)
df[df.groupby('ID')['Hour'].transform(lambda x: testVector.isin(x).all())]
Out:
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Filter might be more idiomatic here though:
df.groupby('ID').filter(lambda x: testVector.isin(x['Hour']).all())
Out:
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Create sets for each ID from Hour column first. Then map for new Series which is compared with vector:
df = df[df['ID'].map(df.groupby(by='ID')['Hour'].apply(set)) >= set(testVector)]
print (df)
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Timings:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'ID': np.random.randint(200, size=N),
'Hour': np.random.choice(range(10000),N)})
print (df)
testVector = [0,2,5]
test_set = set(testVector)
s = pd.Series(testVector)
#maxu sol
In [259]: %timeit (df.loc[df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)])
1 loop, best of 3: 356 ms per loop
#jez sol
In [260]: %timeit (df[df['ID'].map(df.groupby(by='ID')['Hour'].apply(set)) >= set(testVector)])
1 loop, best of 3: 462 ms per loop
#ayhan sol1
In [261]: %timeit (df[df.groupby('ID')['Hour'].transform(lambda x: s.isin(x).all())])
1 loop, best of 3: 300 ms per loop
#ayhan sol2
In [263]: %timeit (df.groupby('ID').filter(lambda x: s.isin(x['Hour']).all()))
1 loop, best of 3: 211 ms per loop