return rows with unique pairs across columns - python

I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.

I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3

Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")

Related

Pandas sum all columns of group result into a single result

I've generated the following dummy data, where the number of rows per id range from 1 to 5:
import pandas as pd, numpy as np
import random
import functools
import operator
uid = functools.reduce(operator.iconcat, [np.repeat(x,random.randint(1,5)).tolist() for x in range(10)], [])
df = pd.DataFrame(columns=list("ABCD"), data= np.random.randint(1, 5, size=(len(uid), 4)))
df['id'] = uid
df.head()
A B C D id
0 1 1 2 2 0
1 1 2 4 4 0
2 2 3 3 2 0
3 4 3 3 1 1
4 1 3 4 4 1
I would like to group by id then sum all the values, I.E:
A B C D id
0 1 1 2 2 0
1 1 2 4 4 0
2 2 3 3 2 0 = (1+1+2+2+1+2+4+4+2+2+3+3+2) = 29
Then duplicate the value for the group so the result would be:
A B C D id val
0 1 1 2 2 0 29
1 1 2 4 4 0 29
2 2 3 3 2 0 29
I've tried to call sum df.groupby('id').sum() but it sums each column separately.
You can use set_index and then stack folowed by groupby and sum, then series.map
df['val'] = df['id'].map(df.set_index("id").stack().groupby(level=0).sum())
Or as suggested by #jezrael, sum has a level=0 arg which does the same as above:
df['val'] = df['id'].map(df.set_index("id").stack().sum(level=0))
A B C D id val
0 4 2 4 1 0 34
1 3 4 4 2 0 34
2 2 4 3 1 0 34
3 1 1 1 3 1 6
4 2 3 1 4 2 50
You can first sum all columns without id and then using GroupBy.transform:
df['val'] = df.drop('id',1).sum(axis=1).groupby(df['id']).transform('sum')
Another idea:
df['val'] = df['id'].map( df.groupby('id').sum().sum(axis=1))
M=df.groupby("id").sum()
np.sum(M,axis=1)
print(M)
print(np.sum(M,axis=1))
First M is the sum of all columns which grouped by id. Then after grouping by id, when we sum all data in a row basically we have wanted result

Change 1st row of a dataframe based on a condition in pandas

I have 2 columns on whose value I want to update the third column for only 1 row.
I have-
df = pd.DataFrame({'A':[1,1,2,3,4,4],
'B':[2,2,4,3,2,1],
'C':[0] * 6})
print (df)
A B C
0 1 2 0
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
If A= 1 and B=2 then only 1st row has C=1 like this -
print (df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
Right now I have used
df.loc[(df['A']==1) & (df['B']==2)].iloc[[0]].loc['C'] = 1
but it doesn't change the dataframe.
Solution if match always at least one row:
Create boolean mask and set to first True index value by idxmax:
mask = (df['A']==1) & (df['B']==2)
df.loc[mask.idxmax(), 'C'] = 1
But if no value matched idxmax return first False value, so added if-else:
mask = (df['A']==1) & (df['B']==2)
idx = mask.idxmax() if mask.any() else np.repeat(False, len(df))
df.loc[idx, 'C'] = 1
print (df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
mask = (df['A']==10) & (df['B']==20)
idx = mask.idxmax() if mask.any() else np.repeat(False, len(df))
df.loc[idx, 'C'] = 1
print (df)
A B C
0 1 2 0
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
Using pd.Series.cumsum to ensure only the first matching criteria is satisfied:
mask = df['A'].eq(1) & df['B'].eq(2)
df.loc[mask & mask.cumsum().eq(1), 'C'] = 1
print(df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
If performance is a concern, see Efficiently return the index of the first value satisfying condition in array.

Concat() alternate group by python3.0

My goal here is to concat() alternate groups between two dataframe.
desired result :
group ordercode quantity
0 A 1
B 1
C 1
D 1
0 A 1
B 3
1 A 1
B 2
C 1
1 A 1
B 1
C 2
My dataframe:
import pandas as pd
df1=pd.DataFrame([[0,"A",1],[0,"B",1],[0,"C",1],[0,"D",1],[1,"A",1],[1,"B",2],[1,"C",1]],columns=["group","ordercode","quantity"])
df2=pd.DataFrame([[0,"A",1],[0,"B",3],[1,"A",1],[1,"B",1],[1,"C",2]],columns=["group","ordercode","quantity"])
print(df1)
print(df2)
I have used dfff=pd.concat([df1,df2]).sort_index(kind="merge")
but I have got the below result:
group ordercode quantity
0 0 A 1
0 0 A 1
1 B 1
1 B 3
2 C 1
3 D 1
4 1 A 1
4 1 A 1
5 B 2
5 B 1
6 C 1
6 C 2
You can see here the concatenate is formed between each rows not by group.
It has to print like
group 0 of df1
group0 of df2
group1 of df1
group1 of df2 and so on
Note:
I have created these DataFrame using groupby() function
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df)//3, 1)) * 4)[0:len(df)]
df=df.groupby(['group', 'ordercode']).sum()
Question:
Where I went wrong?
Its sorting out by taking index
I have used .set_index("group") but It didnt work either.
Use cumcount for helper column used for sorting by sort_values :
df1['g'] = df1.groupby('ordercode').cumcount()
df2['g'] = df2.groupby('ordercode').cumcount()
dfff = pd.concat([df1,df2]).sort_values(['group','g']).reset_index(drop=True)
print (dfff)
group ordercode quantity g
0 0 A 1 0
1 0 B 1 0
2 0 C 1 0
3 0 D 1 0
4 0 A 1 0
5 0 B 3 0
6 1 C 2 0
7 1 A 1 1
8 1 B 2 1
9 1 C 1 1
10 1 A 1 1
11 1 B 1 1
and last remove column:
dfff = dfff.drop('g', axis=1)

How to apply cummulative count on multiple columns of dataframe

Dataframe
a b c
0 0 1 1
1 0 1 1
2 0 0 1
3 0 0 1
4 1 1 0
5 1 1 1
6 1 1 1
7 0 0 1
I am trying apply cummulative count cumcount on multiple columns of dataframe, i have tried applying the cummulative count by grouping each column. Is there any easy way to achieve expected output
I have tried this code , but it is not working
li =[]
for column in df.columns:
li.append(df.groupby(column)[column].cumcount())
pd.concat(li,axis=1)
Expected output
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Create consecutive groups by comparing with shifted values and for each column apply cumcount, last set 1 by boolean mask:
df = (df.ne(df.shift()).cumsum()
.apply(lambda x: df.groupby(x).cumcount() + 1)
.mask(df == 0, 1))
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Another solution if performance is important - count only 1 values and last set 1 by mask by np.where:
a = df == 1
b = a.cumsum()
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3

Apply a value to all instances of a number based on conditions

I have a df like this:
ID Number
1 0
1 0
1 1
2 0
2 0
3 1
3 1
3 0
I want to apply a 5 to any ids that have a 1 anywhere in the number column and a zero to those that don't. For example, if the number "1" appears anywhere in the Number column for ID 1, I want to place a 5 in the total column for every instance of that ID.
My desired output would look as such
ID Number Total
1 0 5
1 0 5
1 1 5
2 0 0
2 0 0
3 1 5
3 1 5
3 0 5
Trying to think of a way leverage applymap for this issue but not sure how to implement.
Use transform to add a column to your df as a result of a groupby on 'ID':
In [6]:
df['Total'] = df.groupby('ID').transform(lambda x: 5 if (x == 1).any() else 0)
df
Out[6]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5
You can use DataFrame.groupby() on ID column and then take max() of the Number column, and then make that into a dictionary and then use that to create the 'Total' column. Example -
grouped = df.groupby('ID')['Number'].max().to_dict()
df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
Demo -
In [44]: df
Out[44]:
ID Number
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 3 1
6 3 1
7 3 0
In [56]: grouped = df.groupby('ID')['Number'].max().to_dict()
In [58]: df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
In [59]: df
Out[59]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5

Categories