python - how to delete duplicate list in each row (pandas)? - python

I have a list contained in each row and I would like to delete duplicated element by keeping the highest value from a score.
here is my data from data frame df1
pair score
0 [A , A ] 1.0000
1 [A , F ] 0.9990
2 [A , G ] 0.9985
3 [A , G ] 0.9975
4 [A , H ] 0.9985
5 [A , H ] 0.9990
I would like to see the result as
pair score
0 [A , A ] 1.0000
1 [A , F ] 0.9990
2 [A , G ] 0.9985
4 [A , H ] 0.9990
I have tried to use group by and set a score = max, but it's not working

First I think working with lists in pandas is not good idea.
Solution working if convert lists to helper column with tuples - then sort_values with drop_duplicates:
df['new'] = df.pair.apply(tuple)
df = df.sort_values('score', ascending=False).drop_duplicates('new')
print (df)
pair score new
0 [A, A] 1.0000 (A, A)
1 [A, F] 0.9990 (A, F)
5 [A, H] 0.9990 (A, H)
2 [A, G] 0.9985 (A, G)
Or to 2 new columns:
df[['a', 'b']] = pd.DataFrame(df.pair.values.tolist())
df = df.sort_values('score', ascending=False).drop_duplicates(['a', 'b'])
print (df)
pair score a b
0 [A, A] 1.0000 A A
1 [A, F] 0.9990 A F
5 [A, H] 0.9990 A H
2 [A, G] 0.9985 A G

Make new column pair2 with sorted values of string type and then drop duplicates
Will handle if pair have value [A,G] and [G,A] treating them same
df['pair2']=df.pair.map(sorted).astype(str)
df.sort_values('score',ascending=False).drop_duplicates('pair2',keep='first').drop('pair2',axis=1).reset_index(drop=True)
Ouput:
pair score
[A, A] 1.0000
[A, F] 0.9990
[A, H] 0.9990
[A, G] 0.9985

Related

Check for pairs not having same values in pandas dataframe

I have dataset in form of pandas dataframe as:
In this dataset, I want to find those names and values that have not same value. It should also work for non-square matrices. Example:
A to B is 4. So, B to A must be 4. But B to A is 8.
A to C is 5. So, C to A must be 5. OK.
A to D is 8. So, D to A must be 8. But D to A is 5.
B to C is 6. So, C to B must be 6. But C to B is 3.
and so on...
So, want output as :
(A,B,4) and (B,A,8)
(A,D,8) and (D,A,5)
(B,C,6) and (C,B,3)
Don't print where value is same.
I am trying it using numpy array and dictionaries but can't figure out exact logic.
Here's an approach using Pandas
data = {'A':[0, 8,5,5,1], 'B':[4,0,3,7,2], 'C':[5,6,0,4,3], 'D':[8,9,2,0,4], 'F':[7,5,6,2,1]}
tdf = pd.DataFrame(data, index=['A', 'B', 'C', 'D', 'E'])
for idx in tdf.index:
if idx in tdf.columns:
for col in tdf.columns:
if col in tdf.index and col != idx and tdf[idx][col] != tdf[col][idx]:
print (f'({idx}, {col}, {tdf[idx][col]}) and ({col}, {idx}, {tdf[col][idx]})' )
tdf:
A B C D F
A 0 4 5 8 7
B 8 0 6 9 5
C 5 3 0 2 6
D 5 7 4 0 2
E 1 2 3 4 1
and output is of the form:
(A, B, 8) and (B, A, 4)
(A, D, 5) and (D, A, 8)
(B, A, 4) and (A, B, 8)
(B, C, 3) and (C, B, 6)
(B, D, 7) and (D, B, 9)
(C, B, 6) and (B, C, 3)
(C, D, 4) and (D, C, 2)
(D, A, 8) and (A, D, 5)
(D, B, 9) and (B, D, 7)
(D, C, 2) and (C, D, 4)
Making a dataframe thanks to what #itprorh66 has provided:
df = pd.DataFrame({'A':[0, 8,5,5,1], 'B':[4,0,3,7,2], 'C':[5,6,0,4,3],
'D':[8,9,2,0,4], 'F':[7,5,6,2,1]}, index=['A', 'B', 'C', 'D', 'E'])
Intersect and create a square data.frame:
cmg = df.index.intersection(df.columns)
df = df[cmg].loc[cmg]
We can use the numpy upper and lower triangle functions, and pull out the indices for the upper triangle:
mat = df.to_numpy()
tr = np.triu_indices(len(cmg),k=1)
Then put everything into a dataframe, the joining of the rownames and column names is a bit unslightly, but thats the best I can do for now:
mat = df.to_numpy()
tr = np.triu_indices(len(cmg),k=1)
match_tri = pd.DataFrame({'i1':df.index[tr[0]] + ',' + df.columns[tr[1]],
'v1':mat[tr],
'i2':df.index[tr[1]] + ',' + df.columns[tr[0]],
'v2':mat.T[tr]
})
Then we just subset based on the values:
match_tri[match_tri.v1 != match_tri.v2]
i1 v1 i2 v2
0 A,B 4 B,A 8
2 A,D 8 D,A 5
3 B,C 6 C,B 3
4 B,D 9 D,B 7
5 C,D 2 D,C 4

Arrange sequences of entries in pairs in a dataframe

Given a table of the form:
ID Sequence
1 A,C,D,E,F,G
2 D,F,G,B
3 A,B,A,C
and so on
Now I wish to arrange this data so that it can be fed into a RNN in a sequential manner so that I'm able to predict the next entry in each sequence. So here's what's required (in a new dataframe) in the form of all possible sequences:
X Y
A,C,D E
C,D,E F
D,E,F G
D,F,G B
A,B,A C
X could be of length 3 or any custom length. How should I go about it?
Here's another way using df.split and applying pd.Series to sublists:
In [623]: df.Sequence.str.split(',')\
...: .apply(lambda x: pd.Series([x[i : i + 3], x[i + 3]] for i in range(0, len(x)- 3))).stack()\
...: .apply(lambda x: pd.Series([x[0], x[1]]))\
...: .reset_index(drop=True)
Out[623]:
0 1
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C
Setting the columns is as simple as df.columns = ['X', 'Y'].
This will do the job:
vals=[l.split(',') for l in df.sequences.values]
X,Y=zip(*sum([[[','.join(el[i:i+3]),el[i+3]] for i in range(len(el)-3)] for el in vals],[]))
res=pd.DataFrame({'X':X,'Y':Y})
Then res is
X Y
0 A,C,D E
1 C,D,E F
2 D,E,F G
3 D,F,G B
4 A,B,A C
Here's one of the (many) ways of doing it.
In [52]: vals = df.Sequence.str.split(',')
In [53]: seqs = []
In [54]: for val in vals:
...: seqs += [{'X': val[i:i+3], 'Y': val[i+3]} for i in xrange(len(val)-3)]
...:
In [55]: pd.DataFrame(seqs)
Out[55]:
X Y
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C

How to convert column with list of values into rows in Pandas DataFrame

Hi I have a dataframe like this:
A B
0: some value [[L1, L2]]
I want to change it into:
A B
0: some value L1
1: some value L2
How can I do that?
Pandas >= 0.25
df1 = pd.DataFrame({'A':['a','b'],
'B':[[['1', '2']],[['3', '4', '5']]]})
print(df1)
A B
0 a [[1, 2]]
1 b [[3, 4, 5]]
df1 = df1.explode('B')
df1.explode('B')
A B
0 a 1
0 a 2
1 b 3
1 b 4
1 b 5
I don't know how good this approach is but it works when you have a list of items.
you can do it this way:
In [84]: df
Out[84]:
A B
0 some value [[L1, L2]]
1 another value [[L3, L4, L5]]
In [85]: (df['B'].apply(lambda x: pd.Series(x[0]))
....: .stack()
....: .reset_index(level=1, drop=True)
....: .to_frame('B')
....: .join(df[['A']], how='left')
....: )
Out[85]:
B A
0 L1 some value
0 L2 some value
1 L3 another value
1 L4 another value
1 L5 another value
UPDATE: a more generic solution
Faster solution with chain.from_iterable and numpy.repeat:
from itertools import chain
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':['a','b'],
'B':[[['A1', 'A2']],[['A1', 'A2', 'A3']]]})
print (df)
A B
0 a [[A1, A2]]
1 b [[A1, A2, A3]]
df1 = pd.DataFrame({ "A": np.repeat(df.A.values,
[len(x) for x in (chain.from_iterable(df.B))]),
"B": list(chain.from_iterable(chain.from_iterable(df.B)))})
print (df1)
A B
0 a A1
1 a A2
2 b A1
3 b A2
4 b A3
Timings:
A = np.unique(np.random.randint(0, 1000, 1000))
B = [[list(string.ascii_letters[:random.randint(3, 10)])] for _ in range(len(A))]
df = pd.DataFrame({"A":A, "B":B})
print (df)
A B
0 0 [[a, b, c, d, e, f, g, h]]
1 1 [[a, b, c]]
2 3 [[a, b, c, d, e, f, g, h, i]]
3 5 [[a, b, c, d, e]]
4 6 [[a, b, c, d, e, f, g, h, i]]
5 7 [[a, b, c, d, e, f, g]]
6 8 [[a, b, c, d, e, f]]
7 10 [[a, b, c, d, e, f]]
8 11 [[a, b, c, d, e, f, g]]
9 12 [[a, b, c, d, e, f, g, h, i]]
10 13 [[a, b, c, d, e, f, g, h]]
...
...
In [67]: %timeit pd.DataFrame({ "A": np.repeat(df.A.values, [len(x) for x in (chain.from_iterable(df.B))]),"B": list(chain.from_iterable(chain.from_iterable(df.B)))})
1000 loops, best of 3: 818 µs per loop
In [68]: %timeit ((df['B'].apply(lambda x: pd.Series(x[0])).stack().reset_index(level=1, drop=True).to_frame('B').join(df[['A']], how='left')))
10 loops, best of 3: 103 ms per loop
I can't find a elegant way to handle this, but the following codes can work...
import pandas as pd
import numpy as np
df = pd.DataFrame([{"a":1,"b":[[1,2]]},{"a":4, "b":[[3,4,5]]}])
z = []
for k,row in df.iterrows():
for j in list(np.array(row.b).flat):
z.append({'a':row.a, 'b':j})
result = pd.DataFrame(z)
I think this is the fastest and simplest way:
df = pd.DataFrame({'A':['a','b'],
'B':[[['A1', 'A2']],[['A1', 'A2', 'A3']]]})
df.set_index('A')['B'].apply(lambda x: pd.Series(x[0]))
Here's another option
unpacked = (pd.melt(df.B.apply(pd.Series).reset_index(),id_vars='index')
.merge(df, left_on = 'index', right_index = True))
unpacked = (unpacked.loc[unpacked.value.notnull(),:]
.drop(columns=['index','variable','B'])
.rename(columns={'value':'B'})
Apply pd.series to column B --> splits each list entry to a different row
Melt this, so that each entry is a separate row (preserving index)
Merge this back on original dataframe
Tidy up - drop unnecessary columns and rename the values column

Using apply on a column

I have a dataframe like this one.
A B C D E
0 a b c d e
1 f g h i j
2 k l m n o
3 p q r s t
What I'd like is to get a dataframe with each column as a list.
0
0 [a, f, k, p]
1 [b, g, l, q]
2 [c, h, m, r]
3 [d, i, o, s]
4 [e, j, p, t]
I'd like to somehow apply a function to each column, converting it to a list and placing it in a new DataFrame. However, apply only operates on individual entries.
df2 = pd.DataFrame(df.transpose().apply(lambda x: [', '.join(x)], axis=1))

Optimizing pandas filter inside apply function

I have a list of pairs--stored in a DataFrame--each pair having an 'a' column and a 'b' column. For each pair I want to return the 'b's that have the same 'a'. For example, given the following set of pairs:
a b
0 c d
1 e f
2 c g
3 e h
4 i j
5 e k
I would like to end up with:
a b equivalents
0 c d [g]
1 e f [h, k]
2 c g [d]
3 e h [f, k]
4 i j []
5 e k [h, e]
I can do this with the following:
def equivalents(x):
l = pairs[pairs["a"] == x["a"]]["b"].tolist()
return l[1:] if l else l
pairs["equivalents"] = pairs.apply(equivalents, axis = 1)
But it is painfully slow on larger sets (e.g. 1 million plus pairs). Any suggestions how I could do this faster?
I think this ought to be a bit faster. First, just add them up.
df['equiv'] = df.groupby('a')['b'].transform(sum)
a b equiv
0 c d dg
1 e f fhk
2 c g dg
3 e h fhk
4 i j j
5 e k fhk
Now convert to a list and remove whichever letter is already in column 'b'.
df.apply( lambda x: [ y for y in list( x.equiv ) if y != x.b ], axis=1 )
0 [g]
1 [h, k]
2 [d]
3 [f, k]
4 []
5 [f, h]

Categories