Related
I need to combine column 1 and 2 in 3 with separator , but ignore empty cells.
So I have this dataframe:
1 2
0 A, B, B D
1 C, D
2 B, B, C D, A
And need to create column 3 (desired output):
1 2 3
0 A, B, B D A, B, B, D
1 C, D C, D
2 B, B, C D, A B, B, C, D, A
So as you see here, empty sell was ignored and , separate elements between in df["3"] (B, B, C, D, A).
I try to do this with simple concatenation, but didn't succeed.
If I simple concatenate df["1"] + df["2"] I will get that the last element of first column combine with first element of last column (BD, CD):
1 2 3
0 A, B, B D A, B, BD
1 C, D C, D
2 B, B, C D, A B, B, CD, A
If add ", " (df["1"] + ", " + df["2"]):
1 2 3
0 A, B, B D A, B, B, D
1 C, D , C, D
2 B, B, C D, A B, B, C, D, A
You will see that each empty cell replace with ", " and added to df["3"] (example = , C, D, but I need C, D).
Code for reproduce:
import pandas as pd
df = pd.DataFrame({"1":["A, B, B","","B, B, C"], "2":["D","C, D","D, A"]})
print(df)
Use str.strip for possible removing , from both sides:
(df["1"] + ", " + df["2"]).str.strip(', ')
how about multiple columns with empty cells in the middle?
A|B|C|D|E|F
1| |3| |5|6
This should produce:
A|B|C|D|E|F|Full
1| |3| |5|6|1$3$5$6
I have a list contained in each row and I would like to delete duplicated element by keeping the highest value from a score.
here is my data from data frame df1
pair score
0 [A , A ] 1.0000
1 [A , F ] 0.9990
2 [A , G ] 0.9985
3 [A , G ] 0.9975
4 [A , H ] 0.9985
5 [A , H ] 0.9990
I would like to see the result as
pair score
0 [A , A ] 1.0000
1 [A , F ] 0.9990
2 [A , G ] 0.9985
4 [A , H ] 0.9990
I have tried to use group by and set a score = max, but it's not working
First I think working with lists in pandas is not good idea.
Solution working if convert lists to helper column with tuples - then sort_values with drop_duplicates:
df['new'] = df.pair.apply(tuple)
df = df.sort_values('score', ascending=False).drop_duplicates('new')
print (df)
pair score new
0 [A, A] 1.0000 (A, A)
1 [A, F] 0.9990 (A, F)
5 [A, H] 0.9990 (A, H)
2 [A, G] 0.9985 (A, G)
Or to 2 new columns:
df[['a', 'b']] = pd.DataFrame(df.pair.values.tolist())
df = df.sort_values('score', ascending=False).drop_duplicates(['a', 'b'])
print (df)
pair score a b
0 [A, A] 1.0000 A A
1 [A, F] 0.9990 A F
5 [A, H] 0.9990 A H
2 [A, G] 0.9985 A G
Make new column pair2 with sorted values of string type and then drop duplicates
Will handle if pair have value [A,G] and [G,A] treating them same
df['pair2']=df.pair.map(sorted).astype(str)
df.sort_values('score',ascending=False).drop_duplicates('pair2',keep='first').drop('pair2',axis=1).reset_index(drop=True)
Ouput:
pair score
[A, A] 1.0000
[A, F] 0.9990
[A, H] 0.9990
[A, G] 0.9985
Given a table of the form:
ID Sequence
1 A,C,D,E,F,G
2 D,F,G,B
3 A,B,A,C
and so on
Now I wish to arrange this data so that it can be fed into a RNN in a sequential manner so that I'm able to predict the next entry in each sequence. So here's what's required (in a new dataframe) in the form of all possible sequences:
X Y
A,C,D E
C,D,E F
D,E,F G
D,F,G B
A,B,A C
X could be of length 3 or any custom length. How should I go about it?
Here's another way using df.split and applying pd.Series to sublists:
In [623]: df.Sequence.str.split(',')\
...: .apply(lambda x: pd.Series([x[i : i + 3], x[i + 3]] for i in range(0, len(x)- 3))).stack()\
...: .apply(lambda x: pd.Series([x[0], x[1]]))\
...: .reset_index(drop=True)
Out[623]:
0 1
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C
Setting the columns is as simple as df.columns = ['X', 'Y'].
This will do the job:
vals=[l.split(',') for l in df.sequences.values]
X,Y=zip(*sum([[[','.join(el[i:i+3]),el[i+3]] for i in range(len(el)-3)] for el in vals],[]))
res=pd.DataFrame({'X':X,'Y':Y})
Then res is
X Y
0 A,C,D E
1 C,D,E F
2 D,E,F G
3 D,F,G B
4 A,B,A C
Here's one of the (many) ways of doing it.
In [52]: vals = df.Sequence.str.split(',')
In [53]: seqs = []
In [54]: for val in vals:
...: seqs += [{'X': val[i:i+3], 'Y': val[i+3]} for i in xrange(len(val)-3)]
...:
In [55]: pd.DataFrame(seqs)
Out[55]:
X Y
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C
Hi I have a dataframe like this:
A B
0: some value [[L1, L2]]
I want to change it into:
A B
0: some value L1
1: some value L2
How can I do that?
Pandas >= 0.25
df1 = pd.DataFrame({'A':['a','b'],
'B':[[['1', '2']],[['3', '4', '5']]]})
print(df1)
A B
0 a [[1, 2]]
1 b [[3, 4, 5]]
df1 = df1.explode('B')
df1.explode('B')
A B
0 a 1
0 a 2
1 b 3
1 b 4
1 b 5
I don't know how good this approach is but it works when you have a list of items.
you can do it this way:
In [84]: df
Out[84]:
A B
0 some value [[L1, L2]]
1 another value [[L3, L4, L5]]
In [85]: (df['B'].apply(lambda x: pd.Series(x[0]))
....: .stack()
....: .reset_index(level=1, drop=True)
....: .to_frame('B')
....: .join(df[['A']], how='left')
....: )
Out[85]:
B A
0 L1 some value
0 L2 some value
1 L3 another value
1 L4 another value
1 L5 another value
UPDATE: a more generic solution
Faster solution with chain.from_iterable and numpy.repeat:
from itertools import chain
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':['a','b'],
'B':[[['A1', 'A2']],[['A1', 'A2', 'A3']]]})
print (df)
A B
0 a [[A1, A2]]
1 b [[A1, A2, A3]]
df1 = pd.DataFrame({ "A": np.repeat(df.A.values,
[len(x) for x in (chain.from_iterable(df.B))]),
"B": list(chain.from_iterable(chain.from_iterable(df.B)))})
print (df1)
A B
0 a A1
1 a A2
2 b A1
3 b A2
4 b A3
Timings:
A = np.unique(np.random.randint(0, 1000, 1000))
B = [[list(string.ascii_letters[:random.randint(3, 10)])] for _ in range(len(A))]
df = pd.DataFrame({"A":A, "B":B})
print (df)
A B
0 0 [[a, b, c, d, e, f, g, h]]
1 1 [[a, b, c]]
2 3 [[a, b, c, d, e, f, g, h, i]]
3 5 [[a, b, c, d, e]]
4 6 [[a, b, c, d, e, f, g, h, i]]
5 7 [[a, b, c, d, e, f, g]]
6 8 [[a, b, c, d, e, f]]
7 10 [[a, b, c, d, e, f]]
8 11 [[a, b, c, d, e, f, g]]
9 12 [[a, b, c, d, e, f, g, h, i]]
10 13 [[a, b, c, d, e, f, g, h]]
...
...
In [67]: %timeit pd.DataFrame({ "A": np.repeat(df.A.values, [len(x) for x in (chain.from_iterable(df.B))]),"B": list(chain.from_iterable(chain.from_iterable(df.B)))})
1000 loops, best of 3: 818 µs per loop
In [68]: %timeit ((df['B'].apply(lambda x: pd.Series(x[0])).stack().reset_index(level=1, drop=True).to_frame('B').join(df[['A']], how='left')))
10 loops, best of 3: 103 ms per loop
I can't find a elegant way to handle this, but the following codes can work...
import pandas as pd
import numpy as np
df = pd.DataFrame([{"a":1,"b":[[1,2]]},{"a":4, "b":[[3,4,5]]}])
z = []
for k,row in df.iterrows():
for j in list(np.array(row.b).flat):
z.append({'a':row.a, 'b':j})
result = pd.DataFrame(z)
I think this is the fastest and simplest way:
df = pd.DataFrame({'A':['a','b'],
'B':[[['A1', 'A2']],[['A1', 'A2', 'A3']]]})
df.set_index('A')['B'].apply(lambda x: pd.Series(x[0]))
Here's another option
unpacked = (pd.melt(df.B.apply(pd.Series).reset_index(),id_vars='index')
.merge(df, left_on = 'index', right_index = True))
unpacked = (unpacked.loc[unpacked.value.notnull(),:]
.drop(columns=['index','variable','B'])
.rename(columns={'value':'B'})
Apply pd.series to column B --> splits each list entry to a different row
Melt this, so that each entry is a separate row (preserving index)
Merge this back on original dataframe
Tidy up - drop unnecessary columns and rename the values column
I have a list of pairs--stored in a DataFrame--each pair having an 'a' column and a 'b' column. For each pair I want to return the 'b's that have the same 'a'. For example, given the following set of pairs:
a b
0 c d
1 e f
2 c g
3 e h
4 i j
5 e k
I would like to end up with:
a b equivalents
0 c d [g]
1 e f [h, k]
2 c g [d]
3 e h [f, k]
4 i j []
5 e k [h, e]
I can do this with the following:
def equivalents(x):
l = pairs[pairs["a"] == x["a"]]["b"].tolist()
return l[1:] if l else l
pairs["equivalents"] = pairs.apply(equivalents, axis = 1)
But it is painfully slow on larger sets (e.g. 1 million plus pairs). Any suggestions how I could do this faster?
I think this ought to be a bit faster. First, just add them up.
df['equiv'] = df.groupby('a')['b'].transform(sum)
a b equiv
0 c d dg
1 e f fhk
2 c g dg
3 e h fhk
4 i j j
5 e k fhk
Now convert to a list and remove whichever letter is already in column 'b'.
df.apply( lambda x: [ y for y in list( x.equiv ) if y != x.b ], axis=1 )
0 [g]
1 [h, k]
2 [d]
3 [f, k]
4 []
5 [f, h]