Combine two columns, ignore empty cells and add a separator - python

I need to combine column 1 and 2 in 3 with separator , but ignore empty cells.
So I have this dataframe:
1 2
0 A, B, B D
1 C, D
2 B, B, C D, A
And need to create column 3 (desired output):
1 2 3
0 A, B, B D A, B, B, D
1 C, D C, D
2 B, B, C D, A B, B, C, D, A
So as you see here, empty sell was ignored and , separate elements between in df["3"] (B, B, C, D, A).
I try to do this with simple concatenation, but didn't succeed.
If I simple concatenate df["1"] + df["2"] I will get that the last element of first column combine with first element of last column (BD, CD):
1 2 3
0 A, B, B D A, B, BD
1 C, D C, D
2 B, B, C D, A B, B, CD, A
If add ", " (df["1"] + ", " + df["2"]):
1 2 3
0 A, B, B D A, B, B, D
1 C, D , C, D
2 B, B, C D, A B, B, C, D, A
You will see that each empty cell replace with ", " and added to df["3"] (example = , C, D, but I need C, D).
Code for reproduce:
import pandas as pd
df = pd.DataFrame({"1":["A, B, B","","B, B, C"], "2":["D","C, D","D, A"]})
print(df)

Use str.strip for possible removing , from both sides:
(df["1"] + ", " + df["2"]).str.strip(', ')

how about multiple columns with empty cells in the middle?
A|B|C|D|E|F
1| |3| |5|6
This should produce:
A|B|C|D|E|F|Full
1| |3| |5|6|1$3$5$6

Related

Get the frequency of all combinations in Pandas

I am trying to get the purchase frequency of all combinations of products.
Suppose my transactions are the following
userid product
u1 A
u1 B
u1 C
u2 A
u2 C
So the solution should be
combination count_of_distinct_users
A 2
B 1
C 2
A, B 1
A, C 2
B, C 1
A, B, C 1
i.e 2 users have purchased product A, one users has purchased product B..., 2 users have purchased products A and C ...
Sefine a function combine to generate all combinations:
from itertools import combinations
def combine(s):
result = []
for i in range(1, len(s)+1):
for c in list(combinations(s, i)):
result+=[c]
return result
This will give all combinations in a column:
df.groupby('user')['product'].apply(combine)
# Out:
# user
# 1 [(A,), (B,), (C,), (A, B), (A, C), (B, C), (A,...
# 2 [(A,), (C,), (A, C)]
# Name: product, dtype: object
Now use explode():
df.groupby('user')['product'].apply(combine).reset_index(name='product_combos') \
.explode('product_combos').groupby('product_combos') \
.size().reset_index(name='user_count')
# Out:
# product_combos user_count
# 0 (A,) 2
# 1 (A, B) 1
# 2 (A, B, C) 1
# 3 (A, C) 2
# 4 (B,) 1
# 5 (B, C) 1
# 6 (C,) 2
Careful with the combinations because the list gets large with many different products!
Here my simple trick is to convert df to dict with list of users like {'A':[u1, u2], 'B':[u1]} then find the combination and merge both products list of users total. like A:[u1, u2] and B:[u1] so merge will be [2,1] and last took the min value pf that list so final count output will be 1.
Code:
from more_itertools import powerset
d = df.groupby('product')['user'].apply(list).to_dict()
##output: {'A': ['u2', 'u1'], 'B': ['u1'], 'C': ['u1', 'u2']}
new= pd.DataFrame([', '.join(i) for i in list(powerset(d.keys()))[1:]], columns =['users'])
## Output: ['A', 'B', 'C', 'A, B', 'A, C', 'B, C', 'A, B, C']
new['count'] = new['users'].apply(lambda x: min([len(d[y]) for y in x.split(', ')]))
new
Output:
users count
0 A 2
1 B 1
2 C 2
3 A, B 1
4 A, C 2
5 B, C 1
6 A, B, C 1

Check for pairs not having same values in pandas dataframe

I have dataset in form of pandas dataframe as:
In this dataset, I want to find those names and values that have not same value. It should also work for non-square matrices. Example:
A to B is 4. So, B to A must be 4. But B to A is 8.
A to C is 5. So, C to A must be 5. OK.
A to D is 8. So, D to A must be 8. But D to A is 5.
B to C is 6. So, C to B must be 6. But C to B is 3.
and so on...
So, want output as :
(A,B,4) and (B,A,8)
(A,D,8) and (D,A,5)
(B,C,6) and (C,B,3)
Don't print where value is same.
I am trying it using numpy array and dictionaries but can't figure out exact logic.
Here's an approach using Pandas
data = {'A':[0, 8,5,5,1], 'B':[4,0,3,7,2], 'C':[5,6,0,4,3], 'D':[8,9,2,0,4], 'F':[7,5,6,2,1]}
tdf = pd.DataFrame(data, index=['A', 'B', 'C', 'D', 'E'])
for idx in tdf.index:
if idx in tdf.columns:
for col in tdf.columns:
if col in tdf.index and col != idx and tdf[idx][col] != tdf[col][idx]:
print (f'({idx}, {col}, {tdf[idx][col]}) and ({col}, {idx}, {tdf[col][idx]})' )
tdf:
A B C D F
A 0 4 5 8 7
B 8 0 6 9 5
C 5 3 0 2 6
D 5 7 4 0 2
E 1 2 3 4 1
and output is of the form:
(A, B, 8) and (B, A, 4)
(A, D, 5) and (D, A, 8)
(B, A, 4) and (A, B, 8)
(B, C, 3) and (C, B, 6)
(B, D, 7) and (D, B, 9)
(C, B, 6) and (B, C, 3)
(C, D, 4) and (D, C, 2)
(D, A, 8) and (A, D, 5)
(D, B, 9) and (B, D, 7)
(D, C, 2) and (C, D, 4)
Making a dataframe thanks to what #itprorh66 has provided:
df = pd.DataFrame({'A':[0, 8,5,5,1], 'B':[4,0,3,7,2], 'C':[5,6,0,4,3],
'D':[8,9,2,0,4], 'F':[7,5,6,2,1]}, index=['A', 'B', 'C', 'D', 'E'])
Intersect and create a square data.frame:
cmg = df.index.intersection(df.columns)
df = df[cmg].loc[cmg]
We can use the numpy upper and lower triangle functions, and pull out the indices for the upper triangle:
mat = df.to_numpy()
tr = np.triu_indices(len(cmg),k=1)
Then put everything into a dataframe, the joining of the rownames and column names is a bit unslightly, but thats the best I can do for now:
mat = df.to_numpy()
tr = np.triu_indices(len(cmg),k=1)
match_tri = pd.DataFrame({'i1':df.index[tr[0]] + ',' + df.columns[tr[1]],
'v1':mat[tr],
'i2':df.index[tr[1]] + ',' + df.columns[tr[0]],
'v2':mat.T[tr]
})
Then we just subset based on the values:
match_tri[match_tri.v1 != match_tri.v2]
i1 v1 i2 v2
0 A,B 4 B,A 8
2 A,D 8 D,A 5
3 B,C 6 C,B 3
4 B,D 9 D,B 7
5 C,D 2 D,C 4

Arrange sequences of entries in pairs in a dataframe

Given a table of the form:
ID Sequence
1 A,C,D,E,F,G
2 D,F,G,B
3 A,B,A,C
and so on
Now I wish to arrange this data so that it can be fed into a RNN in a sequential manner so that I'm able to predict the next entry in each sequence. So here's what's required (in a new dataframe) in the form of all possible sequences:
X Y
A,C,D E
C,D,E F
D,E,F G
D,F,G B
A,B,A C
X could be of length 3 or any custom length. How should I go about it?
Here's another way using df.split and applying pd.Series to sublists:
In [623]: df.Sequence.str.split(',')\
...: .apply(lambda x: pd.Series([x[i : i + 3], x[i + 3]] for i in range(0, len(x)- 3))).stack()\
...: .apply(lambda x: pd.Series([x[0], x[1]]))\
...: .reset_index(drop=True)
Out[623]:
0 1
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C
Setting the columns is as simple as df.columns = ['X', 'Y'].
This will do the job:
vals=[l.split(',') for l in df.sequences.values]
X,Y=zip(*sum([[[','.join(el[i:i+3]),el[i+3]] for i in range(len(el)-3)] for el in vals],[]))
res=pd.DataFrame({'X':X,'Y':Y})
Then res is
X Y
0 A,C,D E
1 C,D,E F
2 D,E,F G
3 D,F,G B
4 A,B,A C
Here's one of the (many) ways of doing it.
In [52]: vals = df.Sequence.str.split(',')
In [53]: seqs = []
In [54]: for val in vals:
...: seqs += [{'X': val[i:i+3], 'Y': val[i+3]} for i in xrange(len(val)-3)]
...:
In [55]: pd.DataFrame(seqs)
Out[55]:
X Y
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C

How to convert column with list of values into rows in Pandas DataFrame

Hi I have a dataframe like this:
A B
0: some value [[L1, L2]]
I want to change it into:
A B
0: some value L1
1: some value L2
How can I do that?
Pandas >= 0.25
df1 = pd.DataFrame({'A':['a','b'],
'B':[[['1', '2']],[['3', '4', '5']]]})
print(df1)
A B
0 a [[1, 2]]
1 b [[3, 4, 5]]
df1 = df1.explode('B')
df1.explode('B')
A B
0 a 1
0 a 2
1 b 3
1 b 4
1 b 5
I don't know how good this approach is but it works when you have a list of items.
you can do it this way:
In [84]: df
Out[84]:
A B
0 some value [[L1, L2]]
1 another value [[L3, L4, L5]]
In [85]: (df['B'].apply(lambda x: pd.Series(x[0]))
....: .stack()
....: .reset_index(level=1, drop=True)
....: .to_frame('B')
....: .join(df[['A']], how='left')
....: )
Out[85]:
B A
0 L1 some value
0 L2 some value
1 L3 another value
1 L4 another value
1 L5 another value
UPDATE: a more generic solution
Faster solution with chain.from_iterable and numpy.repeat:
from itertools import chain
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':['a','b'],
'B':[[['A1', 'A2']],[['A1', 'A2', 'A3']]]})
print (df)
A B
0 a [[A1, A2]]
1 b [[A1, A2, A3]]
df1 = pd.DataFrame({ "A": np.repeat(df.A.values,
[len(x) for x in (chain.from_iterable(df.B))]),
"B": list(chain.from_iterable(chain.from_iterable(df.B)))})
print (df1)
A B
0 a A1
1 a A2
2 b A1
3 b A2
4 b A3
Timings:
A = np.unique(np.random.randint(0, 1000, 1000))
B = [[list(string.ascii_letters[:random.randint(3, 10)])] for _ in range(len(A))]
df = pd.DataFrame({"A":A, "B":B})
print (df)
A B
0 0 [[a, b, c, d, e, f, g, h]]
1 1 [[a, b, c]]
2 3 [[a, b, c, d, e, f, g, h, i]]
3 5 [[a, b, c, d, e]]
4 6 [[a, b, c, d, e, f, g, h, i]]
5 7 [[a, b, c, d, e, f, g]]
6 8 [[a, b, c, d, e, f]]
7 10 [[a, b, c, d, e, f]]
8 11 [[a, b, c, d, e, f, g]]
9 12 [[a, b, c, d, e, f, g, h, i]]
10 13 [[a, b, c, d, e, f, g, h]]
...
...
In [67]: %timeit pd.DataFrame({ "A": np.repeat(df.A.values, [len(x) for x in (chain.from_iterable(df.B))]),"B": list(chain.from_iterable(chain.from_iterable(df.B)))})
1000 loops, best of 3: 818 µs per loop
In [68]: %timeit ((df['B'].apply(lambda x: pd.Series(x[0])).stack().reset_index(level=1, drop=True).to_frame('B').join(df[['A']], how='left')))
10 loops, best of 3: 103 ms per loop
I can't find a elegant way to handle this, but the following codes can work...
import pandas as pd
import numpy as np
df = pd.DataFrame([{"a":1,"b":[[1,2]]},{"a":4, "b":[[3,4,5]]}])
z = []
for k,row in df.iterrows():
for j in list(np.array(row.b).flat):
z.append({'a':row.a, 'b':j})
result = pd.DataFrame(z)
I think this is the fastest and simplest way:
df = pd.DataFrame({'A':['a','b'],
'B':[[['A1', 'A2']],[['A1', 'A2', 'A3']]]})
df.set_index('A')['B'].apply(lambda x: pd.Series(x[0]))
Here's another option
unpacked = (pd.melt(df.B.apply(pd.Series).reset_index(),id_vars='index')
.merge(df, left_on = 'index', right_index = True))
unpacked = (unpacked.loc[unpacked.value.notnull(),:]
.drop(columns=['index','variable','B'])
.rename(columns={'value':'B'})
Apply pd.series to column B --> splits each list entry to a different row
Melt this, so that each entry is a separate row (preserving index)
Merge this back on original dataframe
Tidy up - drop unnecessary columns and rename the values column

Using apply on a column

I have a dataframe like this one.
A B C D E
0 a b c d e
1 f g h i j
2 k l m n o
3 p q r s t
What I'd like is to get a dataframe with each column as a list.
0
0 [a, f, k, p]
1 [b, g, l, q]
2 [c, h, m, r]
3 [d, i, o, s]
4 [e, j, p, t]
I'd like to somehow apply a function to each column, converting it to a list and placing it in a new DataFrame. However, apply only operates on individual entries.
df2 = pd.DataFrame(df.transpose().apply(lambda x: [', '.join(x)], axis=1))

Categories