Creating a new column in a data frame based on row values - python

I want to be able to get the following result without using a for loop or df.apply()
The result for each row should be the row values up until the group index.
group 0 1 2 3 4 5 6 7
0 2 a b c d e f g h
1 5 s t u v w x y z
2 7 a b c d e f g h
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]

Use DataFrame.melt, filter group column and variable column in DataFrame.query and last aggregate list:
s = (df.melt('group', ignore_index=False)
.astype({'variable':int})
.query("group >= variable")
.groupby(level=0)['value']
.agg(list))
df = df[['group']].join(s.rename('result'))
print (df)
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]
Or use apply:
df = (df.set_index('group')
.rename(columns=int)
.apply(lambda x: list(x[x.index <= x.name]), axis=1)
.reset_index(name='result'))
print (df)
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]

Related

Pandas function isin error - 'str' object has no attribute 'isin' [duplicate]

I have a dataframe as like this:
aa bb cc
[a, x, y] a 1
[b, d, z] b 2
[c, e, f] s 3
np.nan d 4
I'm trying to create a new column like this:
aa bb cc dd
[a, x, y] a 1 True
[b, d, z] b 2 True
[c, e, f] s 3 False
np.nan d 4 False
My current solution is:
def some_function(row):
if row['bb].isin(row['aa'])==True:
return True
return False
df['dd'] = df.apply(lambda row: some_function(row), axis=1)
But this throws out an error ("'str' object has no attribute 'isin'", 'occurred at index 0')
I suspect, because I'm missing something when it comes to checking the isin.
Essentially, I need to check if the str value of bb is in column aa which has a list in each cell.
Any ideas on how to do this?
You need parameter in for check membership in list:
df['dd'] = df.apply(lambda x: x.bb in x.aa, axis=1)
print (df)
aa bb cc dd
0 [a, x, y] a 1 True
1 [b, d, z] b 2 True
2 [c, e, f] s 3 False
EDIT:
df['dd'] = df.apply(lambda x: (x.bb in x.aa) and (x.cc == 1), axis=1)
print (df)
aa bb cc dd
0 [a, x, y] a 1 True
1 [b, d, z] b 2 False
2 [c, e, f] s 3 False
Or:
df['dd'] = df.apply(lambda x: x.bb in x.aa, axis=1) & (df['cc'] == 1)
print (df)
aa bb cc dd
0 [a, x, y] a 1 True
1 [b, d, z] b 2 False
2 [c, e, f] s 3 False
EDIT:
df['dd'] = df.apply(lambda x: x.bb in x.aa if type(x.aa) == list else False, axis=1)
print (df)
aa bb cc dd
0 [a, x, y] a 1 True
1 [b, d, z] b 2 True
2 [c, e, f] s 3 False
4 NaN d 4 False

change value in pandas dataframe using iteration

I have a training data set of the following format:
print(data.head(5))
#Output
0 1
0 a b c d e 1
1 a b c d e 1
2 a b c d e 1
3 a b c d e 1
4 a b c d e 1
It is a text classification task and I am trying to split the text "a b c d e" in to a Python list. I tried iteration:
data #the dataset
len_data = len(data)
for row_num in range(len_data):
data.loc[row_num, 0] = data.loc[row_num, 0].split(" ")
However this doesn't work and returned the error Must have equal len keys and value when setting with an iterable. Could someone help me with this problem? Many thanks!
Use str.split:
df[0] = df[0].str.split()
print(df)
# Output
0 1
0 [a, b, c, d, e] 1
1 [a, b, c, d, e] 1
2 [a, b, c, d, e] 1
3 [a, b, c, d, e] 1
4 [a, b, c, d, e] 1
Setup:
data = {0: {0: 'a b c d e', 1: 'a b c d e'}, 1: {0: 1, 1: 1}}
df = pd.DataFrame(data)

Repeating other column values when using pandas.Series.explode()

I have a pandas dataframe of the form
a b
0 [a, b] 0
1 [c, d, e] 1
I have written a function to create a list of partial lists:
def partials(l):
result = []
for i, elem in enumerate(l):
result.append(l[:i+1])
return result
which, when applied to the series df['a'], and exploding, using d['a'].apply(partials).explode() correctly gives:
0 [a]
0 [a, b]
1 [c]
1 [c, d]
1 [c, d, e]
However, this series is necessarily longer than the original. How can I apply this function in-place to column a of my dataframe, such that the column b repeats its value wherever the corresponding line from the original dataframe is 'exploded', like this :
a b
0 [a] 0
0 [a, b] 0
1 [c] 1
1 [c, d] 1
1 [c, d, e] 1
?
You can join back:
(df['a'].apply(partials)
.explode().to_frame()
.join(df.drop('a', axis=1))
)
Output:
a b
0 [a] 0
0 [a, b] 0
1 [c] 1
1 [c, d] 1
1 [c, d, e] 1

Counting each unique array of an array in each row of a column in a data frame

I am practicing pandas and python and I am not so good at for loops. I have a data frame as below: let's say this is df:
Name Value
A [[A,B],[C,D]]
B [[A,B],[D,E]]
C [[D,E],[K,L],[M,L]]
D [[K,L]]
I want to go through each row and find unique arrays and count them.
I have tried np.unique(a, return_index=True) then returns two different list and my problem I don't know how to go through each array.
Expected result would be:
Value Counts
[A,B] 2
[D,E] 2
[K,L] 2
[C,D] 1
[M,L] 1
Thank you very much.
Use DataFrame.explode in pandas +0.25:
df.explode('Value')['Value'].value_counts()
Output:
[K, L] 2
[A, B] 2
[D, E] 2
[C, D] 1
[M, L] 1
Name: Value, dtype: int64
Use Series.explode with Series.value_counts:
df = df['Value'].explode().value_counts().rename_axis('Value').reset_index(name='Counts')
print (df)
Value Counts
0 [D, E] 2
1 [A, B] 2
2 [K, L] 2
3 [C, D] 1
4 [M, L] 1
Numpy solution:
a, v = np.unique(np.concatenate(df['Value']),axis=0, return_counts=True)
df = pd.DataFrame({'Value':a.tolist(), 'Counts':v})
print (df)
Value Counts
0 [A, B] 2
1 [C, D] 1
2 [D, E] 2
3 [K, L] 2
4 [M, L] 1

How to combine multiple pandas data frames that are the same size into one cohesive database?

I am trying to build a database with various pandas data frames that are the same size. I want to assign each point in the data frame with all of the other points.
Such that I get something like this:
Data frame 1:
1 2 3 4
1 A B C D
2 A B C D
3 A B C D
Data frame 2:
1 2 3 4
1 A D G J
2 B E H K
3 C F I L
Final Data Frame (something like this)
1 2 3 4
1 [A,A] [B,D] [C,G] [D,J]
2 [A,B] [B,E] [C,H] [D,K]
3 [A,C] [B,F] [C,I] [D,L]
Check with
pd.concat([df1,df2],1).groupby(level=0,axis=1).agg(lambda x : x.values.tolist())
Out[529]:
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
If I were you I will look at multiple index rather than list type columns
pd.concat([df1,df2],1,keys=[1,2])
Out[530]:
1 2
1 2 3 4 1 2 3 4
1 A B C D A D G J
2 A B C D B E H K
3 A B C D C F I L
You could add the values from both dataframes, plus a separator in between, and iterate over the columns using str.split to create lists. Finally use pd.concat to turn the result into a dataframe:
df = df1 + ',' + df2
pd.concat([df[col].str.split(',') for col in df.columns], axis=1)
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
Or using DataFrame.combine:
df1.combine(df2, func=lambda a,b: (a + ',' + b).str.split(','))
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
Not very efficient but you can convert all cells to lists and then simply add the dataframes like this:
import pandas as pd
import numpy as np
np.random.seed(42)
df1 = pd.DataFrame(
[list('ABCD') for _ in range(5)]
)
df2 = pd.DataFrame(
[np.random.choice(list('ABCD'), size=4).tolist() for _ in range(5)]
)
df1 = df1.apply(lambda x: [[v] for v in x])
df2 = df2.apply(lambda x: [[v] for v in x])
df1 + df2
which outputs
0 1 2 3
0 [A, C] [B, D] [C, A] [D, C]
1 [A, C] [B, D] [C, A] [D, A]
2 [A, C] [B, B] [C, C] [D, C]
3 [A, C] [B, C] [C, D] [D, A]
4 [A, D] [B, D] [C, D] [D, C]

Categories