Pandas function isin error - 'str' object has no attribute 'isin' [duplicate] - python

I have a dataframe as like this:
aa bb cc
[a, x, y] a 1
[b, d, z] b 2
[c, e, f] s 3
np.nan d 4
I'm trying to create a new column like this:
aa bb cc dd
[a, x, y] a 1 True
[b, d, z] b 2 True
[c, e, f] s 3 False
np.nan d 4 False
My current solution is:
def some_function(row):
if row['bb].isin(row['aa'])==True:
return True
return False
df['dd'] = df.apply(lambda row: some_function(row), axis=1)
But this throws out an error ("'str' object has no attribute 'isin'", 'occurred at index 0')
I suspect, because I'm missing something when it comes to checking the isin.
Essentially, I need to check if the str value of bb is in column aa which has a list in each cell.
Any ideas on how to do this?

You need parameter in for check membership in list:
df['dd'] = df.apply(lambda x: x.bb in x.aa, axis=1)
print (df)
aa bb cc dd
0 [a, x, y] a 1 True
1 [b, d, z] b 2 True
2 [c, e, f] s 3 False
EDIT:
df['dd'] = df.apply(lambda x: (x.bb in x.aa) and (x.cc == 1), axis=1)
print (df)
aa bb cc dd
0 [a, x, y] a 1 True
1 [b, d, z] b 2 False
2 [c, e, f] s 3 False
Or:
df['dd'] = df.apply(lambda x: x.bb in x.aa, axis=1) & (df['cc'] == 1)
print (df)
aa bb cc dd
0 [a, x, y] a 1 True
1 [b, d, z] b 2 False
2 [c, e, f] s 3 False
EDIT:
df['dd'] = df.apply(lambda x: x.bb in x.aa if type(x.aa) == list else False, axis=1)
print (df)
aa bb cc dd
0 [a, x, y] a 1 True
1 [b, d, z] b 2 True
2 [c, e, f] s 3 False
4 NaN d 4 False

Related

group by two columns and get unique column values in a list

I'm trying to get the unique values per name1 and name2 of column prod into a further column all_prods. If I just group by one column (for example name1) then I know how to do it but how does it work if I want to group by two columns (name1 and name2)?
df = pd.DataFrame({'name1':['x','x','x','z','z'],'name2':['x','x','x','z','z'],'prod':['c','c','f','f','f']})
df.assign(**{'all_prods': df.name1.map(df.groupby('name1')['prod'].unique())})
Doing transform
df['all_prods'] = df.groupby(['name1','name2'])['prod'].transform(lambda x : len(x)* [x.unique()])
df
Out[6]:
name1 name2 prod all_prods
0 x x c [c, f]
1 x x c [c, f]
2 x x f [c, f]
3 z z f [f]
4 z z f [f]
Variant of your original approach:
df.merge(df.groupby(['name1', 'name2'], as_index=False)
.agg(all_prods=('prod', lambda x: list(dict.fromkeys(x)))))
output:
name1 name2 prod all_prods
0 x x c [c, f]
1 x x c [c, f]
2 x x f [c, f]
3 z z f [f]
4 z z f [f]
another way to do it:
df.merge(df.groupby(['name1', 'name2'])['prod'].apply(lambda x: x.unique()).reset_index(name='all_prods'), on=['name1', 'name2'], how='left')
#
name1 name2 prod all_prods
0 x x c [c, f]
1 x x c [c, f]
2 x x f [c, f]
3 z z f [f]
4 z z f [f]

Creating a new column in a data frame based on row values

I want to be able to get the following result without using a for loop or df.apply()
The result for each row should be the row values up until the group index.
group 0 1 2 3 4 5 6 7
0 2 a b c d e f g h
1 5 s t u v w x y z
2 7 a b c d e f g h
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]
Use DataFrame.melt, filter group column and variable column in DataFrame.query and last aggregate list:
s = (df.melt('group', ignore_index=False)
.astype({'variable':int})
.query("group >= variable")
.groupby(level=0)['value']
.agg(list))
df = df[['group']].join(s.rename('result'))
print (df)
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]
Or use apply:
df = (df.set_index('group')
.rename(columns=int)
.apply(lambda x: list(x[x.index <= x.name]), axis=1)
.reset_index(name='result'))
print (df)
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]

change value in pandas dataframe using iteration

I have a training data set of the following format:
print(data.head(5))
#Output
0 1
0 a b c d e 1
1 a b c d e 1
2 a b c d e 1
3 a b c d e 1
4 a b c d e 1
It is a text classification task and I am trying to split the text "a b c d e" in to a Python list. I tried iteration:
data #the dataset
len_data = len(data)
for row_num in range(len_data):
data.loc[row_num, 0] = data.loc[row_num, 0].split(" ")
However this doesn't work and returned the error Must have equal len keys and value when setting with an iterable. Could someone help me with this problem? Many thanks!
Use str.split:
df[0] = df[0].str.split()
print(df)
# Output
0 1
0 [a, b, c, d, e] 1
1 [a, b, c, d, e] 1
2 [a, b, c, d, e] 1
3 [a, b, c, d, e] 1
4 [a, b, c, d, e] 1
Setup:
data = {0: {0: 'a b c d e', 1: 'a b c d e'}, 1: {0: 1, 1: 1}}
df = pd.DataFrame(data)

Repeating other column values when using pandas.Series.explode()

I have a pandas dataframe of the form
a b
0 [a, b] 0
1 [c, d, e] 1
I have written a function to create a list of partial lists:
def partials(l):
result = []
for i, elem in enumerate(l):
result.append(l[:i+1])
return result
which, when applied to the series df['a'], and exploding, using d['a'].apply(partials).explode() correctly gives:
0 [a]
0 [a, b]
1 [c]
1 [c, d]
1 [c, d, e]
However, this series is necessarily longer than the original. How can I apply this function in-place to column a of my dataframe, such that the column b repeats its value wherever the corresponding line from the original dataframe is 'exploded', like this :
a b
0 [a] 0
0 [a, b] 0
1 [c] 1
1 [c, d] 1
1 [c, d, e] 1
?
You can join back:
(df['a'].apply(partials)
.explode().to_frame()
.join(df.drop('a', axis=1))
)
Output:
a b
0 [a] 0
0 [a, b] 0
1 [c] 1
1 [c, d] 1
1 [c, d, e] 1

How to combine multiple pandas data frames that are the same size into one cohesive database?

I am trying to build a database with various pandas data frames that are the same size. I want to assign each point in the data frame with all of the other points.
Such that I get something like this:
Data frame 1:
1 2 3 4
1 A B C D
2 A B C D
3 A B C D
Data frame 2:
1 2 3 4
1 A D G J
2 B E H K
3 C F I L
Final Data Frame (something like this)
1 2 3 4
1 [A,A] [B,D] [C,G] [D,J]
2 [A,B] [B,E] [C,H] [D,K]
3 [A,C] [B,F] [C,I] [D,L]
Check with
pd.concat([df1,df2],1).groupby(level=0,axis=1).agg(lambda x : x.values.tolist())
Out[529]:
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
If I were you I will look at multiple index rather than list type columns
pd.concat([df1,df2],1,keys=[1,2])
Out[530]:
1 2
1 2 3 4 1 2 3 4
1 A B C D A D G J
2 A B C D B E H K
3 A B C D C F I L
You could add the values from both dataframes, plus a separator in between, and iterate over the columns using str.split to create lists. Finally use pd.concat to turn the result into a dataframe:
df = df1 + ',' + df2
pd.concat([df[col].str.split(',') for col in df.columns], axis=1)
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
Or using DataFrame.combine:
df1.combine(df2, func=lambda a,b: (a + ',' + b).str.split(','))
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
Not very efficient but you can convert all cells to lists and then simply add the dataframes like this:
import pandas as pd
import numpy as np
np.random.seed(42)
df1 = pd.DataFrame(
[list('ABCD') for _ in range(5)]
)
df2 = pd.DataFrame(
[np.random.choice(list('ABCD'), size=4).tolist() for _ in range(5)]
)
df1 = df1.apply(lambda x: [[v] for v in x])
df2 = df2.apply(lambda x: [[v] for v in x])
df1 + df2
which outputs
0 1 2 3
0 [A, C] [B, D] [C, A] [D, C]
1 [A, C] [B, D] [C, A] [D, A]
2 [A, C] [B, B] [C, C] [D, C]
3 [A, C] [B, C] [C, D] [D, A]
4 [A, D] [B, D] [C, D] [D, C]

Categories