compare two columns having list of strings in pandas

compare two columns having list of strings in pandas - python

I have a data frame in pandas having two columns where each row is a list of strings, how would it be possible to check if there is word match(es) in these two columns on a unique row(flag column is the desired output)
A B flag
hello,hi,bye bye, also 1
but, as well see, pandas 0
I have tried
df['A'].str.contains(df['B'])
but I got this error
TypeError: 'Series' objects are mutable, thus they cannot be hashed

You can convert each value to separately words by split and sets and check intersection by &, then convert values to boolean - empty sets are converted to Falses and last convert it to ints - Falses are 0s and Trues are 1s.
zipped = zip(df['A'], df['B'])
df['flag'] = [int(bool(set(a.split(',')) & set(b.split(',')))) for a, b in zipped]
print (df)
A B flag
0 hello,hi,bye bye,also 1
1 but,as well see,pandas 0
Similar solution:
df['flag'] = np.array([set(a.split(',')) & set(b.split(',')) for a, b in zipped]).astype(bool).astype(int)
print (df)
A B flag
0 hello,hi,bye bye, also 1
1 but,as well see, pandas 0
EDIT: There is possible some whitespaces before ,, so add map with str.strip and also remove empty strings with filter:
df = pd.DataFrame({'A': ['hello,hi,bye', 'but,,,as well'],
'B': ['bye ,,, also', 'see,,,pandas']})
print (df)
A B
0 hello,hi,bye bye ,,, also
1 but,,,as well see,,,pandas
zipped = zip(df['A'], df['B'])
def setify(x):
return set(map(str.strip, filter(None, x.split(','))))
df['flag'] = [int(bool(setify(a) & setify(b))) for a, b in zipped]
print (df)
A B flag
0 hello,hi,bye bye ,,, also 1
1 but,,,as well see,,,pandas 0

Related

Applying lambda to a dataframe: works with > operator, but error with ==?

I am trying to replace groups from a group object that have more than one unique value in a particular column.
This line works, and replaces groups with >1 unique values in the column:
df.groupby(['ID'])\
.apply(lambda group: group if len(set(group['col_name'])) > 1 else np.NaN)
However, if I just change the operator in the lambda to == (or <=), it fails:
df.groupby(['ID'])\
.apply(lambda group: group if len(set(group['col_name'])) == 1 else np.NaN)
resulting in:
AttributeError: 'float' object has no attribute '_get_axis'
I am having trouble connecting this error to my implementation, I tried casting 1 as a float to no avail.
If there is a better way to accomplish this same task, that would be helpful as well.

I believe you can use SeriesGroupBy.nunique with transform for same Series like original column, so possible comparing and replace values by where:
df = pd.DataFrame({
'col_name':list('abcdef'),
'ID':list('aaabbc')
})
df['col_name'] = df['col_name'].where(df.groupby('ID')['col_name'].transform('nunique') > 1)
Another solution with numpy.where:
m = df.groupby('ID')['col_name'].transform('nunique') > 1
df['col_name'] = np.where(m, df['col_name'], np.nan)
print (df)
col_name ID
0 a a
1 b a
2 c a
3 d b
4 e b
5 NaN c

Duplicating Pandas Dataframe rows based on string split, without iteration

I have a dataframe with a multiindex, where one of thecolumns represents multiple values, separated by a "|", like this:
value
left right
x a|b 2
y b|c|d -1
I want to duplicate the rows based on the "right" column, to get something like this:
values
left right
x a 2
x b 2
y b -1
y c -1
y d -1
The solution I have to this feels wrong and runs slow, because it's based on iteration:
df2 = df.iloc[:0]
for index, row in df.iterrows():
stgs = index[1].split("|")
for s in stgs:
row.name = (index[0], s)
df2 = df2.append(row)
Is there a more vectored way to do this?

Pandas Series have a dedicated method split to perform this operation
split works only on Series so isolate the Column you want
SO = df['right']
Now 3 steps at once: spilt return A Series of array. apply(pd.Series, 1) convert array in columns. stack stacks you columns into a unique column
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
The only issue is that you have now a multi-index. So just drop the level you don`t need
S1.index.droplevel(-1)
Full example
SO = pd.Series(data=["a,b", "b,c,d"])
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
S1
Out[4]:
0 0 a
1 b
1 0 b
1 c
2 d
S1.index = S1.index.droplevel(-1)
S1
Out[5]:
0 a
0 b
1 b
1 c
1 d

Building upon the answer #xNoK, I am adding here the additional step needed to include the result back in the original DataFrame.
We have this data:
arrays = [['x', 'y'], ['a|b', 'b|c|d']]
midx = pd.MultiIndex.from_arrays(arrays, names=['left', 'right'])
df = pd.DataFrame(index=midx, data=[2, -1], columns=['value'])
df
Out[17]:
value
left right
x a|b 2
y b|c|d -1
First, let's generate the values for right index as #xNoK suggested. First take the Index level we want to work on by index.levels[1] and convert it it to series so that we can perform the str.split() function, and finally stack() it to get the result we want.
new_multi_idx_val = df.index.levels[1].to_series().str.split('|').apply(pd.Series).stack()
new_multi_idx_val
Out[18]:
right
a|b 0 a
1 b
b|c|d 0 b
1 c
2 d
dtype: object
Now we want to put this value in the original DataFrame df. To do that, let's change its shape so that result we generated in the previous step could be copied.
In order to do that, we can repeat the rows (including the indexes) by a number of | present in right level of multi-index. df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)) gives the number of times a row (including index) should be repeated. We apply this to the function index.repeat() and fetch values at those indexes to create a new DataFrame df_repeted.
df_repeted = df.loc[df.index.repeat(df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)))]
df_repeted
Out[19]:
value
left right
x a|b 2
a|b 2
y b|c|d -1
b|c|d -1
b|c|d -1
Now df_repeted DataFrame is in a shape where we could change the index to get the answer we want.
Replace the index of df_repeted with desired values as following:
df_repeted.index = [df_repeted.index.droplevel(1), new_multi_idx_val]
df_repeted.index.rename(names=['left', 'right'], inplace=True)
df_repeted
Out[20]:
value
left right
x a 2
b 2
y b -1
c -1
d -1

Tilde sign in pandas DataFrame

I'm new to python/pandas and came across a code snippet.
df = df[~df['InvoiceNo'].str.contains('C')]
Would be much obliged if I could know what is the tilde sign's usage in this context?

It means bitwise not, inversing boolean mask - Falses to Trues and Trues to Falses.
Sample:
df = pd.DataFrame({'InvoiceNo': ['aaC','ff','lC'],
'a':[1,2,5]})
print (df)
InvoiceNo a
0 aaC 1
1 ff 2
2 lC 5
#check if column contains C
print (df['InvoiceNo'].str.contains('C'))
0 True
1 False
2 True
Name: InvoiceNo, dtype: bool
#inversing mask
print (~df['InvoiceNo'].str.contains('C'))
0 False
1 True
2 False
Name: InvoiceNo, dtype: bool
Filter by boolean indexing:
df = df[~df['InvoiceNo'].str.contains('C')]
print (df)
InvoiceNo a
1 ff 2
So output is all rows of DataFrame, which not contains C in column InvoiceNo.

It's used to invert boolean Series, see pandas-doc.

df = df[~df['InvoiceNo'].str.contains('C')]
The above code block denotes that remove all data tuples from pandas dataframe, which has "C" letters in the strings values in [InvoiceNo] column.
tilde(~) sign works as a NOT(!) operator in this scenario.
Generally above statement uses to remove data tuples that have null values from data columns.

tilde ~ is a bitwise operator. If the operand is 1, it returns 0, and if 0, it returns 1. So you will get the InvoiceNo values in the df that does not contain the string 'C'

How to append a list element to each row in a pandas Series?

I have the following pandas Series, whereby each row is a long string with no spaces. It is shaped (250,) (i.e. there are 250 rows)
import pandas as pd
sr1 = pd.Series(...)
0
0 abdcadbcadcbadacbadbdddddacbadcbadadbcadbcadad...
1 cacacdacadbdcadcabdcbadcbadbdabcabdbbbbbacdbac...
2 bbbbbcadcacddabcadbcdabcbaddcbadcbadbcadbcaaba...
3 acdbcdacdbadbadcbdbaaaacbdacadbacaddcabdacbdab...
....
I have a list of 250 strings which I would like to append to the beginning of each of the rows.
list_of_strings = ["prefix1", "prefix2", "prefix3", ...., "prefix250"]
How does one append each element in list_of_strings to the corresponding row in sr1? The resulting Series should look like this:
0
0 prefix1 abdcadbcadcbadacbadbdddddacbadcbadadbcadbcadad...
1 prefix2 cacacdacadbdcadcabdcbadcbadbdabcabdbbbbbacdbac...
2 prefix3 bbbbbcadcacddabcadbcdabcbaddcbadcbadbcadbcaaba...
3 prefix4 acdbcdacdbadbadcbdbaaaacbdacadbacaddcabdacbdab...
....
My first thought was to try something like:
sr1.insert(0, "prefixes", value = list_of_strings)
But this throws the error AttributeError: 'Series' object has no attribute 'insert'. One could convert sr1 to a pandas DataFrame with sr1 = sr1.to_frame(), and the previous .insert() will result in a DataFrame with two columns.
In python, we can concatenate strings with a specified delimiter as follows:
first = "firstword"
second = "secondword"
combined = " ".join([first, second])
## outputs 'firstword secondword'
I'm not sure how this is down with a pandas Series. Perhaps .apply(' '.join) somehow?

You need first create Series from list and then add double add or + - one for whitespace and another for s:
s = pd.Series(['a','b','c'])
list_of_strings = ["prefix1", "prefix2", "prefix3"]
print (pd.Series(list_of_strings, index=s.index).add(' ').add(s))
#same as
#print (pd.Series(list_of_strings, index=s.index)+ ' ' + s)
0 prefix1 a
1 prefix2 b
2 prefix3 c
dtype: object
Another solution with cat:
print (pd.Series(list_of_strings, index=s.index).str.cat(s, sep=' '))
0 prefix1 a
1 prefix2 b
2 prefix3 c
dtype: object
Solution with apply, but need DataFrame - by constructor or by concat:
print (pd.DataFrame({'prefix':list_of_strings, 'vals':s}).apply(' '.join, axis=1))
0 prefix1 a
1 prefix2 b
2 prefix3 c
dtype: object
print (pd.concat([pd.Series(list_of_strings, index=s.index), s], axis=1)
.apply(' '.join, axis=1))
0 prefix1 a
1 prefix2 b
2 prefix3 c
dtype: object

You can makes a series of your prefixes, then just add the two series together:
import pandas as pd
s1 = pd.Series(['a'*10,'b'*10,'c'*10])
s1
# returns:
# 0 aaaaaaaaaa
# 1 bbbbbbbbbb
# 2 cccccccccc
s2 = pd.Series(['pre1', 'pre2', 'pre3'])
s2+s1
# returns:
# 0 pre1aaaaaaaaaa
# 1 pre2bbbbbbbbbb
# 2 pre3cccccccccc

How about just turn the list of prefixes into a series of length 250, then add them.
sr0 = pd.Series(list_of_strings)
sr1 = sr0 + sr1

Use + operator, it will concatenate strings automatically.
pd.Series(list_of_strings) + " " + sr1

add columns different length pandas

I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?

If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.

Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN

We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9

I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.

If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

compare two columns having list of strings in pandas - python

Related

Applying lambda to a dataframe: works with > operator, but error with ==?

Duplicating Pandas Dataframe rows based on string split, without iteration

Tilde sign in pandas DataFrame

How to append a list element to each row in a pandas Series?

add columns different length pandas

Categories

Resources