I have a Pandas DataFrame with two columns, each row contains a list of elements. I'm trying to find set difference between two columns for each row using pandas.apply method.
My df for example
A B
0 ['a','b','c'] ['a']
1 ['e', 'f', 'g'] ['f', 'g']
So it should look like this:
df.apply(set_diff_func, axis=1)
What I'm trying to achieve:
0 ['b','c']
1 ['e']
I can make it using iterrows, but I've once read, that it's better to use apply when it's possible.
How about
df.apply(lambda row: list(set(row['A']) - set(row['B'])), axis = 1)
or
(df['A'].apply(set) - df['B'].apply(set)).apply(list)
Here's the function you need, you can change the name of the columns with the col1 and col2 arguments by passing them to the args option in apply:
def set_diff_func(row, col1, col2):
return list(set(row[col1]).difference(set(row[col2])))
This should return the required result:
>>> dataset = pd.DataFrame(
[{'A':['a','b','c'], 'B':['a']},
{'A':['e', 'f', 'g'] , 'B':['f', 'g']}])
>>> dataset.apply(set_diff_func, axis=1, args=['A','B'])
0 [c, b]
1 [e]
Related
does anyone know how to concatenate multiple columns excluding duplicated values?
I'm a student of python, this is my first project and I have a problem
I have a dataset like this one:
each number represents a column from my dataframe
df = {'col1': ['a','b','a','c'], 'col2': ["","",'a',''], 'col3': ['','a','','b'], 'col4': ['a','','b',''], 'col2': ['b','c','c','']}
Need a output like this:
new colum
a-b
a-b-c
a-c-b
b-c
Need the data sorted, concatenated and with unique values
I was able to do this in excel using transpose, sort and unique, like this:
=TEXTJOIN("-";;TRANSPOSE(UNIQUE(SORT(TRANSPOSE(A1:E1)))))
But I couldn't figure out how to do it on pandas. Can anoyne help me plz?
Example
you can make example by to_dict and make null from None
data = {'col1': ['a','b','a','c'], 'col2': [None,None,'a',None], 'col3': [None,'a',None,'b'], 'col4': ['a',None,'b',None], 'col5': ['b','c','c',None]}
df = pd.DataFrame(data)
df
col1 col2 col3 col4 col5
0 a None None a b
1 b None a None c
2 a a None b c
3 c None b None None
Code
df.apply(lambda x: '-'.join(sorted(x.dropna().unique())) ,axis=1)
output:
0 a-b
1 a-b-c
2 a-b-c
3 b-c
dtype: object
Assuming your DataFrame's identifier is just 'df', you could try something like this:
# List for output
newList = []
# Iterate over the DataFrame rows
for i, row in df.iterrows():
# Remove duplicates
row.drop_duplicates(inplace=True)
# Remove NaNs
row.dropna(inplace=True)
# Sort alphabetically
row.sort_values(inplace=True)
# Add adjusted row to the list
newList.append(row.to_list())
# Output
print(newList)
If your DataFrame isn't named 'df', just substitute 'df.iterrows()' for '[your dataframe].iterrows()'.
This gives the output:
[['a', 'b'], ['a', 'b', 'c'], ['a', 'b', 'c'], ['b', 'c']]
If you really need the output to be formatted like you said (a-b, a-b-c, etc): you can iterate over 'newList' to concatenate them and add the hyphens.
I want to sort a dataframe by multiple columns like this:
df.sort_values( by=[ 'A', 'B', 'C', 'D', 'E' ], inplace=True )
However i found out that python first sorts the uppercase values and then the lowercase.
I tried this:
df.sort_values( by=[ 'A', 'B', 'C', 'D', 'E' ], inplace=True, key=lambda x: x.str.lower() )
but i get this error:
TypeError: sort_values() got an unexpected keyword argument 'key'
If i could, i would turn all columns to lowercase but i want them as they are.
Any hints?
If check docs - DataFrame.sort_values for correct working need upgrade pandas higher like pandas 1.1.0:
key - callable, optional
Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.
New in version 1.1.0.
Sample:
df = pd.DataFrame({
'A':list('MmMJJj'),
'B':list('aYAbCc')
})
df.sort_values(by=[ 'A', 'B'], inplace=True, key=lambda x: x.str.lower())
print (df)
A B
3 J b
4 J C
5 j c
0 M a
2 M A
1 m Y
I have a list and a dataframe which look like this:
list = ['a', 'b']
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]})
I would like to do something like this:
df1 = df.drop([x for x in list])
I am getting the following error message:
"KeyError: "['a' 'b'] not found in axis""
I know I can do the following:
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]}).set_index('A')
df1 = df.drop([x for x in list])
How can I drop the list values without having to set column 'A' as index? My dataframe has multiple columns.
Input:
A B
0 a 9
1 b 9
2 c 8
3 d 4
Code:
for i in list:
ind = df[df['A']==i].index.tolist()
df=df.drop(ind)
df
Output:
A B
2 c 8
3 d 4
You need to specify the correct axis, namely axis=1.
According to the docs:
axis{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
You also need to make sure you are using the correct column names, (ie. uppercase rather than lower case. And you should avoid using python reserved names, so don't use list.
This should work:
myList = ['A', 'B']
df1 = df.drop([x for x in myList], axis=1)
I currently have a DataFrame like such:
col1 col2 col3
0 0 1 ['a', 'b', 'c']
1 2 3 ['d', 'e', 'f']
2 4 5 ['g', 'h', 'i']
what I want to do is select the rows where a certain value is contained in the lists of col3. For example, the code I would initially run is:
df.loc['a' in df['col3']]
but I get the following error:
KeyError: False
I've taken a look at this question: KeyError: False in pandas dataframe but it doesn't quite answer my question. I've tried the suggested solutions in the answers and it didn't help.
How would I go about this issue? Thanks.
Use list comprehension for test each list:
df1 = df[['a' in x for x in df['col3']]]
print (df1)
col1 col2 col3
0 0 1 [a, b, c]
Or use Series.map:
df1 = df[df['col3'].map(lambda x: 'a' in x)]
#alternative
#df1 = df[df['col3'].apply(lambda x: 'a' in x)]
Or create DataFrame and test by DataFrame.eq with DataFrame.any:
df1 = df[pd.DataFrame(df['col3'].tolist()).eq('a').any(axis=1)]
Use:
df = df[df.testc.map(lambda x: 'a' in x)]
I know I can find duplicate columns using:
df.T.duplicated()
what I'd like to know the index that a duplicate column is a duplicate of. For example, both C and D are duplicates of a A below:
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
A B C D
0 1 0 1 1
1 2 0 2 2
I'd like something like:
duplicate_index = pd.Series([None, None, 'A', 'A'], ['A', 'B', 'C', 'D'])
I don't know if duplicated have an option to give information about the first row with the same data. My idea is by using groupby and transform such as:
arr_first = (df.T.reset_index().groupby([col for col in df.T.columns])['index']
.transform(lambda x: x.iloc[0]).values)
With your example, arr_first is then equal to array(['A', 'B', 'A', 'A'], dtype=object) and because they have the same order than df.columns, to get the expected output, you use np.where like:
duplicate_index = pd.Series(pd.np.where(arr_first != df.columns, arr_first, None),df.columns)
and the result for duplicate_index is
A None
B None
C A
D A
dtype: object
Another more direct way to test if two numeric columns are duplicated with each other is to test the correlation matrix, which test all pairs of columns. Here is the code:
import pandas as pd
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
# compute the correlation matrix
cm = df.corr()
cm
This shows a matrix of the correlation of all columns to each other column (including itself). If a column is 1:1 with another column, then the value is 1.0.
To find all columns that are duplicates of A, then :
cm['A']
A 1.0
B NaN
C 1.0
D 1.0
If you have categorical (string objects) and not numeric, you could make a cross correlation table.
Hope this helps!