Concatenate columns ignoring duplicates with pandas

Concatenate columns ignoring duplicates with pandas - python

does anyone know how to concatenate multiple columns excluding duplicated values?
I'm a student of python, this is my first project and I have a problem
I have a dataset like this one:
each number represents a column from my dataframe
df = {'col1': ['a','b','a','c'], 'col2': ["","",'a',''], 'col3': ['','a','','b'], 'col4': ['a','','b',''], 'col2': ['b','c','c','']}
Need a output like this:
new colum
a-b
a-b-c
a-c-b
b-c
Need the data sorted, concatenated and with unique values
I was able to do this in excel using transpose, sort and unique, like this:
=TEXTJOIN("-";;TRANSPOSE(UNIQUE(SORT(TRANSPOSE(A1:E1)))))
But I couldn't figure out how to do it on pandas. Can anoyne help me plz?

Example
you can make example by to_dict and make null from None
data = {'col1': ['a','b','a','c'], 'col2': [None,None,'a',None], 'col3': [None,'a',None,'b'], 'col4': ['a',None,'b',None], 'col5': ['b','c','c',None]}
df = pd.DataFrame(data)
df
col1 col2 col3 col4 col5
0 a None None a b
1 b None a None c
2 a a None b c
3 c None b None None
Code
df.apply(lambda x: '-'.join(sorted(x.dropna().unique())) ,axis=1)
output:
0 a-b
1 a-b-c
2 a-b-c
3 b-c
dtype: object

Assuming your DataFrame's identifier is just 'df', you could try something like this:
# List for output
newList = []
# Iterate over the DataFrame rows
for i, row in df.iterrows():
# Remove duplicates
row.drop_duplicates(inplace=True)
# Remove NaNs
row.dropna(inplace=True)
# Sort alphabetically
row.sort_values(inplace=True)
# Add adjusted row to the list
newList.append(row.to_list())
# Output
print(newList)
If your DataFrame isn't named 'df', just substitute 'df.iterrows()' for '[your dataframe].iterrows()'.
This gives the output:
[['a', 'b'], ['a', 'b', 'c'], ['a', 'b', 'c'], ['b', 'c']]
If you really need the output to be formatted like you said (a-b, a-b-c, etc): you can iterate over 'newList' to concatenate them and add the hyphens.

Related

Python fill out multiple columns with random values from a dictionary

I'm trying to fill multiple columns of a dataframe with random values from a dictionary. From a another post I understood that you could specify a list and have a column filled with random values from that list like this:
Dataframe:
Col1 Col2 Col3
1 NaN NaN values
2 NaN NaN .
3 NaN NaN .
my_list = ['a', 'b', 'c', 'd']
df['Col1'] = np.random.choice(my_list, len(df))
The code would then fill the column like this:
Col1 Col2 Col3
1 b NaN values
2 d NaN .
3 a NaN .
What I want is to fill out multiple columns and while the first thought would be to use something ugly like this:
my_list1 = ['a', 'b', 'c', 'd']
my_list2 = ['k', 'l', 'e', 'f']
df['Col1'] = np.random.choice(my_list1, len(df))
df['Col2'] = np.random.choice(my_list2, len(df))
I would like to declare a dictionary of lists and somehow call a function that maps the random values to their respective columns:
my_dict = {'Col1': ['a', 'b', 'c', 'd'],
'Col2': ['k', 'l', 'e', 'f']}
df = <insert function to fill columns>
And then the dataframe would end up looking like this:
Col1 Col2 Col3
1 b l values
2 d f .
3 c k .
Note that I would only want to fill out a certain amount of columns in my dataframe and not all of them

This should get you where you need to go:
for k, v in my_dict.items():
df[k] = np.random.choice(v, len(df))

#vtasca's answers is perfectly fine. Because you asked about other ways of doing this though, here's a fun way using dictionary comprehensions. It's a loop hiding behind some makeup:
chosen = {k: np.random.choice(v, len(df)) for k, v in my_dict.items()}
df = pd.concat([df, pd.DataFrame(chosen)], axis=1)

How to drop rows based on column value if column is not set as index in pandas?

I have a list and a dataframe which look like this:
list = ['a', 'b']
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]})
I would like to do something like this:
df1 = df.drop([x for x in list])
I am getting the following error message:
"KeyError: "['a' 'b'] not found in axis""
I know I can do the following:
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]}).set_index('A')
df1 = df.drop([x for x in list])
How can I drop the list values without having to set column 'A' as index? My dataframe has multiple columns.

Input:
A B
0 a 9
1 b 9
2 c 8
3 d 4
Code:
for i in list:
ind = df[df['A']==i].index.tolist()
df=df.drop(ind)
df
Output:
A B
2 c 8
3 d 4

You need to specify the correct axis, namely axis=1.
According to the docs:
axis{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
You also need to make sure you are using the correct column names, (ie. uppercase rather than lower case. And you should avoid using python reserved names, so don't use list.
This should work:
myList = ['A', 'B']
df1 = df.drop([x for x in myList], axis=1)

Selecting rows of a DataFrame according to whether an element is in a column's list

I currently have a DataFrame like such:
col1 col2 col3
0 0 1 ['a', 'b', 'c']
1 2 3 ['d', 'e', 'f']
2 4 5 ['g', 'h', 'i']
what I want to do is select the rows where a certain value is contained in the lists of col3. For example, the code I would initially run is:
df.loc['a' in df['col3']]
but I get the following error:
KeyError: False
I've taken a look at this question: KeyError: False in pandas dataframe but it doesn't quite answer my question. I've tried the suggested solutions in the answers and it didn't help.
How would I go about this issue? Thanks.

Use list comprehension for test each list:
df1 = df[['a' in x for x in df['col3']]]
print (df1)
col1 col2 col3
0 0 1 [a, b, c]
Or use Series.map:
df1 = df[df['col3'].map(lambda x: 'a' in x)]
#alternative
#df1 = df[df['col3'].apply(lambda x: 'a' in x)]
Or create DataFrame and test by DataFrame.eq with DataFrame.any:
df1 = df[pd.DataFrame(df['col3'].tolist()).eq('a').any(axis=1)]

Use:
df = df[df.testc.map(lambda x: 'a' in x)]

Apply a specific function for two columns in Pandas DataFrame

I have a Pandas DataFrame with two columns, each row contains a list of elements. I'm trying to find set difference between two columns for each row using pandas.apply method.
My df for example
A B
0 ['a','b','c'] ['a']
1 ['e', 'f', 'g'] ['f', 'g']
So it should look like this:
df.apply(set_diff_func, axis=1)
What I'm trying to achieve:
0 ['b','c']
1 ['e']
I can make it using iterrows, but I've once read, that it's better to use apply when it's possible.

How about
df.apply(lambda row: list(set(row['A']) - set(row['B'])), axis = 1)
or
(df['A'].apply(set) - df['B'].apply(set)).apply(list)

Here's the function you need, you can change the name of the columns with the col1 and col2 arguments by passing them to the args option in apply:
def set_diff_func(row, col1, col2):
return list(set(row[col1]).difference(set(row[col2])))
This should return the required result:
>>> dataset = pd.DataFrame(
[{'A':['a','b','c'], 'B':['a']},
{'A':['e', 'f', 'g'] , 'B':['f', 'g']}])
>>> dataset.apply(set_diff_func, axis=1, args=['A','B'])
0 [c, b]
1 [e]

Finding the Location of the Duplicate for Duplicated Columns in Pandas

I know I can find duplicate columns using:
df.T.duplicated()
what I'd like to know the index that a duplicate column is a duplicate of. For example, both C and D are duplicates of a A below:
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
A B C D
0 1 0 1 1
1 2 0 2 2
I'd like something like:
duplicate_index = pd.Series([None, None, 'A', 'A'], ['A', 'B', 'C', 'D'])

I don't know if duplicated have an option to give information about the first row with the same data. My idea is by using groupby and transform such as:
arr_first = (df.T.reset_index().groupby([col for col in df.T.columns])['index']
.transform(lambda x: x.iloc[0]).values)
With your example, arr_first is then equal to array(['A', 'B', 'A', 'A'], dtype=object) and because they have the same order than df.columns, to get the expected output, you use np.where like:
duplicate_index = pd.Series(pd.np.where(arr_first != df.columns, arr_first, None),df.columns)
and the result for duplicate_index is
A None
B None
C A
D A
dtype: object

Another more direct way to test if two numeric columns are duplicated with each other is to test the correlation matrix, which test all pairs of columns. Here is the code:
import pandas as pd
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
# compute the correlation matrix
cm = df.corr()
cm
This shows a matrix of the correlation of all columns to each other column (including itself). If a column is 1:1 with another column, then the value is 1.0.
To find all columns that are duplicates of A, then :
cm['A']
A 1.0
B NaN
C 1.0
D 1.0
If you have categorical (string objects) and not numeric, you could make a cross correlation table.
Hope this helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concatenate columns ignoring duplicates with pandas - python

Related

Python fill out multiple columns with random values from a dictionary

How to drop rows based on column value if column is not set as index in pandas?

Selecting rows of a DataFrame according to whether an element is in a column's list

Apply a specific function for two columns in Pandas DataFrame

Finding the Location of the Duplicate for Duplicated Columns in Pandas

Categories

Resources