does anyone know how to concatenate multiple columns excluding duplicated values?
I'm a student of python, this is my first project and I have a problem
I have a dataset like this one:
each number represents a column from my dataframe
df = {'col1': ['a','b','a','c'], 'col2': ["","",'a',''], 'col3': ['','a','','b'], 'col4': ['a','','b',''], 'col2': ['b','c','c','']}
Need a output like this:
new colum
a-b
a-b-c
a-c-b
b-c
Need the data sorted, concatenated and with unique values
I was able to do this in excel using transpose, sort and unique, like this:
=TEXTJOIN("-";;TRANSPOSE(UNIQUE(SORT(TRANSPOSE(A1:E1)))))
But I couldn't figure out how to do it on pandas. Can anoyne help me plz?
Example
you can make example by to_dict and make null from None
data = {'col1': ['a','b','a','c'], 'col2': [None,None,'a',None], 'col3': [None,'a',None,'b'], 'col4': ['a',None,'b',None], 'col5': ['b','c','c',None]}
df = pd.DataFrame(data)
df
col1 col2 col3 col4 col5
0 a None None a b
1 b None a None c
2 a a None b c
3 c None b None None
Code
df.apply(lambda x: '-'.join(sorted(x.dropna().unique())) ,axis=1)
output:
0 a-b
1 a-b-c
2 a-b-c
3 b-c
dtype: object
Assuming your DataFrame's identifier is just 'df', you could try something like this:
# List for output
newList = []
# Iterate over the DataFrame rows
for i, row in df.iterrows():
# Remove duplicates
row.drop_duplicates(inplace=True)
# Remove NaNs
row.dropna(inplace=True)
# Sort alphabetically
row.sort_values(inplace=True)
# Add adjusted row to the list
newList.append(row.to_list())
# Output
print(newList)
If your DataFrame isn't named 'df', just substitute 'df.iterrows()' for '[your dataframe].iterrows()'.
This gives the output:
[['a', 'b'], ['a', 'b', 'c'], ['a', 'b', 'c'], ['b', 'c']]
If you really need the output to be formatted like you said (a-b, a-b-c, etc): you can iterate over 'newList' to concatenate them and add the hyphens.
Related
I'm trying to fill multiple columns of a dataframe with random values from a dictionary. From a another post I understood that you could specify a list and have a column filled with random values from that list like this:
Dataframe:
Col1 Col2 Col3
1 NaN NaN values
2 NaN NaN .
3 NaN NaN .
my_list = ['a', 'b', 'c', 'd']
df['Col1'] = np.random.choice(my_list, len(df))
The code would then fill the column like this:
Col1 Col2 Col3
1 b NaN values
2 d NaN .
3 a NaN .
What I want is to fill out multiple columns and while the first thought would be to use something ugly like this:
my_list1 = ['a', 'b', 'c', 'd']
my_list2 = ['k', 'l', 'e', 'f']
df['Col1'] = np.random.choice(my_list1, len(df))
df['Col2'] = np.random.choice(my_list2, len(df))
I would like to declare a dictionary of lists and somehow call a function that maps the random values to their respective columns:
my_dict = {'Col1': ['a', 'b', 'c', 'd'],
'Col2': ['k', 'l', 'e', 'f']}
df = <insert function to fill columns>
And then the dataframe would end up looking like this:
Col1 Col2 Col3
1 b l values
2 d f .
3 c k .
Note that I would only want to fill out a certain amount of columns in my dataframe and not all of them
This should get you where you need to go:
for k, v in my_dict.items():
df[k] = np.random.choice(v, len(df))
#vtasca's answers is perfectly fine. Because you asked about other ways of doing this though, here's a fun way using dictionary comprehensions. It's a loop hiding behind some makeup:
chosen = {k: np.random.choice(v, len(df)) for k, v in my_dict.items()}
df = pd.concat([df, pd.DataFrame(chosen)], axis=1)
I have a list and a dataframe which look like this:
list = ['a', 'b']
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]})
I would like to do something like this:
df1 = df.drop([x for x in list])
I am getting the following error message:
"KeyError: "['a' 'b'] not found in axis""
I know I can do the following:
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]}).set_index('A')
df1 = df.drop([x for x in list])
How can I drop the list values without having to set column 'A' as index? My dataframe has multiple columns.
Input:
A B
0 a 9
1 b 9
2 c 8
3 d 4
Code:
for i in list:
ind = df[df['A']==i].index.tolist()
df=df.drop(ind)
df
Output:
A B
2 c 8
3 d 4
You need to specify the correct axis, namely axis=1.
According to the docs:
axis{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
You also need to make sure you are using the correct column names, (ie. uppercase rather than lower case. And you should avoid using python reserved names, so don't use list.
This should work:
myList = ['A', 'B']
df1 = df.drop([x for x in myList], axis=1)
I currently have a DataFrame like such:
col1 col2 col3
0 0 1 ['a', 'b', 'c']
1 2 3 ['d', 'e', 'f']
2 4 5 ['g', 'h', 'i']
what I want to do is select the rows where a certain value is contained in the lists of col3. For example, the code I would initially run is:
df.loc['a' in df['col3']]
but I get the following error:
KeyError: False
I've taken a look at this question: KeyError: False in pandas dataframe but it doesn't quite answer my question. I've tried the suggested solutions in the answers and it didn't help.
How would I go about this issue? Thanks.
Use list comprehension for test each list:
df1 = df[['a' in x for x in df['col3']]]
print (df1)
col1 col2 col3
0 0 1 [a, b, c]
Or use Series.map:
df1 = df[df['col3'].map(lambda x: 'a' in x)]
#alternative
#df1 = df[df['col3'].apply(lambda x: 'a' in x)]
Or create DataFrame and test by DataFrame.eq with DataFrame.any:
df1 = df[pd.DataFrame(df['col3'].tolist()).eq('a').any(axis=1)]
Use:
df = df[df.testc.map(lambda x: 'a' in x)]
I have a Pandas DataFrame with two columns, each row contains a list of elements. I'm trying to find set difference between two columns for each row using pandas.apply method.
My df for example
A B
0 ['a','b','c'] ['a']
1 ['e', 'f', 'g'] ['f', 'g']
So it should look like this:
df.apply(set_diff_func, axis=1)
What I'm trying to achieve:
0 ['b','c']
1 ['e']
I can make it using iterrows, but I've once read, that it's better to use apply when it's possible.
How about
df.apply(lambda row: list(set(row['A']) - set(row['B'])), axis = 1)
or
(df['A'].apply(set) - df['B'].apply(set)).apply(list)
Here's the function you need, you can change the name of the columns with the col1 and col2 arguments by passing them to the args option in apply:
def set_diff_func(row, col1, col2):
return list(set(row[col1]).difference(set(row[col2])))
This should return the required result:
>>> dataset = pd.DataFrame(
[{'A':['a','b','c'], 'B':['a']},
{'A':['e', 'f', 'g'] , 'B':['f', 'g']}])
>>> dataset.apply(set_diff_func, axis=1, args=['A','B'])
0 [c, b]
1 [e]
I know I can find duplicate columns using:
df.T.duplicated()
what I'd like to know the index that a duplicate column is a duplicate of. For example, both C and D are duplicates of a A below:
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
A B C D
0 1 0 1 1
1 2 0 2 2
I'd like something like:
duplicate_index = pd.Series([None, None, 'A', 'A'], ['A', 'B', 'C', 'D'])
I don't know if duplicated have an option to give information about the first row with the same data. My idea is by using groupby and transform such as:
arr_first = (df.T.reset_index().groupby([col for col in df.T.columns])['index']
.transform(lambda x: x.iloc[0]).values)
With your example, arr_first is then equal to array(['A', 'B', 'A', 'A'], dtype=object) and because they have the same order than df.columns, to get the expected output, you use np.where like:
duplicate_index = pd.Series(pd.np.where(arr_first != df.columns, arr_first, None),df.columns)
and the result for duplicate_index is
A None
B None
C A
D A
dtype: object
Another more direct way to test if two numeric columns are duplicated with each other is to test the correlation matrix, which test all pairs of columns. Here is the code:
import pandas as pd
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
# compute the correlation matrix
cm = df.corr()
cm
This shows a matrix of the correlation of all columns to each other column (including itself). If a column is 1:1 with another column, then the value is 1.0.
To find all columns that are duplicates of A, then :
cm['A']
A 1.0
B NaN
C 1.0
D 1.0
If you have categorical (string objects) and not numeric, you could make a cross correlation table.
Hope this helps!