Concatenate columns in alphabetical (or numerical order) - python

Suppose a df:
col1 col2
1 A B
2 A C
3 G A
I want to get:
col1 col2 col3
1 A B AB
2 A C AC
3 G A AG
Is there any short function to achieve that? or I need to write my own and apply?

Use list comprehension with sorting by numpy.sort or sorted and join:
df['col3'] = [''.join(x) for x in np.sort(df)]
#alternative
#df['col3'] = [''.join(sorted(x)) for x in df]
print (df)
col1 col2 col3
1 A B AB
2 A C AC
3 G A AG
With lambda function is obviously solution slowier:
df['col3'] = df.apply(lambda x: ''.join(sorted(x)), axis=1)

You can sort values with numpy.sort and concatenate with str.join:
df['col3'] = list(map(''.join, np.sort(df)))
Output:
col1 col2 col3
1 A B AB
2 A C AC
3 A G AG

Related

Create new column by using a list comprehension with two 'for' loops in Pandas DataFrame

I have the following dataframe
df=pd.DataFrame({'col1': ['aaaa', 'aabb', 'bbcc', 'ccdd'],
'col2': ['ab12', 'cd15', 'kf25', 'zx78']})
df
col1 col2
0 aaaa ab12
1 aabb cd15
2 bbcc kf25
3 ccdd zx78
I want to create 'col3' based on 'col1' and 'col2', I want to get:
df
col1 col2 col3
0 aaaa ab12 aa-12
1 aabb cd15 aa-15
2 bbcc kf25 bb-25
3 ccdd zx78 cc-78
I tried to use list comprehension but I got the error: ValueError: Length of values (16) does not match length of index (4)
The code I used is :
df['col3']=[x[0:2]+'-'+y[2:4] for x in df['col1'] for y in df['col2']]
Use simple slicing with the str accessor, and concatenation:
df['col3'] = df['col1'].str[:2] + '-' + df['col2'].str[2:4]
Or, if you want the last two characters of col2:
df['col3'] = df['col1'].str[:2] + '-' + df['col2'].str[-2:]
Output:
col1 col2 col3
0 aaaa ab12 aa-12
1 aabb cd15 aa-15
2 bbcc kf25 bb-25
3 ccdd zx78 cc-78
why your approach did not work
You would have needed to zip:
df['col3'] = [x[0:2]+'-'+y[2:4] for x,y in zip(df['col1'], df['col2'])]

How to filter dataframe by columns values combinations?

I have a dataframe:
col1 col2 col3
a b b
a b c
k l o
b l b
I want to keep only rows where col1 is "a", col2 is "b" and col3 is "b" or col1 is "k", col2 is "l" and col3 is "o". So desired result is:
col1 col2 col3
a b b
k l o
How to do that? i can write dt[(dt["col1"]=="a")&(dt["col2"]=="b")&(dt["col1"]=="b")] but what about second case? should i put it with or?
abb = (df["col1"]=="a") & (df["col2"]=="b") & (df["col3"]=="b")
klo = (df["col1"]=="k") & (df["col2"]=="l") & (df["col3"]=="o")
df[(abb) | (klo)]
col1 col2 col3
0 a b b
2 k l o
Alternatively, you could write something like this, just to avoid all those conditionals:
abb = 'abb'
klo = 'klo'
strings = [abb, klo]
def f(x):
if ''.join(x) in strings:
return True
return False
df[df.apply(lambda x: f(x), axis=1)]
col1 col2 col3
0 a b b
2 k l o
So, here we are applying a custom function to each row with df.apply. Inside the function we turn the row into a single string with str.join and we check if this string exists in our predefined list of strings. Finally, we use the resulting pd.Series with booleans to select from our df.

Add a new column with matching values in a list in pandas

I have a dataframe such as :
the_list =['LjHH','Lhy_kd','Ljk']
COL1 COL2
A ADJJDUD878_Lhy_kd
B Y0_0099JJ_Ljk
C YTUUDBBDHHD
D POL0990E_LjHH'
And I would like to add a new COL3 column where if within COL2 I have a match with a value in the_list, I add in that column the matching element of the_list.
Expected result;
COL1 COL2 COL3
A ADJJDUD878_Lhy_kd Lhy_kd
B Y0_0099JJ_2_Ljk Ljk
C YTUUDBBDHHD NA
D POL0990E_LjHH' LjHH
For get only first matched values use Series.str.extract with joined values of lists by | for regex or:
the_list =['LjHH','Lhy_kd','Ljk']
df['COL3'] = df['COL2'].str.extract(f'({"|".join(the_list)})', expand=False)
print (df)
COL1 COL2 COL3
0 A ADJJDUD878_Lhy_kd Lhy_kd
1 B Y0_0099JJ_Ljk Ljk
2 C YTUUDBBDHHD NaN
3 D POL0990E_LjHH' LjHH
For get all matched values (if possible multiple values) use Series.str.findall with Series.str.join and last repalce empty string to NaNs:
the_list =['LjHH','Lhy_kd','Ljk']
df['COL3']=df['COL2'].str.findall(f'{"|".join(the_list)}').str.join(',').replace('',np.nan)
print (df)
COL1 COL2 COL3
0 A ADJJDUD878_Lhy_kd Lhy_kd
1 B Y0_0099JJ_Ljk Ljk
2 C YTUUDBBDHHD NaN
3 D POL0990E_LjHH' LjHH

how to create a dataframe aggregating (grouping?) a dataframe containing only strings

I would like to create a dataframe "aggregating" a larger data set.
Starting:
df:
col1 col2
1 A B
2 A C
3 A B
and getting:
df_aggregated:
col1 col2
1 A B
2 A C
without using any calclulation (count())
I would write:
df_aggreagated = df.groupby('col1')
but I do not get anything
print ( df_aggregated )
"error"
any help appreciated
You can accomplish this by simply dropping the duplicate entries using the df.drop_duplicates function:
df_aggregated = df.drop_duplicates(subset=['col1', 'col2'], keep=False)
print(df_aggregated)
col1 col2
1 A B
2 A C
You can use groupby with a function:
In [849]: df.groupby('col2', as_index=False).max()
Out[849]:
col2 col1
0 B A
1 C A

Return groupby columns as new dataframe in Python Pandas

Input: CSV with 5 columns.
Expected Output: Unique combinations of 'col1', 'col2', 'col3'.
Sample Input:
col1 col2 col3 col4 col5
0 A B C 11 30
1 A B C 52 10
2 B C A 15 14
3 B C A 1 91
Sample Expected Output:
col1 col2 col3
A B C
B C A
Just expecting this as output. I don't need col4 and col5 in output. And also don't need any sum, count, mean etc. Tried using pandas to achieve this but no luck.
My code:
input_df = pd.read_csv("input.csv");
output_df = input_df.groupby(['col1', 'col2', 'col3'])
This code is returning 'pandas.core.groupby.DataFrameGroupBy object at 0x0000000009134278'.
But I need dataframe like above. Any help much appreciated.
df[['col1', 'col2', 'col3']].drop_duplicates()
First you can use .drop() to delete col4 and col5 as you said you don't need them.
df = df.drop(['col4', 'col5'], axis=1)
Then, you can use .drop_duplicates() to delete the duplicate rows in col1, col2 and col3.
df = df.drop_duplicates(['col1', 'col2', 'col3'])
df
The output:
col1 col2 col3
0 A B C
2 B C A
You noticed that in the output the index is 0, 2 instead of 0,1. To fix that you can do this:
df.index = range(len(df))
df
The output:
col1 col2 col3
0 A B C
1 B C A

Categories