Python-Pandas Join two columns by adding a character - python

There are three different columns, col2 and col3 need to be joined with the
character "/" between the two columns and after joining column name need to be col2. please help !!!
col1 col2 col3
B 0.0.0.0 0 0
B 2.145.26.0 24
B 2.145.27.0 24
B 10.0.0.0 8 20
Expected output:
col1 col2
B 0.0.0.0 0/0
B 2.145.26.0/24
B 2.145.27.0/24
B 10.0.0.0 8/20

df.col2 = df.col2.astype(str).str.cat(df.col3, sep='/')
See https://pandas.pydata.org/pandas-docs/version/0.23/api.html#string-handling for string operations.

IIUC
df['col2']+='/'+df.col3.astype(str)
df
Out[74]:
col1 col2 col3
0 B 0.0.0.00/0 0
1 B 2.145.26.0/24 24
2 B 2.145.27.0/24 24
3 B 10.0.0.08/20 20

Are you looking for something like this? Not sure if this can be labelled as Pythonic way?
temp_list = [{'col1':'B','col2':'0.0.0.0 0', 'col3':'0'},
{'col1':'B','col2':'2.145.26.0', 'col3':'24'},
{'col1':'B','col2':'2.145.27.0', 'col3':'24'},
{'col1':'B','col2':'10.0.0.0 8', 'col3':'10'}]
df = pd.DataFrame(data=temp_list)
df['col4'] = df['col2'] + "/" + df['col3']
df.drop(columns=['col2','col3'],inplace=True)
df.rename(columns = {'col4':'col2'})

Related

Concatenate columns in alphabetical (or numerical order)

Suppose a df:
col1 col2
1 A B
2 A C
3 G A
I want to get:
col1 col2 col3
1 A B AB
2 A C AC
3 G A AG
Is there any short function to achieve that? or I need to write my own and apply?
Use list comprehension with sorting by numpy.sort or sorted and join:
df['col3'] = [''.join(x) for x in np.sort(df)]
#alternative
#df['col3'] = [''.join(sorted(x)) for x in df]
print (df)
col1 col2 col3
1 A B AB
2 A C AC
3 G A AG
With lambda function is obviously solution slowier:
df['col3'] = df.apply(lambda x: ''.join(sorted(x)), axis=1)
You can sort values with numpy.sort and concatenate with str.join:
df['col3'] = list(map(''.join, np.sort(df)))
Output:
col1 col2 col3
1 A B AB
2 A C AC
3 A G AG

Find name of column which is non nan

I have a Dataframe defined like :
df1 = pd.DataFrame({"col1":[1,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
"col2":[np.nan,3,np.nan,4,np.nan,np.nan,np.nan,5,6],
"col3":[np.nan,np.nan,7,np.nan,np.nan,8,9,np.nan, np.nan]})
I want to transform it into a DataFrame like:
df2 = pd.DataFrame({"col_name":['col1','col2','col3','col2','col1',
'col3','col3','col2','col2'],
"value":[1,3,7,4,2,8,9,5,6]})
If possible, can we reverse this process too? By that I mean convert df2 into df1.
I don't want to go through the DataFrame iteratively as it becomes too computationally expensive.
You can stack it:
out = (df1.stack().astype(int).droplevel(0)
.rename_axis('col_name').reset_index(name='value'))
Output:
col_name value
0 col1 1
1 col2 3
2 col3 7
3 col2 4
4 col1 2
5 col3 8
6 col3 9
7 col2 5
8 col2 6
To go from out back to df1, you could pivot:
out1 = pd.pivot(out.reset_index(), 'index', 'col_name', 'value')

how to create a dataframe aggregating (grouping?) a dataframe containing only strings

I would like to create a dataframe "aggregating" a larger data set.
Starting:
df:
col1 col2
1 A B
2 A C
3 A B
and getting:
df_aggregated:
col1 col2
1 A B
2 A C
without using any calclulation (count())
I would write:
df_aggreagated = df.groupby('col1')
but I do not get anything
print ( df_aggregated )
"error"
any help appreciated
You can accomplish this by simply dropping the duplicate entries using the df.drop_duplicates function:
df_aggregated = df.drop_duplicates(subset=['col1', 'col2'], keep=False)
print(df_aggregated)
col1 col2
1 A B
2 A C
You can use groupby with a function:
In [849]: df.groupby('col2', as_index=False).max()
Out[849]:
col2 col1
0 B A
1 C A

Change value of column on condition

I have an example of Dataframe df:
Col1 Col2
a "some string AXA some string "
b "some string2"
I would like to:
if df.Col2 contains "AXA" then change the value to 1, if not then change it to 0.
So I get:
Col1 Col2
a 1
b 0
I've tried something like,
if "AXA" in df['Col2']:
df['Col2'] = 1
or if I can do something like
df.loc[df['Col2'] contains "AXA"] = 1
Thank you for help !
You can use str.contains for boolean mask and then cast to int:
print (df.Col2.str.contains('AXA'))
0 True
1 False
Name: Col2, dtype: bool
df['Col2'] = df.Col2.str.contains('AXA').astype(int)
print (df)
Col1 Col2
0 a 1
1 b 0
EDIT: If need create output by 2 conditions, fastest is use double numpy.where:
print (df)
Col1 Col2
0 a some string AXA some string
1 a some string AXE some string
2 b some string2
df['Col2'] = np.where(df.Col2.str.contains('AXA'), 1,
np.where(df.Col2.str.contains('AXE'), 2, 0))
print (df)
Col1 Col2
0 a 1
1 a 2
2 b 0

Return groupby columns as new dataframe in Python Pandas

Input: CSV with 5 columns.
Expected Output: Unique combinations of 'col1', 'col2', 'col3'.
Sample Input:
col1 col2 col3 col4 col5
0 A B C 11 30
1 A B C 52 10
2 B C A 15 14
3 B C A 1 91
Sample Expected Output:
col1 col2 col3
A B C
B C A
Just expecting this as output. I don't need col4 and col5 in output. And also don't need any sum, count, mean etc. Tried using pandas to achieve this but no luck.
My code:
input_df = pd.read_csv("input.csv");
output_df = input_df.groupby(['col1', 'col2', 'col3'])
This code is returning 'pandas.core.groupby.DataFrameGroupBy object at 0x0000000009134278'.
But I need dataframe like above. Any help much appreciated.
df[['col1', 'col2', 'col3']].drop_duplicates()
First you can use .drop() to delete col4 and col5 as you said you don't need them.
df = df.drop(['col4', 'col5'], axis=1)
Then, you can use .drop_duplicates() to delete the duplicate rows in col1, col2 and col3.
df = df.drop_duplicates(['col1', 'col2', 'col3'])
df
The output:
col1 col2 col3
0 A B C
2 B C A
You noticed that in the output the index is 0, 2 instead of 0,1. To fix that you can do this:
df.index = range(len(df))
df
The output:
col1 col2 col3
0 A B C
1 B C A

Categories