Change value of column on condition - python

I have an example of Dataframe df:
Col1 Col2
a "some string AXA some string "
b "some string2"
I would like to:
if df.Col2 contains "AXA" then change the value to 1, if not then change it to 0.
So I get:
Col1 Col2
a 1
b 0
I've tried something like,
if "AXA" in df['Col2']:
df['Col2'] = 1
or if I can do something like
df.loc[df['Col2'] contains "AXA"] = 1
Thank you for help !

You can use str.contains for boolean mask and then cast to int:
print (df.Col2.str.contains('AXA'))
0 True
1 False
Name: Col2, dtype: bool
df['Col2'] = df.Col2.str.contains('AXA').astype(int)
print (df)
Col1 Col2
0 a 1
1 b 0
EDIT: If need create output by 2 conditions, fastest is use double numpy.where:
print (df)
Col1 Col2
0 a some string AXA some string
1 a some string AXE some string
2 b some string2
df['Col2'] = np.where(df.Col2.str.contains('AXA'), 1,
np.where(df.Col2.str.contains('AXE'), 2, 0))
print (df)
Col1 Col2
0 a 1
1 a 2
2 b 0

Related

Add a new column with matching values in a list in pandas

I have a dataframe such as :
the_list =['LjHH','Lhy_kd','Ljk']
COL1 COL2
A ADJJDUD878_Lhy_kd
B Y0_0099JJ_Ljk
C YTUUDBBDHHD
D POL0990E_LjHH'
And I would like to add a new COL3 column where if within COL2 I have a match with a value in the_list, I add in that column the matching element of the_list.
Expected result;
COL1 COL2 COL3
A ADJJDUD878_Lhy_kd Lhy_kd
B Y0_0099JJ_2_Ljk Ljk
C YTUUDBBDHHD NA
D POL0990E_LjHH' LjHH
For get only first matched values use Series.str.extract with joined values of lists by | for regex or:
the_list =['LjHH','Lhy_kd','Ljk']
df['COL3'] = df['COL2'].str.extract(f'({"|".join(the_list)})', expand=False)
print (df)
COL1 COL2 COL3
0 A ADJJDUD878_Lhy_kd Lhy_kd
1 B Y0_0099JJ_Ljk Ljk
2 C YTUUDBBDHHD NaN
3 D POL0990E_LjHH' LjHH
For get all matched values (if possible multiple values) use Series.str.findall with Series.str.join and last repalce empty string to NaNs:
the_list =['LjHH','Lhy_kd','Ljk']
df['COL3']=df['COL2'].str.findall(f'{"|".join(the_list)}').str.join(',').replace('',np.nan)
print (df)
COL1 COL2 COL3
0 A ADJJDUD878_Lhy_kd Lhy_kd
1 B Y0_0099JJ_Ljk Ljk
2 C YTUUDBBDHHD NaN
3 D POL0990E_LjHH' LjHH

Concat string Series when some values are empty strings

How can I vectorize the concatenation of strings in two columns when there are empty strings present? Here is the problem:
My columns in DF:
col1 = pd.Series(['text1','','text3'])
col2 = pd.Series(['text1','text2','text3'])
When I do:
new_col = col1.str.cat(col2, sep='/')
it gives:
new_col = pd.Series(['text1/text1','','text3/text3'])
but it should give:
new_col = pd.Series(['text1/text1','/text2','text3/text3'])
How can I do this?
Pandas version 0.24.2
If there is missing value instead empty string is necessary parameter na_rep in Series.str.cat:
col1 = pd.Series(['text1',np.nan,'text3'])
col2 = pd.Series(['text1','text2','text3'])
because if empty string it working nice:
col1 = pd.Series(['text1','','text3'])
col2 = pd.Series(['text1','text2','text3'])
new_col = col1.str.cat(col2, sep='/')
print (new_col)
0 text1/text1
1 /text2
2 text3/text3
dtype: object
Also is possible use alternative:
new_col = col1 + '/' + col2
print (new_col)
0 text1/text1
1 /text2
2 text3/text3
dtype: object

Pandas Combining two rows into one [duplicate]

Given the following dataframe:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'A','A']})
df
COL1 COL2
0 A NaN
1 NaN A
2 A A
I would like to create a column ('COL3') that uses the value from COL1 per row unless that value is null (or NaN). If the value is null (or NaN), I'd like for it to use the value from COL2.
The desired result is:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Thanks in advance!
In [8]: df
Out[8]:
COL1 COL2
0 A NaN
1 NaN B
2 A B
In [9]: df["COL3"] = df["COL1"].fillna(df["COL2"])
In [10]: df
Out[10]:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
You can use np.where to conditionally set column values.
df = df.assign(COL3=np.where(df.COL1.isnull(), df.COL2, df.COL1))
>>> df
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
If you don't mind mutating the values in COL2, you can update them directly to get your desired result.
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
>>> df
COL1 COL2
0 A NaN
1 NaN B
2 A B
df.COL2.update(df.COL1)
>>> df
COL1 COL2
0 A A
1 NaN B
2 A A
Using .combine_first, which gives precedence to non-null values in the Series or DataFrame calling it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
df['COL3'] = df.COL1.combine_first(df.COL2)
Output:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
If we mod your df slightly then you will see that this works and in fact will work for any number of columns so long as there is a single valid value:
In [5]:
df = pd.DataFrame({'COL1': ['B', np.nan,'B'],
'COL2' : [np.nan,'A','A']})
df
Out[5]:
COL1 COL2
0 B NaN
1 NaN A
2 B A
In [6]:
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Out[6]:
0 B
1 A
2 B
dtype: object
first_valid_index will return the index value (in this case column) that contains the first non-NaN value:
In [7]:
df.apply(lambda x: x.first_valid_index(), axis=1)
Out[7]:
0 COL1
1 COL2
2 COL1
dtype: object
So we can use this to index into the series
You can also use mask which replaces the values where COL1 is NaN by column COL2:
In [8]: df.assign(COL3=df['COL1'].mask(df['COL1'].isna(), df['COL2']))
Out[8]:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A

how to create a dataframe aggregating (grouping?) a dataframe containing only strings

I would like to create a dataframe "aggregating" a larger data set.
Starting:
df:
col1 col2
1 A B
2 A C
3 A B
and getting:
df_aggregated:
col1 col2
1 A B
2 A C
without using any calclulation (count())
I would write:
df_aggreagated = df.groupby('col1')
but I do not get anything
print ( df_aggregated )
"error"
any help appreciated
You can accomplish this by simply dropping the duplicate entries using the df.drop_duplicates function:
df_aggregated = df.drop_duplicates(subset=['col1', 'col2'], keep=False)
print(df_aggregated)
col1 col2
1 A B
2 A C
You can use groupby with a function:
In [849]: df.groupby('col2', as_index=False).max()
Out[849]:
col2 col1
0 B A
1 C A

Python-Pandas Join two columns by adding a character

There are three different columns, col2 and col3 need to be joined with the
character "/" between the two columns and after joining column name need to be col2. please help !!!
col1 col2 col3
B 0.0.0.0 0 0
B 2.145.26.0 24
B 2.145.27.0 24
B 10.0.0.0 8 20
Expected output:
col1 col2
B 0.0.0.0 0/0
B 2.145.26.0/24
B 2.145.27.0/24
B 10.0.0.0 8/20
df.col2 = df.col2.astype(str).str.cat(df.col3, sep='/')
See https://pandas.pydata.org/pandas-docs/version/0.23/api.html#string-handling for string operations.
IIUC
df['col2']+='/'+df.col3.astype(str)
df
Out[74]:
col1 col2 col3
0 B 0.0.0.00/0 0
1 B 2.145.26.0/24 24
2 B 2.145.27.0/24 24
3 B 10.0.0.08/20 20
Are you looking for something like this? Not sure if this can be labelled as Pythonic way?
temp_list = [{'col1':'B','col2':'0.0.0.0 0', 'col3':'0'},
{'col1':'B','col2':'2.145.26.0', 'col3':'24'},
{'col1':'B','col2':'2.145.27.0', 'col3':'24'},
{'col1':'B','col2':'10.0.0.0 8', 'col3':'10'}]
df = pd.DataFrame(data=temp_list)
df['col4'] = df['col2'] + "/" + df['col3']
df.drop(columns=['col2','col3'],inplace=True)
df.rename(columns = {'col4':'col2'})

Categories