Python Pandas: Apply method to column if condition met in another column - python

I have a pandas dataframe that looks something like the following:
>>> df = pd.DataFrame([["B","X"],["C","Y"],["D","X"]])
>>> df.columns = ["A","B"]
>>> df
A B
0 B X
1 C Y
2 D X
How can I apply a method to change the values of column A only if the value in column B is "X"? The desired outcome for example might be:
>>> df
A B
0 Bx X
1 C Y
2 Dx X
I thought of combining the two columns together (df['C']=df['A']+df['B']) but probably there's a better way to perform such a simple operation

One approach is using loc
df.loc[df.B == 'X', 'A']+='x'
A B
0 Bx X
1 C Y
2 Dx X
EDIT: Based on the question in the comment, is this what you are looking for?
df.loc[df.B == 'X', 'A'] = df.A.str.lower()+'x'
A B
0 bx X
1 C Y
2 dx X

Related

Python. Change numeric data into categorical [duplicate]

I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})
This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))
To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

Conditionally select dataframes from a list of dfs

I'm trying to subset a list of dataframes with a function. This function would need to return only the df's which for example have a Z-column-total of > 14 and X-column-values (rows 0-4) which are 30% below or above the average of those 5 values. So, in the example below df1 would be returned and df2 not.
Can this be done, evaluating every dataframe with these kinds of conditions? Could anyone point me in the right direction?
N = 5
np.random.seed(0)
df1 = pd.DataFrame(
{'X':np.random.uniform(0,5,N),
'Y':np.random.uniform(0,5,N),
'Z':np.random.uniform(0,5,N),
})
df2 = pd.DataFrame(
{'X':np.random.uniform(0,5,N),
'Y':np.random.uniform(0,5,N),
'Z':np.random.uniform(0,5,N),
})
df1.loc['total'] = df1.sum()
df2.loc['total'] = df2.sum()
df_list = (df1, df2)
X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176521 16.611793 14.426486
--------------------------------------
X Y Z
0 0.435646 4.893092 3.199605
1 0.101092 3.995793 0.716766
2 4.163099 2.307397 4.723345
3 3.890784 3.902646 2.609242
4 4.350061 0.591372 2.073310
total 12.940682 15.690299 13.322267
List comprehension can be used, with the 2 stated conditions.
The Z condition is pretty straightforward and easy to implement. Regarding the X condition, I created a little function that returns True if the dataframe matches the condition, else False.
In [156]: def check_X(df):
...: avg = df.drop('total')['X'].mean()
...: for val in df.drop('total')['X']:
...: if val/avg < 0.7 or val/avg > 1.3: #30% more or less
...: return False
...: return True
...:
Therefore, we can get the expected result by doing:
In [157]: [df for df in df_list if df.drop('total')['Z'].sum() > 14 and check_X(df)]
Out[157]:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176522 16.611794 14.426486]
Edit: a better, one-liner solution that doesn't use any user-defined function:
In [205]: [df for df in df_list if df['Z'].sum() > 14 and ((df['X'] > df['X'].mean()*0.7) & (df['X'] < df['X'].mean()*1.3)).all()]
Out[205]:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180]
For simplicity, I dropped the 'total' row from both df before processing:
In [204]: df_list = [df.drop('total') for df in df_list]
If you have a list of dataframes then conditionally select the dataframe using list comprehension and you can use slicing (iloc[0:-1] for excluding last row).
new_list= [x for x in df_list if (x.loc['total','Z']>14) and
((x.iloc[0:-1]['X'] > x.iloc[0:-1]['X'].mean()*0.7) & (x.iloc[0:-1]['X'] < x.iloc[0:-1]['X'].mean()*1.3)).all()]
Output:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176521 16.611793 14.426486]

How to lower all the elements in a pandas dataframe?

Just a quick question guys, I have a pandas dataframe:
In [11]: df = pd.DataFrame([['A', 'B'], ['C', E], ['D', 'C']],columns=['X', 'Y', 'Z'])
In [12]: df
Out[12]:
X Y Z
0 A B D
1 C E C
How can I convert to lower all the elements of df:
Out[12]:
X Y Z
0 a b d
1 c e c
I look over the documentation and I tried the following:
df = [[col.lower() for col in [df["X"],df["Y"], df["Z"]]]]
df
Nevertheless, it doesnt work. How to lower all the elements inside a pandas dataframe?.
Either
df.applymap(str.lower)
Out:
X Y Z
0 a b d
1 c e c
Or
df.apply(lambda col: col.str.lower())
Out:
X Y Z
0 a b d
1 c e c
The first one is faster and it looks nicer but the second one can handle NaNs.
using applymap with lambda will work even if df contain NaN values and String
import pandas as pd
df = df.applymap(lambda x: x.lower() if pd.notnull(x) else x)

Creating a new column based on if-elif-else condition

I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)

Categories