Making a new column in pandas based on values of other columns?

Making a new column in pandas based on values of other columns? - python

Here is the head of my dataframe
df_s['makes'] = df_s['result']
df_s['misses'] = df_s['result']
df_s.loc[(df_s['team'] == 'BOS') & (df_s['shot_distance'] >= 23) &(df_s['result'] == 'made'), 'makes'] = 1
df_s.loc[(df_s['team'] != 'BOS') | (df_s['shot_distance'] < 23) | (df_s['result'] == 'missed') | (df_s['makes'] == 'made'), 'makes'] = 0
df_s.fillna(0, inplace=True)
df_s.loc[(df_s['team'] == 'BOS') & (df_s['shot_distance'] >= 23) & (df_s['result'] == 'missed'), 'misses'] = 1
df_s.loc[(df_s['team'] != 'BOS') | (df_s['shot_distance'] < 23) | (df_s['result'] == 'made'), 'misses'] = 0
df_s.fillna(0, inplace=True)
Is the following a better way to do this, or is there an easier solution?:
>>> df['filter'] = (df['a'] >= 20) & (df['b'] >= 20)
a b c filter
0 1 50 1 False
1 10 60 30 False
2 20 55 1 True
3 3 0 0 False
4 10 0 0 False

A more readable way is to create masks
mask1 = df_s['team'] == 'BOS'
mask2 = df_s['shot_distance'] >= 23
mask3 = df_s['result'] == 'made'
df_s.loc[(mask1 & mask2 & mask3), 'makes'] = 1
df_s.loc[(~mask1 | ~mask2 | ~mask3), 'makes'] = 0
df_s.fillna(0, inplace=True)

Related

Python: Add a complex conditional column without for loop

I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient.
Can my code be simplified and made more efficient?
(I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).
import pandas as pd
path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"
racecard_path = path + hist_file
df = pd.read_excel(racecard_path)
df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:
cnt = cnt + 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
df.loc[cnt,"Mask"] = 1
else:
df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]
df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)
Sample data/output:
HxRun HxFPos HxTotalBtn Mask Total
7 5 8 0 0
13 3 2.75 1 3
12 5 3.75 0 0
11 5 5.75 0 0
11 7 9.25 0 0
11 9 14.5 0 0
10 10 26.75 0 0
8 4 19.5 1 4
8 8 67 0 0

Use df.assign() for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
or alternatively df.query (if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
or df.iterrows() as a fallback
df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[] calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)

Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column
by #Hossein-Kalbasi
I've just found an answer - please comment if this is not the most efficient.
df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1

How do I normalise a Pandas data column with multiple conditionals?

I am trying to create a new pandas column which is normalised data from another column.
I created three separate series and then merged them into one.
While this approache has provided me with the desired result, I was wondering whether there's a better way to do this.
x = df["Data Col"].copy()
#if the value is between 70 and 30 find the difference of the previous value.
#Positive difference = 1 & Negative difference = -1
btw = pd.Series(np.where(x.between(30, 70, inclusive=False), x.diff(), 0))
btw[btw < 0] = -1
btw[btw > 0] = 1
#All values above 70 are -1
up = pd.Series(np.where(x.gt(70), -1, 0))
#All values below 30 are 1
dw = pd.Series(np.where(x.lt(30), 1, 0))
combined = up + dw + btw
df["Normalised Col"] = np.array(combined)
I tried to use functions and loops directly on the Pandas Data Column but I couldn't figure out how to get the .diff()

Use numpy.select with chain masks by & for bitwise AND and | for bitwise OR:
np.random.seed(2019)
df = pd.DataFrame({'Data Col':np.random.randint(10, 100, size=10)})
#print (df)
d = df["Data Col"].diff()
m1 = df["Data Col"].between(30, 70, inclusive=False)
m2 = d < 0
m3 = d > 0
m4 = df["Data Col"].gt(70)
m5 = df["Data Col"].lt(30)
df["Normalised Col1"] = np.select([(m1 & m2) | m4, (m1 & m3) | m5], [-1, 1], default=0)
print (df)
Data Col Normalised Col1
0 82 -1
1 41 -1
2 47 1
3 98 -1
4 72 -1
5 34 -1
6 39 1
7 25 1
8 22 1
9 26 1

Updating a pandas column to replace values with np.nan if the value occurs once, and then reset once another value occus

The title is very confusing, so let me explain. I have a pandas column:
x | desired x
1.5 | 1
1 | 1
1 | 1
1 | 1
1 | 1
0 | 0
0 | 0
0 | 0
0 | 0
1 | 0
0 | 0
-1.5|-1
-1 |-1
-1 |-1
-1 |-1
0 | 0
0 | 0
0 | 0
0 | 0
-1 | 0
0 | 0
0 | 0
1.5 | 1
...
Currently, I have solved this using itertuples:
currval = np.nan
for idx in df.itertuples():
if idx[33] == 1.5:
currval = 1
elif idx[33] == -1.5:
currval = -1
elif idx[32] <> "":
currval = np.nan
else:
next
df.loc[idx.Index,'refPos2'] = currval
however, this code is wayyyyyy too slow, and was wondering if anyone had ideas on how to vectorize this.
Thanks!

The problem statement I understood from the comments, here is the solution:
for index, item in enumerate(a): ## a is your list [-1.5,1,1,0,1,1.5]
if item == 1.5:
a[index] = 1
elif item == -1.5:
a[index] = -1
elif a[index] == 0:
a[index] = 0
elif a[index] == 1 and a[index-1] ==0:
a[index] = 0
else:
a[index] =1

Pandas: Filtering multiple conditions

I'm trying to do boolean indexing with a couple conditions using Pandas. My original DataFrame is called df. If I perform the below, I get the expected result:
temp = df[df["bin"] == 3]
temp = temp[(~temp["Def"])]
temp = temp[temp["days since"] > 7]
temp.head()
However, if I do this (which I think should be equivalent), I get no rows back:
temp2 = df[df["bin"] == 3]
temp2 = temp2[~temp2["Def"] & temp2["days since"] > 7]
temp2.head()
Any idea what accounts for the difference?

Use () because operator precedence:
temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
Alternatively, create conditions on separate rows:
cond1 = df["bin"] == 3
cond2 = df["days since"] > 7
cond3 = ~df["Def"]
temp2 = df[cond1 & cond2 & cond3]
Sample:
df = pd.DataFrame({'Def':[True] *2 + [False]*4,
'days since':[7,8,9,14,2,13],
'bin':[1,3,5,3,3,3]})
print (df)
Def bin days since
0 True 1 7
1 True 3 8
2 False 5 9
3 False 3 14
4 False 3 2
5 False 3 13
temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
print (temp2)
Def bin days since
3 False 3 14
5 False 3 13

OR
df_train[(df_train["fold"]==1) | (df_train["fold"]==2)]
AND
df_train[(df_train["fold"]==1) & (df_train["fold"]==2)]

Alternatively, you can use the method query:
df.query('not Def & (`days since` > 7) & (bin == 3)')

If you want multiple conditions:
Del_Det_5k_top_10 = Del_Det[(Del_Det['State'] == 'NSW') & (Del_Det['route'] == 2) |
(Del_Det['State'] == 'VIC') & (Del_Det['route'] == 3)]

Pandas categorizing age variable into groups

I have a dataframe df with age and I am working on categorizing the file into age groups with 0s and 1s.
df:
User_ID | Age
35435 22
45345 36
63456 18
63523 55
I tried the following
df['Age_GroupA'] = 0
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
but get this error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
To avoid it, I am going for .loc
df['Age_GroupA'] = 0
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
However, this marks all ages as 1
This is what I get
User_ID | Age | Age_GroupA
35435 22 1
45345 36 1
63456 18 1
63523 55 1
while this is the goal
User_ID | Age | Age_GroupA
35435 22 1
45345 36 0
63456 18 1
63523 55 0
Thank you

Due to peer pressure (#DSM), I feel compelled to breakdown your error:
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
this is chained indexing/assignment
so what you tried next:
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
is incorrect form, when using loc you want:
df.loc[<boolean mask>, cols of interest] = some scalar or calculated value
like this:
df.loc[(df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 'Age_GroupA'] = 1
You could also have done this using np.where:
df['Age_GroupA'] = np.where( (df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 1, 0)
To do this in 1 line, there are many ways to do this

You can convert boolean mask to int - True are 1 and False are 0:
df['Age_GroupA'] = ((df['Age'] >= 1) & (df['Age'] <= 25)).astype(int)
print (df)
User ID Age Age_GroupA
0 35435 22 1
1 45345 36 0
2 63456 18 1
3 63523 55 0

This worked for me. Jezrael already explained it.
dataframe['Age_GroupA'] = ((dataframe['Age'] >= 1) & (dataframe['Age'] <= 25)).astype(int)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Making a new column in pandas based on values of other columns? - python

A more readable way is to create masks mask1 = df_s['team'] == 'BOS' mask2 = df_s['shot_distance'] >= 23 mask3 = df_s['result'] == 'made' df_s.loc[(mask1 & mask2 & mask3), 'makes'] = 1 df_s.loc[(~mask1 | ~mask2 | ~mask3), 'makes'] = 0 df_s.fillna(0, inplace=True)

Related

Python: Add a complex conditional column without for loop

How do I normalise a Pandas data column with multiple conditionals?

Updating a pandas column to replace values with np.nan if the value occurs once, and then reset once another value occus

Pandas: Filtering multiple conditions

Pandas categorizing age variable into groups

Categories

Resources