I have a dataframe where I have six columns that are coded 1 for yes and 0 for no. There is also a column for year. The output I need is finding the conditional probability between each column being coded 1 according to year. I tried incorporating some suggestions from this post: Pandas - Conditional Probability of a given specific b but with no luck. Other things I came up with are inefficient. I am really struggling to find the best way to go about this.
Current dataframe:
Output I am seeking:
To get your wide-formatted data into the long format of linked post, consider running melt and then run a self merge by year for all pairwise combinations (avoiding same keys and reverse duplicates). Then calculate as linked post shows:
long_df = current_df.melt(
id_vars = "Year",
var_name = "Key",
value_name = "Value"
)
pairwise_df = (
long_df.merge(
long_df,
on = "Year",
suffixes = ["1", "2"]
).query("Key1 < Key2")
.assign(
Both_Occur = lambda x: np.where(
(x["Value1"] == 1) & (x["Value2"] == 1),
1,
0
)
)
)
prob_df = (
(pairwise_df.groupby(["Year", "Key1", "Key2"])["Both_Occur"].value_counts() /
pairwise_df.groupby(["Year", "Key1", "Key2"])["Both_Occur"].count()
).to_frame(name = "Prob")
.reset_index()
.query("Both_Occur == 1")
.drop(["Both_Occur"], axis = "columns")
)
To demonstrate with reproducible data
import numpy as np
import pandas as pd
np.random.seed(112621)
random_df = pd.DataFrame({
'At least one tree': np.random.randint(0, 2, 100),
'At least two trees': np.random.randint(0, 2, 100),
'Clouds': np.random.randint(0, 2, 100),
'Grass': np.random.randint(0, 2, 100),
'At least one mounain': np.random.randint(0, 2, 100),
'Lake': np.random.randint(0, 2, 100),
'Year': np.random.randint(1983, 1995, 100)
})
# ...same code as above...
prob_df
Year Key1 Key2 Prob
0 1983 At least one mounain At least one tree 0.555556
2 1983 At least one mounain At least two trees 0.555556
5 1983 At least one mounain Clouds 0.416667
6 1983 At least one mounain Grass 0.555556
8 1983 At least one mounain Lake 0.555556
.. ... ... ... ...
351 1994 At least two trees Grass 0.490000
353 1994 At least two trees Lake 0.420000
355 1994 Clouds Grass 0.280000
357 1994 Clouds Lake 0.240000
359 1994 Grass Lake 0.420000
Related
I have an input dataframe for daily fruit spend which looks like this:
spend_df
Date Apples Pears Grapes
01/01/22 10 47 0
02/01/22 0 22 3
03/01/22 11 0 3
...
For each fruit, I need to apply a function using their respective parameters and inputs spends. The function includes the previous day and the current day spends, which is as follows:
y = beta(1 - exp(-(theta*previous + current)/alpha))
parameters_df
Parameter Apples Pears Grapes
alpha 132 323 56
beta 424 31 33
theta 13 244 323
My output data frame should look like this (may contain errors):
profit_df
Date Apples Pears Grapes
01/01/22 30.93 4.19 0
02/01/22 265.63 31.00 1.72
03/01/22 33.90 30.99 32.99
...
This is what I attempted:
# First map parameters_df to spend_df
merged_df = input_df.merge(parameters_df, on=['Apples','Pears','Grapes'])
# Apply function to each row
profit_df = merged_df.apply(lambda x: beta(1 - exp(-(theta*x[-1] + x)/alpha))
It might be easier to read if you extract the necessary variables from parameters_df and spend_df first. Then a simple application of the formula will produce the expected output.
# extract alpha, beta, theta from parameters df
alpha, beta, theta = parameters_df.iloc[:, 1:].values
# select fruit columns
current = spend_df[['Apples', 'Pears', 'Grapes']]
# find previous values of fruit columns
previous = current.shift(fill_value=0)
# calculate profit using formula
y = beta*(1 - np.exp(-(theta*previous + current) / alpha))
profit_df = spend_df[['Date']].join(y)
Another approach using Pandas rolling function (this is a generalized version to as many fruits as necessary) :
import pandas as pd
import numpy as np
sdf = pd.DataFrame({
"Date": ['01/01/22', '02/01/22', '03/01/22'],
"Apples": [10, 0, 11],
"Pears": [47, 22, 0],
"Grapes": [0, 3, 3],
}).set_index("Date")
pdf = pd.DataFrame({
"Parameter": ['alpha', 'beta', 'theta'],
"Apples": [132, 424, 13],
"Pears": [323, 31, 244],
"Grapes": [56, 33, 323],
}).set_index("Parameter")
def func(r):
t = (pdf.loc['alpha', r.name], pdf.loc['beta', r.name], pdf.loc['theta', r.name])
return r.rolling(2).apply(lambda x: t[1]*(1 - np.exp(-(t[2]*x[0] + x[1])/t[0])))
r1 = sdf.iloc[0:2,:].shift(fill_value=0).apply(lambda r: func(r), axis=0)
r = sdf.apply(lambda r: func(r), axis=0)
r.iloc[0]=r1.shift(-1).iloc[0]
print(r)
Result
Apples Pears Grapes
Date
01/01/22 30.934651 4.198004 0.000000
02/01/22 265.637775 31.000000 1.721338
03/01/22 33.901168 30.999998 32.999999
I have the following data set. I want to create a dataframe that contains all teams and include the number of games played, wins, losses, and draws, and average point differential in 2017 (Y = 17).
Date Y HomeTeam AwayTeam HomePoints AwayPoints
2014-08-16 14 Arsenal Crystal Palace 2 1
2014-08-16 14 Leicester Everton 2 2
2014-08-16 14 Man United Swansea 1 2
2014-08-16 14 QPR Hull 0 1
2014-08-16 14 Stoke Aston Villa 0 1
I wrote the following code:
df17 = df[df['Y'] == 17]
df17['differential'] = abs(df['HomePoints'] - df['AwayPoints'])
df17['home_wins'] = np.where(df17['HomePoints'] > df17['AwayPoints'], 1, 0)
df17['home_losses'] = np.where(df17['HomePoints'] < df17['AwayPoints'], 1, 0)
df17['home_ties'] = np.where(df17['HomePoints'] == df17['AwayPoints'], 1, 0)
df17['game_count'] = 1
df17.groupby("HomeTeam").agg({"differential": np.mean, "home_wins": np.sum, "home_losses": np.sum, "home_ties": np.sum, "game_count": np.sum}).sort_values(["differential"], ascending = False)
But i dont think this is correct as I'm only accounting for home team..does someone have a cleanear method?
Melting the dataframe granting us two new lines per old line, this allows us to have a line for the HomeTeam and a line for the AwayTeam.
Please find the documentation for the melt method here : https://pandas.pydata.org/docs/reference/api/pandas.melt.html
df = pd.melt(df, id_vars=['Date', 'Y', 'HomePoints', 'AwayPoints'], value_vars=['HomeTeam', 'AwayTeam'])
df = df.rename({'value': 'Team', 'variable': 'Home/Away'}, axis=1)
df['Differential'] = df['Home/Away'].replace({'HomeTeam': 1, 'AwayTeam': -1}) * (df['HomePoints'] - df['AwayPoints'])
def count_wins(x):
return (x > 0).sum()
def count_losses(x):
return (x < 0).sum()
def count_draws(x):
return (x == 0).sum()
df = df.groupby('Team')['Differential'].agg(['count', count_wins, count_losses, count_draws, 'sum'])
df = df.rename({'count': 'Number of games', 'count_wins': 'Wins', 'count_losses': 'Losses', 'count_draws': 'Draws', 'sum': 'Differential'}, axis=1)
I have a data frame in pandas.
pd.DataFrame({
"category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food", "Living : Something", "Living : Anitsomething"],
"amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100, 1000, -1000]
})
Categories and subcategories are split by a colon.
I am trying to sort this data frame in descending amount (absolute value) order. Whilst respecting the hierarchical grouping. I.e. the sorted result should look like
Transport 5000
Transport : Car 4900
Transport : Train 100
Household 1600
Household : Utilities 600
Household : Utilities : Water 400
Household : Utilities : Electric 200
Household : Rent 400
Living 250
Living : Something 1000
Living : Antisomething -1000
Living : Other 150
Living : Food 100
I can do this recursively in an incredibly inefficient manner. Super slow but it works.
def sort_hierachical(self, full_df, name_column, sort_column, parent="", level=0):
result_df = pd.DataFrame(columns=full_df.columns)
part_df = full_df.loc[(full_df[name_column].str.count(':') == level) & (full_df[name_column].str.startswith(parent)), :]
part_df['abs'] = part_df[sort_column].abs()
part_df = part_df.sort_values('abs', ascending=False)
for _, row in part_df.iterrows():
category = row[name_column]
row_df = pd.DataFrame(columns = full_df.columns).append(row)
child_rows = self.sort_hierachical(full_df, name_column, sort_column, category, level+1)
if not child_rows.empty:
result_df = pd.concat([result_df, row_df], sort=False)
result_df = pd.concat([result_df, child_rows], sort=False)
else:
result_df = pd.concat([result_df, row_df], sort=False)
return result_df
df = self.sort_hierachical(df, "category", "amount")
My question: Is there a nice performant way to do such a thing in pandas. Some sort of group by sort or multi index trick??
Good karma will come to the ones who can solve this challenging problem :)
Edit:
This almost works... But the -1000, 1000 messes up the sort order.
def _sort_tree_df(self, df, tree_column, sort_column):
sort_key = sort_column + '_abs'
df[sort_key] = df[sort_column].abs()
df.index = pd.MultiIndex.from_frame(df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
sort_columns = [df[tree_column].values]
sort_columns.append(df[sort_key].values)
for x in range(df.index.nlevels, 0, -1):
group_lvl = list(range(0, x))
sort_columns.append(df.groupby(level=group_lvl)[sort_key].transform('max').values)
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
df_sorted = df_sorted.drop(sort_key, axis=1)
return df_sorted
Edit2:
Ok I think I've managed to make it work. I'm still very confused how lexsort works. I made this work through educated trial and error. If you understand it please feel free to explain it. Also feel free to post a better method.
def _sort_tree_df(self, df, tree_column, sort_column, delimeter=':'):
df.index = pd.MultiIndex.from_frame(df[tree_column].str.split(delimeter).apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
sort_columns = [df[tree_column].values]
sort_columns.append(df[sort_column].abs().values)
for x in range(df.index.nlevels, 0, -1):
group_lvl = list(range(0, x))
sort_columns.append(df.groupby(level=group_lvl)[sort_column].transform('sum').abs().values)
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
return df_sorted
Edit3 :
Actually this doesn't always sort correctly :(
Edit4
The problem is I need a way to make th transform('sum') only apply to items where level = x-1
ie something like:
df['level'] = df[tree_column].str.count(':')
sorting_by = df.groupby(level=group_lvl)[sort_column].transform('sum' if 'level' = x-1).abs().values
or
sorting_by = df.groupby(level=group_lvl).loc['level' = x-1: sort_column].transform('sum').abs().values
both of which are not valid
Anyone know how to do a conditional transform like this on a multi index df?
I am not sure I exactly understood the question but I think you should split the columns in sub categories and then do a value sorting based on hierarchy you want. Something like the following might do the job.
use the following to create new columns:
for _, row in df.iterrows():
for item, col in zip(row.category.split(':'), ['cat', 'sub_cat', 'sub_sub_cat']):
df.loc[_, col] = item
and then just sort them
df.sort_values(['cat', 'sub_cat', 'sub_sub_cat', 'amount'])
category amount cat sub_cat sub_sub_cat
3 Household 1100 Household NaN NaN
7 Household : Cleaning 100 Household Cleaning NaN
8 Household : Cleaning : Bathroom 75 Household Cleaning Bathroom
9 Household : Cleaning : Kitchen 25 Household Cleaning Kitchen
10 Household : Rent 400 Household Rent NaN
4 Household : Utilities 600 Household Utilities NaN
6 Household : Utilities : Electric 200 Household Utilities Electric
5 Household : Utilities : Water 400 Household Utilities Water
11 Living 250 Living NaN NaN
15 Living : Anitsomething -1000 Living Anitsomething NaN
13 Living : Food 100 Living Food NaN
12 Living : Other 150 Living Other NaN
14 Living : Something 1000 Living Something NaN
0 Transport 5000 Transport NaN NaN
1 Transport : Car 4900 Transport Car NaN
2 Transport : Train 100 Transport Train Na
OK, took a while to nut out but now I'm pretty sure this works. Much faster than recursive method too.
def _sort_tree_df(self, df, tree_column, sort_column, delimeter=':'):
df=df.copy()
parts = df[tree_column].str.split(delimeter).apply(lambda x: [y.strip() for y in x]).apply(pd.Series)
for i, column in enumerate(parts.columns):
df[column] = parts[column]
sort_columns = [df[tree_column].values]
sort_columns.append(df[sort_column].abs().values)
df['level'] = df[tree_column].str.count(':')
for x in range(len(parts.columns), 0, -1):
group_columns = list(range(0, x))
sorting_by = df.copy()
sorting_by.loc[sorting_by['level'] != x-1, sort_column] = np.nan
sorting_by = sorting_by.groupby(group_columns)[sort_column].transform('sum').abs().values
sort_columns.append(sorting_by)
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
df.drop([column for column in parts.columns], inplace=True, axis=1)
df.drop('level', inplace=True, axis=1)
return df_sorted
I have a dataframe of categories and amounts. Categories can be nested into sub categories an infinite levels using a colon separated string. I wish to sort it by descending amount. But in hierarchical type fashion like shown.
How I need it sorted
CATEGORY AMOUNT
Transport 5000
Transport : Car 4900
Transport : Train 100
Household 1100
Household : Utilities 600
Household : Utilities : Water 400
Household : Utilities : Electric 200
Household : Cleaning 100
Household : Cleaning : Bathroom 75
Household : Cleaning : Kitchen 25
Household : Rent 400
Living 250
Living : Other 150
Living : Food 100
EDIT:
The data frame:
pd.DataFrame({
"category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food"],
"amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100]
})
Note: this is the order I want it. It may be in any arbitrary order before the sort.
EDIT2:
If anyone looking for a similar solution I posted the one I settled on here: How to sort dataframe in pandas by value in hierarchical category structure
One way could be to first str.split the category column.
df_ = df['category'].str.split(' : ', expand=True)
print (df_.head())
0 1 2
0 Transport None None
1 Transport Car None
2 Transport Train None
3 Household None None
4 Household Utilities None
Then get the column amount and what you want is to get the maximum amount per group based on:
the first column alone,
then the first and the second columns
then the first-second and third columns, ...
You can do this with groupby.transform with max, and you concat each column created.
s = df['amount']
l_cols = list(df_.columns)
dfa = pd.concat([s.groupby([df_[col] for col in range(0, lv+1)]).transform('max')
for lv in l_cols], keys=l_cols, axis=1)
print (dfa)
0 1 2
0 5000 NaN NaN
1 5000 4900.0 NaN
2 5000 100.0 NaN
3 1100 NaN NaN
4 1100 600.0 NaN
5 1100 600.0 400.0
6 1100 600.0 200.0
7 1100 100.0 NaN
8 1100 100.0 75.0
9 1100 100.0 25.0
10 1100 400.0 NaN
11 250 NaN NaN
12 250 150.0 NaN
13 250 100.0 NaN
Now you just need to sort_values on all columns in the right order on first 0, then 1, then 2..., get the index and use loc to order df in the expected way
dfa = dfa.sort_values(l_cols, na_position='first', ascending=False)
dfs = df.loc[dfa.index] #here you can reassign to df directly
print (dfs)
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
6 Household : Utilities : Electric 200
10 Household : Rent 400 #here is the one difference with this data
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
11 Living 250
12 Living : Other 150
13 Living : Food 100
I packaged #Ben. T's answer into a more generic function, hopefully this is clearer to read!
EDIT: I have made changes to the function to group by columns in order rather than one by one to address potential issues noted by #Ben. T in the comments.
import pandas as pd
def category_sort_df(df, sep, category_col, numeric_col, ascending=False):
'''Sorts dataframe by nested categories using `sep` as the delimiter for `category_col`.
Sorts numeric columns in descending order by default.
Returns a copy.'''
df = df.copy()
try:
to_sort = pd.to_numeric(df[numeric_col])
except ValueError:
print(f'Column `{numeric_col}` is not numeric!')
raise
categories = df[category_col].str.split(sep, expand=True)
# Strips any white space before and after sep
categories = categories.apply(lambda x: x.str.split().str[0], axis=1)
levels = list(categories.columns)
to_concat = []
for level in levels:
# Group by columns in order rather than one at a time
level_by = [df_[col] for col in range(0, level+1)]
gb = to_sort.groupby(level_by)
to_concat.append(gb.transform('max'))
dfa = pd.concat(to_concat, keys=levels, axis=1)
ixs = dfa.sort_values(levels, na_position='first', ascending=False).index
df = df.loc[ixs].copy()
return df
Using Python 3.7.3, pandas 0.24.2
To answer my own question: I found a way. Kind of long winded but here it is.
import numpy as np
import pandas as pd
def sort_tree_df(df, tree_column, sort_column):
sort_key = sort_column + '_abs'
df[sort_key] = df[sort_column].abs()
df.index = pd.MultiIndex.from_frame(
df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
sort_columns = [df[tree_column].values, df[sort_key].values] + [
df.groupby(level=list(range(0, x)))[sort_key].transform('max').values
for x in range(df.index.nlevels - 1, 0, -1)
]
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
df_sorted.drop(sort_key, axis=1, inplace=True)
return df_sorted
sort_tree_df(df, 'category', 'amount')
If you don't mind adding an extra column you can extract the main category from the category and then sort by amount/main category/category, ie.:
df['main_category'] = df.category.str.extract(r'^([^ ]+)')
df.sort_values(['main_category', 'amount', 'category'], ascending=False)[['category', 'amount']]
Output:
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
11 Living 250
12 Living : Other 150
13 Living : Food 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
10 Household : Rent 400
6 Household : Utilities : Electric 200
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
Note that this will work well only if your main categories are single words without spaces. Otherwise you will need to do it in a different way, ie. extract all non-colons and strip the trailing space:
df['main_category'] = df.category.str.extract(r'^([^:]+)')
df['main_category'] = df.main_category.str.rstrip()
I try to add a new column "energy_class" to a dataframe "df_energy" which it contains the string "high" if the "consumption_energy" value > 400, "medium" if the "consumption_energy" value is between 200 and 400, and "low" if the "consumption_energy" value is under 200.
I try to use np.where from numpy, but I see that numpy.where(condition[, x, y]) treat only two condition not 3 like in my case.
Any idea to help me please?
Thank you in advance
Try this:
Using the setup from #Maxu
col = 'consumption_energy'
conditions = [ df2[col] >= 400, (df2[col] < 400) & (df2[col]> 200), df2[col] <= 200 ]
choices = [ "high", 'medium', 'low' ]
df2["energy_class"] = np.select(conditions, choices, default=np.nan)
consumption_energy energy_class
0 459 high
1 416 high
2 186 low
3 250 medium
4 411 high
5 210 medium
6 343 medium
7 328 medium
8 208 medium
9 223 medium
You can use a ternary:
np.where(consumption_energy > 400, 'high',
(np.where(consumption_energy < 200, 'low', 'medium')))
I like to keep the code clean. That's why I prefer np.vectorize for such tasks.
def conditions(x):
if x > 400: return "High"
elif x > 200: return "Medium"
else: return "Low"
func = np.vectorize(conditions)
energy_class = func(df_energy["consumption_energy"])
Then just add numpy array as a column in your dataframe using:
df_energy["energy_class"] = energy_class
The advantage in this approach is that if you wish to add more complicated constraints to a column, it can be done easily.
Hope it helps.
I would use the cut() method here, which will generate very efficient and memory-saving category dtype:
In [124]: df
Out[124]:
consumption_energy
0 459
1 416
2 186
3 250
4 411
5 210
6 343
7 328
8 208
9 223
In [125]: pd.cut(df.consumption_energy,
[0, 200, 400, np.inf],
labels=['low','medium','high']
)
Out[125]:
0 high
1 high
2 low
3 medium
4 high
5 medium
6 medium
7 medium
8 medium
9 medium
Name: consumption_energy, dtype: category
Categories (3, object): [low < medium < high]
WARNING: Be careful with NaNs
Always be careful that if your data has missing values np.where may be tricky to use and may give you the wrong result inadvertently.
Consider this situation:
df['cons_ener_cat'] = np.where(df.consumption_energy > 400, 'high',
(np.where(df.consumption_energy < 200, 'low', 'medium')))
# if we do not use this second line, then
# if consumption energy is missing it would be shown medium, which is WRONG.
df.loc[df.consumption_energy.isnull(), 'cons_ener_cat'] = np.nan
Alternatively, you can use one-more nested np.where for medium versus nan which would be ugly.
IMHO best way to go is pd.cut. It deals with NaNs and easy to use.
Examples:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
# pd.cut
df['age_cat'] = pd.cut(df.age, [0, 20, 60, np.inf], labels=['child','medium','old'])
# manually add another line for nans
df['age_cat2'] = np.where(df.age > 60, 'old', (np.where(df.age <20, 'child', 'medium')))
df.loc[df.age.isnull(), 'age_cat'] = np.nan
# multiple nested where
df['age_cat3'] = np.where(df.age > 60, 'old',
(np.where(df.age <20, 'child',
np.where(df.age.isnull(), np.nan, 'medium'))))
# outptus
print(df[['age','age_cat','age_cat2','age_cat3']].head(7))
age age_cat age_cat2 age_cat3
0 22.0 medium medium medium
1 38.0 medium medium medium
2 26.0 medium medium medium
3 35.0 medium medium medium
4 35.0 medium medium medium
5 NaN NaN medium nan
6 54.0 medium medium medium
Let's start by creating a dataframe with 1000000 random numbers between 0 and 1000 to be used as test
df_energy = pd.DataFrame({'consumption_energy': np.random.randint(0, 1000, 1000000)})
[Out]:
consumption_energy
0 683
1 893
2 545
3 13
4 768
5 385
6 644
7 551
8 572
9 822
A bit of a description of the dataframe
print(df.energy.describe())
[Out]:
consumption_energy
count 1000000.000000
mean 499.648532
std 288.600140
min 0.000000
25% 250.000000
50% 499.000000
75% 750.000000
max 999.000000
There are various ways to achieve that, such as:
Using numpy.where
df_energy['energy_class'] = np.where(df_energy['consumption_energy'] > 400, 'high', np.where(df_energy['consumption_energy'] > 200, 'medium', 'low'))
Using numpy.select
df_energy['energy_class'] = np.select([df_energy['consumption_energy'] > 400, df_energy['consumption_energy'] > 200], ['high', 'medium'], default='low')
Using numpy.vectorize
df_energy['energy_class'] = np.vectorize(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))(df_energy['consumption_energy'])
Using pandas.cut
df_energy['energy_class'] = pd.cut(df_energy['consumption_energy'], bins=[0, 200, 400, 1000], labels=['low', 'medium', 'high'])
Using Python's built in modules
def energy_class(x):
if x > 400:
return 'high'
elif x > 200:
return 'medium'
else:
return 'low'
df_energy['energy_class'] = df_energy['consumption_energy'].apply(energy_class)
Using a lambda function
df_energy['energy_class'] = df_energy['consumption_energy'].apply(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))
Time Comparison
From all the tests that I've done, by measuring time with time.perf_counter() (for other ways to measure time of execution see this), pandas.cut was the fastest approach.
method time
0 np.where() 0.124139
1 np.select() 0.155879
2 numpy.vectorize() 0.452789
3 pandas.cut() 0.046143
4 Python's built-in functions 0.138021
5 lambda function 0.19081
Notes:
For the difference between pandas.cut and pandas.qcut see this: What is the difference between pandas.qcut and pandas.cut?
Try this : Even if consumption_energy contains nulls don't worry about it.
def egy_class(x):
'''
This function assigns classes as per the energy consumed.
'''
return ('high' if x>400 else
'low' if x<200 else 'medium')
chk = df_energy.consumption_energy.notnull()
df_energy['energy_class'] = df_energy.consumption_energy[chk].apply(egy_class)
I second using np.vectorize. It is much faster than np.where and also cleaner code wise. You can definitely tell the speed up with larger data sets. You can use a dictionary format for your conditionals as well as the output of those conditions.
# Vectorizing with numpy
row_dic = {'Condition1':'high',
'Condition2':'medium',
'Condition3':'low',
'Condition4':'lowest'}
def Conditions(dfSeries_element,dictionary):
'''
dfSeries_element is an element from df_series
dictionary: is the dictionary of your conditions with their outcome
'''
if dfSeries_element in dictionary.keys():
return dictionary[dfSeries]
def VectorizeConditions():
func = np.vectorize(Conditions)
result_vector = func(df['Series'],row_dic)
df['new_Series'] = result_vector
# running the below function will apply multi conditional formatting to your df
VectorizeConditions()
myassign["assign3"]=np.where(myassign["points"]>90,"genius",(np.where((myassign["points"]>50) & (myassign["points"]<90),"good","bad"))
when you wanna use only "where" method but with multiple condition. we can add more condition by adding more (np.where) by the same method like we did above. and again the last two will be one you want.