How to reassign a Pandas dataframe value using the .apply() method? - python

Is there a way to reassign values in a pandas dataframe using the .apply() method?
I have this code:
import pandas as pd
df = pd.DataFrame({'switch': ['ON', 'OFF', 'ON'],
'value': [10, 15, 20]})
print (df, '\n')
def myfunc(row):
if row['switch'] == 'ON':
row['value'] = 500
elif row['switch'] == 'OFF':
row['value'] = 0
df = df.apply(myfunc, axis=1)
print (df)
The code is not working. I am trying to achieve the following output after running the .apply() method:
switch value
0 ON 500
1 OFF 0
2 ON 500
Why is the "row['value'] = 500" assignment not working and how can I rewrite it to make it work?

its not working because your function needs to return the value. also, you need to assign it back to the dataframe column for it to be present.
def f(row):
if row['switch'] == 'ON':
return 500
elif row['switch'] == 'OFF':
return 0
df['value'] = df.apply(f, axis=1)
df now has the values:
switch value
0 ON 500
1 OFF 0
2 ON 500
one thing to note here is whether switch can have any other values other than ON and OFF.
if those are the only permitted values, then you may replace the named function with a lambda expression.
if other values are present, then they will currently be set to None since your if condition block does not handle them. You would need to set a value for every type of switch or a default value to end up with a data frame without None in value

In addition to you not returning the value which is causing the error, I would suggest that you do not use apply() instead use a vectorized version using np.where() which is much faster.
import numpy as np
df['value'] = np.where(df['switch'] == "ON", 500, 0)

Related

Lazy evaluate Pandas dataframe filters

I'm observing a behavior that's weird to me, can anyone tell me how I can define filter once and re-use throughout my code?
>>> df = pd.DataFrame([1,2,3], columns=['A'])
>>> my_filter = df.A == 2
>>> df.loc[1] = 5
>>> df[my_filter]
A
1 5
I expect my_filter to return empty dataset since none of the A columns are equal to 2.
I'm thinking about making a function that returns the filter and re-use that but is there any more pythonic as well as pandaic way of doing this?
def get_my_filter(df):
return df.A == 2
df[get_my_filter()]
change df
df[get_my_filter()]
Masks are not dynamic, they stay how you defined them when you defined them.
So if you still need to change the dataframe value, you should swap lines 2 and 3.
That would work.
you applied the filter in the first place. Changing a value in the row won't help.
df = pd.DataFrame([1,2,3], columns=['A'])
my_filter = df.A == 2
print(my_filter)
'''
A
0 False
1 True
2 False
'''
as you can see, it returns a series. If you change the data after this process, it will not work. because this represents the first version of the df. But you can use define filter as a string. You can achieve what you want if you use the string filter inside the eval() function.
df = pd.DataFrame([1,2,3], columns=['A'])
my_filter = 'df.A == 2'
df.loc[1] = 5
df[eval(my_filter)]
'''
Out[205]:
Empty DataFrame
Columns: [A]
Index: []
'''

How to create conditionnal columns in Pandas with any?

I'm working with Pandas. I need to create a new column in a dataframe according to conditions in other columns. I try to look for each value in a series if it contains a value (a condition to return text).This works when the values are exactly the same but not when the value is only a part of the value of the series.
Sample data :
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
if ("ores") in df5["Symptom"]: return "Things"
df["new_column"] = df.swifter.apply(conditions, axis=1)
It's doesn't work because any("something") is always True
So i tried :
df['new_column'] = np.where(df2["Symptom"].str.contains('ores'), 'yes', 'no') : return "Things"
It doesn't work because it's inside a loop.
I can't use np.select because it needed two separate lists and my code has to be easily editable (and it can't come from a dict).
It also doesn't work with find_all. And also not with :
df["new_column"] == "ores" is True: return "things"
I don't really understand why nothing work and what i have to do ?
Edit :
df5 = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
(df5["Symptom"].str.contains('ores'), 'Things')
df5["Deversement Service"] = np.where(conditions)
df5
For the moment i have a lenght of values problem
To add a new column with condition, use np.where:
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
df['new'] = np.where(df["Symptom"].str.contains('ores'), 'Things', "")
print (df)
Symptom new
0 ores Things
1 ores + more texts Things
2 anything else
If you need a single boolean value, use pd.Series.any:
if df["Symptom"].str.contains('ores').any():
print ("Things")
# Things

I want to update a pandas dataframe iteratively

I have a dataframe that I need to check some conditions in 2 other columns and update another column iteratively. Basically I want to replace NaNs in smoking_status column with new categories.
Here is my code:
import numpy as np
for i in range(df.shape[0]):
if df['age'][i] < 15 and df['smoking_status'][i] == np.nan:
df['smoking_status'][i] = 'never smoked'
elif df['age'][i] >= 15 and df['smoking_status'][i] == np.nan:
df['smoking_status'][i] = 'occassional smoker'
The code runs but when I check my updated table I still notice no change. Any help would be appreciated.
Try to use pandas' vectorized functions instead of looping through every problem. They are both faster and result in neater code:
cond = df['smoking_status'].isna()
df.loc[cond, 'smoking_status'] = np.where(df.loc[cond, 'Age'] < 15, 'never smoked', 'occassional smoker')

Filling each row of one column of a DataFrame with different values (a random distribution)

I have a DataFrame with aprox. 4 columns and 200 rows. I created a 5th column with null values:
df['minutes'] = np.nan
Then, I want to fill each row of this new column with random inverse log normal values. The code to generate 1 inverse log normal:
note: if the code bellow is ran multiple times it will generate a new result because of the value inside ppf() : random.random()
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
What's happening when I do that is that it's filling all 200 rows of df['minutes'] with the same number, instead of triggering the random.random() for each row as I expected it to.
What do I have to do? I tried using for loopbut apparently I'm not getting it right (giving the same results):
for i in range(1,len(df)):
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
what am I doing wrong?
Also, I'll add that later I'll need to change some parameters of the inverse log normal above if the value of another column is 0 or 1. as in:
if df['type'] == 0:
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
elif df['type'] == 1:
df['minutes'] = df['minutes'].fillna(stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int))
thanks in advance.
The problem with your use of fillna here is that this function takes a value as argument and applies it to every element along the specified axis. So your stat value is calculated once and then distributed into every row.
What you need is your function called for every element on the axis, so your argument must be the function itself and not a value. That's a job for apply, which takes a function and applies it on elements along an axis.
I'm straight jumping to your final requirements:
You could use apply just on the minutes-column (as a pandas.Series method) with a lambda-function and then assign the respective results to the type-column filtered rows of column minutes:
import numpy as np
import pandas as pd
import scipy.stats as stats
import random
# setup
df = pd.DataFrame(np.random.randint(0, 2, size=(8, 4)),
columns=list('ABC') + ['type'])
df['minutes'] = np.nan
df.loc[df.type == 0, 'minutes'] = \
df['minutes'].apply(lambda _: stats.lognorm(
0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int),
convert_dtype=False))
df.loc[df.type == 1, 'minutes'] = \
df['minutes'].apply(lambda _: stats.lognorm(
1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int),
convert_dtype=False))
... or you use apply as a DataFrame method with a function wrapping your logic to distinguish between values of type-column and assign the result back to the minutes-column:
def calc_minutes(row):
if row['type'] == 0:
return stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int)
elif row['type'] == 1:
return stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int)
df['minutes'] = df.apply(calc_minutes, axis=1)
Managed to do it with some steps with a different mindset:
Created 2 lists, each with i's own parameters
Used NumPy's append
so that for each row a different random number
lognormal_tone = []
lognormal_ttwo = []
for i in range(len(s)):
lognormal_tone.append(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
lognormal_ttwo.append(stats.lognorm(0.4, scale=np.exp(2.7)).ppf(random.random()).astype(int))
Then, included them in the DataFrame with another previously created list:
df = pd.DataFrame({'arrival':arrival,'minTypeOne':lognormal_tone, 'minTypeTwo':lognormal_two})

Python - Population of PANDAS dataframe column based on conditions met in other dataframes' columns

I have 3 dataframes (df1, df2, df3) which are identically structured (# and labels of rows/columns), but populated with different values.
I want to populate df3 based on values in the associated column/rows in df1 and df2. I'm doing this with a FOR loop and a custom function:
for x in range(len(df3.columns)):
df3.iloc[:, x] = customFunction(x)
I want to populate df3 using this custom IF/ELSE function:
def customFunction(y):
if df1.iloc[:,y] <> 1 and df2.iloc[:,y] = 0:
return "NEW"
elif df2.iloc[:,y] = 2:
return "OLD"
else:
return "NEITHER"
I understand why I get an error message when i run this, but i can't figure out how to apply this function to a series. I could do it row by row with more complex code but i'm hoping there's a more efficient solution? I fear my approach is flawed.
v1 = df1.values
v2 = df2.values
df3.loc[:] = np.where(
(v1 != 1) & (v2 == 0), 'NEW',
np.where(v2 == 2, 'OLD', 'NEITHER'))
Yeah, try to avoid loops in pandas, its inefficient and built to be used with the underlying numpy vectorization.
You want to use the apply function.
Something like:
df3['new_col'] = df3.apply(lambda x: customFunction(x))
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Categories