This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I am a newbie and trying to figure out how to correctly use the .loc function in pandas for slicing a dataframe. Any help is greatly appreciated.
The code is:
df1['Category'] = df[key_column].apply(lambda x: process_df1(x, 'category'))
where df1 is a dataframe,
key_column is a specific column identified to be operated upon
process_df1 is a function defined to run on df1.
The problem is I am trying to avoid the error:
"A value is trying to be set on a copy of a slice from a DataFrame.
Try using
.loc[row_indexer,col_indexer] = value instead"
I don't want to ignore / suppress the warnings or set
`pd.options.mode.chained_assignment = None.
Is there an alternative besides these 2?
I have tried using
df.loc[df1['Category'] = df[key_column].apply(lambda x: process_df1(x, 'category'))]
but it still produces the same error. Am I using the .loc incorrectly?
Apologies if it is a confusing question.
df1 = df[:break_index]
df2 = df[break_index:]
Thank you.
The apply method performs the function in place to the series you are running it on (key_column in this case)
If you are trying to create a new column based upon a function using another column as input you can use list comprehension
df1['Category'] = [process_df1(x, 'category') for x in df1[key_column]]
NOTE I'm assuming process_df1 operates on a single value from the key_column column and returns a new value based upon your writing. If that's not the case please update your question
Unless you give more details on the source data and your expected results, we won't be able to provide you clear answer. For now, here's something I just created to help you understand how we can pass two values and get things going.
import pandas as pd
df = pd.DataFrame({'Category':['fruit','animal','plant','fruit','animal','plant'],
'Good' :[27, 82, 32, 91, 99, 67],
'Faulty' :[10, 5, 12, 8, 2, 12],
'Region' :['north','north','south','south','north','south']})
def shipment(categ,y):
d = {'a': 0, 'b': 1, 'c': 2, 'd':3}
if (categ,y) == ('fruit','a'):
return 10
elif (categ,y) == ('fruit','b'):
return 20
elif (categ,y) == ('animal','a'):
return 30
elif (categ,y) == ('animal','c'):
return 40
elif (categ,y) == ('plant','a'):
return 50
elif (categ,y) == ('plant','d'):
return 60
else:
return 99
df['result'] = df['Category'].apply(lambda x: shipment(x,'a'))
print (df)
Related
This question already has answers here:
Pandas apply based on conditional from another column
(2 answers)
Closed 7 months ago.
How can I run through my dataframe and edit the data based on another column?
for people in andrew_lewis:
if andrew_lewis['MRR'] == 'KC2':
andrew_lewis['payout'] * 100
else:
print('wrong')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
It might not be obvious, but you can use numpy's magic indexing to do this:
import pandas as pd
data = [
['abc', 100],
['KC2', 200],
['def', 300],
['KC2', 400]
]
df = pd.DataFrame( data, columns=['MRR','QTY'] )
df.loc[df['MRR']=='KC2','QTY'] = df[df['MRR']=='KC2']['QTY'] * 100
print(df)
So, that selects specific rows with one column.
It looks like what you are trying to do is set the the value for the column payout to itself times 100 when the column MRR is 'KC2'. If that's not the case, then, well, I've probably got the wrong answer here.
To do this, you need a statement where you have andrew_lewis['payout'] = ..., with the ... to be figured out. For that, you want it to be one thing when a condition is met and another thing when the condition is not met. You can use numpy.where for this.
import numpy as np
andrew_lewis['payout'] = np.where(
andrew_lewis['MRR'] == 'KC2',
andrew_lewis['payout'] * 100,
andrew_lewis['payout'])
np.where is helpful in that it doesn't require any loop over the dataframe, but handles this itself (and more efficiently). It will return an array, and we set that array to just overwrite our original dataframe column.
It takes three parameters:
Condition: we want to test if the value in a column 'MRR' is 'KC2'
True result: what we give when the condition is true. You want it to give the value of 'payout' (in the same row) times 100
False result: what we give when the condition is false. Just give the original value of 'payout'.
You should try real hard to avoid using for-loops with dataframes. I won't say they are never necessary (other people may say that and be right), but you can almost always do what you are trying to do with np.where, np.choose, pd.df.apply, or another numpy and pandas function.
You can use the mask method:
import pandas as pd
df = pd.DataFrame(
[['KC1', 10],
['KC2', 20],
['KC2', 30],
['KC3', 40]],
columns=['MRR','payout'])
df["payout"] = df["payout"].mask(df["MRR"]=="KC2", df["payout"]*100)
print(df)
# result
MRR payout
0 KC1 10
1 KC2 2000
2 KC2 3000
3 KC3 40
I'm working with Pandas. I need to create a new column in a dataframe according to conditions in other columns. I try to look for each value in a series if it contains a value (a condition to return text).This works when the values are exactly the same but not when the value is only a part of the value of the series.
Sample data :
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
if ("ores") in df5["Symptom"]: return "Things"
df["new_column"] = df.swifter.apply(conditions, axis=1)
It's doesn't work because any("something") is always True
So i tried :
df['new_column'] = np.where(df2["Symptom"].str.contains('ores'), 'yes', 'no') : return "Things"
It doesn't work because it's inside a loop.
I can't use np.select because it needed two separate lists and my code has to be easily editable (and it can't come from a dict).
It also doesn't work with find_all. And also not with :
df["new_column"] == "ores" is True: return "things"
I don't really understand why nothing work and what i have to do ?
Edit :
df5 = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
(df5["Symptom"].str.contains('ores'), 'Things')
df5["Deversement Service"] = np.where(conditions)
df5
For the moment i have a lenght of values problem
To add a new column with condition, use np.where:
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
df['new'] = np.where(df["Symptom"].str.contains('ores'), 'Things', "")
print (df)
Symptom new
0 ores Things
1 ores + more texts Things
2 anything else
If you need a single boolean value, use pd.Series.any:
if df["Symptom"].str.contains('ores').any():
print ("Things")
# Things
Is there a way to reassign values in a pandas dataframe using the .apply() method?
I have this code:
import pandas as pd
df = pd.DataFrame({'switch': ['ON', 'OFF', 'ON'],
'value': [10, 15, 20]})
print (df, '\n')
def myfunc(row):
if row['switch'] == 'ON':
row['value'] = 500
elif row['switch'] == 'OFF':
row['value'] = 0
df = df.apply(myfunc, axis=1)
print (df)
The code is not working. I am trying to achieve the following output after running the .apply() method:
switch value
0 ON 500
1 OFF 0
2 ON 500
Why is the "row['value'] = 500" assignment not working and how can I rewrite it to make it work?
its not working because your function needs to return the value. also, you need to assign it back to the dataframe column for it to be present.
def f(row):
if row['switch'] == 'ON':
return 500
elif row['switch'] == 'OFF':
return 0
df['value'] = df.apply(f, axis=1)
df now has the values:
switch value
0 ON 500
1 OFF 0
2 ON 500
one thing to note here is whether switch can have any other values other than ON and OFF.
if those are the only permitted values, then you may replace the named function with a lambda expression.
if other values are present, then they will currently be set to None since your if condition block does not handle them. You would need to set a value for every type of switch or a default value to end up with a data frame without None in value
In addition to you not returning the value which is causing the error, I would suggest that you do not use apply() instead use a vectorized version using np.where() which is much faster.
import numpy as np
df['value'] = np.where(df['switch'] == "ON", 500, 0)
The question I have is closely related to this post.
Assume I have the following dataset:
df = pd.DataFrame({"A":range(1,10), "B":range(5,14), "Group":
[1,1,2,2,2,2,3,3,3],"C":[0,0,10,0,0,16,0,0,22], "last":[0,1,0,0,0,1,0,0,1],
"Want": [19.25,8,91.6,71.05,45.85,16,104.95,65.8,22]})
The last observation for the group is straight forward. This is how the code looks like:
def calculate(df):
if (df.last == 1):
value = df.loc["A"] + df.loc["B"]
else:
for all other observation PER GROUP, the row value is calculated as follows:
value = (df.loc[i-1, "C"] + 3 * df.loc[i, "A"] + 1.65 * df.loc[i, "B"])
return value
To further clarify, these are the formulas for calculating the Want column for Group 2 using excel: F4="F5+(3*A4)+(1.65*B4)", F5="F6+(3*A5)+(1.65*B5)", F6="F7+(3*A6)+(1.65*B6)", F7="A7+B7". There's some kind of "recursive" nature to it, which is why I thought of the "for loop"
I would really appreciate a solution where it's consistent with the first if statement. That is
value = something
rather than the function returning a data frame or something like that, so that I can call the function using the following
df["value"] = df.apply(calculate, axis=1)
Your help is appreciated. Thanks
You don't need apply here. Usually, apply is very slow and you'll want to avoid that.
Problems with this recursive characteristic, however, are usually hard to vectorize. Thankfully, yours can be solved using a reversed cumsum and np.where
df['Want'] = np.where(df['last'] == 1, df['A'] + df['B'], 3*df['A'] + 1.65*df['B'])
df['Want'] = df[::-1].groupby('Group')['Want'].cumsum()
I encountered this lambda expression today and can't understand how it's used:
data["class_size"]["DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
The line of code doesn't seem to call the lambda function or pass any arguments into it so I'm confused how it does anything at all. The purpose of this is to take two columns CSD and SCHOOL CODE and combine the entries in each row into a new row, DBN. So does this lambda expression ever get used?
You're writing your results incorrectly to a column. data["class_size"]["DBN"] is not the correct way to select the column to write to. You've also selected a column to use apply with but you'd want that across the entire dataframe.
data["DBN"] = data.apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
the apply method of a pandas Series takes a function as one of its arguments.
here is a quick example of it in action:
import pandas as pd
data = {"numbers":range(30)}
def cube(x):
return x**3
df = pd.DataFrame(data)
df['squares'] = df['numbers'].apply(lambda x: x**2)
df['cubes'] = df['numbers'].apply(cube)
print df
gives:
numbers squares cubes
0 0 0 0
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
...
as you can see, either defining a function (like cube) or using a lambda function works perfectly well.
As has already been pointed out, if you're having problems with your particular piece of code it's that you have data["class_size"]["DBN"] = ... which is incorrect. I was assuming that was an odd typo because you didn't mention getting a key error, which is what that would result in.
if you're confused about this, consider:
def list_apply(func, mylist):
newlist = []
for item in mylist:
newlist.append(func(item))
this is a (not very efficient) function for applying a function to every item in a list. if you used it with cube as before:
a_list = range(10)
print list_apply(cube, a_list)
you get:
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
this is a simplistic example of how the apply function in pandas is implemented. I hope that helps?
Are you using a multi-index dataframe (i.e. There are column hierarchies)? It's hard to tell without seeing your data, but I'm presuming it is the case, since just using data["class_size"].apply() would yield a series on a normal dataframe (meaning the lambda wouldn't be able to find your columns specified and then there would be an error!)
I actually found this answer which explains the problem of trying to create columns in multi-index dataframes, one confusing things with multi-index column creation is that you can try to create a column like you are doing and it will seem to run without any issues, but won't actually create what you want. Instead, you need to change data["class_size"]["DBN"] = ... to data["class_size", "DBN"] = ... So, in full:
data["class_size","DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
Of course, if it isn't a mult-index dataframe then this won't help, and you should look towards one of the other answers.
I think 0:02d means 2 decimal place for "CSD" value. {}{} basically places the 2 values together to form 'DBN'.