Using an if statement in a dataframe with lambda functions - python

I am trying to add a new column to a dataframe based on an if statement depending on the values of two columns. i.e. if column x == None then column y else column x
below is the script I have written but doesn't work. any ideas?
dfCurrentReportResults['Retention'] = dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x)
Also I got this error message:
AttributeError: ("'Series' object has no attribute 'Retention_x'", u'occurred at index BUSINESSUNIT_NAME')
fyi: BUSINESSUNIT_NAME is the first column name
Additional Info:
My data printed out looks like this and I want to add a 3rd column to take a value if there is one else keep NaN.
Retention_x Retention_y
0 1 NaN
1 NaN 0.672183
2 NaN 1.035613
3 NaN 0.771469
4 NaN 0.916667
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
UPDATE:
In the end I was having issues referencing the Null or is Null in my dataframe the final line of code I used also including the axis = 1 answered my question.
dfCurrentReportResults['RetentionLambda'] = dfCurrentReportResults.apply(lambda x : x['Retention_y'] if pd.isnull(x['Retention_x']) else x['Retention_x'], axis = 1)
Thanks #EdChum, #strim099 and #aus_lacy for all your input. As my data set gets larger I may switch to the np.where option if I notice performance issues.

You'r lambda is operating on the 0 axis which is columnwise. Simply add axis=1 to the apply arg list. This is clearly documented.
In [1]: import pandas
In [2]: dfCurrentReportResults = pandas.DataFrame([['a','b'],['c','d'],['e','f'],['g','h'],['i','j']], columns=['Retention_y', 'Retention_x'])
In [3]: dfCurrentReportResults['Retention_x'][1] = None
In [4]: dfCurrentReportResults['Retention_x'][3] = None
In [5]: dfCurrentReportResults
Out[5]:
Retention_y Retention_x
0 a b
1 c None
2 e f
3 g None
4 i j
In [6]: dfCurrentReportResults['Retention'] = dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x, axis=1)
In [7]: dfCurrentReportResults
Out[7]:
Retention_y Retention_x Retention
0 a b b
1 c None c
2 e f f
3 g None g
4 i j j

Just use np.where:
dfCurrentReportResults['Retention'] = np.where(df.Retention_x == None, df.Retention_y, else df.Retention_x)
This uses the test condition, the first param and sets the value to df.Retention_y else df.Retention_x
also avoid using apply where possible as this is just going to loop over the values, np.where is a vectorised method and will scale much better.
UPDATE
OK no need to use np.where just use the following simpler syntax:
dfCurrentReportResults['Retention'] = df.Retention_y.where(df.Retention_x == None, df.Retention_x)
Further update
dfCurrentReportResults['Retention'] = df.Retention_y.where(df.Retention_x.isnull(), df.Retention_x)

Related

Moving pandas series value by switching column name?

I have a DF, however the last value of some series should be placed in a different one. This happened due to column names not being standardized - i.e., some are "Wx_y_x_PRED" and some are "Wx_x_y_PRED". I'm having difficulty writing a function that will simply find the columns with >= 225 NaN's and changing the column it's assigned to.
I've written a function that for some reason will sometimes work and sometimes won't. When it does, it further creates approx 850 columns in its wake (the OG dataframe is around 420 with the duplicate columns). I'm hoping to have something that just reassigns the value. If it automatically deletes the incorrect column, that's awesome too, but I just used .dropna(thresh = 2) when my function worked originally.
Here's what it looks like originally:
in: df = pd.DataFrame(data = {'W10_IND_JAC_PRED': ['NaN','NaN','NaN','NaN','NaN',2],
'W10_JAC_IND_PRED': [1,2,1,2,1,'NAN']})
out:df
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 NAN
I wrote this, which occasionally works but most of the time doesn't and i'm not sure why.
def switch_cols(x):
"""Takes mismatched columns (where only the last value != NaN) and changes order of team column names"""
if x.isna().sum() == 5:
col_string = x.name.split('_')
col_to_switch = ('_').join([col_string[0],col_string[2],col_string[1],'PRED'])
df[col_to_switch]['row_name'] = x[-1]
else:
pass
return x
Most of the time it just returns to me the exact same DF, but this is the desired outcome.
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 2
Anyone have any tips or could share why my function works maybe 10% of the time?
Edit:
so this is an ugly "for" loop I wrote that works. I know there has to be a much more pythonic way of doing this while preserving original column names, though.
for i in range(df.shape[1]):
if df.iloc[:,i].isna().sum() == 5:
split_nan_col = df.columns[i].split('_')
correct_col_name = ('_').join([split_nan_col[0],split_nan_col[2],split_nan_col[1],split_nan_col[3]])
df.loc[5,correct_col_name] = df.loc[5,df.columns[i]]
else:
pass
Doing with split before frozenset(will return the order list), then we do join: Notice this solution can be implemented to more columns
df.columns=df.columns.str.split('_').map(frozenset).map('_'.join)
df.mask(df=='NaN').groupby(level=0,axis=1).first() # groupby first will return the first not null value
PRED_JAC_W10_IND
0 1
1 2
2 1
3 2
4 1
5 2

Set values based on df.query?

I'd like to set the value of a column based on a query. I could probably use .where to accomplish this, but the criteria for .query are strings which are easier for me to maintain, especially when the criteria become complex.
import numpy as np
import pandas as pd
np.random.seed(51723)
df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
I'd like to make a new column, d, and set the value to 1 where these criteria are met:
criteria = '(a < b) & (b < c)'
Among other things, I've tried:
df['d'] = np.nan
df.query(criteria).loc[:,'d'] = 1
But that seems to do nothing except giving the SettingWithCopyWarning even though I'm using .loc
And passing inplace like this:
df.query(criteria, inplace=True).loc[:,'d'] = 1
Gives AttributeError: 'NoneType' object has no attribute 'loc'
AFAIK df.query() returns a new DF, so try the following approach:
In [146]: df.loc[df.eval(criteria), 'd'] = 1
In [147]: df
Out[147]:
a b c d
0 0.175155 0.221811 0.808175 1.0
1 0.069033 0.484528 0.841618 1.0
2 0.174685 0.648299 0.904037 1.0
3 0.292404 0.423220 0.897146 1.0
4 0.169869 0.395967 0.590083 1.0
5 0.574394 0.804917 0.746797 NaN
6 0.642173 0.252437 0.847172 NaN
7 0.073629 0.821715 0.859776 1.0
8 0.999789 0.833708 0.230418 NaN
9 0.028163 0.666961 0.582713 NaN

Pandas logical indexing on a single column of a dataframe to assign values

I am an R programmer and looking for a similar way to do something like this in R:
data[data$x > value, y] <- 1
(basically, take all rows where the x column is greater than some value and assign the y column at those rows the value of 1)
In pandas it would seem the equivalent would go something like:
data['y'][data['x'] > value] = 1
But this gives a SettingWithCopyWarning.
Equivalent statements I've tried are:
condition = data['x']>value
data.loc(condition,'x')=1
But I'm seriously confused. Maybe I'm thinking too much in R terms and can't wrap my head around what's going on in Python.
What would be equivalent code for this in Python, or workarounds?
Your statement is incorrect it should be:
data.loc[condition, 'x'] = 1
Example:
In [3]:
df = pd.DataFrame({'a':np.random.randn(10)})
df
Out[3]:
a
0 -0.063579
1 -1.039022
2 -0.011687
3 0.036160
4 0.195576
5 -0.921599
6 0.494899
7 -0.125701
8 -1.779029
9 1.216818
In [4]:
condition = df['a'] > 0
df.loc[condition, 'a'] = 20
df
Out[4]:
a
0 -0.063579
1 -1.039022
2 -0.011687
3 20.000000
4 20.000000
5 -0.921599
6 20.000000
7 -0.125701
8 -1.779029
As you are subscripting the df you should use square brackets [] rather than parentheses () which is a function call. See the docs

Pandas DataFrame constructor introduces NaN when including the index argument

I'm creating a pandas DataFrame object using the DataFrame constructor. My data is a dict of lists and categorical data Series objects. When I pass an index to the constructor, my categorical data series gets reset with NaN values. What's going on here? Thanks in advance!
Example:
import pandas as pd
import numpy as np
a = pd.Series(['a','b','c'],dtype="category")
b = pd.Series(['a','b','c'],dtype="object")
c = pd.Series(['a','b','cc'],dtype="object")
A = pd.DataFrame({'A':a,'B':[1,2,3]},index=["0","1","2"])
AA = pd.DataFrame({'A':a,'B':[1,2,3]})
B = pd.DataFrame({'A':b,'C':[4,5,6]})
print("DF A:")
print(A)
print("\nDF A, without specifying an index in the constructor:")
print(AA)
print("\nDF B:")
print(B)
This doesn't have anything to do with categories vs. object, it has to do with index alignment.
You're getting NaNs in A because you're telling the constructor you want an index of three strings. But a has an index of its own, consisting of the integers [0, 1, 2]. Since that doesn't match the index you've said you want, the data doesn't align, and so you get a DataFrame with the index you said you wanted and the NaNs highlight that the data is missing. By contrast, B is simply a list, and so there's no index to ignore, and accordingly it assumes the data is given in index-appropriate order.
This might be easier to see than to explain. Regardless of dtype, if the indices don't match, you get NaN:
In [147]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="category"),'B':[1,2,3]},
index=["0","1","2"])
Out[147]:
A B
0 NaN 1
1 NaN 2
2 NaN 3
In [148]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="object"),'B':[1,2,3]},
index=["0","1","2"])
Out[148]:
A B
0 NaN 1
1 NaN 2
2 NaN 3
If you use a fully-matching index, it works:
In [149]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="object"),'B':[1,2,3]},
index=[0,1,2])
Out[149]:
A B
0 a 1
1 b 2
2 c 3
And if you use a partially-matching index, you'll get values where the indices align and NaN where they don't:
In [150]: pd.DataFrame({'A':pd.Series(list("abc"), dtype="object"),'B':[1,2,3]},
index=[0,1,10])
Out[150]:
A B
0 a 1
1 b 2
10 NaN 3

Pandas. Selection by label. One-row output

I'm trying to select every entry in a pandas DataFrame D, correspoding to some certain userid, filling missing etime values with zeros as follows:
user_entries = D.loc[userid]
user_entries.index = user_entries.etime
user_entries = user_entries.reindex(range(distinct_time_entries_num))
user_entries = user_entries.fillna(0)
The problem is, for some ids, there exists exactly one entry, and thus .loc() method is returning a Series object with an unexpected index:
(Pdb) user_entries.index = user_entries.etime
*** TypeError: Index(...) must be called with a collection of some kind, 388 was passed
(Pdb) user_entries
etime 388
requested 1
rejected 0
Name: 351, dtype: int64
(Pdb) user_entries.index
Index([u'etime', u'requested', u'rejected'], dtype='object')
which is painful to handle. I'd seiously prefer a DataFrame object with one row. Is there any way around it? Thanks.
UPD: A have to apologize for unintengible formulation, this is my first post here. I'll try again.
So the deal is: there is a dataframe, indexed by userid. Every userid can possibly have up to some number N corresponding dataframe rows (columns are: 'etime','requested','rejected') for which 'etime' is basically the key. For some 'userid', there exist all of the N corresponding entries, but for the most of them, there are missing entries for some 'etime'.
My intensions are: for every 'userid' construct an explicit DataFrame object, containing all N entries indexed by 'etime', filled with zeros for the missing entries. That's why I'm changing index to 'etime' and then reindexing selected row subset with the full 'etime' range.
The problem is: for some 'userid' there is exactly one corresponding 'etime', for which.loc() subsetting returns not a dataframe with one row indexed by 'userid' but a series object indexed by the array:
Index([u'etime', u'requested', u'rejected'], dtype='object')
And that's why changing index fails. Checking dimensions and index every time I select some dataframe subset looks pretty ugly. What else can I do about it?
UPD2: here is the script demonstrating the case
full_etime_range = range(10)
df = DataFrame(index=[0,0,1],
columns=['etime','requested'],
data=[[0,1],[1,1],[1,1]])
for i in df.index:
tmp = df.loc[i]
tmp.index = tmp['etime']
tmp = tmp.reindex(full_etime_range,fill_value = 0)
print tmp
So, starting with df being your dataframe, we can do the following safely:
In[215]: df.set_index([df.index, 'etime'], inplace=True)
In[216]: df
Out[216]:
requested
etime
0 0 1
1 1
1 1 1
DF = pd.DataFrame(index=full_etime_range, columns=[])
df0 = DF.copy()
In[225]: df0.join(df.loc[0])
Out[225]:
requested
0 1
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
In[230]: df1 = DF.copy()
In[231]: df1.join(df.loc[1])
Out[231]:
requested
0 NaN
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
which is technically what you want. But behold, we can do this nicer:
listOfDf = [DF.copy().join(df.loc[i]) for i in df.index.get_level_values(1).unique()]
I wanted to do it even one level nicer, but the following did not work - maybe someone can chip in why.
df.groupby(level=0).apply(lambda x: DF.copy().join(x))
Are you just trying to fill nas? Why are you reindexing the dataframe?
Just
user_entries = D.loc[userid]
user_entries.fillna(0)
Should do the trick. But if you are willing to fillna just for the etime field, what you should do is:
user_entries = D.loc[userid]
temp = user_entries["etime"].fillna(0)
user_extries["etime"] = temp
Hope it helps. If not, clarify what you're trying to achieve

Categories