Pandas KeyError: 'occurred at index 0' - python

Let's say I have a Pandas dataframe df:
start_time Event
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
I want to set the value of the Event column to -1 when the corresponding start_time lies between two values, so I define this function:
def test(time):
if (time['start_time'] >= 5) and (time['start_time'] <= 8):
return -1
else:
return time
To apply this to the event column, I do the following:
df[['Event']] = df[['Event']].apply(test,axis=1)
which yields this error: KeyError: ('start_time', 'occurred at index 0')
Why is this happening? Should be a simple fix.

Simply do:
df['Event'] = df.apply(test, axis=1)['Event']

The function that you are passing to .apply() uses the start_time field of the input argument (in the conditional checks if (time['start_time'] >= 5) and (time['start_time'] <= 8)). So it should be applied to a DataFrame or Series that has a start_time column.
However, before you call apply you are first calling df[['Event']], which returns a Series. So df[['Event']].apply() will apply a function to the resulting Series. But when the function reaches the expression time['start_time'], it is looking for a column called start_time in the Series, can't find it (because only 'Event' column was kept), and raises a KeyError.
The solution is to pass a DataFrame or a Series that has a start_time column in it. In your case you want to apply the function to the entire DataFrame so replace df[['Event']] with the whole DataFrame df.
df = df.apply(test, axis=1)
and change your function to modify the Event column instead of returning a value. Replace return -1 with time['Event'] = -1 and eliminate the else return time part (i.e., don't change anything if the conditions aren't met).

Related

Setting value of row in certain column based on a slice of pandas dataframe - using both loc and iloc

I am trying to slice my dataframe based on a certain condition, and select the first row of that slice, and set the value of the column of that first row.
index COL_A. COL.B. COLC
0. cond_A1 cond_B1. 0
1 cond_A1 cond_B1. 0
2 cond_A1 cond_B1. 0
3 cond_A2 cond_B2. 0
4. cond_A2 cond_B2. 0
Neither of the following lines of code I have attempted update the dataframe
df.loc[((df['COL_A'] == cond_A1) & (df['COL_B'] == cond_b1)), 'COL_C'].iloc[0] = 1
df[((df['COL_A'] == cond_A1) & (df['COL_B'] == cond_b1))].iloc[0]['COL_C'] = 1
I need to be able to loop through the conditions so that I could apply the same code to the next set of conditions, and update the COL_C row with index 3 based on these new conditions.
You can update only the first row of your slice with the following code:
df.loc[df.loc[(df['COL_A'] == cond_A1) & (df['COL_B'] == cond_b1), 'COL_C'].index[0], 'COL_C'] = 1

check each row and column in dataframe and replace value with user define function

df=pd.DataFrame({'0':["qwa-abc","abd-xyz","abt-Rac","xyz-0vc"],'1':['axc-0aa',"abd-xyz","abt-Rac","xyz-1avc"],
'3':['abc-aaa',"NaN","abt-9ac","xyz-9vc"]})
I have this DataFrame, I want to check each row and each column for a specific value. for example index 0
there are 4 values "qwa-abc","abd-xyz","abt-Rac","xyz-0avc".
for every value I want to check if xxx-any numberxx.
example:
qua-abc has a at the position 4, so do nothing. when it reach to xyz-0ac there is number 0 at position 4. hence it should run user define function to replace whole value(xyz-0vc) to whatever the user define function get.
NOTE: I tried running str.replace but it only supports specific user define string. here user function will connect to different system and get a string. hence it's not predefine.
If you want to change all the cells in your Dataframe you need to use pd.apply over the row axis, so your custom function needs to take a pd.Series as one of the parameters. In this example row is the series.
This generator function iterates over each cell in the row, checks if the character at index 4 is numeric. If true returns the value to replace string with, otherwise it will return the value of the cell itself.
def replace_value(row, value):
for cell in row:
if pd.notna(cell) and cell[4].isnumeric():
yield value
else:
yield cell
df.apply(lambda x: pd.Series(replace_value(x, 'myvalue')), axis=1)
You then apply your custom function row wise, (axis=1) and wrap it in a lambda so you can pass additional arguments (value in this case) and then call pd.Series on the iterator returned by the function.
Hope it makes sense.
You don't need a separate method, try this:
In [1200]: df.loc[df['0'].str[4].str.isdigit(), '0'] = 'myvalue'
In [1201]: df
Out[1201]:
0 1 3
0 qwa-abc axc-0aa abc-aaa
1 abd-xyz abd-xyz NaN
2 abt-Rac abt-Rac abt-9ac
3 myvalue xyz-1avc xyz-9vc
For doing this in all columns, do this:
In [1242]: def check_digit(cols,new_val):
...: for i in cols:
...: df.loc[(df[i].str[4].str.isdigit()) & (df[i].notna()), i] = new_val
...:
In [1243]: df.apply(lambda x: check_digit(df.columns, 'myval'), 1)
In [1244]: df
Out[1244]:
0 1 3
0 qwa-abc myval abc-aaa
1 abd-xyz abd-xyz NaN
2 abt-Rac abt-Rac myval
3 myval myval myval
This answer is based on #NomadMonad
string_replacer() is a function that will change value based on input value that satisfies condition
def replace_value(row, value):
for cell in row:
try:
if pd.notna(cell) and cell[4].isnumeric():
value=string_replacer(cell)
yield value
else:
yield cell
except:
print(row,value)
if_df.apply(lambda x: pd.Series(replace_value(x,value)), axis=1)

Python PANDAS: Applying a function to a dataframe, with arguments defined within dataframe

I have a dataframe with headers 'Category', 'Factor1', 'Factor2', 'Factor3', 'Factor4', 'UseFactorA', 'UseFactorB'.
The value of 'UseFactorA' and 'UseFactorB' are one of the strings ['Factor1', 'Factor2', 'Factor3', 'Factor4'], keyed based on the value in 'Category'.
I want to generate a column, 'Result', which equals dataframe[UseFactorA]/dataframe[UseFactorB]
Take the below dataframe as an example:
[Category] [Factor1] [Factor2] [Factor3] [Factor4] [useFactor1] [useFactor2]
A 1 2 5 8 'Factor1' 'Factor3'
B 2 7 4 2 'Factor3' 'Factor1'
The 'Result' series should be [2, .2]
However, I cannot figure out how to feed the value of useFactor1 and useFactor2 into an index to make this happen--if the columns to use were fixed, I would just give
df['Result'] = df['Factor1']/df['Factor2']
However, when I try to give
df['Results'] = df[df['useFactorA']]/df[df['useFactorB']]
I get the error
ValueError: Wrong number of items passed 3842, placement implies 1
Is there a method for doing what I am trying here?
Probably not the prettiest solution (because of the iterrows), but what comes to mind is to iterate through the sets of factors and set the 'Result' value at each index:
for i, factors in df[['UseFactorA', 'UseFactorB']].iterrows():
df.loc[i, 'Result'] = df[factors['UseFactorA']] / df[factors['UseFactorB']]
Edit:
Another option:
def factor_calc_for_row(row):
factorA = row['UseFactorA']
factorB = row['UseFactorB']
return row[factorA] / row[factorB]
df['Result'] = df.apply(factor_calc_for_row, axis=1)
Here's the one liner:
df['Results'] = [df[df['UseFactorA'][x]][x]/df[df['UseFactorB'][x]][x] for x in range(len(df))]
How it works is:
df[df['UseFactorA']]
Returns a data frame,
df[df['UseFactorA'][x]]
Returns a Series
df[df['UseFactorA'][x]][x]
Pulls a single value from the series.

Keep upper n rows of a pandas dataframe based on condition

how would I delete all rows from a dataframe that come after a certain fulfilled condition? As an example I have the following dataframe:
import pandas as pd
xEnd=1
yEnd=2
df = pd.DataFrame({'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]})
How would i get a dataframe that deletes the last 4 rows and keeps the upper 2 as in row 2 the condition x=xEnd and y=yEnd is fulfilled.
EDITED: should have mentioned that the dataframe is not necessarily ascending. Could also be descending and i still would like to get the upper ones.
To slice your dataframe until the first time a condition across 2 series are satisfied, first calculate the required index and then slice via iloc.
You can calculate the index via set_index, isin and np.ndarray.argmax:
idx = df.set_index(['x', 'y']).isin((xEnd, yEnd)).values.argmax()
res = df.iloc[:idx+1]
print(res)
x y id
0 1 1 0
1 1 2 1
If you need better performance, see Efficiently return the index of the first value satisfying condition in array.
not 100% sure i understand correctly, but you can filter your dataframe like this:
df[(df.x <= xEnd) & (df.y <= yEnd)]
this yields the dataframe:
id x y
0 0 1 1
1 1 1 2
If x and y are not strictly increasing and you want whats above the line that satisfy condition:
df[df.index <= (df[(df.x == xEnd) & (df.y == yEnd)]).index.tolist()]
df = df.iloc[[0:yEnd-1],[:]]
Select just first two rows and keep all columns and put it in new dataframe.
Or you can use the same name of variable too.

IF ELSE using Numpy and Pandas

After searching several forums on similar questions, it appears that one way to iterate a conditional statement quickly is using Numpy's np.where() function on Pandas. I am having trouble with the following task:
I have a dataset that looks like several rows of:
PatientID Date1 Date2 ICD
1234 12/14/10 12/12/10 313.2, 414.2, 228.1
3213 8/2/10 9/5/12 232.1, 221.0
I am trying to create a conditional statement such that:
1. if strings '313.2' or '414.2' exist in df['ICD'] return 1
2. if strings '313.2' or '414.2' exist in df['ICD'] and Date1>Date2 return 2
3. Else return 0
Given that Date1 and Date2 are in date-time format and my data frame is coded as df, I have the following code:
df['NewColumn'] = np.where(df.ICD.str.contains('313.2|414.2').astype(int), 1, np.where(((df.ICD.str.contains('313.2|414.2').astype(int))&(df['Date1']>df['Date2'])), 2, 0)
However this code only returns a series with 1's and 0's and does not include a 2. How else can I complete this task?
You almost had it, you needed to pass a raw string (prepend with r) to contains so it treats it as a regex:
In [115]:
df['NewColumn'] = np.where(df.ICD.str.contains(r'313.2|414.2').astype(int), 1, np.where(((df.ICD.str.contains(r'313.2|414.2').astype(int))&(df['Date1']>df['Date2'])), 2, 0))
df
Out[115]:
PatientID Date1 Date2 ICD NewColumn
0 1234 2010-12-14 2010-12-12 313.2,414.2,228.1 1
1 3213 2010-08-02 2012-09-05 232.1,221.0 0
You get 1 returned because it short circuits on the first condition because that is met, if you want to get 2 returned then you need to rearrange the order of evaluation:
In [122]:
df['NewColumn'] = np.where( (df.ICD.str.contains(r'313.2|414.2').astype(int)) & ( df['Date1'] > df['Date2'] ), 2 ,
np.where( df.ICD.str.contains(r'313.2|414.2').astype(int), 1, 0 ) )
df
Out[122]:
PatientID Date1 Date2 ICD NewColumn
0 1234 2010-12-14 2010-12-12 313.2,414.2,228.1 2
1 3213 2010-08-02 2012-09-05 232.1,221.0 0
It is much easier to use the pandas functionality itself. Using numpy to do something that pandas already does is a good way to get unexpected behaviour.
Assuming you want to check for a cell value containing 313.2 only (so 2313.25 returns False).
df['ICD'].astype(str) == '313.2'
returns a Series Object of True or False next to each index entry.
so
boolean =(df['ICD'].astype(str) == '313.2')| (df['ICD'].astype(str) == '414.2')
if(boolean.any()):
#do something
return 1
boolean2 =((df['ICD'].astype(str) == '313.2')| (df['ICD'].astype(str) == '414.2'))&(df['Date1']>df['Date2'])
if(boolean2.any()):
return 2
etc
Pandas also has the function isin() which can simplify things further.
The docs are here: http://pandas.pydata.org/pandas-docs/stable/indexing.html
Also, you do not return two because of the order you evaluate the conditional statements.In any circumstance where condition 2 evaluates as true, condition 1 must evaluate to be true also. So as you test condition 1 too, it always returns 1 or passes.
In short, you need to test condition 2 first, as there is no circumstance where 1 can be false and 2 can be true.

Categories