Comparing dataframe cell value to previous cell value - python

I am trying to iterate through each row of a pandas dataframe and compare a certain cell's value to the same cell in the previous row. I imagined this could be done using .shift(), but it is returning a series instead of the cell value.
I have seen some usage of groupby and iloc for accessing the value of a cell but not for iterative comparisons, and using some sort of incrementing counter method or manually storing the value of each cell and then comparing doesn't seem very efficient.
Here is what I imagined would work, but no joy.
for index, row in df.iterrows():
if row['apm'] > df['apm'].shift(1):
# do something

You can just create a new column (e.g. flag) to indicate whether or not the boolean check is true.
df = df.assign(flag=df['apm'].gt(df['apm'].shift()))
Then you could perform your action based on the value of this column.

df['apm'].shift(1)
returns a series with previous values for each row, except for None in the first row.
Thus, the code
df['apm'] > df['apm'].shift(1)
will return the series with boolen values, also True or False for each row.
Would that be enough for your task?

Related

How to return a value from one column of a df if the value of another column is found in a list

I have a DataFrame with 2 columns, A and B, and then a list of values called C.
I need my code to check whether each value in df.B is in list C, and for all that are True then return the corresponding values from df.A to a list.
I've tried using a for loop, and later found the .isin() function but just can't figure out how to utilize either of these in a way that'll give me the result I need.
Use:
df.loc[df['B'].isin(C), 'A'].tolist()
Never use a loop when you can select things natively with pandas.
df[df.B.isin(C)].A.to_list()
If the B name is not a single word (e.g. COLUMN NAME), use:
df[df['B'].isin(C)]['A'].to_list()
df.B.isin(C) will return a Series of True/False depending on whether the values in B are in C. Then use this to select the rows in the original dataframe. Finally, select the column A and convert to list.

limit pandas .loc method output within a iloc range

I am looking for a maximum value within my pandas dataframe but only within certain index range:
df.loc[df['Score'] == df['Score'].iloc[430:440].max()]
This gives me a pandas.core.frame.DataFrame type output with multiple rows.
I specifically need the the index integer of the maximum value within iloc[430:440] and only the first index the maximum value occurs.
Is there anyway to limit the range of the .loc method?
Thank you
If you just want the index:
i = df['Score'].iloc[430:440].idxmax()
If you want to get the row as well:
df.loc[i]
If you want to get the first row in the entire dataframe with that value (rather than just within the range you specified originally):
df[df['Score'] == df['Score'].iloc[430:440].max()].iloc[0]

How do I change a cell's contents within an if statement?

Here I'm trying make the cell for the days_employed column in the row that has 'retiree' in a different column into NaN. As it is, it makes the entire days_employed column into NaN, whereas I only want that specific cell in that row to be NaN
for row in df['income_type']:
if row == 'retiree':
df['days_employed'] = float('Nan')
Is there something similar to row in df['days_employed'] = float('Nan')?
Since this seems like pandas to me, you can vectorize this by applying a condition on the entire column, then assigning those cells to NaN:
df.loc[df['income_type'].eq('retiree'), 'days_employed'] = np.nan
Or, reassign the column using np.wher:
df['days_employed'] = np.where(
df['income_type'].eq('retiree'), np.nan, df['days_employed'])
When you do list[column] you are specifying the entire column, that is why the entire days_employed is being changed to NaN. Try this and let me know if it works:
for row in df['income_type']:
if row == 'retiree':
df['days_employed'][x] = float('Nan')
Note: x is the specific value in ['days_employed'] that should be changed to float('Nan').
If you do not know the value of the cell you want to change, or if it changes often, then simply use a for loop to look for the index of the cell you want to change, and set x to that index.

Missing Value detection fails for completely empty cells python pandas

I have created a function for which the input is a pandas dataframe.
It should return the row-indices of the rows with a missing value.
It works for all the defined Missingness values except when the cell is entirely empty - even though I tried to specify this in the missing_values List as [...,""] .
What could be the issue here? Or is there even a more intuitive way to solve this in general?
def missing_values(x):
df=x
missing_values = ["NaN","NAN","NA","Na","n/a", "na", "--","-"," ","","None","0","-inf"] #common ways to indicate missingness
observations = df.shape[0] # Gives number of observations (rows)
variables = df.shape[1] # Gives number of variables (columns)
row_index_list = []
#this goes through each observation in the first row
for n in range(0,variables): #this iterates over all variables
column_list = [] #creates a list for each value per variable
for i in range(0,observations): #now this iterates over every observation per variable
column_list.append(df.iloc[i,n]) #and adds the values to the list
for i in range(0,len(column_list)): #now for every value
if column_list[i] in missing_values: #it is checked, whether the value is a Missing one
row_index_list.append(column_list.index(column_list[i])) #and if yes, the row index is appended
finished = list(set(row_index_list)) #set is used to make sure the index only appears once if there are multiple occurences in one row and then it is listed
return finished
There might be spurious whitespace, so try adding strip() on this line:
if column_list[i].strip() in missing_values: #it is checked, whether the value is a Missing one
Also a simpler way to get the indexes of rows containing missing_values is with isin() and any(axis=1):
x = x.replace('\s+', '', regex=True)
row_index_list = x[x.isin(missing_values).any(axis=1)].index
When you import a file to Pandas using for example read_csv or read_excel, the missing variable (literally missing) is then can only be specify using np.nan or other type of null value with the numpy library.
(Sorry my bad right here, I was really silly when doing np.nan == np.nan)
You can replace the np.nan value first with:
df = df.replace(np.nan, 'NaN')
then your function can catch it.
Another way is to use isna() in pandas,
df.isna()
This will return the same DataFrame but with cell contains boolean value, True for each cell that is np.nan
If you do df.isna().any(),
This will return a Series with True value for any columns that contains null value.
If you want to retrieved the ID, simply adding the parameter axis = 1 to any():
df.isna().any(axis = 1)
This will return a Series show all the rows with np.nan value.
Now you have the boolean values that indicate which row contains null values. If you add these boolean value to a list and apply that on the DF.index this will took out the index value of the rows containing null.
booleanlist = df.isna().any(axis =1).tolist()
null_row_id = df.index[booleanlist]

Python (pandas) loop through values in a column, do a calc with each value

I have a data set of dB values in a dataframe and want to do a calc for each row in a specific column. I've tried this:
for i in dataAnti['antilog']:
x = 10**(i/10)
It gives me the correct value but only loops once. How do I save these new values in a new column or save over the values in the antilog column?
You need to define the new column and simply formulate the calculus you desire.
dataAnti['new_column'] = 10**(dataAnti['antilog']/10)
This will automatically take the value of each row and perform the calculation to assign the resulting value to the same row in the new_column
You can make use of the apply attribute.
dataAnti['result']=dataAnti['antilog'].apply(lambda i: 10**(i/10))
You can pass any function inside apply() that takes an input and applies the result to each column.

Categories