I have created a function for which the input is a pandas dataframe.
It should return the row-indices of the rows with a missing value.
It works for all the defined Missingness values except when the cell is entirely empty - even though I tried to specify this in the missing_values List as [...,""] .
What could be the issue here? Or is there even a more intuitive way to solve this in general?
def missing_values(x):
df=x
missing_values = ["NaN","NAN","NA","Na","n/a", "na", "--","-"," ","","None","0","-inf"] #common ways to indicate missingness
observations = df.shape[0] # Gives number of observations (rows)
variables = df.shape[1] # Gives number of variables (columns)
row_index_list = []
#this goes through each observation in the first row
for n in range(0,variables): #this iterates over all variables
column_list = [] #creates a list for each value per variable
for i in range(0,observations): #now this iterates over every observation per variable
column_list.append(df.iloc[i,n]) #and adds the values to the list
for i in range(0,len(column_list)): #now for every value
if column_list[i] in missing_values: #it is checked, whether the value is a Missing one
row_index_list.append(column_list.index(column_list[i])) #and if yes, the row index is appended
finished = list(set(row_index_list)) #set is used to make sure the index only appears once if there are multiple occurences in one row and then it is listed
return finished
There might be spurious whitespace, so try adding strip() on this line:
if column_list[i].strip() in missing_values: #it is checked, whether the value is a Missing one
Also a simpler way to get the indexes of rows containing missing_values is with isin() and any(axis=1):
x = x.replace('\s+', '', regex=True)
row_index_list = x[x.isin(missing_values).any(axis=1)].index
When you import a file to Pandas using for example read_csv or read_excel, the missing variable (literally missing) is then can only be specify using np.nan or other type of null value with the numpy library.
(Sorry my bad right here, I was really silly when doing np.nan == np.nan)
You can replace the np.nan value first with:
df = df.replace(np.nan, 'NaN')
then your function can catch it.
Another way is to use isna() in pandas,
df.isna()
This will return the same DataFrame but with cell contains boolean value, True for each cell that is np.nan
If you do df.isna().any(),
This will return a Series with True value for any columns that contains null value.
If you want to retrieved the ID, simply adding the parameter axis = 1 to any():
df.isna().any(axis = 1)
This will return a Series show all the rows with np.nan value.
Now you have the boolean values that indicate which row contains null values. If you add these boolean value to a list and apply that on the DF.index this will took out the index value of the rows containing null.
booleanlist = df.isna().any(axis =1).tolist()
null_row_id = df.index[booleanlist]
Related
Here I'm trying make the cell for the days_employed column in the row that has 'retiree' in a different column into NaN. As it is, it makes the entire days_employed column into NaN, whereas I only want that specific cell in that row to be NaN
for row in df['income_type']:
if row == 'retiree':
df['days_employed'] = float('Nan')
Is there something similar to row in df['days_employed'] = float('Nan')?
Since this seems like pandas to me, you can vectorize this by applying a condition on the entire column, then assigning those cells to NaN:
df.loc[df['income_type'].eq('retiree'), 'days_employed'] = np.nan
Or, reassign the column using np.wher:
df['days_employed'] = np.where(
df['income_type'].eq('retiree'), np.nan, df['days_employed'])
When you do list[column] you are specifying the entire column, that is why the entire days_employed is being changed to NaN. Try this and let me know if it works:
for row in df['income_type']:
if row == 'retiree':
df['days_employed'][x] = float('Nan')
Note: x is the specific value in ['days_employed'] that should be changed to float('Nan').
If you do not know the value of the cell you want to change, or if it changes often, then simply use a for loop to look for the index of the cell you want to change, and set x to that index.
I have a dataframe with a column named rDREB% which contains missing values, as shown:count of cells with value of columns. I tried:
playersData['rDREB%'] = playersData['rDREB%'].fillna(0, inplace=True)
After executing the code, the whole column will be empty when i check. Isn't the code supposed to replace only null value with 0? i am confused.
before the code
after the code
P.S. i am also trying to replace other columns with missing values, i.e. ScoreVal, PlayVal, rORB%, OBPM, BPM...
Using inplace means fillna returns nothing, which you're assinging to your column. Either remove inplace, or don't assign the return value to the column:
playersData['rDREB%'] = playersData['rDREB%'].fillna(0)
or
playersData['rDREB%'].fillna(0, inplace=True)
The first approach is recommended. See this question for more info: In pandas, is inplace = True considered harmful, or not?
I was wondering how I would be able to expand out a list in a cell without repeating variables in other cells.
The goal is to get it so that the list is expanded but the first column is not repeated. I know how to expand the list out but I would not like to have the first column values repeated if that is possible. Thank you for any help!!
In order to get what you're asking for, you still have to use explode() to get what you need. You just have to take it a step further and change the values of the first column. Please note that this will destroy the association between the elements of the list and the letter of the row they were first in. You would be creating a third value for the column (an empty string) that would be repeated for every record not beginning with 1.
If you want to eliminate the value from the rows you are talking about but still want to have those records associated with the value that their list was associated with, you can't. It's not logically possible for a value to both be in a given cell but also not be in that cell. So, I will show you the steps for eliminating the original association.
For this example, I named the columns since they are not provided.
data = [
["a",["1 hey","2 hi","3 hello"]],
["b",["1 what","2 how","3 say"]]
]
df = pd.DataFrame(data,columns=["first","second"])
df = df.explode("second")
df['first'] = df.apply(lambda x: x['first'] if x['second'][0] == '1' else '', axis=1)
I have to reassign a reassign a column value for specific rows based on state. The data frame I am working with has only two columns, SET VALUE and AMOUNT, with STATE being in the index. I need to change the value of SET VALUE to 'YES' for the 3 customers with the highest value in the AMOUNT column for each state. How can I do this in the pandas framework?
I have attempted to use a for loop on the state in the index and then sort by AMOUNT column values and assign 'YES' to the first three rows in the SET VALUE column.
for state in trial.index:
trial[trial.index == state].sort_values('AMOUNT', ascending = False)['SET VALUE'].iloc[0:3] = 'YES'
print(trial[trial.index == state])
I am expecting the print portion of this loop to include 3 'YES' values but instead all I get are 'NO' values (the default for the column). It is unclear to me why this is happening.
I would advise against repeated index for various reasons. This case being one, as it is harder for you to update the rows. Here's what I would do:
# make STATE a column, and index continuous numbers
df = df.reset_index()
# get the actual indexes of the largest amounts
idx = df.groupby('STATE').AMOUNT.nlargest(3).index.get_level_values(1)
# update
df.loc[idx, 'SET_VALUE'] = 'YES'
I have been trying to wrap my head around this for a while now and have yet to come up with a solution.
My question is how do I change current column values in multiple columns based on the column name if criteria is met???
I have survey data which has been read in as a pandas csv dataframe:
import pandas as pd
df = pd.read_csv("survey_data")
I have created a dictionary with column names and the values I want in each column if the current column value is equal to 1. Each column contains 1 or NaN. Basically any column within the data frame ending in '_SA' =5, '_A' =4, '_NO' =3, '_D' =2 and '_SD' stays as the current value 1. All of the 'NaN' values remain as is. This is the dictionary:
op_dict = {
'op_dog_SA':5,
'op_dog_A':4,
'op_dog_NO':3,
'op_dog_D':2,
'op_dog_SD':1,
'op_cat_SA':5,
'op_cat_A':4,
'op_cat_NO':3,
'op_cat_D':2,
'op_cat_SD':1,
'op_fish_SA':5,
'op_fish_A':4,
'op_fish_NO':3,
'op_fish_D':2,
'op_fish__SD':1}
I have also created a list of the columns within the data frame I would like to be changed if the current column value = 1 called [op_cols]. Now I have been trying to use something like this that iterates through the values in those columns and replaces 1 with the mapped value in the dictionary:
for i in df[op_cols]:
if i == 1:
df[op_cols].apply(lambda x: op_dict.get(x,x))
df[op_cols]
It is not spitting out an error but it is not replacing the 1 values with the corresponding value from the dictionary. It remains as 1.
Any advice/suggestions on why this would not work or a more efficient way would be greatly appreciated
So if I understand your question you want to replace all ones in a column with 1,2,3,4,5 depending on the column name?
I think all you need to do is iterate through your list and multiple by the value your dict returns:
for col in op_cols:
df[col] = df[col]*op_dict[col]
This does what you describe and is far faster than replacing every value. NaNs will still be NaNs, you could handle those in the loop with fillna if you like too.