How do I change a cell's contents within an if statement? - python

Here I'm trying make the cell for the days_employed column in the row that has 'retiree' in a different column into NaN. As it is, it makes the entire days_employed column into NaN, whereas I only want that specific cell in that row to be NaN
for row in df['income_type']:
if row == 'retiree':
df['days_employed'] = float('Nan')
Is there something similar to row in df['days_employed'] = float('Nan')?

Since this seems like pandas to me, you can vectorize this by applying a condition on the entire column, then assigning those cells to NaN:
df.loc[df['income_type'].eq('retiree'), 'days_employed'] = np.nan
Or, reassign the column using np.wher:
df['days_employed'] = np.where(
df['income_type'].eq('retiree'), np.nan, df['days_employed'])

When you do list[column] you are specifying the entire column, that is why the entire days_employed is being changed to NaN. Try this and let me know if it works:
for row in df['income_type']:
if row == 'retiree':
df['days_employed'][x] = float('Nan')
Note: x is the specific value in ['days_employed'] that should be changed to float('Nan').
If you do not know the value of the cell you want to change, or if it changes often, then simply use a for loop to look for the index of the cell you want to change, and set x to that index.

Related

New column in DataFrame from other columns AND rows

I want to create a new column, V, in an existing DataFrame, df. I would like the value of the new column to be the difference between the value in the 'x' column in that row, and the value of the 'x' column in the row below it.
As an example, in the picture below, I want the value of the new column to be
93.244598 - 93.093285 = 0.151313.
I know how to create a new column based on existing columns in Pandas, but I don't know how to reference other rows using this method. Is there a way to do this that doesn't involve iterating over the rows in the dataframe? (since I have read that this is generally a bad idea)
You can use pandas.DataFrame.shift for your use case.
The last row will not have any row to subtract from so you will get the value for that cell as NaN
df['temp_x'] = df['x'].shift(-1)
df[`new_col`] = df['x'] - df['temp_x']
or one liner :
df[`new_col`] = df['x'] - df['x'].shift(-1)
the column new_col will contain the expected data
An ideal solution is to use diff:
df['new'] = df['x'].diff(-1)

Missing Value detection fails for completely empty cells python pandas

I have created a function for which the input is a pandas dataframe.
It should return the row-indices of the rows with a missing value.
It works for all the defined Missingness values except when the cell is entirely empty - even though I tried to specify this in the missing_values List as [...,""] .
What could be the issue here? Or is there even a more intuitive way to solve this in general?
def missing_values(x):
df=x
missing_values = ["NaN","NAN","NA","Na","n/a", "na", "--","-"," ","","None","0","-inf"] #common ways to indicate missingness
observations = df.shape[0] # Gives number of observations (rows)
variables = df.shape[1] # Gives number of variables (columns)
row_index_list = []
#this goes through each observation in the first row
for n in range(0,variables): #this iterates over all variables
column_list = [] #creates a list for each value per variable
for i in range(0,observations): #now this iterates over every observation per variable
column_list.append(df.iloc[i,n]) #and adds the values to the list
for i in range(0,len(column_list)): #now for every value
if column_list[i] in missing_values: #it is checked, whether the value is a Missing one
row_index_list.append(column_list.index(column_list[i])) #and if yes, the row index is appended
finished = list(set(row_index_list)) #set is used to make sure the index only appears once if there are multiple occurences in one row and then it is listed
return finished
There might be spurious whitespace, so try adding strip() on this line:
if column_list[i].strip() in missing_values: #it is checked, whether the value is a Missing one
Also a simpler way to get the indexes of rows containing missing_values is with isin() and any(axis=1):
x = x.replace('\s+', '', regex=True)
row_index_list = x[x.isin(missing_values).any(axis=1)].index
When you import a file to Pandas using for example read_csv or read_excel, the missing variable (literally missing) is then can only be specify using np.nan or other type of null value with the numpy library.
(Sorry my bad right here, I was really silly when doing np.nan == np.nan)
You can replace the np.nan value first with:
df = df.replace(np.nan, 'NaN')
then your function can catch it.
Another way is to use isna() in pandas,
df.isna()
This will return the same DataFrame but with cell contains boolean value, True for each cell that is np.nan
If you do df.isna().any(),
This will return a Series with True value for any columns that contains null value.
If you want to retrieved the ID, simply adding the parameter axis = 1 to any():
df.isna().any(axis = 1)
This will return a Series show all the rows with np.nan value.
Now you have the boolean values that indicate which row contains null values. If you add these boolean value to a list and apply that on the DF.index this will took out the index value of the rows containing null.
booleanlist = df.isna().any(axis =1).tolist()
null_row_id = df.index[booleanlist]

Comparing dataframe cell value to previous cell value

I am trying to iterate through each row of a pandas dataframe and compare a certain cell's value to the same cell in the previous row. I imagined this could be done using .shift(), but it is returning a series instead of the cell value.
I have seen some usage of groupby and iloc for accessing the value of a cell but not for iterative comparisons, and using some sort of incrementing counter method or manually storing the value of each cell and then comparing doesn't seem very efficient.
Here is what I imagined would work, but no joy.
for index, row in df.iterrows():
if row['apm'] > df['apm'].shift(1):
# do something
You can just create a new column (e.g. flag) to indicate whether or not the boolean check is true.
df = df.assign(flag=df['apm'].gt(df['apm'].shift()))
Then you could perform your action based on the value of this column.
df['apm'].shift(1)
returns a series with previous values for each row, except for None in the first row.
Thus, the code
df['apm'] > df['apm'].shift(1)
will return the series with boolen values, also True or False for each row.
Would that be enough for your task?

Cleaning Data: Replacing Current Column Values with Values mapped in Dictionary

I have been trying to wrap my head around this for a while now and have yet to come up with a solution.
My question is how do I change current column values in multiple columns based on the column name if criteria is met???
I have survey data which has been read in as a pandas csv dataframe:
import pandas as pd
df = pd.read_csv("survey_data")
I have created a dictionary with column names and the values I want in each column if the current column value is equal to 1. Each column contains 1 or NaN. Basically any column within the data frame ending in '_SA' =5, '_A' =4, '_NO' =3, '_D' =2 and '_SD' stays as the current value 1. All of the 'NaN' values remain as is. This is the dictionary:
op_dict = {
'op_dog_SA':5,
'op_dog_A':4,
'op_dog_NO':3,
'op_dog_D':2,
'op_dog_SD':1,
'op_cat_SA':5,
'op_cat_A':4,
'op_cat_NO':3,
'op_cat_D':2,
'op_cat_SD':1,
'op_fish_SA':5,
'op_fish_A':4,
'op_fish_NO':3,
'op_fish_D':2,
'op_fish__SD':1}
I have also created a list of the columns within the data frame I would like to be changed if the current column value = 1 called [op_cols]. Now I have been trying to use something like this that iterates through the values in those columns and replaces 1 with the mapped value in the dictionary:
for i in df[op_cols]:
if i == 1:
df[op_cols].apply(lambda x: op_dict.get(x,x))
df[op_cols]
It is not spitting out an error but it is not replacing the 1 values with the corresponding value from the dictionary. It remains as 1.
Any advice/suggestions on why this would not work or a more efficient way would be greatly appreciated
So if I understand your question you want to replace all ones in a column with 1,2,3,4,5 depending on the column name?
I think all you need to do is iterate through your list and multiple by the value your dict returns:
for col in op_cols:
df[col] = df[col]*op_dict[col]
This does what you describe and is far faster than replacing every value. NaNs will still be NaNs, you could handle those in the loop with fillna if you like too.

.fillna column if two cells are empty in Pandas

Can somebody tell me why in my for loop
df_all = pd.read_csv("assembly_summary.txt", delimiter='\t', index_col=0)
for row in df_all.index:
if pd.isnull(df_all.infraspecific_name[row]) and pd.isnull(df_all.isolate[row]):
df_all.infraspecific_name.fillna('NA', inplace=True)
print(df_all[['infraspecific_name', 'isolate']])
.fillna fills the specified cell even when the column referred to in the second part of the if statement is not null?
I am trying to use .fillna ONLY if both of the cells referred to in my if statement are null.
I also tried changing the second to last line to df_all.infraspecific_name[row].fillna('NA', inplace=True) Which doesn't work either.
df_all.loc[row,['infraspecific_name']].fillna('NA', inplace=True) corrects the problem, but then when both cells infraspecific_name and isolate ARE null, it doesn't fill the cell with 'NA'
I am not sure if my lack of understanding is in Python loops or Pandas.
The .csv file I am using can be found at ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
Since you are indexing your first col, you could use update:
df_all['infraspecific_name']
Returns a Series of only the specified column. The following will perform .fillna only on select (elements) rows [where condition True]
[(df_all['infraspecific_name'].isnull()) & (df_all['isolate'].isnull())].fillna('NA')
You can achieve all your steps in one line by combining the above and preceding it all with update.
df_all.update(df_all['infraspecific_name'][(df_all['infraspecific_name'].isnull()) & (df_all['isolate'].isnull())].fillna('NA'))
Number of rows changed
len(df_all[df_all['infraspecific_name'] == 'NA'])
1825
The rest of the dataframe should be intact.
This should get you what you want
csvfile = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt'
df_all = pd.read_csv(csvfile, delimiter='\t', index_col=0)
mask = df_all[['infraspecific_name', 'isolate']].isnull().all(axis=1)
df_all.loc[mask, 'infraspecific_name'] = 'NA'
the 3rd line uses these values df_all[['infraspecific_name', 'isolate']] then for each value tests for nulls .isnull(). Then the last part .all(axis=1) is finding out if all columns in each row have Truth values in them.
The 4th line is using that mask to find the locations of the values that need changing.

Categories