I have to reassign a reassign a column value for specific rows based on state. The data frame I am working with has only two columns, SET VALUE and AMOUNT, with STATE being in the index. I need to change the value of SET VALUE to 'YES' for the 3 customers with the highest value in the AMOUNT column for each state. How can I do this in the pandas framework?
I have attempted to use a for loop on the state in the index and then sort by AMOUNT column values and assign 'YES' to the first three rows in the SET VALUE column.
for state in trial.index:
trial[trial.index == state].sort_values('AMOUNT', ascending = False)['SET VALUE'].iloc[0:3] = 'YES'
print(trial[trial.index == state])
I am expecting the print portion of this loop to include 3 'YES' values but instead all I get are 'NO' values (the default for the column). It is unclear to me why this is happening.
I would advise against repeated index for various reasons. This case being one, as it is harder for you to update the rows. Here's what I would do:
# make STATE a column, and index continuous numbers
df = df.reset_index()
# get the actual indexes of the largest amounts
idx = df.groupby('STATE').AMOUNT.nlargest(3).index.get_level_values(1)
# update
df.loc[idx, 'SET_VALUE'] = 'YES'
Related
I want to return only those rows of the Data frame df, where ALL values of df[list of column names] are less than user input float value .
Note- df[list of column names] is a Data frame containing list of specific columns, that we can not hard-code.
eg..
First we filter specific columns start with D_BALANCE:
cols= list(df.loc[:,df.columns.str.startswith("D_BALANCE ")])
Then we select all columns except end column (list):
rest_except_end_col= cols[0:-1]
Our main master data frame is df, NOW i want to check if df[rest_except_end_col] < 1000.0 ,
IF ALL ROWS OF THESE COLUMNS ARE LESS ONLY THEN I WANT TO GET THAT RECORD(complete row) FROM df.
df[(df[rest_except_end_col] < 1000.0).all(axis=1)]
pd.DataFrame.all():
Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).
I have created a function for which the input is a pandas dataframe.
It should return the row-indices of the rows with a missing value.
It works for all the defined Missingness values except when the cell is entirely empty - even though I tried to specify this in the missing_values List as [...,""] .
What could be the issue here? Or is there even a more intuitive way to solve this in general?
def missing_values(x):
df=x
missing_values = ["NaN","NAN","NA","Na","n/a", "na", "--","-"," ","","None","0","-inf"] #common ways to indicate missingness
observations = df.shape[0] # Gives number of observations (rows)
variables = df.shape[1] # Gives number of variables (columns)
row_index_list = []
#this goes through each observation in the first row
for n in range(0,variables): #this iterates over all variables
column_list = [] #creates a list for each value per variable
for i in range(0,observations): #now this iterates over every observation per variable
column_list.append(df.iloc[i,n]) #and adds the values to the list
for i in range(0,len(column_list)): #now for every value
if column_list[i] in missing_values: #it is checked, whether the value is a Missing one
row_index_list.append(column_list.index(column_list[i])) #and if yes, the row index is appended
finished = list(set(row_index_list)) #set is used to make sure the index only appears once if there are multiple occurences in one row and then it is listed
return finished
There might be spurious whitespace, so try adding strip() on this line:
if column_list[i].strip() in missing_values: #it is checked, whether the value is a Missing one
Also a simpler way to get the indexes of rows containing missing_values is with isin() and any(axis=1):
x = x.replace('\s+', '', regex=True)
row_index_list = x[x.isin(missing_values).any(axis=1)].index
When you import a file to Pandas using for example read_csv or read_excel, the missing variable (literally missing) is then can only be specify using np.nan or other type of null value with the numpy library.
(Sorry my bad right here, I was really silly when doing np.nan == np.nan)
You can replace the np.nan value first with:
df = df.replace(np.nan, 'NaN')
then your function can catch it.
Another way is to use isna() in pandas,
df.isna()
This will return the same DataFrame but with cell contains boolean value, True for each cell that is np.nan
If you do df.isna().any(),
This will return a Series with True value for any columns that contains null value.
If you want to retrieved the ID, simply adding the parameter axis = 1 to any():
df.isna().any(axis = 1)
This will return a Series show all the rows with np.nan value.
Now you have the boolean values that indicate which row contains null values. If you add these boolean value to a list and apply that on the DF.index this will took out the index value of the rows containing null.
booleanlist = df.isna().any(axis =1).tolist()
null_row_id = df.index[booleanlist]
I have the below data frame
and i have a variable as ID = 1052107168068132864
How I can filter all the values to drop it after that column and can get the result like below. In a way i want to drop all the column after that Id including it as well.
and then update the value of ID as 1052121282324692992 as the current value.
i want to repeat this in a loop so that every time i get a new data frame the same operation will keep going and if that is the top value then nothing should happen.
Assuming IDs are unique, using iloc
df.iloc[:df[df.ID == '1052121282324692992'].index.item()]
Using idxmax
idx = (df['ID'] == ID).idxmax()
new_df = df.iloc[:idx, :]
I have the below data frame
and i have a variable as ID = 1052107168068132864
How I can filter all the values to drop it after that column and can get the result like below. In a way i want to drop all the column after that Id including it as well.
and then update the value of ID as 1052121282324692992 as the current value.
i want to repeat this in a loop so that every time i get a new data frame the same operation will keep going and if that is the top value then nothing should happen.
I am having two solutions but they only works when index are in serial way :-
df.iloc[:df[df.ID == '1052121282324692992'].index.item()]
or
idx = (df['ID'] == ID).idxmax()
new_df = df.iloc[:idx, :]
I have been trying to wrap my head around this for a while now and have yet to come up with a solution.
My question is how do I change current column values in multiple columns based on the column name if criteria is met???
I have survey data which has been read in as a pandas csv dataframe:
import pandas as pd
df = pd.read_csv("survey_data")
I have created a dictionary with column names and the values I want in each column if the current column value is equal to 1. Each column contains 1 or NaN. Basically any column within the data frame ending in '_SA' =5, '_A' =4, '_NO' =3, '_D' =2 and '_SD' stays as the current value 1. All of the 'NaN' values remain as is. This is the dictionary:
op_dict = {
'op_dog_SA':5,
'op_dog_A':4,
'op_dog_NO':3,
'op_dog_D':2,
'op_dog_SD':1,
'op_cat_SA':5,
'op_cat_A':4,
'op_cat_NO':3,
'op_cat_D':2,
'op_cat_SD':1,
'op_fish_SA':5,
'op_fish_A':4,
'op_fish_NO':3,
'op_fish_D':2,
'op_fish__SD':1}
I have also created a list of the columns within the data frame I would like to be changed if the current column value = 1 called [op_cols]. Now I have been trying to use something like this that iterates through the values in those columns and replaces 1 with the mapped value in the dictionary:
for i in df[op_cols]:
if i == 1:
df[op_cols].apply(lambda x: op_dict.get(x,x))
df[op_cols]
It is not spitting out an error but it is not replacing the 1 values with the corresponding value from the dictionary. It remains as 1.
Any advice/suggestions on why this would not work or a more efficient way would be greatly appreciated
So if I understand your question you want to replace all ones in a column with 1,2,3,4,5 depending on the column name?
I think all you need to do is iterate through your list and multiple by the value your dict returns:
for col in op_cols:
df[col] = df[col]*op_dict[col]
This does what you describe and is far faster than replacing every value. NaNs will still be NaNs, you could handle those in the loop with fillna if you like too.