dropping all the columns after a matching value - python

I have the below data frame
and i have a variable as ID = 1052107168068132864
How I can filter all the values to drop it after that column and can get the result like below. In a way i want to drop all the column after that Id including it as well.
and then update the value of ID as 1052121282324692992 as the current value.
i want to repeat this in a loop so that every time i get a new data frame the same operation will keep going and if that is the top value then nothing should happen.

Assuming IDs are unique, using iloc
df.iloc[:df[df.ID == '1052121282324692992'].index.item()]

Using idxmax
idx = (df['ID'] == ID).idxmax()
new_df = df.iloc[:idx, :]

Related

How can i keep original index when doing outer merge and dropping rows?

I have a big df (rates) that contains all information, then I have a second dataframe (aig_df) that contains a couple of rows of the first one.
I need to get a 3rd dataframe that is basically the big one (rates) without the rows on the second one (aig_df), but I need to keep the corresponding indices of the rows that results of rates without aig_df.
With the code I have now, I can get the 3rd dataframe with all the information needed but with int index and I need the index corresponding to each row (Index = Stock Ticker).
rates = pd.read_sql("SELECT Ticker, Carrier, Product, Name, CDSC,StrategyTerm,ParRate,Spread,Fee,Cap FROM ProductRates ", conn).set_index('Ticker')
aig_df = rates.query('Product == "X5 Advantage AnnuitySM"')
competitors_df = pd.merge(rates, aig_df[['Carrier', 'Product', 'Name','CDSC','StrategyTerm','ParRate','Spread','Fee','Cap']],indicator=True,
how='outer').query('_merge=="left_only"').drop('_merge',axis=1)
¿Is there any way to do what I need?
Thanks for your attention
In your specific case, you don't need a merge to do what you want:
result = rates[rates["Product"] != "X5 Advantage AnnuitySM"]

How to assign values to a slice in a pandas data frame

I have to reassign a reassign a column value for specific rows based on state. The data frame I am working with has only two columns, SET VALUE and AMOUNT, with STATE being in the index. I need to change the value of SET VALUE to 'YES' for the 3 customers with the highest value in the AMOUNT column for each state. How can I do this in the pandas framework?
I have attempted to use a for loop on the state in the index and then sort by AMOUNT column values and assign 'YES' to the first three rows in the SET VALUE column.
for state in trial.index:
trial[trial.index == state].sort_values('AMOUNT', ascending = False)['SET VALUE'].iloc[0:3] = 'YES'
print(trial[trial.index == state])
I am expecting the print portion of this loop to include 3 'YES' values but instead all I get are 'NO' values (the default for the column). It is unclear to me why this is happening.
I would advise against repeated index for various reasons. This case being one, as it is harder for you to update the rows. Here's what I would do:
# make STATE a column, and index continuous numbers
df = df.reset_index()
# get the actual indexes of the largest amounts
idx = df.groupby('STATE').AMOUNT.nlargest(3).index.get_level_values(1)
# update
df.loc[idx, 'SET_VALUE'] = 'YES'

How can I grab rows with max date from Pandas dataframe?

I have a Pandas dataframe that looks like this:
and I want to grab for each distinct ID, the row with the max date so that my final results looks something like this:
My date column is of data type 'object'. I have tried grouping and then trying to grab the max like the following:
idx = df.groupby(['ID','Item'])['date'].transform(max) == df_Trans['date']
df_new = df[idx]
However I am unable to get the desired result.
idxmax
Should work so long as index is unique or the maximal index isn't repeated.
df.loc[df.groupby('ID').date.idxmax()]
OP (edited)
Should work as long as maximal values are unique. Otherwise, you'll get all rows equal to the maximum.
df[df.groupby('ID')['date'].transform('max') == df['date']]
W-B go to solution
And also very good solution.
df.sort_values(['ID', 'date']).drop_duplicates('date', keep='last')
The last bit of code from piRSquared's answer is wrong.
We are trying to get distinct IDs, so the column used in drop_duplicates should be 'ID'. keep='last' would then retrieve the last (and max) date for each ID.
df.sort_values(['ID', 'date']).drop_duplicates('ID', keep='last')
My answer is a generalization of piRSquared's answer:
manykey indicates the keys from which the mapping is desired (many-to)
onekey indicates the keys to which the mapping is desired (-to-one)
sortkey is sortable key and it follows asc set to True (as python standard)
def get_last(df:pd.DataFrame,manykey:list[str],onekey:list[str],sortkey,asc=True):
return df.sort_values(sortkey,asc).drop_duplicates(subset=manykey, keep='last')[manykey+onekey]
In your case the answer should be:
get_last(df,["id"],["item"],"date")
Note that I am using the onekey explicitly because I want to drop the rest of the keys (if they are in the table) and create a mapping.

dropping all the columns after a matching value without using the index

I have the below data frame
and i have a variable as ID = 1052107168068132864
How I can filter all the values to drop it after that column and can get the result like below. In a way i want to drop all the column after that Id including it as well.
and then update the value of ID as 1052121282324692992 as the current value.
i want to repeat this in a loop so that every time i get a new data frame the same operation will keep going and if that is the top value then nothing should happen.
I am having two solutions but they only works when index are in serial way :-
df.iloc[:df[df.ID == '1052121282324692992'].index.item()]
or
idx = (df['ID'] == ID).idxmax()
new_df = df.iloc[:idx, :]

Cleaning Data: Replacing Current Column Values with Values mapped in Dictionary

I have been trying to wrap my head around this for a while now and have yet to come up with a solution.
My question is how do I change current column values in multiple columns based on the column name if criteria is met???
I have survey data which has been read in as a pandas csv dataframe:
import pandas as pd
df = pd.read_csv("survey_data")
I have created a dictionary with column names and the values I want in each column if the current column value is equal to 1. Each column contains 1 or NaN. Basically any column within the data frame ending in '_SA' =5, '_A' =4, '_NO' =3, '_D' =2 and '_SD' stays as the current value 1. All of the 'NaN' values remain as is. This is the dictionary:
op_dict = {
'op_dog_SA':5,
'op_dog_A':4,
'op_dog_NO':3,
'op_dog_D':2,
'op_dog_SD':1,
'op_cat_SA':5,
'op_cat_A':4,
'op_cat_NO':3,
'op_cat_D':2,
'op_cat_SD':1,
'op_fish_SA':5,
'op_fish_A':4,
'op_fish_NO':3,
'op_fish_D':2,
'op_fish__SD':1}
I have also created a list of the columns within the data frame I would like to be changed if the current column value = 1 called [op_cols]. Now I have been trying to use something like this that iterates through the values in those columns and replaces 1 with the mapped value in the dictionary:
for i in df[op_cols]:
if i == 1:
df[op_cols].apply(lambda x: op_dict.get(x,x))
df[op_cols]
It is not spitting out an error but it is not replacing the 1 values with the corresponding value from the dictionary. It remains as 1.
Any advice/suggestions on why this would not work or a more efficient way would be greatly appreciated
So if I understand your question you want to replace all ones in a column with 1,2,3,4,5 depending on the column name?
I think all you need to do is iterate through your list and multiple by the value your dict returns:
for col in op_cols:
df[col] = df[col]*op_dict[col]
This does what you describe and is far faster than replacing every value. NaNs will still be NaNs, you could handle those in the loop with fillna if you like too.

Categories