Drop duplicate rows based on a column value - python

I'm trying to write a small code to drop duplicate row based on column unique values, what I'm trying to accomplish is getting all the unique values from user_id and drop according to those unique values using drop_duplicates whilst keeping the last occurrence. keeping in mind the column that I want to drop duplicates from which is date_time.
code:
for i in recommender_train_df['user_id'].unique():
recommender_train_df.loc[recommender_train_df['user_id'] == i].drop_duplicates(subset='date_time', keep="last", inplace=True)
problem with this code it's literally does nothing, I tried and tried and same result nothing happens.
quick note: I have 100k different user_id (unique) so I need a solution that would work as fast as possible for this problem.

The problem is that when you use df.loc, it is returning a copy of original dataframe, so your modification doesn't affect the original dataframe. See python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow for more detail.
If you want to drop duplicated on part of column, you can get the duplicated item index and drop based on these indices:
for i in recommender_train_df['user_id'].unique():
mask = recommender_train_df.loc[recommender_train_df['user_id'] == 15].duplicated(subset='date_time', keep="last")
indices = mask[mask.tolist()].index
recommender_train_df.drop(indices, inplace=True)

Related

Pandas - drop rows based on two conditions on different columns

Although there are several related questions answered in Pandas, I cannot solve this issue. I have a large dataframe (~ 49000 rows) and want to drop rows the meet two conditions at the same time(~ 120):
For one column: an exact string
For another column: a NaN value
My code is ignoring the conditions and no row is removed.
to_remove = ['string1', 'string2']
df.drop(df[df['Column 1'].isin(to_remove) & (df['Column 2'].isna())].index, inplace=True)
What am I doing wrong? Thanks for any hint!
Instead of calling drop, and passing the index, You can create the mask for the condition for which you want to keep the rows, then take only those rows. Also, the logic error seems to be there, you are checking two different condition combined by AND for the same column values.
df[~(df['Column1'].isin(to_remove) & (df['Column2'].isna()))]
Also, if you need to check in the same column, then you probably want to combine the conditions by or i.e. |
If needed, you can reset_index at last.
Also, as side note, your list to_remove has two same string values, I'm assuming thats a typo in the question.

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Changing pandas DataFrame values based on values from the same and previous rows

I have the following pandas df:
it is sorted by 'patient_id', 'StartTime', 'hour_counter'.
I'm looking to perform two conditional operations on the df:
Change the value of the Delta_Value column
Delete the entire row
Where the condition depends on the values of ParameterID or patient_id in the current row and the row before.
I managed to do that using classic programming (i.e. a simple loop in Python), but not using Pandas.
Specifically, I want to change the 'Delta_Value' to 0 or delete the entire row, if the ParameterID in the current row is different from the one at the row before.
I've tried to use .groupby().first(), but that won't work in some cases because the same patient_id can have multiple occurrences of the same ParameterID with a different
ParameterID in between those occurrences. For example record 10 in the df.
And I need the records to be sorted by the StartTime & hour_counter.
Any suggestions?

Python pandas issues with .drop and a non-unique index

I have a pandas DataFrame, say df, and I'm trying to drop certain rows by an index. Specifically:
myindex = df[df.column2 != myvalue].index
df.drop(myindex, inplace = True)
This seems to work just fine for most DataFrames but strange things seem to happen with one DataFrame where I get a non-unique index myindex (I am not quite sure why since the DataFrame has no duplicate rows). To be more precise, a lot more values get dropped than there are in the index (in the extreme case I actually drop all rows even though there are several hundred rows where column2 has myvalue). Extracting only unique values (myindex.unique() and dropping the rows using the unique index doesn't help either. At the same time,
df = df[df.column2 != myvalue]
works just as I'd like it to. I'd rather use the inplace drop however but more importantly I would like to understand why the results are not the same with the direct asignment and with the drop method using the index.
Unfortunately, I cannot provide the data as those cannot be published and since I am not sure what is wrong exactly, I cannot simulate them either. However, I suspect it probably has something to do with myindex being nonunique (which also confuses me since there are no duplicate rows in df but it might very well be that I misunderstand the way the index is created).
If there are repeated values in your index, doing reset_index before might help. That will set your current index as a column and add a new sequential index (with unique values) instead.
df = df.reset_index()
The reason the 2 methods are not the same is that in one case you are passing a series of booleans that represents with rows to keep and which ones to drop (index values are not relevant here). In the case with the drop, you are passing a list of index values (which map to several positions).
Finally, to check is your index has duplicates, you shouldn't check for duplicate rows. Simply do:
df.index.has_duplicates

pandas dataframe add new column based on calulation on other column and avoid chained index

I have a pandas dataframe and I need to add a new column, which would be based on calculation of specific columns,indicated by a column 'site'. I have found a way to do this with resort to numpy, but always it gives warning about chained index. I am sure there should be better solution, please help if you know.
df_num_bin1['Chip_id_3']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_89_S1]*0x100+df_num_bin1[WB_78_S1],df_num_bin1[WB_89_S2]*0x100+df_num_bin1[WB_78_S2])
df_num_bin1['Chip_id_2']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_67_S1]*0x100+df_num_bin1[WB_56_S1],df_num_bin1[WB_67_S2]*0x100+df_num_bin1[WB_56_S2])
df_num_bin1['Chip_id_1']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_45_S1]*0x100+df_num_bin1[WB_34_S1],df_num_bin1[WB_45_S2]*0x100+df_num_bin1[WB_34_S2])
df_num_bin1['Chip_id_0']=np.where(df_num_bin1[key_site_num]==1,df_num_bin1[WB_23_S1]*0x100+df_num_bin1[WB_12_S1],df_num_bin1[WB_23_S2]*0x100+df_num_bin1[WB_12_S2])
df_num_bin1['mac_low']=(df_num_bin1['Chip_id_1'].map(int) % 0x10000) *0x100+df_num_bin1['Chip_id_0'].map(int) // 0x1000000
The code above have 2 issues:
1: Here the value of column [key_site_num] determines which columns I should extract chip id data from. In this example it is only of site 0 or 1, but actually it could be 2 or 3 as well. I would need a general solution.
2: it generates chained index warning;
C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:35: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Well, i´m not too sure about your first quest but i think that this will help you.
import pandas as pd
reader = pd.read_csv(path,engine='python')
reader['new'] = reader['treasury.maturity.rate']+reader['bond.yield']
reader.to_csv('test.csv',index=False)
As you can see,you don´t need get the values before operate with them only reference the column where they are; and to do the same for only a specific row you could filter the dataframe before create the new column.

Categories