My Pandas DataFrame has 17543 rows. I want to drop a row, only if every column contains 'nan'. I tried instructions as per the link drop rows in for loop
but did not help. The following is my code
NullRows=0
for i in range(len(SetMerge.index)):
if(SetMerge.iloc[i].isnull().all()):
df=SetMerge.drop(SetMerge.index[i])
NullRows +=1
print("total null rows : ", NullRows)
I get only one row dropped in df with 17542 rows whereas NullRows output is 30.
drop doesn't mutate your SetMerge. Thus, you need to re-assign SetMerge after drop, or use another function.
It is written in answer, by link which you've posted here and checked. Specify inplace=True option for mutation.
Related
I'm trying to write a small code to drop duplicate row based on column unique values, what I'm trying to accomplish is getting all the unique values from user_id and drop according to those unique values using drop_duplicates whilst keeping the last occurrence. keeping in mind the column that I want to drop duplicates from which is date_time.
code:
for i in recommender_train_df['user_id'].unique():
recommender_train_df.loc[recommender_train_df['user_id'] == i].drop_duplicates(subset='date_time', keep="last", inplace=True)
problem with this code it's literally does nothing, I tried and tried and same result nothing happens.
quick note: I have 100k different user_id (unique) so I need a solution that would work as fast as possible for this problem.
The problem is that when you use df.loc, it is returning a copy of original dataframe, so your modification doesn't affect the original dataframe. See python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow for more detail.
If you want to drop duplicated on part of column, you can get the duplicated item index and drop based on these indices:
for i in recommender_train_df['user_id'].unique():
mask = recommender_train_df.loc[recommender_train_df['user_id'] == 15].duplicated(subset='date_time', keep="last")
indices = mask[mask.tolist()].index
recommender_train_df.drop(indices, inplace=True)
I'd like to select one column only but all the rows except last row.
If I did it like below, the result is empty.
a = data_vaf.loc[:-1, 'Area']
loc:location
iloc:index location.
They just can't operate implicitly.
Therefore we exclude last raw by iloc then select the column Area
As shown by the comment from #ThePyGuy
data_vaf.iloc[:-1]['Area']
Here's the structure of
iloc[row, column]
And
iloc[row] do the same thing as iloc[row,:]
df.iloc[:-1] do the same thing as df[:-1]
There are multiple ways to do this as addressed in the comments using iloc.
df['col'].iloc[:-1]
Then we just drop the last row
out = data_vaf.drop(data_vaf.index[-1])['Area']
I have the following pandas df:
it is sorted by 'patient_id', 'StartTime', 'hour_counter'.
I'm looking to perform two conditional operations on the df:
Change the value of the Delta_Value column
Delete the entire row
Where the condition depends on the values of ParameterID or patient_id in the current row and the row before.
I managed to do that using classic programming (i.e. a simple loop in Python), but not using Pandas.
Specifically, I want to change the 'Delta_Value' to 0 or delete the entire row, if the ParameterID in the current row is different from the one at the row before.
I've tried to use .groupby().first(), but that won't work in some cases because the same patient_id can have multiple occurrences of the same ParameterID with a different
ParameterID in between those occurrences. For example record 10 in the df.
And I need the records to be sorted by the StartTime & hour_counter.
Any suggestions?
I see a lot of questions related to dropping rows that have a certain value in a column, or dropping the entirety of columns, but pretend we have a Pandas Dataframe like the one below.
In this case, how could one write a line to go through the CSV, and drop all rows like 2 and 4? Thank you.
You could try
~((~df).all(axis=1))
to get the rows that you want to keep/drop. To get the dataframe with just those rows, you would use
df = df[~((~df).all(axis=1))]
A more detailed explanation is here:
Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError
This should help
for i in range(df.shape[0]):
value=df.shape[1]
count=0
for column_name in column_names:
if df.loc[[i]].column_name==False:
count=count+1
if count==value:
df.drop(index=i,inplace=True)
enter image description here
Our objective right now is to drop the duplicate player rows, but keep the row with the highest count in the G column (Games played). What code can we use to achieve this? I've attached a link to the image of our Pandas output here.
You probably want to first sort the dataframe by column G.
df = df.sort_values(by='G', ascending=False)
You can then use drop_duplicates to drop all duplicates except for the first occurrence.
df.drop_duplicates(['Player'], keep='first')
There are 2 ways that I can think of
df.groupby('Player', as_index=False)['G'].max()
and
df.sort_values('G').drop_duplicates(['Player'] , keep = 'last')
The first method uses groupby to group values by Player, and contracts rows keeping the one with the maximum of G. The second one uses the drop_duplicate method of Pandas to achieve the same.
Try this,
Assume your dataframe object is df1 then
series= df1.groupby('Player')['G'].max() # this will return series.
pd.DataFrame(series)
let me know if this work for you or not.