Python pandas issues with .drop and a non-unique index

Python pandas issues with .drop and a non-unique index - python

I have a pandas DataFrame, say df, and I'm trying to drop certain rows by an index. Specifically:
myindex = df[df.column2 != myvalue].index
df.drop(myindex, inplace = True)
This seems to work just fine for most DataFrames but strange things seem to happen with one DataFrame where I get a non-unique index myindex (I am not quite sure why since the DataFrame has no duplicate rows). To be more precise, a lot more values get dropped than there are in the index (in the extreme case I actually drop all rows even though there are several hundred rows where column2 has myvalue). Extracting only unique values (myindex.unique() and dropping the rows using the unique index doesn't help either. At the same time,
df = df[df.column2 != myvalue]
works just as I'd like it to. I'd rather use the inplace drop however but more importantly I would like to understand why the results are not the same with the direct asignment and with the drop method using the index.
Unfortunately, I cannot provide the data as those cannot be published and since I am not sure what is wrong exactly, I cannot simulate them either. However, I suspect it probably has something to do with myindex being nonunique (which also confuses me since there are no duplicate rows in df but it might very well be that I misunderstand the way the index is created).

If there are repeated values in your index, doing reset_index before might help. That will set your current index as a column and add a new sequential index (with unique values) instead.
df = df.reset_index()
The reason the 2 methods are not the same is that in one case you are passing a series of booleans that represents with rows to keep and which ones to drop (index values are not relevant here). In the case with the drop, you are passing a list of index values (which map to several positions).
Finally, to check is your index has duplicates, you shouldn't check for duplicate rows. Simply do:
df.index.has_duplicates

Related

Pandas - drop rows based on two conditions on different columns

Although there are several related questions answered in Pandas, I cannot solve this issue. I have a large dataframe (~ 49000 rows) and want to drop rows the meet two conditions at the same time(~ 120):
For one column: an exact string
For another column: a NaN value
My code is ignoring the conditions and no row is removed.
to_remove = ['string1', 'string2']
df.drop(df[df['Column 1'].isin(to_remove) & (df['Column 2'].isna())].index, inplace=True)
What am I doing wrong? Thanks for any hint!

Instead of calling drop, and passing the index, You can create the mask for the condition for which you want to keep the rows, then take only those rows. Also, the logic error seems to be there, you are checking two different condition combined by AND for the same column values.
df[~(df['Column1'].isin(to_remove) & (df['Column2'].isna()))]
Also, if you need to check in the same column, then you probably want to combine the conditions by or i.e. |
If needed, you can reset_index at last.
Also, as side note, your list to_remove has two same string values, I'm assuming thats a typo in the question.

Drop duplicate rows based on a column value

I'm trying to write a small code to drop duplicate row based on column unique values, what I'm trying to accomplish is getting all the unique values from user_id and drop according to those unique values using drop_duplicates whilst keeping the last occurrence. keeping in mind the column that I want to drop duplicates from which is date_time.
code:
for i in recommender_train_df['user_id'].unique():
recommender_train_df.loc[recommender_train_df['user_id'] == i].drop_duplicates(subset='date_time', keep="last", inplace=True)
problem with this code it's literally does nothing, I tried and tried and same result nothing happens.
quick note: I have 100k different user_id (unique) so I need a solution that would work as fast as possible for this problem.

The problem is that when you use df.loc, it is returning a copy of original dataframe, so your modification doesn't affect the original dataframe. See python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow for more detail.
If you want to drop duplicated on part of column, you can get the duplicated item index and drop based on these indices:
for i in recommender_train_df['user_id'].unique():
mask = recommender_train_df.loc[recommender_train_df['user_id'] == 15].duplicated(subset='date_time', keep="last")
indices = mask[mask.tolist()].index
recommender_train_df.drop(indices, inplace=True)

How to sample from Pandas DataFrame while keeping row order

Given any DataFrame 2-dimensional, you can call eg. df.sample(frac=0.3) to retrieve a sample. But this sample will have completely shuffled row order.
Is there a simple way to get a subsample that preserves the row order?

What we can do instead is use df.sample(), and then sort the resultant index by the original row order. Appending the sort_index() call does the trick. Here's my code:
df = pd.DataFrame(np.random.randn(100, 10))
result = df.sample(frac=0.3).sort_index()
You can even get it in ascending order. Documentation here.

The way the question is phrased, it sounds like the accepted answer does not provide a valid solution. I'm not sure what the OP really wanted; however, if we don't assume the original index is already sorted, we can't rely on sort_index() to reorder the rows according to their original order.
Assuming we have a DataFrame with an arbitrary index
df = pd.DataFrame(np.random.randn(100, 10), np.random.rand(100))
We can reset the index first to get a RangeIndex, sample, reorder, and reinstate the original index
df_sample = df.reset_index().sample(frac=0.3).sort_index().set_index("index")
And this guarantees we maintain the original order, whatever it was, whatever the index.
Finally, in case there's already a column named "index", we'll need to do something slightly different such as rename the index first, or keep it in a separate variable while we sample. But the principle remains the same.

Python Pandas Dataframes comparison on 2 columns (with where clause)

I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.

You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.

Adding levels to MultiIndex, removing without losing

Let's assume I have a DataFrame df with a MultiIndex and it has the level L.
Is there a way to remove L from the index and add it again?
df = df.index.drop('L') removes L completely from the DataFrame ( unlike df= df.reset_index() which has a drop argument).
I could of course do df = df.reset_index().set_index(everything_but_L, inplace=True).
Now, let us assume the index contains everything but L, and I want to add L.
df.index.insert(0, df.L) doesn't work.
Again, I could of course call df= df.reset_index().set_index(everything_including_L, inplace=True) but it doesn't feel right.
Why do I need this? Since indices need not be unique, it can occur that I want to add a new column so the index becomes unique. Dropping may be useful in situations where after splitting data one level of the index does not contain any information anymore (say my index is A,B and I operate on a df with A=x but I do not want to lose A which would occur with index.droplevel('A')).

In the current version (0.17.1) it is possible to
df.set_index(column_to_add, append=True, inplace=True)
and
df.reset_index(level=column_to_remove_from_index).
This comes along with a substantial speedup versus resetting n columns and then adding n+1 to the index.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas issues with .drop and a non-unique index - python

Related

Pandas - drop rows based on two conditions on different columns

Drop duplicate rows based on a column value

How to sample from Pandas DataFrame while keeping row order

Python Pandas Dataframes comparison on 2 columns (with where clause)

Adding levels to MultiIndex, removing without losing

Categories

Resources