Adding levels to MultiIndex, removing without losing - python

Let's assume I have a DataFrame df with a MultiIndex and it has the level L.
Is there a way to remove L from the index and add it again?
df = df.index.drop('L') removes L completely from the DataFrame ( unlike df= df.reset_index() which has a drop argument).
I could of course do df = df.reset_index().set_index(everything_but_L, inplace=True).
Now, let us assume the index contains everything but L, and I want to add L.
df.index.insert(0, df.L) doesn't work.
Again, I could of course call df= df.reset_index().set_index(everything_including_L, inplace=True) but it doesn't feel right.
Why do I need this? Since indices need not be unique, it can occur that I want to add a new column so the index becomes unique. Dropping may be useful in situations where after splitting data one level of the index does not contain any information anymore (say my index is A,B and I operate on a df with A=x but I do not want to lose A which would occur with index.droplevel('A')).

In the current version (0.17.1) it is possible to
df.set_index(column_to_add, append=True, inplace=True)
and
df.reset_index(level=column_to_remove_from_index).
This comes along with a substantial speedup versus resetting n columns and then adding n+1 to the index.

Related

MultiIndex (multilevel) column names from Dataframe rows

I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T

Drop duplicate rows based on a column value

I'm trying to write a small code to drop duplicate row based on column unique values, what I'm trying to accomplish is getting all the unique values from user_id and drop according to those unique values using drop_duplicates whilst keeping the last occurrence. keeping in mind the column that I want to drop duplicates from which is date_time.
code:
for i in recommender_train_df['user_id'].unique():
recommender_train_df.loc[recommender_train_df['user_id'] == i].drop_duplicates(subset='date_time', keep="last", inplace=True)
problem with this code it's literally does nothing, I tried and tried and same result nothing happens.
quick note: I have 100k different user_id (unique) so I need a solution that would work as fast as possible for this problem.
The problem is that when you use df.loc, it is returning a copy of original dataframe, so your modification doesn't affect the original dataframe. See python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow for more detail.
If you want to drop duplicated on part of column, you can get the duplicated item index and drop based on these indices:
for i in recommender_train_df['user_id'].unique():
mask = recommender_train_df.loc[recommender_train_df['user_id'] == 15].duplicated(subset='date_time', keep="last")
indices = mask[mask.tolist()].index
recommender_train_df.drop(indices, inplace=True)

How to sample from Pandas DataFrame while keeping row order

Given any DataFrame 2-dimensional, you can call eg. df.sample(frac=0.3) to retrieve a sample. But this sample will have completely shuffled row order.
Is there a simple way to get a subsample that preserves the row order?
What we can do instead is use df.sample(), and then sort the resultant index by the original row order. Appending the sort_index() call does the trick. Here's my code:
df = pd.DataFrame(np.random.randn(100, 10))
result = df.sample(frac=0.3).sort_index()
You can even get it in ascending order. Documentation here.
The way the question is phrased, it sounds like the accepted answer does not provide a valid solution. I'm not sure what the OP really wanted; however, if we don't assume the original index is already sorted, we can't rely on sort_index() to reorder the rows according to their original order.
Assuming we have a DataFrame with an arbitrary index
df = pd.DataFrame(np.random.randn(100, 10), np.random.rand(100))
We can reset the index first to get a RangeIndex, sample, reorder, and reinstate the original index
df_sample = df.reset_index().sample(frac=0.3).sort_index().set_index("index")
And this guarantees we maintain the original order, whatever it was, whatever the index.
Finally, in case there's already a column named "index", we'll need to do something slightly different such as rename the index first, or keep it in a separate variable while we sample. But the principle remains the same.

Python pandas issues with .drop and a non-unique index

I have a pandas DataFrame, say df, and I'm trying to drop certain rows by an index. Specifically:
myindex = df[df.column2 != myvalue].index
df.drop(myindex, inplace = True)
This seems to work just fine for most DataFrames but strange things seem to happen with one DataFrame where I get a non-unique index myindex (I am not quite sure why since the DataFrame has no duplicate rows). To be more precise, a lot more values get dropped than there are in the index (in the extreme case I actually drop all rows even though there are several hundred rows where column2 has myvalue). Extracting only unique values (myindex.unique() and dropping the rows using the unique index doesn't help either. At the same time,
df = df[df.column2 != myvalue]
works just as I'd like it to. I'd rather use the inplace drop however but more importantly I would like to understand why the results are not the same with the direct asignment and with the drop method using the index.
Unfortunately, I cannot provide the data as those cannot be published and since I am not sure what is wrong exactly, I cannot simulate them either. However, I suspect it probably has something to do with myindex being nonunique (which also confuses me since there are no duplicate rows in df but it might very well be that I misunderstand the way the index is created).
If there are repeated values in your index, doing reset_index before might help. That will set your current index as a column and add a new sequential index (with unique values) instead.
df = df.reset_index()
The reason the 2 methods are not the same is that in one case you are passing a series of booleans that represents with rows to keep and which ones to drop (index values are not relevant here). In the case with the drop, you are passing a list of index values (which map to several positions).
Finally, to check is your index has duplicates, you shouldn't check for duplicate rows. Simply do:
df.index.has_duplicates

pandas not modifying df

new to pandas here. I have a df:
inked=tracker[['A','B','C','D','AA','BB','CC', 'DD', 'E', 'F']]
single letter column names contain names and double letter column names contain numbers but also NaN.
I am converting all NaN to zeros by using this:
inked.loc[:,'AA':'DD'].fillna(0)
and it works, but when I do
inked.head()
I get the original df with the NaN. How can I make the change permanently in the df?
By default, fillna() is not performed in place. If you were operating directly on the DataFrame, then you could use the inplace=True argument, like this:
inked.fillna(0, inplace=True)
However, if you first select a subset of the columns, using loc, then the results are lost.
This was covered here. Basically, you need to re-assign the updated DataFrame back to the original DataFrame. For a list of columns (rather than a range, like you originally tried), you can do this:
inked[['AA','DD']] = inked[['AA','DD']].fillna(0)
In general when performing dataframe operations, when you want to alter a dataframe you either need to re-assign it to itself, or to a new variable. (In my experience at least)
inked = inked.loc[:,'AA':'DD'].fillna(0)

Categories