How to sample from Pandas DataFrame while keeping row order - python

Given any DataFrame 2-dimensional, you can call eg. df.sample(frac=0.3) to retrieve a sample. But this sample will have completely shuffled row order.
Is there a simple way to get a subsample that preserves the row order?

What we can do instead is use df.sample(), and then sort the resultant index by the original row order. Appending the sort_index() call does the trick. Here's my code:
df = pd.DataFrame(np.random.randn(100, 10))
result = df.sample(frac=0.3).sort_index()
You can even get it in ascending order. Documentation here.

The way the question is phrased, it sounds like the accepted answer does not provide a valid solution. I'm not sure what the OP really wanted; however, if we don't assume the original index is already sorted, we can't rely on sort_index() to reorder the rows according to their original order.
Assuming we have a DataFrame with an arbitrary index
df = pd.DataFrame(np.random.randn(100, 10), np.random.rand(100))
We can reset the index first to get a RangeIndex, sample, reorder, and reinstate the original index
df_sample = df.reset_index().sample(frac=0.3).sort_index().set_index("index")
And this guarantees we maintain the original order, whatever it was, whatever the index.
Finally, in case there's already a column named "index", we'll need to do something slightly different such as rename the index first, or keep it in a separate variable while we sample. But the principle remains the same.

Related

MultiIndex (multilevel) column names from Dataframe rows

I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T

Python pandas issues with .drop and a non-unique index

I have a pandas DataFrame, say df, and I'm trying to drop certain rows by an index. Specifically:
myindex = df[df.column2 != myvalue].index
df.drop(myindex, inplace = True)
This seems to work just fine for most DataFrames but strange things seem to happen with one DataFrame where I get a non-unique index myindex (I am not quite sure why since the DataFrame has no duplicate rows). To be more precise, a lot more values get dropped than there are in the index (in the extreme case I actually drop all rows even though there are several hundred rows where column2 has myvalue). Extracting only unique values (myindex.unique() and dropping the rows using the unique index doesn't help either. At the same time,
df = df[df.column2 != myvalue]
works just as I'd like it to. I'd rather use the inplace drop however but more importantly I would like to understand why the results are not the same with the direct asignment and with the drop method using the index.
Unfortunately, I cannot provide the data as those cannot be published and since I am not sure what is wrong exactly, I cannot simulate them either. However, I suspect it probably has something to do with myindex being nonunique (which also confuses me since there are no duplicate rows in df but it might very well be that I misunderstand the way the index is created).
If there are repeated values in your index, doing reset_index before might help. That will set your current index as a column and add a new sequential index (with unique values) instead.
df = df.reset_index()
The reason the 2 methods are not the same is that in one case you are passing a series of booleans that represents with rows to keep and which ones to drop (index values are not relevant here). In the case with the drop, you are passing a list of index values (which map to several positions).
Finally, to check is your index has duplicates, you shouldn't check for duplicate rows. Simply do:
df.index.has_duplicates

Sorting Dataframe using pandas. Keeping columns intact

As seen in the image below, I would like to sort the chats by Type in alphabetical order. However, I do not wish to mess up the order of [Date , User_id] within each Chat name. How should I do so given that I have the input dataframe on the left? (Using Pandas in python)
You want to sort the values using a stable sorting algorithm which is mergesort:
df.sort_values(by='Type', kind='mergesort')
From the linked answer:
A sorting algorithm is said to be stable if two objects with equal
keys appear in the same order in sorted output as they appear in the
input array to be sorted.
From pandas docs:
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See also ndarray.np.sort for more
information. mergesort is the only stable algorithm. For DataFrames,
this option is only applied when sorting on a single column or label.
Update: As #ALollz correctly pointed out it is better to convert all the values to lower case first and then do the sorting (i.e. otherwise "Bird" will be placed before "aligator" in the result):
df['temp'] = df['Type'].str.lower()
df = df.sort_values(by='temp', kind='mergesort')
df = df.drop('temp', axis=1)
df.sort_values(by=['Type']) [1]
You could do your own sort function[2], string could be compare directly stringRow2 < stringRow3 .
[1] https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
[2] Sort pandas DataFrame with function over column values

Preserving order of occurrence with size() function

I would like to preserve the order of my DataFrame when using the .size() function. My first DataFrame is created by choosing a subset of a larger one:
df_South = df[df['REGION_NAME'] == 'South']
Here is an example of what the DataFrame looks like:
With this DataFrame I count the occurrences of each unique 'TEMPBIN_CONS' variable.
South_Count = df_South.groupby('TEMPBIN_CONS').size()
I would like to maintain the order that exists using the SORT column. I created this column based on the order I would like my 'TEMPBIN_CONS' variable to appear after counting. I can't seem to get it to appear in the proper order though. I've tried using .sort_index() on South_Count and it does not change order that groupby() creates.
Ultimately this is my solution for fixing the axis ordering of a bar plot I am creating of South_Count. As it is the ordering is very difficult to read and would like it to appear in a logical order.
For reference South_Count, and subsequently the axis of my bar plot appears in
this order:
Try this:
South_Count = df_South.groupby('TEMPBIN_CONS', sort=False ).size()
Looks as though your data is sorted as string.

Adding levels to MultiIndex, removing without losing

Let's assume I have a DataFrame df with a MultiIndex and it has the level L.
Is there a way to remove L from the index and add it again?
df = df.index.drop('L') removes L completely from the DataFrame ( unlike df= df.reset_index() which has a drop argument).
I could of course do df = df.reset_index().set_index(everything_but_L, inplace=True).
Now, let us assume the index contains everything but L, and I want to add L.
df.index.insert(0, df.L) doesn't work.
Again, I could of course call df= df.reset_index().set_index(everything_including_L, inplace=True) but it doesn't feel right.
Why do I need this? Since indices need not be unique, it can occur that I want to add a new column so the index becomes unique. Dropping may be useful in situations where after splitting data one level of the index does not contain any information anymore (say my index is A,B and I operate on a df with A=x but I do not want to lose A which would occur with index.droplevel('A')).
In the current version (0.17.1) it is possible to
df.set_index(column_to_add, append=True, inplace=True)
and
df.reset_index(level=column_to_remove_from_index).
This comes along with a substantial speedup versus resetting n columns and then adding n+1 to the index.

Categories