Preserving order of occurrence with size() function - python

I would like to preserve the order of my DataFrame when using the .size() function. My first DataFrame is created by choosing a subset of a larger one:
df_South = df[df['REGION_NAME'] == 'South']
Here is an example of what the DataFrame looks like:
With this DataFrame I count the occurrences of each unique 'TEMPBIN_CONS' variable.
South_Count = df_South.groupby('TEMPBIN_CONS').size()
I would like to maintain the order that exists using the SORT column. I created this column based on the order I would like my 'TEMPBIN_CONS' variable to appear after counting. I can't seem to get it to appear in the proper order though. I've tried using .sort_index() on South_Count and it does not change order that groupby() creates.
Ultimately this is my solution for fixing the axis ordering of a bar plot I am creating of South_Count. As it is the ordering is very difficult to read and would like it to appear in a logical order.
For reference South_Count, and subsequently the axis of my bar plot appears in
this order:

Try this:
South_Count = df_South.groupby('TEMPBIN_CONS', sort=False ).size()
Looks as though your data is sorted as string.

Related

How to Index a dataframe based on an applied function? -Pandas

I have a dataframe that I created from a master table in SQL. That new dataframe is then grouped by type as I want to find the outliers for each group in the master table.
The function finds the outliers, showing where in the GroupDF they outliers occur. How do I see this outliers as a part of the original dataframe? Not just volume but also location, SKU, group etc.
dataframe: HOSIERY_df
Code:
##Sku Group Data Frames
grouped_skus = sku_volume.groupby('SKUGROUP')
HOSIERY_df = grouped_skus.get_group('HOSIERY')
hosiery_outliers = find_outliers_IQR(HOSIERY_df['VOLUME'])
hosiery_outliers
#.iloc[[hosiery_outliers]]
#hosiery_outliers
Picture to show code and output:
I know enough that I need to find the rows based on location of the index. Like Vlookup in Excel but i need to do it with in Python. Not sure how to pull only the 5, 6, 7...3888 and 4482nd place in the HOSIERY_df.
You can provide a list of index numbers as integers to iloc, which it looks like you have tried based on your commented-out code. So, you may want to make sure that find_outliers_IQR is returning a list of int so it will work properly with iloc, or convert it's output.
It looks like it's currently returning a DataFrame. You can get the index of that frame as a list like this:
hosiery_outliers.index.tolist()

How to filter columns containing missing values

I am using the following code:
sns.displot(
data=df.isna().melt(value_name="missing"),
y="variable",
hue="missing",
multiple="fill",
height=16
)
plt.show()
to create a heatmap of missing values of the df. However since my df has a lot of columns, the chart has to be very tall in order to accommodate all the information. I tried altering the data argument to be something like this:
data = df[df.columns.values.isna()].isna() or data = df[df.isna().sum() > 0].isna() so basically, I want to filter the dataframe to have only columns with at least one missing value. I tried looking for a correct answer but couldn't find it.
Nearly there. To select all columns with at least one missing value, use:
df[df.columns[df.isna().any()]]
Alternatively, you could use .sum() and then choose some threshold:
threshold = 0
df[df.columns[df.isna().sum() > threshold]]
And then append .isna().melt(value_name="missing") for your data var.

How to sort all columns independently in ascending order in a pandas dataframe?

I would like to sort in ascending order all the columns in a dataframe independently. My data frame is as follows:
date,A,B,C,D
1989-12-31,540.8,497.351,757.9,649.811
1990-12-31,388.9,453.65,454.2,714.898
1991-12-31,796.3,170.308,1080.4,274.678
1992-12-31,427.7,304.587,695.6,414.898
I have tried manually:
df1=df.sort_values(by=['A','B','C','D'],axis=0, inplace=True)
date,A,B,C,D
1990-12-31,388.9,453.65,454.2,714.898
1992-12-31,427.7,304.587,695.6,414.898
1989-12-31,540.8,497.351,757.9,649.811
1991-12-31,796.3,170.308,1080.4,274.678
But as you can see it works only with column 'A'.
Do I have to do a loop on each column?
Is there a simpler way? I have had a look in the sort manual but I am not able to figure it out.
Thanks
Your code worked - but the way sorting works is that it is hierarchical, if every value in column A is different, then it will force the sort first on column A. It will then sort on column B for all instances where there are multiple of the same value in A, and for C only if there are instances where both A and B are the same, but C differs.
However - all your values shown in A are different from one another so it will only sort based on A using this method.

How to sample from Pandas DataFrame while keeping row order

Given any DataFrame 2-dimensional, you can call eg. df.sample(frac=0.3) to retrieve a sample. But this sample will have completely shuffled row order.
Is there a simple way to get a subsample that preserves the row order?
What we can do instead is use df.sample(), and then sort the resultant index by the original row order. Appending the sort_index() call does the trick. Here's my code:
df = pd.DataFrame(np.random.randn(100, 10))
result = df.sample(frac=0.3).sort_index()
You can even get it in ascending order. Documentation here.
The way the question is phrased, it sounds like the accepted answer does not provide a valid solution. I'm not sure what the OP really wanted; however, if we don't assume the original index is already sorted, we can't rely on sort_index() to reorder the rows according to their original order.
Assuming we have a DataFrame with an arbitrary index
df = pd.DataFrame(np.random.randn(100, 10), np.random.rand(100))
We can reset the index first to get a RangeIndex, sample, reorder, and reinstate the original index
df_sample = df.reset_index().sample(frac=0.3).sort_index().set_index("index")
And this guarantees we maintain the original order, whatever it was, whatever the index.
Finally, in case there's already a column named "index", we'll need to do something slightly different such as rename the index first, or keep it in a separate variable while we sample. But the principle remains the same.

Sorting Dataframe using pandas. Keeping columns intact

As seen in the image below, I would like to sort the chats by Type in alphabetical order. However, I do not wish to mess up the order of [Date , User_id] within each Chat name. How should I do so given that I have the input dataframe on the left? (Using Pandas in python)
You want to sort the values using a stable sorting algorithm which is mergesort:
df.sort_values(by='Type', kind='mergesort')
From the linked answer:
A sorting algorithm is said to be stable if two objects with equal
keys appear in the same order in sorted output as they appear in the
input array to be sorted.
From pandas docs:
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See also ndarray.np.sort for more
information. mergesort is the only stable algorithm. For DataFrames,
this option is only applied when sorting on a single column or label.
Update: As #ALollz correctly pointed out it is better to convert all the values to lower case first and then do the sorting (i.e. otherwise "Bird" will be placed before "aligator" in the result):
df['temp'] = df['Type'].str.lower()
df = df.sort_values(by='temp', kind='mergesort')
df = df.drop('temp', axis=1)
df.sort_values(by=['Type']) [1]
You could do your own sort function[2], string could be compare directly stringRow2 < stringRow3 .
[1] https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
[2] Sort pandas DataFrame with function over column values

Categories