As seen in the image below, I would like to sort the chats by Type in alphabetical order. However, I do not wish to mess up the order of [Date , User_id] within each Chat name. How should I do so given that I have the input dataframe on the left? (Using Pandas in python)
You want to sort the values using a stable sorting algorithm which is mergesort:
df.sort_values(by='Type', kind='mergesort')
From the linked answer:
A sorting algorithm is said to be stable if two objects with equal
keys appear in the same order in sorted output as they appear in the
input array to be sorted.
From pandas docs:
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See also ndarray.np.sort for more
information. mergesort is the only stable algorithm. For DataFrames,
this option is only applied when sorting on a single column or label.
Update: As #ALollz correctly pointed out it is better to convert all the values to lower case first and then do the sorting (i.e. otherwise "Bird" will be placed before "aligator" in the result):
df['temp'] = df['Type'].str.lower()
df = df.sort_values(by='temp', kind='mergesort')
df = df.drop('temp', axis=1)
df.sort_values(by=['Type']) [1]
You could do your own sort function[2], string could be compare directly stringRow2 < stringRow3 .
[1] https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
[2] Sort pandas DataFrame with function over column values
Related
I have a program that uses a mask similar to the check marked answer shown here to create multiple sets of random numbers in a dataframe, df.
Create random.randint with condition in a group by?
My code:
for city in state:
mask = df['City'] == city
df.loc[mask, 'Random'] = np.random.randint(1, 200, mask.sum())
This takes quite some time the bigger dataframe df is. Is there a way to speed this up with groupby?
You can try:
df['Random'] = df.assign(Random=0).groupby(df['City'])['Random'] \
.transform(lambda x: np.random.randint(1, 200, len(x)))
I've figured out a much quicker way to do this. I'll keep it more general given the application might be different depending on what you want to achieve and keep Corralien's answer as the check mark.
Instead of creating a mask or group and using .loc to update the dataframe in place, I sorted the dataframe by the 'City' then created a list of unique values from my 'City' column.
Looping over the unique list (i.e.; the grouping), I generated the random numbers for each grouping, putting them in a new list using the .extend() function. I then added the 'Random' column from this list, and sorted the dataframe back using the index.
I would like to sort in ascending order all the columns in a dataframe independently. My data frame is as follows:
date,A,B,C,D
1989-12-31,540.8,497.351,757.9,649.811
1990-12-31,388.9,453.65,454.2,714.898
1991-12-31,796.3,170.308,1080.4,274.678
1992-12-31,427.7,304.587,695.6,414.898
I have tried manually:
df1=df.sort_values(by=['A','B','C','D'],axis=0, inplace=True)
date,A,B,C,D
1990-12-31,388.9,453.65,454.2,714.898
1992-12-31,427.7,304.587,695.6,414.898
1989-12-31,540.8,497.351,757.9,649.811
1991-12-31,796.3,170.308,1080.4,274.678
But as you can see it works only with column 'A'.
Do I have to do a loop on each column?
Is there a simpler way? I have had a look in the sort manual but I am not able to figure it out.
Thanks
Your code worked - but the way sorting works is that it is hierarchical, if every value in column A is different, then it will force the sort first on column A. It will then sort on column B for all instances where there are multiple of the same value in A, and for C only if there are instances where both A and B are the same, but C differs.
However - all your values shown in A are different from one another so it will only sort based on A using this method.
Given any DataFrame 2-dimensional, you can call eg. df.sample(frac=0.3) to retrieve a sample. But this sample will have completely shuffled row order.
Is there a simple way to get a subsample that preserves the row order?
What we can do instead is use df.sample(), and then sort the resultant index by the original row order. Appending the sort_index() call does the trick. Here's my code:
df = pd.DataFrame(np.random.randn(100, 10))
result = df.sample(frac=0.3).sort_index()
You can even get it in ascending order. Documentation here.
The way the question is phrased, it sounds like the accepted answer does not provide a valid solution. I'm not sure what the OP really wanted; however, if we don't assume the original index is already sorted, we can't rely on sort_index() to reorder the rows according to their original order.
Assuming we have a DataFrame with an arbitrary index
df = pd.DataFrame(np.random.randn(100, 10), np.random.rand(100))
We can reset the index first to get a RangeIndex, sample, reorder, and reinstate the original index
df_sample = df.reset_index().sample(frac=0.3).sort_index().set_index("index")
And this guarantees we maintain the original order, whatever it was, whatever the index.
Finally, in case there's already a column named "index", we'll need to do something slightly different such as rename the index first, or keep it in a separate variable while we sample. But the principle remains the same.
I want to sort groups of rows based on a column (in my example, 'Group' is the column to group and then sort the groups (maintain in-group row order). I can't sort by index because the index is purposefully out of order as a result of previous operations.
df = pd.DataFrame({
'Group':[5,5,5,9,9,777,777,1,2,2],
'V1':['a','b','a',3,6,1,None,10,3,None],
'V2':['blah','blah','blah','dog','cat','cat','na','first','last','nada'],
'V3':[1,2,3,4,5,5,4,3,2,1,]
})
And want it to look like this:
I've tried various things like
df.groupby(['Group'])['Group']).aggregate({'min grp':'min'}).sort_values(by=['min grp'], ascending=True)
If it helps, the original df was created via pd.concat(list-of-dataframes) and when I sorted them afterwards by Group it also sorted the rows within the Group based on the index, which does not work for my specific problem.
You need to use sort_values with option kind='mergesort'. From pandas docs:
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See also ndarray.np.sort for more
information. mergesort is the only stable algorithm. For DataFrames,
this option is only applied when sorting on a single column or label.
A sort algorithm is called stable when two identical element with equal keys appear in the same order as they are in the input. List of stable sorts are: insertion sort, merge sort, bubble sort, tim sort, counting sort
So you need:
df = df.sort_values('Group', kind='mergesort')
When you call sort_values without kind, it is default ‘quicksort’ and quicksort is not stable
If I understand your question correctly, you don't want to group-by, but to sort by the values of your column Group. You can do it with pandas.sort_values()
df.sort_values('Group', inplace=True)
What exactly is the lexsort_depth of a multi-index dataframe? Why does it have to be sorted for indexing?
For example, I have noticed that, after manually building a multi-index dataframe df with columns organized in three levels, if I try to do:
idx = pd.IndexSlice
df[idx['foo', 'bar']]
I get:
KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (0)'
and at this point, df.columns.lexsort_depth is 0
However, if I do, as recommended here and here:
df = df.sortlevel(0,axis=1)
then the cross-section indexing works. Why? What exactly is lexsort_depth, and why sorting with sortlevel fixes this type of indexing?
lexsort_depth is the number of levels of a multi-index that are sorted lexically. That is, in an a-b-c-1-2-3 order (normal sort order).
So element indexing will work if a multi-index is not sorted, but the lookups may be quite a bit slower (in 0.15.2, this will show a PerformanceWarning for doing these kinds of lookups, see here
The reason that sorting in general a good idea is that pandas is able to use hash-based indexing to figure out where the location is in a particular level independently for the level. ; then you can use these indexers to find the final locations.
Pandas takes advantage of np.searchsorted to find these locations when its sorted. If its not sorted, then you have to fallback to some different (slower) methods.
here is the code that does this.