How to save the sort order in Dataframe? - python

I want to add a new column to save the sort order, which is sorted by one of the columns, in Dataframe. For example, I would like to sort by column 'B'(ascending) and add a new column 'C' to save the sort order. that means i want to get a column'C', which is [4,3,1,2,4,2]
df=pd.DataFrame({"A":[1,2,3,4,5,6],"B":[5,2,0,1,5,1]})

Try with rank, and method='dense' so that rank always increases by 1 between groups:
import pandas as pd
df=pd.DataFrame({"A":[1,2,3,4,5,6],"B":[5,2,0,1,5,1]})
df['C']=df['B'].rank(method='dense')
df
Output:
A B C
0 1 5 4
1 2 2 3
2 3 0 1
3 4 1 2
4 5 5 4
5 6 1 2

Related

Sliding minimum value in a pandas column

I am working with a pandas dataframe where I have the following two columns: "personID" and "points". I would like to create a third variable ("localMin") which will store the minimum value of the column "points" at each point in the dataframe as compared with all previous values in the "points" column for each personID (see image below).
Does anyone have an idea how to achieve this most efficiently? I have approached this problem using shift() with different period sizes, but of course, shift is sensitive to variations in the sequence and doesn't always produce the output I would expect.
Thank you in advance!
Use groupby.cummin:
df['localMin'] = df.groupby('personID')['points'].cummin()
Example:
df = pd.DataFrame({'personID': list('AAAAAABBBBBB'),
'points': [3,4,2,6,1,2,4,3,1,2,6,1]
})
df['localMin'] = df.groupby('personID')['points'].cummin()
output:
personID points localMin
0 A 3 3
1 A 4 3
2 A 2 2
3 A 6 2
4 A 1 1
5 A 2 1
6 B 4 4
7 B 3 3
8 B 1 1
9 B 2 1
10 B 6 1
11 B 1 1

Pandas merge multiple value columns into a value and type column

I have a pandas dataframe where there are multiple integer value columns denoting a count. I want to transform this dataframe such that the value columns are merged into one column but another column is created denoting the column the value was taken from.
Input
a b c
0 2 5 8
1 3 6 9
2 4 7 10
Output
count type
0 2 a
1 3 a
2 4 a
3 5 b
4 6 b
5 7 b
6 8 c
7 9 c
8 10 c
Im sure this is possible by looping over the entries and creating however many rows for each original row but im sure there is a pandas way to achieve this and I would like to know what it is called.
You could do that with the following
pd.melt(df, value_vars=['a','b','c'], value_name='count', var_name='type')

Groupby by index in Pandas

How can I use groupby by indexes (1,2,3)(they all are in the same order) and get the sum of the column score belonging to the range of each indexes? Basically I have this:
index score
1 2
2 2
3 2
1 3
2 3
3 3
What I want:
index score sum
1 2 6
2 2 9
3 2
1 3
2 3
3 3
I understand it has to be something like this :
df = df.groupby(['Year'])['Score'].sum()
but instead of a Year, to somehow do it by indexes?
Per the comments, you can groupby the index and return the cumcount() in a new object s. Then, you can groupby this new object s and get the sum(). I am assuming index is on your index in your example and not a column called index. If it is a column called index, then first do df = df.set_index('index'):
s = df.groupby(level=0).cumcount()
df.groupby(s)['score'].sum()
0 6
1 9
Name: score, dtype: int64
If you print out s, then s looks like this:
index
1 0
2 0
3 0
1 1
2 1
3 1

Is there a function in pandas to help me count each string from each row list?

I have a dataframe like this:
df1
a b c
0 1 2 [bg10, ng45, fg56]
1 4 5 [cv10, fg56]
2 7 8 [bg10, ng45, fg56]
3 7 8 [fg56, fg56]
4 7 8 [bg10]
I would like to count the total occurences take place of each type in column 'c'. I would then like to return the value of column 'b' for the values in column 'c' that have a count total of '1'.
The expected output is soemthing like this:
c b total_count
0 bg10 2 2
0 ng45 2 2
0 fg56 2 5
1 cv10 5 1
1 fg56 5 5
I have tried the 'Collections' library, and a 'for' loop (I understand its not best practise in Pandas) but i think i'm missing some fundamental udnerstanding of lists within cells, and how to perform analysis like these.
Thank you for taking my question into consideration.
I would use apply the following way:
first I create the df:
df1=pd.DataFrame({"b":[2,5,8,8], "c":[['bg10', 'ng45', 'fg56'],['cv10', 'fg56'],['bg10', 'ng45', 'fg56'],['fg56', 'fg56']]})
next use apply to count the number of (non unique) items in a list and save it in a different column:
df1["count_c"]=df1.c.apply(lambda x: len(x))
you will get the following:
b c count_c
0 2 [bg10, ng45, fg56] 3
1 5 [cv10, fg56] 2
2 8 [bg10, ng45, fg56] 3
3 8 [fg56, fg56] 2
to get the lines when c larger than threshold:`
df1[df1["count_c"]>2]["b"]
note: if you want to count only unique values in each list in column c you should use:
df1["count_c"]=df1.c.apply(lambda x: len(set(x)))
EDIT
in order to count the total number of each item I would try this:
first let's "unpack all the lists into columns
new_df1=(df1.c.apply(lambda x: pd.Series(x))).stack().reset_index(level=1,drop=True).to_frame("c").join(df1[["b"]],how="left")
then get the total counts of each item in the list and add it to a new col:
counts_dict=new_df1.c.value_counts().to_dict()
new_df1["total_count_c"]=new_df1.c.map(counts_dict)
new_df1.head()
c b total_count_c
0 bg10 2 2
0 ng45 2 2
0 fg56 2 5
1 cv10 5 1
1 fg56 5 5

Making a Multiindexed Pandas Dataframe Non-Symmetric

I have a multi-indexed dataframe which looks roughly like this:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
>>> Output
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
4 5 1 1 5
In this dataframe, the zero-th row and fifth row are symmetric in the sense that if the entire A and B columns of the zero-th row are flipped, it becomes identical to the fifth one. Similarly, the second row is symmetric with itself.
I am planning to remove these rows from my original dataframe, thus making it 'non-symmetric'. The specific plans are as follow:
If a row with higher index is symmetric with a row with lower index, keep the lower one and remove the higher one. For example, from the above dataframe, keep the zero-th row and remove the fifth row.
If a row is symmetric with itself, remove that row. For example, from the above dataframe, remove the second row.
My attempt was to first zip the four lists into a tuple list, remove the symmetric tuples by a simple if-statement, unzip them, and merge them back into a dataframe. However, this turned out to be inefficient, making it unscalable for large dataframes.
How can I achieve this in an efficient manner? I guess utilizing several built-in pandas methods is necessary, but it seems quite complicated.
Namudon'tdie,
Try this solution:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
test['idx'] = test.index * 2 # adding auxiliary column 'idx' (all even)
test2 = test.iloc[:, [2,3,0,1,4]] # creating flipped DF
test2.columns = test.columns # fixing column names
test2['idx'] = test2.index * 2 + 1 # for flipped DF column 'idx' is all odd
df = pd.concat([test, test2])
df = df.sort_values (by='idx')
df = df.set_index('idx')
print(df)
A B
a b a b
idx
0 1 5 5 1
1 5 1 1 5
2 2 4 2 4
3 2 4 2 4
4 3 3 3 3
5 3 3 3 3
6 4 2 4 2
7 4 2 4 2
8 5 1 1 5
9 1 5 5 1
df = df.drop_duplicates() # remove rows with duplicates
df = df[df.index%2 == 0] # remove rows with odd idx (flipped)
df = df.reset_index()[['A', 'B']]
print(df)
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
The idea is to create flipped rows with odd indexes, so that they will be placed under their original rows after reindexing. Then delete duplicates, keeping rows with lower indices. For cleanup simply delete remaining rows with odd indices.
Note that row [3,3,3,3] stayed. There should be a separate filter to take care of self-symmetric rows. Since your definition of self-symmetric is unclear (other rows have certain degree of symmetry too), I leave this part to you. Should be straightforward.

Categories