How to sort dataframe columns baed on 2 indexes? - python

I have a data frame like this
df:
Index C-1 C-2 C-3 C-4 ........
Ind-1 3 9 5 4
Ind-2 5 2 8 3
Ind-3 0 1 1 0
.
.
The data frame has more than a hundred columns and rows with whole numbers(0-60) as values.
The first two rows(indexes) have values in the range 2-12/
I want to sort the columns based on values in the first and second rows(indexes) in ascending order. I need not care about sorting in the remaining rows.
Can anyone help me with this

pandas.DataFrame.sort_values
In the first argument you pass rows that you need sorting on, and than axis to sort through columns.
Mind that by placement has a priority. If you want to sort first by second row and than by the first, you should pass ['Ind-2','Ind-1'], this will have a different result.
df.sort_values(by=['Ind-1','Ind-2'],axis=1)
Output
C-1 C-4 C-3 C-2
Index
Ind-1 3 4 5 9
Ind-2 5 3 8 2
Ind-3 0 0 1 1

Related

Put level of dataframe index at the same level of columns on a Multi-Index Dataframe

Context: I'd like to "bump" the index level of a multi-index dataframe up. In other words, I'd like to put the index level of a dataframe at the same level as the columns of a multi-indexed dataframe
Let's say we have this dataframe:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
tt.index.name = 'Index Column'
And we perform this change to add a multi-index level (like a label of a table)
tt = pd.concat([tt],keys=['Multi-Index Table Label'], axis=1)
Which results in this:
Multi-Index Table Label
A B C
Index Column
0 1 4 7
1 2 5 8
2 3 6 9
Desired Output: How can I make it so that the dataframe looks like this instead (notice the removal of the empty level on the dataframe/table):
Multi-Index Table Label
Index Column A B C
0 1 4 7
1 2 5 8
2 3 6 9
Attempts: I was testing something out and you can essentially remove the index level by doing this:
tt.index.name = None
Which would result in :
Multi-Index Table Label
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Essentially removing that extra level/empty line, but the thing is that I do want to keep the Index Column as it will give information about the type of data present on the index (which in this example are just 0,1,2 but can be years, dates, etc).
How could I do that?
Thank you all in advance :)
How about this:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
tt.insert(loc=0, column='Index Column', value=tt.index)
tt = pd.concat([tt],keys=['Multi-Index Table Label'], axis=1)
tt = tt.style.hide_index()

Why do we need to add : when defining a new column using .iloc function

When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.

Divide most of the columns by the value of one of them

I have a dataframe like this one:
Name Team Shots Goals Actions Games Minutes
1 Player 1 ABC 5 3 20 2 15
2 Player 2 ATL 6 2 15 1 30
3 Player 3 RMA 3 3 16 1 20
4 Player 4 BAR 9 0 22 3 28
5 Player 5 ATL 8 1 19 2 32
Actually, in my df I have around 120 columns, but this is an example to see how would be the solution. I need the same df but with the values of most of the columns divided by one. In this case I would like to have the values os 'Shots', 'Goals' and 'Actions' divided by 'Minutes' but I don't want to apply this condition to 'Games' (and some 3 or 4 other specific columns in my real case).
Do you know any code to apply what I need telling the exceptions columns I don't need to apply the division?
try:
exclude=['Games','Minutes']
#create a list of excluded columns
cols=df.columns[df.dtypes!='O']
#Filterout columns that are of type int\float
cols=cols[~cols.isin(exclude)]
#Filter out columns other than that are present in exclude list
Finally:
out=df[cols].div(df['Minutes'],axis=0)
Update:
If you want complete and final df with the excludes columns and the values in this one then you can use join() method:
finaldf=out.join(df[exclude])
#If you want to join only excluded column
OR
cols=df.columns[df.dtypes=='O'].tolist()+exclude
finaldf=out.join(df[cols])
#If you want all the columns excluded+string ones
You can use df.div() to divide multiple columns by one column in place:
df[['Shots','Goals','Actions']].div(df.Minutes, axis=0)

What is the best way to create new pandas dataframe consisting of specific rows of an existing dataframe that match criteria?

I have a pandas dataframe with 6 columns and several rows, each row being data from a specific participant in an experiment. Each column is a particular scale that the participant responded to and contains their scores. I want to create a new dataframe that has only data from those participants whose score for one particular measure matches a criteria.
The criteria is that it has to match one of the items from a list that I have generated separately.
To paraphrase, I have the data in a dataframe and I want to isolate participants who scored a certain score in one of the 6 measures that matches a list of scores that are of interest. I want to have all the 6 columns in the new dataframe with just the rows of participants of interest. Hope this is clear.
I tried using the groupby function but it doesn't offer enough specificity in specifying the criteria, or at least I don't know the syntax if such methods exist. I'm fairly new to pandas.
You could use isin() and any() to isolate the participants getting a particular score in the tests.
Here's a small example DataFrame showing the scores of five participants in three tests:
>>> df = pd.DataFrame(np.random.randint(1,6,(5,3)), columns=['Test1','Test2','Test3'])
>>> df
Test1 Test2 Test3
0 3 3 5
1 5 5 2
2 5 3 4
3 1 3 3
4 2 1 1
If you want a DataFrame with the participants scoring a 1 or 2 in any of the three tests, you could do the following:
>>> score = [1, 2]
>>> df[df.isin(score).any(axis=1)]
Test1 Test2 Test3
1 5 5 2
3 1 3 3
4 2 1 1
Here df.isin(score) creates a boolean DataFrame showing whether each value of df was in the list scores or not. any(axis=1) checks each row for at least one True value, creating a boolean Series. This Series is then used to index the DataFrame df.
If I understood your question correctly you want to query a dataframe for inclusion of the entries in a list.
Like, you have a "results" df like
df = pd.DataFrame({'score1' : np.random.randint(0,10,5)
, 'score2' : np.random.randint(0,10,5)})
score1 score2
0 7 2
1 9 9
2 9 3
3 9 3
4 0 4
and a set of positive outcomes
positive_outcomes = [1,5,7,3]
then you can query the df like
df_final = df[df.score1.isin(positive_outcomes) | df.score2.isin(positive_outcomes)]
to get
score1 score2
0 7 2
2 9 3
3 9 3

Sorting DateTimeSeries duplicates in DataFrame

Alright so I hope we don't need an example for this, but let's say we have a DataFrame with 100k rows and 50+ instances of the Index being the exact same DateTime.
What would be the fastest way to sort my DataFrame by Time, but then if there is a tie choose a second column to sort by.
So:
Sort By Time
If Duplicate Time, sort by 'Cost'
If you pass a list of the columns in the order you want them sorted by it will sort by the first column and then the second column
df = DataFrame({'a':[1,2,1,1,3], 'b':[1,2,2,2,1]})
df
Out[11]:
a b
0 1 1
1 2 2
2 1 2
3 1 2
4 3 1
In [13]:
df.sort(columns=['a','b'], inplace=True)
df
Out[13]:
a b
0 1 1
2 1 2
3 1 2
1 2 2
4 3 1
So for your example
df.sort(columns=['Time', 'Cost'],inplace=True)
would work
EDIT
It has pointed out (by #AndyHayden) that there is bug if you have nested NaN in supplementary columns, see this SO and there is a GitHub issue, this may not be an issue in your case but it is something to be aware of.

Categories