Sorting DateTimeSeries duplicates in DataFrame

Sorting DateTimeSeries duplicates in DataFrame - python

Alright so I hope we don't need an example for this, but let's say we have a DataFrame with 100k rows and 50+ instances of the Index being the exact same DateTime.
What would be the fastest way to sort my DataFrame by Time, but then if there is a tie choose a second column to sort by.
So:
Sort By Time
If Duplicate Time, sort by 'Cost'

If you pass a list of the columns in the order you want them sorted by it will sort by the first column and then the second column
df = DataFrame({'a':[1,2,1,1,3], 'b':[1,2,2,2,1]})
df
Out[11]:
a b
0 1 1
1 2 2
2 1 2
3 1 2
4 3 1
In [13]:
df.sort(columns=['a','b'], inplace=True)
df
Out[13]:
a b
0 1 1
2 1 2
3 1 2
1 2 2
4 3 1
So for your example
df.sort(columns=['Time', 'Cost'],inplace=True)
would work
EDIT
It has pointed out (by #AndyHayden) that there is bug if you have nested NaN in supplementary columns, see this SO and there is a GitHub issue, this may not be an issue in your case but it is something to be aware of.

Related

How to sort dataframe columns baed on 2 indexes?

I have a data frame like this
df:
Index C-1 C-2 C-3 C-4 ........
Ind-1 3 9 5 4
Ind-2 5 2 8 3
Ind-3 0 1 1 0
.
.
The data frame has more than a hundred columns and rows with whole numbers(0-60) as values.
The first two rows(indexes) have values in the range 2-12/
I want to sort the columns based on values in the first and second rows(indexes) in ascending order. I need not care about sorting in the remaining rows.
Can anyone help me with this

pandas.DataFrame.sort_values
In the first argument you pass rows that you need sorting on, and than axis to sort through columns.
Mind that by placement has a priority. If you want to sort first by second row and than by the first, you should pass ['Ind-2','Ind-1'], this will have a different result.
df.sort_values(by=['Ind-1','Ind-2'],axis=1)
Output
C-1 C-4 C-3 C-2
Index
Ind-1 3 4 5 9
Ind-2 5 3 8 2
Ind-3 0 0 1 1

Why do we need to add : when defining a new column using .iloc function

When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.

pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).

You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.

converting unstacked dataframe into a dataframe in pandas

I have a dataframe like the following:
This dataframe is a result of unstacking another dataframe.
cost 10 20 30
-------------------------------------------
cycles
--------------------------------------------
1 2 4 6
2 1 2 3
3 3 6 9
4 1 0 5
I want something like this:
cycles 10 20 30
-----------------------------------------------
1 2 4 6
2 1 2 3
3 3 6 9
4 1 0 5
I am a little confused about pivoting in the unstacked dataframes. I went through a couple of other similar posts but I really couldn't understand. I want to perform regression on every column of this dataframe. I figured it would be difficult to access the cycles column in the first dataframe, so I would really appreciate if someone can shed any light on this. Thanks in advance!

Do you want this?
df.reset_index()
Edit: To drop the axis name cost, you need to use:
df.rename_axis(None, axis=1).reset_index()
This will still return an index, that is just the way pandas works, but it will not have a label floating over it. If you want cycle as the index without cost, you can just use the first part:
df.rename_axis(None, axis=1)

What is the best way to create new pandas dataframe consisting of specific rows of an existing dataframe that match criteria?

I have a pandas dataframe with 6 columns and several rows, each row being data from a specific participant in an experiment. Each column is a particular scale that the participant responded to and contains their scores. I want to create a new dataframe that has only data from those participants whose score for one particular measure matches a criteria.
The criteria is that it has to match one of the items from a list that I have generated separately.
To paraphrase, I have the data in a dataframe and I want to isolate participants who scored a certain score in one of the 6 measures that matches a list of scores that are of interest. I want to have all the 6 columns in the new dataframe with just the rows of participants of interest. Hope this is clear.
I tried using the groupby function but it doesn't offer enough specificity in specifying the criteria, or at least I don't know the syntax if such methods exist. I'm fairly new to pandas.

You could use isin() and any() to isolate the participants getting a particular score in the tests.
Here's a small example DataFrame showing the scores of five participants in three tests:
>>> df = pd.DataFrame(np.random.randint(1,6,(5,3)), columns=['Test1','Test2','Test3'])
>>> df
Test1 Test2 Test3
0 3 3 5
1 5 5 2
2 5 3 4
3 1 3 3
4 2 1 1
If you want a DataFrame with the participants scoring a 1 or 2 in any of the three tests, you could do the following:
>>> score = [1, 2]
>>> df[df.isin(score).any(axis=1)]
Test1 Test2 Test3
1 5 5 2
3 1 3 3
4 2 1 1
Here df.isin(score) creates a boolean DataFrame showing whether each value of df was in the list scores or not. any(axis=1) checks each row for at least one True value, creating a boolean Series. This Series is then used to index the DataFrame df.

If I understood your question correctly you want to query a dataframe for inclusion of the entries in a list.
Like, you have a "results" df like
df = pd.DataFrame({'score1' : np.random.randint(0,10,5)
, 'score2' : np.random.randint(0,10,5)})
score1 score2
0 7 2
1 9 9
2 9 3
3 9 3
4 0 4
and a set of positive outcomes
positive_outcomes = [1,5,7,3]
then you can query the df like
df_final = df[df.score1.isin(positive_outcomes) | df.score2.isin(positive_outcomes)]
to get
score1 score2
0 7 2
2 9 3
3 9 3

Python Pandas isin return index

I have a pandas DataFrame df with a list of unique ids id, and a DataFrame with master list of all known ids master_df.id. I'm trying to figure out the best way to preform an isin that also returns to me the index where the value is located. So if my DataFrame was
master_df was
index id
1 1
2 2
3 3
and df was
index id
1 3
2 4
3 1
I want something like (3, False, 1).
I'm currently doing an is in and then looking then brute forcing the lookup with a loop, but I'm sure there is a much better way to do it.

One way is to do a merge:
In [11]: df.merge(mdf, on='id', how='left')
Out[11]:
index_x id index_y
0 1 3 3
1 2 4 NaN
2 3 1 1
and column index_y is the desired result*:
In [12]: df.merge(mdf, on='id', how='left').index_y
Out[12]:
0 3
1 NaN
2 1
Name: index_y, dtype: float64
* Except for NaN vs. False, but I think NaN is what you really want here. As #DSM points out, in python False == 0 so you may get into trouble with False as the representative for missing vs being found with id 0. (If you still want to do it then replace the NaN with 0 using .fillna(0)).
Note: it's possible it will be more efficient to just take the columns you care about:
df[['id']].merge(mdf[['id', 'index']], on='id', how='left')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting DateTimeSeries duplicates in DataFrame - python

Related

How to sort dataframe columns baed on 2 indexes?

Why do we need to add : when defining a new column using .iloc function

converting unstacked dataframe into a dataframe in pandas

What is the best way to create new pandas dataframe consisting of specific rows of an existing dataframe that match criteria?

Python Pandas isin return index

Categories

Resources