converting unstacked dataframe into a dataframe in pandas

converting unstacked dataframe into a dataframe in pandas - python

I have a dataframe like the following:
This dataframe is a result of unstacking another dataframe.
cost 10 20 30
-------------------------------------------
cycles
--------------------------------------------
1 2 4 6
2 1 2 3
3 3 6 9
4 1 0 5
I want something like this:
cycles 10 20 30
-----------------------------------------------
1 2 4 6
2 1 2 3
3 3 6 9
4 1 0 5
I am a little confused about pivoting in the unstacked dataframes. I went through a couple of other similar posts but I really couldn't understand. I want to perform regression on every column of this dataframe. I figured it would be difficult to access the cycles column in the first dataframe, so I would really appreciate if someone can shed any light on this. Thanks in advance!

Do you want this?
df.reset_index()
Edit: To drop the axis name cost, you need to use:
df.rename_axis(None, axis=1).reset_index()
This will still return an index, that is just the way pandas works, but it will not have a label floating over it. If you want cycle as the index without cost, you can just use the first part:
df.rename_axis(None, axis=1)

Related

How to sort dataframe columns baed on 2 indexes?

I have a data frame like this
df:
Index C-1 C-2 C-3 C-4 ........
Ind-1 3 9 5 4
Ind-2 5 2 8 3
Ind-3 0 1 1 0
.
.
The data frame has more than a hundred columns and rows with whole numbers(0-60) as values.
The first two rows(indexes) have values in the range 2-12/
I want to sort the columns based on values in the first and second rows(indexes) in ascending order. I need not care about sorting in the remaining rows.
Can anyone help me with this

pandas.DataFrame.sort_values
In the first argument you pass rows that you need sorting on, and than axis to sort through columns.
Mind that by placement has a priority. If you want to sort first by second row and than by the first, you should pass ['Ind-2','Ind-1'], this will have a different result.
df.sort_values(by=['Ind-1','Ind-2'],axis=1)
Output
C-1 C-4 C-3 C-2
Index
Ind-1 3 4 5 9
Ind-2 5 3 8 2
Ind-3 0 0 1 1

Why do we need to add : when defining a new column using .iloc function

When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.

pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).

You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.

Create a multiindex DataFrame from existing delimited column names

I have a pandas DataFrame that looks like the following
A_value A_avg B_value B_avg
date
2020-01-01 1 2 3 4
2020-02-01 5 6 7 8
and my goal is to create a multiindex Dataframe that looks like that:
A B
value avg value avg
date
2020-01-01 1 2 3 4
2020-02-01 5 6 7 8
So the part of the column name before the '-' should be the first level of the column index and the part afterwards the second level. The first part is unstructured, the second is always the same (4 endings).
I tried to solve it with pd.wide_to_long() but I think that is the wrong path, as I don't want to change the df itself. The real df is much larger, so creating it manually is not an option. I'm stuck here and did not find a solution.

You can split the columns by the delimier and expand to create Multiindex:
df.columns=df.columns.str.split("_",expand=True)

Effciency: Dropping rows with the same timestamp while still having the median of second column for that timestamp

What I wanna do:
Column 'angle' has tracked about 20 angles per second (can vary). But my 'Time' timestamp has only an accuracy of 1s (therefore always about ~20 rows are having the same timestamp)(total rows of over 1 million in the dataframe).
My result shall be a new dataframe with a changing timestamp for each row. The angle for the timestamp shall be the median of the ~20 timestamps in that intervall.
My Idea:
I iterate through the rows and check if the timestamp has changed.
If so, I select all timestamps until it changes, calculate the median, and append it to a new dataframe.
Nevertheless I have many many big data files and I am wondering if there is a faster way to achieve my goal.
Right now my code is the following (see below).
It is not fast and I think there must be a better way to do that with pandas/numpy (or something else?).
a = 0
for i in range(1,len(df1.index)):
if df1.iloc[[a],[1]].iloc[0][0]==df1.iloc[[i],[1]].iloc[0][0]:
continue
else:
if a == 0:
df_result = df1[a:i-1].median()
else:
df_result = df_result.append(df1[a:i-1].median(), ignore_index = True)
a = i

You can use groupby here. Below, I made a simple dummy dataframe.
import pandas as pd
df1 = pd.DataFrame({'time': [1,1,1,1,1,1,2,2,2,2,2,2],
'angle' : [8,9,7,1,4,5,11,4,3,8,7,6]})
df1
time angle
0 1 8
1 1 9
2 1 7
3 1 1
4 1 4
5 1 5
6 2 11
7 2 4
8 2 3
9 2 8
10 2 7
11 2 6
Then, we group by the timestamp and take the median of the angle column within that group, and convert the result to a pandas dataframe.
df2 = pd.DataFrame(df1.groupby('time')['angle'].median())
df2 = df2.reset_index()
df2
time angle
0 1 6.0
1 2 6.5

You can use the .agg after grouping function to select operation according to the column
df1.groupby('Time', as_index=False).agg({"angle":"median"})

Sorting DateTimeSeries duplicates in DataFrame

Alright so I hope we don't need an example for this, but let's say we have a DataFrame with 100k rows and 50+ instances of the Index being the exact same DateTime.
What would be the fastest way to sort my DataFrame by Time, but then if there is a tie choose a second column to sort by.
So:
Sort By Time
If Duplicate Time, sort by 'Cost'

If you pass a list of the columns in the order you want them sorted by it will sort by the first column and then the second column
df = DataFrame({'a':[1,2,1,1,3], 'b':[1,2,2,2,1]})
df
Out[11]:
a b
0 1 1
1 2 2
2 1 2
3 1 2
4 3 1
In [13]:
df.sort(columns=['a','b'], inplace=True)
df
Out[13]:
a b
0 1 1
2 1 2
3 1 2
1 2 2
4 3 1
So for your example
df.sort(columns=['Time', 'Cost'],inplace=True)
would work
EDIT
It has pointed out (by #AndyHayden) that there is bug if you have nested NaN in supplementary columns, see this SO and there is a GitHub issue, this may not be an issue in your case but it is something to be aware of.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

converting unstacked dataframe into a dataframe in pandas - python

Related

How to sort dataframe columns baed on 2 indexes?

Why do we need to add : when defining a new column using .iloc function

Create a multiindex DataFrame from existing delimited column names

Effciency: Dropping rows with the same timestamp while still having the median of second column for that timestamp

Sorting DateTimeSeries duplicates in DataFrame

Categories

Resources