Organizing results of experiments with pandas - python

I have the following input data. Each line is the result of one experiment:
instance algo profit time
x A 10 0.5
y A 20 0.1
z A 13 0.7
x B 39 0.9
y B 12 1.2
z B 14 0.6
And I would like to generate the following table:
A B
instance profit time profit time
x 10 0.5 39 0.9
y 20 0.1 12 1.2
z 13 0.7 14 0.6
I have tried using pivot and pivot_table with no success. Is there any way to achieve this result with pandas?

First melt to get'profit' and 'time' in the same column, and then use a pivot table with multiple column levels
(df.melt(id_vars=['instance', 'algo'])
.pivot_table(index='instance', columns=['algo', 'variable'], values='value'))
#algo A B
#variable profit time profit time
#instance
#x 10.0 0.5 39.0 0.9
#y 20.0 0.1 12.0 1.2
#z 13.0 0.7 14.0 0.6

set_index and unstack:
df.set_index(['instance', 'algo']).unstack().swaplevels(1, 0, axis=1)
profit time
algo A B A B
instance
x 10 39 0.5 0.9
y 20 12 0.1 1.2
z 13 14 0.7 0.6
(df.set_index(['instance', 'algo'])
.unstack()
.swaplevel(1, 0, axis=1)
.sort_index(axis=1))
algo A B
profit time profit time
instance
x 10 0.5 39 0.9
y 20 0.1 12 1.2
z 13 0.7 14 0.6
Another option is using pivot and swaplevel:
(df.pivot('instance', 'algo', ['profit', 'time'])
.swaplevel(1, 0, axis=1)
.sort_index(axis=1))
algo A B
profit time profit time
instance
x 10.0 0.5 39.0 0.9
y 20.0 0.1 12.0 1.2
z 13.0 0.7 14.0 0.6

Related

Using a two column dataframe to have a time counter, but reset on a certain condition

I currently have two columns in a dataframe, one called Total Time, and one called cycle. Total time is the time between each instance in the dataframe occuring, and cycle indicates what cycle that the time belongs to. I want to create a time column, Cycle Time, that shows the acccumulation of total time during each cycle. I have code that almost works, but with one exception - it adds the time on between each cycle, which I don't want (when the cycle changes, I want the counter to completely reset). Here is my current code, to better understand what I'm aiming to achieve:
import pandas as pd
df = pd.DataFrame({"Cycle": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5],
"Total Time": [0,0.2,0.2,0.4,0.4,0.7,0.7,1.0,1.0,1.2,1.3,1.3,1.5,1.6,1.6,1.8,1.8]})
df['Cycle Time'] = df['Total Time'].diff().fillna(0).groupby(df['Cycle']).cumsum()
print(df['Cycle Time'])
0 0.0
1 0.2
2 0.2
3 0.2
4 0.2
5 0.5
6 0.5
7 0.3
8 0.3
9 0.5
10 0.1
11 0.1
12 0.3
13 0.1
14 0.1
15 0.3
16 0.3
As there is time between each new cycle, the cycle resets so there is no time difference between the first two instances of the new cycle (except in the first cycle). This also occurs at certain stages in the total time, where the time remains the same. Ideally, my output would look like this:
0 0.0
1 0.2
2 0.2
3 0.0
4 0.0
5 0.3
6 0.3
7 0.0
8 0.0
9 0.2
10 0.0
11 0.0
12 0.2
13 0.0
14 0.0
15 0.2
16 0.2
Basically, I'd like to create a counter that adds up all the time of each cycle, but resets to zero at the first instance of the new cycle in the dataframe.
What you describe is:
df['Cycle Time'] = (df.groupby('Cycle')['Total Time']
.apply(lambda s: s.diff().fillna(0).cumsum())
)
But this is not so efficient, as you get the diff to then take the cumsum.
What you want is equivalent to just subtracting the initial value per group:
df['Cycle Time'] = df['Total Time'].sub(
df.groupby('Cycle')['Total Time'].transform('first')
)
output:
Cycle Total Time Cycle Time
0 1 0.0 0.0
1 1 0.2 0.2
2 1 0.3 0.3
3 2 0.4 0.0
4 2 0.4 0.0
5 2 0.7 0.3
6 2 0.9 0.5
7 3 1.0 0.0
8 3 1.0 0.0
9 3 1.2 0.2
10 4 1.3 0.0
11 4 1.3 0.0
12 4 1.5 0.2
13 5 1.6 0.0
14 5 1.6 0.0
15 5 1.8 0.2
16 5 2.1 0.5

Number Of Rows Since Positive/Negative in Pandas

I have a DataFrame similar to this:
MACD
0 -2.3
1 -0.3
2 0.8
3 0.1
4 0.6
5 -0.7
6 1.1
7 2.4
How can I add an extra column showing the number of rows since MACD was on the opposite side of the origin (positive/negative)?
Desired Outcome:
MACD RowsSince
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1
3 0.1 2
4 0.6 3
5 -0.7 1
6 1.1 1
7 2.4 2
We can try with use np.sign with diff create the subgroup , then with groupby + cumcount
s = np.sign(df['MACD']).diff().ne(0).cumsum()
df['new'] = (df.groupby(s).cumcount()+1).mask(s.eq(1))
df
Out[80]:
MACD new
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1.0
3 0.1 2.0
4 0.6 3.0
5 -0.7 1.0
6 1.1 1.0
7 2.4 2.0

Create a new column based on conditions from other columns in Python

I want to do in Python something very similar as this question from this one R users. My intention is to create a new column that its values are created based on conditions from other columns
For example:
d = {'year': [2010, 2011,2013, 2014], 'PD': [0.5, 0.8, 0.9, np.nan], 'PD_thresh': [0.7, 0.8, 0.9, 0.7]}
df_temp = pd.DataFrame(data=d)
Now I want to create a condition that says:
pseudo-code:
if for year X the value of PD is greater or equal to the value of PD_thresh
then set 0 in a new column y_pseudo
otherwise set 1
My expected outcome is this:
df_temp
Out[57]:
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.6 0.7 0.0
2 2013 0.9 0.8 1.0
3 2014 NaN 0.7 NaN
Use numpy.select with isna and ge:
m1 = df_temp['PD'].isna()
m2 = df_temp['PD'].ge(df_temp['PD_thresh'])
df_temp['y_pseudo'] = np.select([m1, m2], [np.nan, 1], default=0)
print (df_temp)
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.6 0.8 0.0
2 2013 0.9 0.9 1.0
3 2014 NaN 0.7 NaN
Another solution is convert mask to integer for True/False to 1/0 mapping and set only non missing rows by notna:
m2 = df_temp['PD'].ge(df_temp['PD_thresh'])
m3 = df_temp['PD'].notna()
df_temp.loc[m3, 'y_pseudo'] = m2[m3].astype(int)
print (df_temp)
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.6 0.8 0.0
2 2013 0.9 0.9 1.0
3 2014 NaN 0.7 NaN
Your data d is different from your outcome, and I think you meant 1 if greater than the threshold, not the other way around, so I have this:
y = [a if np.isnan(a) else 1 if a>=b else 0 for a,b in zip(df_temp.PD,df_temp.PD_thresh)]
df_temp['y_pseudo'] = y
Output:
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.8 0.8 1.0
2 2013 0.9 0.9 1.0
3 2014 NaN 0.7 NaN

How to incrementally add linear regression column to pandas dataframe

I got an example pandas dataframe like this:
a b
0 6.0 0.6
1 1.0 0.3
2 3.0 0.8
3 5.0 0.1
4 7.0 0.4
5 2.0 0.2
6 0.0 0.9
7 4.0 0.7
8 8.0 0.0
9 9.0 0.5
I want to add a new column, linear to the column, which is the linear regression output of fit a on b. Now I got:
from sklearn.linear_model import LinearRegression
repr = LinearRegression()
repr.fit(df['a'].as_matrix().reshape(-1,1),df['b'].as_matrix().reshape(-1,1))
repr.predict(df['a'].as_matrix().reshape(-1,1)) # This will give the linear regression outcome for whole column
Now I want to incrementally do linear regression on series a, so the first entry of linear will be b[0], and the second will be b[0]/a[0]*a[1], and the third will be the linear regression outcome of the first two entries, and so on and so forth. I have no clue how to do that with pandas, except for iterating through all the entries, is there a batter way?
You can use expanding with some custom apply functions. Interesting way to do LR...
from io import StringIO
import pandas as pd
import numpy as np
df = pd.read_table(StringIO(""" a b
0 6.0 0.6
1 1.0 0.3
2 3.0 0.8
3 5.0 0.1
4 7.0 0.4
5 2.0 0.2
6 0.0 0.9
7 4.0 0.7
8 8.0 0.0
9 9.0 0.5
10 10.0 0.4
11 11.0 0.35
12 12.0 0.3
13 13.0 0.28
14 14.0 0.27
15 15.0 0.22"""), sep='\s+')
df = df.sort_values(by='a')
ax = df.plot(x='a',y='b',kind='scatter')
m, b = np.polyfit(df['a'],df['b'],1)
lin_reg = lambda x, m, b : m*x + b
df['lin'] = lin_reg(df['a'], m, b)
def make_m(x):
y = df['b'].iloc[0:len(x)]
return np.polyfit(x, y, 1)[0]
def make_b(x):
y = df['b'].iloc[0:len(x)]
return np.polyfit(x, y, 1)[1]
df['new'] = df['a'].expanding().apply(make_m, raw=True)*df['a'] + df['a'].expanding().apply(make_b, raw=True)
# df = df.sort_values(by='a')
ax.plot(df.a,df.lin)
ax.plot(df.a,df.new)

How to use boolean indexing with Pandas

I have a dataframe:
df =
time time b
0 0.0 1.1 21
1 0.1 2.2 22
2 0.2 3.3 23
3 0.3 4.4 24
4 0.4 5.5 24
I also have a series for my units, defined as
su =
time sal
time zulu
b m/s
Now, I want to set df.index equal to the "time (sal)" values. Those values can be in any column and I will need to check.
I can do this as:
df.index = df.values[:,(df.columns == 'time') & (su.values == 'sal')]
But, my index looks like:
array([[0.0],
[0.1],
[0.2],
[0.3],
[0.4]])
However, this is an array of arrays. In bigger datasets, plot seems to take longer. If I hardcode the value, I get just an array:
df.index = df[0,0]
array([0.0, 0.1, 0.2, 0.3, 0.4])
I can also do the following:
inx = ((df.columns == 'time') & (s.values == 'sal')).tolist().index(True)
This sets "inx" to 0 and then gets a single array
df.index=df.values[0,inx]
However, I shouldn't have to do this. Am I using pandas and boolean indexing incorrectly?
I want:
df =
time time b
0.0 0.0 1.1 21
0.1 0.1 2.2 22
0.2 0.2 3.3 23
0.3 0.3 4.4 24
0.4 0.4 5.5 24
As I understood, this is what you expected. However, I renamed time names as time1 & time2, otherwise it won't let to create the dictionary with same name.
df = {'time1': [0.0,0.1,0.2,0.3,0.4], 'time2': [1.1,2.2,3.3,4.4,5.5],'b':[21,22,23,24,24]}
su = {'time1':'sal', 'time2':'zulu', 'b':'m/s'}
indexes = df[su.keys()[su.values().index('sal')]]
df = pd.DataFrame(df, index=indexes, columns=['time1', 'time2', 'b'])
print df
Your original DataFrame has the duplicate column name, it make complexity.
Try to modify the columns' name.
Sample Code
unit = pd.Series(['sal', 'zulu', 'm/s'], index=['time', 'time', 'b'])
>>> df
time time b
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
new_col = ['{}({})'.format(df.columns[i], unit[i]) for i in range(len(df.columns))]
>>> new_col
['time(sal)', 'time(zulu)', 'b(m/s)']
>>> df.columns = new_col
>>> df
time(sal) time(zulu) b(m/s)
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
>>> df.index = df['time(sal)'].values
>>> df
time(sal) time(zulu) b(m/s)
0.0 0.0 1.1 21.0
0.1 0.1 2.2 22.0
0.2 0.2 3.3 23.0
0.3 0.3 4.4 24.0
0.4 0.4 5.5 25.0

Categories