I have the following table:
Site Peril ReturnPeriod Min Max Mean
0 one river 20 0.0 0.1 0.05
1 one river 100 0.0 0.1 0.05
2 one coast 20 2.0 5.3 4.00
3 one coast 100 2.0 5.3 4.00
4 two river 20 0.1 0.5 0.90
5 two coast 20 0.3 0.5 0.80
I'm trying to reshape it to get to this:
Peril: river coast
Site ReturnPeriod Min Max Mean Min Max Mean
0 one 20 0.0 0.1 0.05 2.0 5.3 4.00
1 one 100 0.0 0.1 0.05 2.0 5.3 4.00
2 two 20 0.1 0.5 0.90 0.3 0.5 0.80
I think melt can take me halfway there but I'm having trouble getting the final output. Any ideas?
I think that this may actually possible with just a call to pivot_table:
df.pivot_table(values = ['Min', 'Mean', 'Max'], rows = ['Site', 'ReturnPeriod'], cols = 'Peril')
I need to check it more thoroughly though.
Related
I currently have two columns in a dataframe, one called Total Time, and one called cycle. Total time is the time between each instance in the dataframe occuring, and cycle indicates what cycle that the time belongs to. I want to create a time column, Cycle Time, that shows the acccumulation of total time during each cycle. I have code that almost works, but with one exception - it adds the time on between each cycle, which I don't want (when the cycle changes, I want the counter to completely reset). Here is my current code, to better understand what I'm aiming to achieve:
import pandas as pd
df = pd.DataFrame({"Cycle": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5],
"Total Time": [0,0.2,0.2,0.4,0.4,0.7,0.7,1.0,1.0,1.2,1.3,1.3,1.5,1.6,1.6,1.8,1.8]})
df['Cycle Time'] = df['Total Time'].diff().fillna(0).groupby(df['Cycle']).cumsum()
print(df['Cycle Time'])
0 0.0
1 0.2
2 0.2
3 0.2
4 0.2
5 0.5
6 0.5
7 0.3
8 0.3
9 0.5
10 0.1
11 0.1
12 0.3
13 0.1
14 0.1
15 0.3
16 0.3
As there is time between each new cycle, the cycle resets so there is no time difference between the first two instances of the new cycle (except in the first cycle). This also occurs at certain stages in the total time, where the time remains the same. Ideally, my output would look like this:
0 0.0
1 0.2
2 0.2
3 0.0
4 0.0
5 0.3
6 0.3
7 0.0
8 0.0
9 0.2
10 0.0
11 0.0
12 0.2
13 0.0
14 0.0
15 0.2
16 0.2
Basically, I'd like to create a counter that adds up all the time of each cycle, but resets to zero at the first instance of the new cycle in the dataframe.
What you describe is:
df['Cycle Time'] = (df.groupby('Cycle')['Total Time']
.apply(lambda s: s.diff().fillna(0).cumsum())
)
But this is not so efficient, as you get the diff to then take the cumsum.
What you want is equivalent to just subtracting the initial value per group:
df['Cycle Time'] = df['Total Time'].sub(
df.groupby('Cycle')['Total Time'].transform('first')
)
output:
Cycle Total Time Cycle Time
0 1 0.0 0.0
1 1 0.2 0.2
2 1 0.3 0.3
3 2 0.4 0.0
4 2 0.4 0.0
5 2 0.7 0.3
6 2 0.9 0.5
7 3 1.0 0.0
8 3 1.0 0.0
9 3 1.2 0.2
10 4 1.3 0.0
11 4 1.3 0.0
12 4 1.5 0.2
13 5 1.6 0.0
14 5 1.6 0.0
15 5 1.8 0.2
16 5 2.1 0.5
I have the following input data. Each line is the result of one experiment:
instance algo profit time
x A 10 0.5
y A 20 0.1
z A 13 0.7
x B 39 0.9
y B 12 1.2
z B 14 0.6
And I would like to generate the following table:
A B
instance profit time profit time
x 10 0.5 39 0.9
y 20 0.1 12 1.2
z 13 0.7 14 0.6
I have tried using pivot and pivot_table with no success. Is there any way to achieve this result with pandas?
First melt to get'profit' and 'time' in the same column, and then use a pivot table with multiple column levels
(df.melt(id_vars=['instance', 'algo'])
.pivot_table(index='instance', columns=['algo', 'variable'], values='value'))
#algo A B
#variable profit time profit time
#instance
#x 10.0 0.5 39.0 0.9
#y 20.0 0.1 12.0 1.2
#z 13.0 0.7 14.0 0.6
set_index and unstack:
df.set_index(['instance', 'algo']).unstack().swaplevels(1, 0, axis=1)
profit time
algo A B A B
instance
x 10 39 0.5 0.9
y 20 12 0.1 1.2
z 13 14 0.7 0.6
(df.set_index(['instance', 'algo'])
.unstack()
.swaplevel(1, 0, axis=1)
.sort_index(axis=1))
algo A B
profit time profit time
instance
x 10 0.5 39 0.9
y 20 0.1 12 1.2
z 13 0.7 14 0.6
Another option is using pivot and swaplevel:
(df.pivot('instance', 'algo', ['profit', 'time'])
.swaplevel(1, 0, axis=1)
.sort_index(axis=1))
algo A B
profit time profit time
instance
x 10.0 0.5 39.0 0.9
y 20.0 0.1 12.0 1.2
z 13.0 0.7 14.0 0.6
How would I bin some data based on the index of the data, in python 3
Let's say I have the following data
1 0.5
3 0.6
5 0.7
6 0.8
8 0.9
10 1
11 1.1
12 1.2
14 1.3
15 1.4
17 1.5
18 1.6
19 1.7
20 1.8
22 1.9
24 2
25 2.1
28 2.2
31 2.3
35 2.4
how would I take this data and bin both columns such that each bin has n number of values in it, and average the numbers in each bin and output them.
for example, if I wanted to bin the values by 4
I would take the first four data points:
1 0.5
3 0.6
5 0.7
6 0.8
and the averages of these would be: 3.75 0.65
I would continue down the columns by taking the next set of four, and so on
until I averaged all of the sets of four to get this:
3.75 0.65
10.25 1.05
16 1.45
21.25 1.85
29.75 2.25
How can I do this using python
Base on numpy reshape
pd.DataFrame([np.mean(x.reshape(len(df)//4,-1),axis=1) for x in df.values.T]).T
0 1
0 3.75 0.65
1 10.25 1.05
2 16.00 1.45
3 21.25 1.85
4 29.75 2.25
You can "bin" the index into groups of 4 and call groupby in the index.
df.groupby(df.index // 4).mean()
0 1
0 3.75 0.65
1 10.25 1.05
2 16.00 1.45
3 21.25 1.85
4 29.75 2.25
I have the following pandas data frame:
code:
df = pd.DataFrame({'A1': [0.1,0.5,3.0, 9.0], 'A2':[2.0,4.5,1.2,9.0],'Random
data':[300,4500,800,900],'Random data2':[3000,450,80,90]})
output:
A1 A2 Randomdata Randomdata2
0 0.1 2.0 300 3000
1 0.5 4.5 4500 450
2 3.0 1.2 800 80
3 9.0 9.0 900 90
It is only showing A1 and A2 but it actually goes from A1 to A30 of data. I want to calculate the average and standard deviation for each row but only columns A1 to A30 (not including the columns Randomdata and Randomdata2) and add 2 new columns with the average and standard deviation like shown below.
A1 A2 Randomdata Randomdata2 Average Stddev
0 0.1 2.0 300 3000
1 0.5 4.5 4500 450
2 3.0 1.2 800 80
3 9.0 9.0 900 90
Preferred Approach
Use pd.DataFrame.filter
Your choice for regex pattern can be as explicit as you'd like. In this case, I specified that the column must start with 'A' and have 1 or more digits afterwards.
d = df.filter(regex='^A\d+')
df.assign(Average=d.mean(1), Stddev=d.std(1))
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
Alt 1
This is trying too hard.
rnm = dict(mean='Average', std='Stddev')
df.join(df.T[df.columns.str.startswith('A')].agg(['mean', 'std']).T.rename(columns=rnm))
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
This is my first question on stackoverflow. Go easy on me!
I have two data sets acquired simultaneously by different acquisition systems with different sampling rates. One is very regular, and the other is not. I would like to create a single dataframe containing both data sets, using the regularly spaced timestamps (in seconds) as the reference for both. The irregularly sampled data should be interpolated on the regularly spaced timestamps.
Here's some toy data demonstrating what I'm trying to do:
import pandas as pd
import numpy as np
# evenly spaced times
t1 = np.array([0,0.5,1.0,1.5,2.0])
y1 = t1
# unevenly spaced times
t2 = np.array([0,0.34,1.01,1.4,1.6,1.7,2.01])
y2 = 3*t2
df1 = pd.DataFrame(data={'y1':y1,'t':t1})
df2 = pd.DataFrame(data={'y2':y2,'t':t2})
df1 and df2 look like this:
df1:
t y1
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
3 1.5 1.5
4 2.0 2.0
df2:
t y2
0 0.00 0.00
1 0.34 1.02
2 1.01 3.03
3 1.40 4.20
4 1.60 4.80
5 1.70 5.10
6 2.01 6.03
I'm trying to merge df1 and df2, interpolating y2 on df1.t. The desired result is:
df_combined:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 1.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
I've been reading documentation for pandas.resample, as well as searching previous stackoverflow questions, but haven't been able to find a solution to my particular problem. Any ideas? Seems like it should be easy.
UPDATE:
I figured out one possible solution: interpolate the second series first, then append to the first data frame:
from scipy.interpolate import interp1d
f2 = interp1d(t2,y2,bounds_error=False)
df1['y2'] = f2(df1.t)
which gives:
df1:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 1.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
That works, but I'm still open to other solutions if there's a better way.
If you construct a single DataFrame from Series, using time values as index, like this:
>>> t1 = np.array([0, 0.5, 1.0, 1.5, 2.0])
>>> y1 = pd.Series(t1, index=t1)
>>> t2 = np.array([0, 0.34, 1.01, 1.4, 1.6, 1.7, 2.01])
>>> y2 = pd.Series(3*t2, index=t2)
>>> df = pd.DataFrame({'y1': y1, 'y2': y2})
>>> df
y1 y2
0.00 0.0 0.00
0.34 NaN 1.02
0.50 0.5 NaN
1.00 1.0 NaN
1.01 NaN 3.03
1.40 NaN 4.20
1.50 1.5 NaN
1.60 NaN 4.80
1.70 NaN 5.10
2.00 2.0 NaN
2.01 NaN 6.03
You can simply interpolate it, and select only the part where y1 is defined:
>>> df.interpolate('index').reindex(y1)
y1 y2
0.0 0.0 0.0
0.5 0.5 1.5
1.0 1.0 3.0
1.5 1.5 4.5
2.0 2.0 6.0
It's not exactly clear to me how you're getting rid of some of the values in y2, but it seems like if there is more than one for a given timepoint, you only want the first one. Also, it seems like your time values should be in the index. I also added column labels. It looks like this:
import pandas as pd
# evenly spaced times
t1 = [0,0.5,1.0,1.5,2.0]
y1 = t1
# unevenly spaced times
t2 = [0,0.34,1.01,1.4,1.6,1.7,2.01]
# round t2 values to the nearest half
new_t2 = [round(num * 2)/2 for num in t2]
# set y2 values
y2 = [3*z for z in new_t2]
# eliminate entries that have the same index value
for x in range(1, len(new_t2), -1):
if new_t2[x] == new_t2[x-1]:
new_t2.delete(x)
y2.delete(x)
ser1 = pd.Series(y1, index=t1)
ser2 = pd.Series(y2, index=new_t2)
df = pd.concat((ser1, ser2), axis=1)
df.columns = ('Y1', 'Y2')
print df
This prints:
Y1 Y2
0.0 0.0 0.0
0.5 0.5 1.5
1.0 1.0 3.0
1.5 1.5 4.5
1.5 1.5 4.5
1.5 1.5 4.5
2.0 2.0 6.0