Python dataframe interpolation - adding a new row to a dataframe - python

I have a dataframe that I would like to add a new row when EVM = a specific value (-30) and update the other columns with linear interpolation.
Index PwrOut EVM PwrGain Vout
0 -0.760031 -58.322902 32.239969 134.331851
1 3.242575 -58.073389 32.242575 134.332376
2 7.246203 -57.138122 32.246203 134.343538
3 11.251078 -54.160870 32.251078 134.383609
4 15.257129 -48.624869 32.257129 134.487430
5 17.260618 -45.971596 32.260618 134.586753
6 18.263079 -44.319692 32.263079 134.656616
7 19.266674 -41.532695 32.266674 134.743599
8 20.271934 -37.546253 32.271934 134.849050
9 21.278990 -33.239208 32.278990 134.972439
10 22.286989 -29.221786 32.286989 135.111068
11 23.293533 -25.652448 32.293533 135.261357
For example, (in the 3rd column) EVM = -30 lies between rows 9 and 10 above. How can I include a new row (between rows 9 and 10) that has EVM = -30 and then update the other columns (in this new row only) with linear interpolation that is based on the EVM column's position between the numbers in rows 9 and 10?
It would be great to be able to search and find the rows that EVM =-30 lies between.
Is it possible to apply linear interpolation to some rows but nonlinear interpolation to other columns?
Thanks!

Interpolation is by far the easiest part. Here is one approach.
First, find the missing rows and add them one by one:
targets = (-50, -40, -30) # Arbitrary
idxs = df.EVM.searchsorted(targets) # Find the rows location
arr = df.values
for idx, target in zip(idxs, targets):
arr = np.insert(arr, idx, [np.nan, target, np.nan, np.nan], axis=0)
df1 = pd.DataFrame(arr, columns=df.columns)
Then you can actually interpolate:
df2 = df1.interpolate('linear')
Output:
PwrOut EVM PwrGain Vout
0 -0.760031 -58.322902 32.239969 134.331851
1 3.242575 -58.073389 32.242575 134.332376
2 7.246203 -57.138122 32.246203 134.343538
3 11.251078 -54.160870 32.251078 134.383609
4 13.254103 -50.000000 32.254103 134.435519
5 15.257129 -48.624869 32.257129 134.487430
6 17.260618 -45.971596 32.260618 134.586753
7 18.263079 -44.319692 32.263079 134.656616
9 19.266674 -41.532695 32.266674 134.743599
8 19.769304 -40.000000 32.269304 134.796324
11 20.271934 -37.546253 32.271934 134.849050
12 21.278990 -33.239208 32.278990 134.972439
10 21.782989 -30.000000 32.282989 135.041753
13 22.286989 -29.221786 32.286989 135.111068
14 23.293533 -25.652448 32.293533 135.261357
If you want custom interpolation methods by columns, go individually, e.g:
df2.PwrOut = df1.PwrOut.interpolate('cubic')

Related

Take average of window in pandas

I have a large pandas dataframe, I want to average first 12 rows, then next 12 rows and so on. I wrote a for loop for this task
df_list=[]
for i in range(0,len(df),12):
print(i,i+12)
df_list.append(df.iloc[i:i+12].mean())
pd.concat(df_list,1).T
Is there an efficient way to do this without for loop
You can divide the index by N i.e. 12 in your case, then group the dataframe by the quotient, and finally call mean on these groups:
# Random dataframe of shape 120,4
>>> df=pd.DataFrame(np.random.randint(10,100,(120,4)), columns=list('ABCD'))
>>> df.groupby(df.index//12).mean()
A B C D
0 49.416667 52.583333 63.833333 47.833333
1 60.166667 61.666667 53.750000 34.583333
2 49.916667 54.500000 50.583333 64.750000
3 51.333333 51.333333 56.333333 60.916667
4 51.250000 51.166667 50.750000 50.333333
5 56.333333 50.916667 51.416667 59.750000
6 53.750000 57.000000 45.916667 59.250000
7 48.583333 59.750000 49.250000 50.750000
8 53.750000 48.750000 51.583333 68.000000
9 54.916667 48.916667 57.833333 43.333333
I believe you want to split your dataframe to seperate chunks with 12 rows. Then you can use np.arange inside groupby to take the mean of each seperate chunk:
df.groupby(np.arange(len(df)) // 12).mean()

Calculate row-wise dot products based on previous row and next row in pandas

I have a pandas dataframe like below:
Coordinate
1 (1150.0,1760.0)
28 (1260.0,1910.0)
6 (1030.0,2070.0)
12 (1170.0,2300.0)
9 (790.0,2260.0)
5 (750.0,2030.0)
26 (490.0,2130.0)
29 (360.0,1980.0)
3 (40.0,2090.0)
2 (630.0,1660.0)
20 (590.0,1390.0)
Now, I want to create a new column 'dotProduct' by applying the formula
np.dot((b-a),(b-c)) where b is the Coordinates(1260.0,1910.0) for index 28, c is the same for index 6, (i.e. (1030.0,2070.0)). The calculated product is for row 2. So, in a way I have to get the previous row value and next value too. This way I have to calculate for entire 'Coordinate' I am quite new to pandas, hence still in learning path. Please guide me a bit.
Thanks a lot for the help.
I assume that your 'Coordinate' column elements are already tuples of float values.
# Convert elements of 'Coordinate' into numpy array
df.Coordinate = df.Coordinate.apply(np.array)
# Subtract +/- 1 shifted values from original 'Coordinate'
a = df.Coordinate - df.Coordinate.shift(1)
b = df.Coordinate - df.Coordinate.shift(-1)
# take row-wise dot product based on the arrays a, b
df['dotProduct'] = [np.dot(x, y) for x, y in zip(a, b)]
# make 'Coordinate' tuple again (if you want)
df.Coordinate = df.Coordinate.apply(tuple)
Now I get this as df:
Coordinate dotProduct
1 (1150.0, 1760.0) NaN
28 (1260.0, 1910.0) 1300.0
6 (1030.0, 2070.0) -4600.0
12 (1170.0, 2300.0) 62400.0
9 (790.0, 2260.0) -24400.0
5 (750.0, 2030.0) 12600.0
26 (490.0, 2130.0) -18800.0
29 (360.0, 1980.0) -25100.0
3 (40.0, 2090.0) 236100.0
2 (630.0, 1660.0) -92500.0
20 (590.0, 1390.0) NaN

how to construct an index from percentage change time series?

consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.
An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100
As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64

Is there a pandas way of getting the averages between consecutive rows?

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(30,3))
df.head()
which gives:
0 1 2
0 0.741955 0.913681 0.110109
1 0.079039 0.662438 0.510414
2 0.469055 0.201658 0.259958
3 0.371357 0.018394 0.485339
4 0.850254 0.808264 0.469885
Say I want to add another column that will build the averages in column 2: between index (0,1) (1,2)... (28,29).
I imagine this is a common task as column 2 are the x axis positions and I want the categorical labels on a plot to appear in the middle between the 2 points on the x axis.
So I was wondering if there is a pandas way for this:
averages = []
for index, item in enumerate(df[2]):
if index < df[2].shape[0] -1:
averages.append((item + df[2].iloc[index + 1]) / 2)
df["averages"] = pd.Series(averages)
df.head()
which gives:
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333138
3 0.996487 0.272300 0.334554 0.586686
as you can see 0.31 is an average of 0.21 and 0.42.
Thanks!
I think that you can do this with pandas.DataFrame.rolling. Using your dataframe head as an example:
df['averages'] = df[2].rolling(2).mean().shift(-1)
returns:
>>> df
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333139
3 0.996487 0.272300 0.334554 NaN
The NaN at the end is there because there is no row indexed 4; but in your full dataframe, it would go on until the second to last row (the average of value at indices 28 and 29, i.e. your 29th and 30th values). I just wanted to show that this gives the same values as your desired output, so I used the exact data you provided. (for future reference, if you want to provide a reproducible dataframe for us from random numbers, use and show us a random seed such as np.random.seed(42) before creating the df, that way, we'll all have the same one.)
breaking it down:
df[2] is there because you're interested in column 2; .rolling(2) is there because you want to get the mean of 2 values (if you wanted the mean of 3 values, use .rolling(3), etc...), .mean() is whatever function you want (in your case, the mean); finally .shift(-1) makes sure that the new column is in the proper place (i.e., makes sure you show the mean of each value in column 2 and the value below, as the default would be the value above)
This is one way, though slightly loopy. But #sacul's solution is better. I leave this here for reference only.
import pandas as pd
import numpy as np
from itertools import zip_longest
df = pd.DataFrame(np.random.rand(30, 3))
v = df.values[:, -1]
df = df.join(pd.DataFrame(np.array([np.mean([i, j], axis=0) for i, j in \
zip_longest(v, v[1:], fillvalue=v[-1])]), columns=['2_pair_avg']))
# 0 1 2 2_pair_avg
# 0 0.382656 0.228837 0.053199 0.373678
# 1 0.812690 0.255277 0.694156 0.697738
# 2 0.040521 0.211511 0.701320 0.491044
# 3 0.558739 0.697916 0.280768 0.615398
# 4 0.262771 0.912669 0.950029 0.489550
# 5 0.217489 0.405125 0.029071 0.101794
# 6 0.577929 0.933565 0.174517 0.214530
# 7 0.067030 0.452027 0.254544 0.613225
# 8 0.580869 0.556112 0.971907 0.582547
# 9 0.483528 0.951537 0.193188 0.175215
# 10 0.481141 0.589833 0.157242 0.159363
# 11 0.087057 0.823691 0.161485 0.108634
# 12 0.319516 0.161386 0.055784 0.285276
# 13 0.901529 0.365992 0.514768 0.386599
# 14 0.270118 0.454583 0.258430 0.245463
# 15 0.379739 0.299569 0.232497 0.214943
# 16 0.017621 0.182647 0.197389 0.538386
# 17 0.720688 0.147093 0.879383 0.732239
# 18 0.859594 0.538390 0.585096 0.503846
# 19 0.360718 0.571567 0.422596 0.287384
# 20 0.874800 0.391535 0.152171 0.239078
# 21 0.935150 0.379871 0.325984 0.294485
# 22 0.269607 0.891331 0.262986 0.212050
# 23 0.140976 0.414547 0.161115 0.542682
# 24 0.851434 0.059209 0.924250 0.801210
# 25 0.389025 0.774885 0.678170 0.388856
# 26 0.679247 0.982517 0.099542 0.372649
# 27 0.670354 0.279138 0.645756 0.336031
# 28 0.393414 0.970737 0.026307 0.343947
# 29 0.479611 0.349401 0.661587 0.661587

Selection of a Series from pandas dataframe by interpolating column labels

I have a pandas dataframe that contains for multiple positions (defined by coordinate x) a value for different timesteps. I want to create a pandas.Series object that contains the value at a given position x for all timesteps (so all index-values of the dataframe). If x is not one of the column labels, I want to interpolate between the two nearest x values.
An excerpt from the dataframe object (min(x)=0 and max(x)=0.28):
0.000000 0.007962 0.018313 0.031770 0.049263 0.072004
time (s)
15760800 0.500481 0.500481 0.500481 0.500481 0.500481 0.500481
15761400 1.396126 0.487198 0.498765 0.501326 0.500234 0.500544
15762000 1.455313 0.542441 0.489421 0.502851 0.499945 0.500597
15762600 1.492908 0.592022 0.487835 0.502233 0.500139 0.500527
15763200 1.521089 0.636743 0.490874 0.500704 0.500485 0.500423
15763800 1.542632 0.675589 0.496401 0.499065 0.500788 0.500335
I can find ways to slice the dataframe by available column labels. But is there an elegant way to do the interpolation?
In the end I want a function that looks something like this: result = sliceDataframe( dataframe=dfin,x=0.01),with result a pandas.Series object so I can call it in one line (or maybe two) in another postprocessing script.
I think you would be best with writing a simple function yourself. Something like:
def sliceDataframe(df, x):
# supposing the column labels are sorted:
pos = np.searchsorted(df.columns.values, x)
# select the two neighbouring column labels:
left = df.columns[pos-1]
right = df.columns[pos]
# simple interpolation
interpolated = df[left] + (df[right] - df[left])/(right - left) * (x - left)
interpolated.name = x
return interpolated
Another option is to use the interpolate method, but therefore, you should add a column with NaNs with the label you want.
With the function of above:
In [105]: df = pd.DataFrame(np.random.randn(8,4))
In [106]: df.columns = df.columns.astype(float)
In [107]: df
Out[107]:
0 1 2 3
0 -0.336453 1.219877 -0.912452 -1.047431
1 0.842774 -0.361236 -0.245771 0.014917
2 -0.974621 1.050503 0.367389 0.789570
3 1.091484 1.352065 1.215290 0.393900
4 -0.100972 -0.250026 -1.135837 -0.339204
5 0.503436 -0.764224 -1.099864 0.962370
6 -0.599090 0.908235 -0.581446 0.662604
7 -2.234131 0.512995 -0.591829 -0.046959
In [108]: sliceDataframe(df, 0.5)
Out[108]:
0 0.441712
1 0.240769
2 0.037941
3 1.221775
4 -0.175499
5 -0.130394
6 0.154572
7 -0.860568
Name: 0.5, dtype: float64
With the interpolate method:
In [109]: df[0.5] = np.NaN
In [110]: df.sort(axis=1).interpolate(axis=1)
Out[110]:
0.0 0.5 1.0 2.0 3.0
0 -0.336453 0.441712 1.219877 -0.912452 -1.047431
1 0.842774 0.240769 -0.361236 -0.245771 0.014917
2 -0.974621 0.037941 1.050503 0.367389 0.789570
3 1.091484 1.221775 1.352065 1.215290 0.393900
4 -0.100972 -0.175499 -0.250026 -1.135837 -0.339204
5 0.503436 -0.130394 -0.764224 -1.099864 0.962370
6 -0.599090 0.154572 0.908235 -0.581446 0.662604
7 -2.234131 -0.860568 0.512995 -0.591829 -0.046959
In [111]: df.sort(axis=1).interpolate(axis=1)[0.5]
Out[111]:
0 0.441712
1 0.240769
2 0.037941
3 1.221775
4 -0.175499
5 -0.130394
6 0.154572
7 -0.860568
Name: 0.5, dtype: float64

Categories