Selection of a Series from pandas dataframe by interpolating column labels - python

I have a pandas dataframe that contains for multiple positions (defined by coordinate x) a value for different timesteps. I want to create a pandas.Series object that contains the value at a given position x for all timesteps (so all index-values of the dataframe). If x is not one of the column labels, I want to interpolate between the two nearest x values.
An excerpt from the dataframe object (min(x)=0 and max(x)=0.28):
0.000000 0.007962 0.018313 0.031770 0.049263 0.072004
time (s)
15760800 0.500481 0.500481 0.500481 0.500481 0.500481 0.500481
15761400 1.396126 0.487198 0.498765 0.501326 0.500234 0.500544
15762000 1.455313 0.542441 0.489421 0.502851 0.499945 0.500597
15762600 1.492908 0.592022 0.487835 0.502233 0.500139 0.500527
15763200 1.521089 0.636743 0.490874 0.500704 0.500485 0.500423
15763800 1.542632 0.675589 0.496401 0.499065 0.500788 0.500335
I can find ways to slice the dataframe by available column labels. But is there an elegant way to do the interpolation?
In the end I want a function that looks something like this: result = sliceDataframe( dataframe=dfin,x=0.01),with result a pandas.Series object so I can call it in one line (or maybe two) in another postprocessing script.

I think you would be best with writing a simple function yourself. Something like:
def sliceDataframe(df, x):
# supposing the column labels are sorted:
pos = np.searchsorted(df.columns.values, x)
# select the two neighbouring column labels:
left = df.columns[pos-1]
right = df.columns[pos]
# simple interpolation
interpolated = df[left] + (df[right] - df[left])/(right - left) * (x - left)
interpolated.name = x
return interpolated
Another option is to use the interpolate method, but therefore, you should add a column with NaNs with the label you want.
With the function of above:
In [105]: df = pd.DataFrame(np.random.randn(8,4))
In [106]: df.columns = df.columns.astype(float)
In [107]: df
Out[107]:
0 1 2 3
0 -0.336453 1.219877 -0.912452 -1.047431
1 0.842774 -0.361236 -0.245771 0.014917
2 -0.974621 1.050503 0.367389 0.789570
3 1.091484 1.352065 1.215290 0.393900
4 -0.100972 -0.250026 -1.135837 -0.339204
5 0.503436 -0.764224 -1.099864 0.962370
6 -0.599090 0.908235 -0.581446 0.662604
7 -2.234131 0.512995 -0.591829 -0.046959
In [108]: sliceDataframe(df, 0.5)
Out[108]:
0 0.441712
1 0.240769
2 0.037941
3 1.221775
4 -0.175499
5 -0.130394
6 0.154572
7 -0.860568
Name: 0.5, dtype: float64
With the interpolate method:
In [109]: df[0.5] = np.NaN
In [110]: df.sort(axis=1).interpolate(axis=1)
Out[110]:
0.0 0.5 1.0 2.0 3.0
0 -0.336453 0.441712 1.219877 -0.912452 -1.047431
1 0.842774 0.240769 -0.361236 -0.245771 0.014917
2 -0.974621 0.037941 1.050503 0.367389 0.789570
3 1.091484 1.221775 1.352065 1.215290 0.393900
4 -0.100972 -0.175499 -0.250026 -1.135837 -0.339204
5 0.503436 -0.130394 -0.764224 -1.099864 0.962370
6 -0.599090 0.154572 0.908235 -0.581446 0.662604
7 -2.234131 -0.860568 0.512995 -0.591829 -0.046959
In [111]: df.sort(axis=1).interpolate(axis=1)[0.5]
Out[111]:
0 0.441712
1 0.240769
2 0.037941
3 1.221775
4 -0.175499
5 -0.130394
6 0.154572
7 -0.860568
Name: 0.5, dtype: float64

Related

Python dataframe interpolation - adding a new row to a dataframe

I have a dataframe that I would like to add a new row when EVM = a specific value (-30) and update the other columns with linear interpolation.
Index PwrOut EVM PwrGain Vout
0 -0.760031 -58.322902 32.239969 134.331851
1 3.242575 -58.073389 32.242575 134.332376
2 7.246203 -57.138122 32.246203 134.343538
3 11.251078 -54.160870 32.251078 134.383609
4 15.257129 -48.624869 32.257129 134.487430
5 17.260618 -45.971596 32.260618 134.586753
6 18.263079 -44.319692 32.263079 134.656616
7 19.266674 -41.532695 32.266674 134.743599
8 20.271934 -37.546253 32.271934 134.849050
9 21.278990 -33.239208 32.278990 134.972439
10 22.286989 -29.221786 32.286989 135.111068
11 23.293533 -25.652448 32.293533 135.261357
For example, (in the 3rd column) EVM = -30 lies between rows 9 and 10 above. How can I include a new row (between rows 9 and 10) that has EVM = -30 and then update the other columns (in this new row only) with linear interpolation that is based on the EVM column's position between the numbers in rows 9 and 10?
It would be great to be able to search and find the rows that EVM =-30 lies between.
Is it possible to apply linear interpolation to some rows but nonlinear interpolation to other columns?
Thanks!
Interpolation is by far the easiest part. Here is one approach.
First, find the missing rows and add them one by one:
targets = (-50, -40, -30) # Arbitrary
idxs = df.EVM.searchsorted(targets) # Find the rows location
arr = df.values
for idx, target in zip(idxs, targets):
arr = np.insert(arr, idx, [np.nan, target, np.nan, np.nan], axis=0)
df1 = pd.DataFrame(arr, columns=df.columns)
Then you can actually interpolate:
df2 = df1.interpolate('linear')
Output:
PwrOut EVM PwrGain Vout
0 -0.760031 -58.322902 32.239969 134.331851
1 3.242575 -58.073389 32.242575 134.332376
2 7.246203 -57.138122 32.246203 134.343538
3 11.251078 -54.160870 32.251078 134.383609
4 13.254103 -50.000000 32.254103 134.435519
5 15.257129 -48.624869 32.257129 134.487430
6 17.260618 -45.971596 32.260618 134.586753
7 18.263079 -44.319692 32.263079 134.656616
9 19.266674 -41.532695 32.266674 134.743599
8 19.769304 -40.000000 32.269304 134.796324
11 20.271934 -37.546253 32.271934 134.849050
12 21.278990 -33.239208 32.278990 134.972439
10 21.782989 -30.000000 32.282989 135.041753
13 22.286989 -29.221786 32.286989 135.111068
14 23.293533 -25.652448 32.293533 135.261357
If you want custom interpolation methods by columns, go individually, e.g:
df2.PwrOut = df1.PwrOut.interpolate('cubic')

Pandas how to truncate small float values

I'm using pandas.DataFrame.round to truncate columns on a DataFrame, but I have a column of p-values that have small values, which are being rounded to zero. For example, all the values bellow are being rounded to 0.
p-value
2.298564e-17
6.848231e-91
1.089847e-10
9.390048e-04
5.628517e-35
4.621786e-19
4.601818e-54
9.639073e-19
I want something like
p-value
2.29e-17
6.84e-91
1.08e-10
9.39e-04
5.62e-35
4.62e-19
4.60e-54
9.63e-19
Numpy has functions for this.
data = """p-value
2.298564e-17
6.848231e-91
1.089847e-10
9.390048e-04
5.628517e-35
4.621786e-19
4.601818e-54
9.639073e-19"""
a = [x for x in data.split("\n")]
df = pd.DataFrame({"p-value":a[1:]})
df["p-value"] = df["p-value"].astype(np.float)
df["p-value"].apply(lambda x: np.format_float_scientific(x, precision=2))
output
0 2.3e-17
1 6.85e-91
2 1.09e-10
3 9.39e-04
4 5.63e-35
5 4.62e-19
6 4.60e-54
7 9.64e-19
Name: p-value, dtype: object
not quite truncate, but rather round:
df['p-value'].apply(lambda x: f'{x:.2e}')
Output:
0 2.30e-17
1 6.85e-91
2 1.09e-10
3 9.39e-04
4 5.63e-35
5 4.62e-19
6 4.60e-54
7 9.64e-19
Name: p-value, dtype: object

Comparing elements between two dataframes and adding columns in case of equality

Considering two dataframes as follows:
import pandas as pd
df_rp = pd.DataFrame({'id':[1,2,3,4,5,6,7,8], 'res': ['a','b','c','d','e','f','g','h']})
df_cdr = pd.DataFrame({'id':[1,2,5,6,7,1,2,3,8,9,3,4,8],
'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89, -22.84, -22.98,
-22.14, -22.28, -22.42, -22.56, -22.70, -22.13],
'LONGITUDE':[-43.19, -43.39, -43.24, -43.28, -43.67, -43.11, -43.22,
-43.33, -43.44, -43.55, -43.66, -43.77, -43.88]})
What I have to do:
Compare each df_rp['id'] element with each df_cdr['id'] element;
If they are the same, I need to add in a data structure (list, series, etc.) the latitudes and longitudes that are on the same line as the id without repeating the id.
Below is an example of how I need the data to be grouped:
1:[-22.98,-43.19],[-22.84,-43.11]
2:[-22.97,-43.39],[-22.98,-43.22]
3:[-22.14,-43.33],[-22.56,-43.66]
4:[-22.70,-43.77]
5:[-22.92,-43.24]
6:[-22.87,-43.28]
7:[-22.89,-43.67]
8:[-22.28,-43.44],[-22.13,-43.88]
I'm having a hard time choosing which data structure is best for the situation (as I did in the example looks like a dictionary, but there would be several dictionaries) and how to add latitude and logitude to pairs without repeating the id. I appreciate any help.
We need to agg the second df , then reindex assign it back
df_rp['L$L']=df_cdr.drop('id',1).apply(tuple,1).groupby(df_cdr.id).agg(list).reindex(df_rp.id).to_numpy()
df_rp
Out[59]:
id res L$L
0 1 a [(-22.98, -43.19), (-22.84, -43.11)]
1 2 b [(-22.97, -43.39), (-22.98, -43.22)]
2 3 c [(-22.14, -43.33), (-22.56, -43.66)]
3 4 d [(-22.7, -43.77)]
4 5 e [(-22.92, -43.24)]
5 6 f [(-22.87, -43.28)]
6 7 g [(-22.89, -43.67)]
7 8 h [(-22.28, -43.44), (-22.13, -43.88)]
df_cdr['lat_long'] = df_cdr.apply(lambda x: list([x['LATITUDE'],x['LONGITUDE']]),axis=1)
df_cdr = df_cdr.drop(columns=['LATITUDE' , 'LONGITUDE'],axis=1)
df_cdr = df_cdr.groupby('id').agg(lambda x: x.tolist())
Output
lat_long
id
1 [[-22.98, -43.19], [-22.84, -43.11]]
2 [[-22.97, -43.39], [-22.98, -43.22]]
3 [[-22.14, -43.33], [-22.56, -43.66]]
4 [[-22.7, -43.77]]
5 [[-22.92, -43.24]]
6 [[-22.87, -43.28]]
7 [[-22.89, -43.67]]
8 [[-22.28, -43.44], [-22.13, -43.88]]
9 [[-22.42, -43.55]]
Assume df_rp.id is unique and sorted as in your sample. I come up with solution using set_index and loc to filter out id in df_cdr, but not in df_rp. Next, call groupby with lambda returns arrays
s = (df_cdr.set_index('id').loc[df_rp.id].groupby(level=0).
apply(lambda x: x.to_numpy()))
Out[709]:
id
1 [[-22.98, -43.19], [-22.84, -43.11]]
2 [[-22.97, -43.39], [-22.98, -43.22]]
3 [[-22.14, -43.33], [-22.56, -43.66]]
4 [[-22.7, -43.77]]
5 [[-22.92, -43.24]]
6 [[-22.87, -43.28]]
7 [[-22.89, -43.67]]
8 [[-22.28, -43.44], [-22.13, -43.88]]
dtype: object

Calculate row-wise dot products based on previous row and next row in pandas

I have a pandas dataframe like below:
Coordinate
1 (1150.0,1760.0)
28 (1260.0,1910.0)
6 (1030.0,2070.0)
12 (1170.0,2300.0)
9 (790.0,2260.0)
5 (750.0,2030.0)
26 (490.0,2130.0)
29 (360.0,1980.0)
3 (40.0,2090.0)
2 (630.0,1660.0)
20 (590.0,1390.0)
Now, I want to create a new column 'dotProduct' by applying the formula
np.dot((b-a),(b-c)) where b is the Coordinates(1260.0,1910.0) for index 28, c is the same for index 6, (i.e. (1030.0,2070.0)). The calculated product is for row 2. So, in a way I have to get the previous row value and next value too. This way I have to calculate for entire 'Coordinate' I am quite new to pandas, hence still in learning path. Please guide me a bit.
Thanks a lot for the help.
I assume that your 'Coordinate' column elements are already tuples of float values.
# Convert elements of 'Coordinate' into numpy array
df.Coordinate = df.Coordinate.apply(np.array)
# Subtract +/- 1 shifted values from original 'Coordinate'
a = df.Coordinate - df.Coordinate.shift(1)
b = df.Coordinate - df.Coordinate.shift(-1)
# take row-wise dot product based on the arrays a, b
df['dotProduct'] = [np.dot(x, y) for x, y in zip(a, b)]
# make 'Coordinate' tuple again (if you want)
df.Coordinate = df.Coordinate.apply(tuple)
Now I get this as df:
Coordinate dotProduct
1 (1150.0, 1760.0) NaN
28 (1260.0, 1910.0) 1300.0
6 (1030.0, 2070.0) -4600.0
12 (1170.0, 2300.0) 62400.0
9 (790.0, 2260.0) -24400.0
5 (750.0, 2030.0) 12600.0
26 (490.0, 2130.0) -18800.0
29 (360.0, 1980.0) -25100.0
3 (40.0, 2090.0) 236100.0
2 (630.0, 1660.0) -92500.0
20 (590.0, 1390.0) NaN

Is there a pandas way of getting the averages between consecutive rows?

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(30,3))
df.head()
which gives:
0 1 2
0 0.741955 0.913681 0.110109
1 0.079039 0.662438 0.510414
2 0.469055 0.201658 0.259958
3 0.371357 0.018394 0.485339
4 0.850254 0.808264 0.469885
Say I want to add another column that will build the averages in column 2: between index (0,1) (1,2)... (28,29).
I imagine this is a common task as column 2 are the x axis positions and I want the categorical labels on a plot to appear in the middle between the 2 points on the x axis.
So I was wondering if there is a pandas way for this:
averages = []
for index, item in enumerate(df[2]):
if index < df[2].shape[0] -1:
averages.append((item + df[2].iloc[index + 1]) / 2)
df["averages"] = pd.Series(averages)
df.head()
which gives:
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333138
3 0.996487 0.272300 0.334554 0.586686
as you can see 0.31 is an average of 0.21 and 0.42.
Thanks!
I think that you can do this with pandas.DataFrame.rolling. Using your dataframe head as an example:
df['averages'] = df[2].rolling(2).mean().shift(-1)
returns:
>>> df
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333139
3 0.996487 0.272300 0.334554 NaN
The NaN at the end is there because there is no row indexed 4; but in your full dataframe, it would go on until the second to last row (the average of value at indices 28 and 29, i.e. your 29th and 30th values). I just wanted to show that this gives the same values as your desired output, so I used the exact data you provided. (for future reference, if you want to provide a reproducible dataframe for us from random numbers, use and show us a random seed such as np.random.seed(42) before creating the df, that way, we'll all have the same one.)
breaking it down:
df[2] is there because you're interested in column 2; .rolling(2) is there because you want to get the mean of 2 values (if you wanted the mean of 3 values, use .rolling(3), etc...), .mean() is whatever function you want (in your case, the mean); finally .shift(-1) makes sure that the new column is in the proper place (i.e., makes sure you show the mean of each value in column 2 and the value below, as the default would be the value above)
This is one way, though slightly loopy. But #sacul's solution is better. I leave this here for reference only.
import pandas as pd
import numpy as np
from itertools import zip_longest
df = pd.DataFrame(np.random.rand(30, 3))
v = df.values[:, -1]
df = df.join(pd.DataFrame(np.array([np.mean([i, j], axis=0) for i, j in \
zip_longest(v, v[1:], fillvalue=v[-1])]), columns=['2_pair_avg']))
# 0 1 2 2_pair_avg
# 0 0.382656 0.228837 0.053199 0.373678
# 1 0.812690 0.255277 0.694156 0.697738
# 2 0.040521 0.211511 0.701320 0.491044
# 3 0.558739 0.697916 0.280768 0.615398
# 4 0.262771 0.912669 0.950029 0.489550
# 5 0.217489 0.405125 0.029071 0.101794
# 6 0.577929 0.933565 0.174517 0.214530
# 7 0.067030 0.452027 0.254544 0.613225
# 8 0.580869 0.556112 0.971907 0.582547
# 9 0.483528 0.951537 0.193188 0.175215
# 10 0.481141 0.589833 0.157242 0.159363
# 11 0.087057 0.823691 0.161485 0.108634
# 12 0.319516 0.161386 0.055784 0.285276
# 13 0.901529 0.365992 0.514768 0.386599
# 14 0.270118 0.454583 0.258430 0.245463
# 15 0.379739 0.299569 0.232497 0.214943
# 16 0.017621 0.182647 0.197389 0.538386
# 17 0.720688 0.147093 0.879383 0.732239
# 18 0.859594 0.538390 0.585096 0.503846
# 19 0.360718 0.571567 0.422596 0.287384
# 20 0.874800 0.391535 0.152171 0.239078
# 21 0.935150 0.379871 0.325984 0.294485
# 22 0.269607 0.891331 0.262986 0.212050
# 23 0.140976 0.414547 0.161115 0.542682
# 24 0.851434 0.059209 0.924250 0.801210
# 25 0.389025 0.774885 0.678170 0.388856
# 26 0.679247 0.982517 0.099542 0.372649
# 27 0.670354 0.279138 0.645756 0.336031
# 28 0.393414 0.970737 0.026307 0.343947
# 29 0.479611 0.349401 0.661587 0.661587

Categories