I have sensor data captured at different frequencies (this is data I've invented to simplify the operation). I want to resample the voltage data by increasing the number of data points and interpolate them so I have 16 instead of 12.
Pandas has a resample/upsample function but I can only find examples where people have gone from weekly data to daily data (adding 6 daily data points by interpolation between two weekly data points).
time (pressure)
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
pressure
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
time (voltage)
0.07
0.14
0.21
0.28
0.35
0.42
0.49
0.56
0.63
0.7
0.77
0.84
voltage
2.2
2.5
2.8
3.1
3.4
3.7
4
4.3
4.6
4.9
5.2
5.5
I would like my voltage to have 16 samples instead of 12 with the missing values interpolated. Thanks!
Let's assume two Series, "pressure" and "voltage":
pressure = pd.Series({0.05: 1.0, 0.1: 1.1, 0.15: 1.2, 0.2: 1.3, 0.25: 1.4, 0.3: 1.5, 0.35: 1.6, 0.4: 1.7, 0.45: 1.8,
0.5: 1.9, 0.55: 2.0, 0.6: 2.1, 0.65: 2.2, 0.7: 2.3, 0.75: 2.4, 0.8: 2.5}, name='pressure')
voltage = pd.Series({0.07: 2.2, 0.14: 2.5, 0.21: 2.8, 0.28: 3.1, 0.35: 3.4, 0.42: 3.7,
0.49: 4.0, 0.56: 4.3, 0.63: 4.6, 0.7: 4.9, 0.77: 5.2, 0.84: 5.5}, name='voltage')
You can either use pandas.merge_asof:
pd.merge_asof(pressure, voltage, left_index=True, right_index=True)
output:
or pandas.concat+interpolate:
(pd.concat([pressure, voltage], axis=1)
.sort_index()
.apply(pd.Series.interpolate)
#.plot(x='pressure', y='voltage', marker='o') # uncomment to plot
)
output:
Finally, to interpolate only on voltage, drop NAs on pressure first:
(pd.concat([pressure, voltage], axis=1)
.sort_index()
.dropna(subset=['pressure'])
.apply(pd.Series.interpolate)
)
output:
I have so many Nan values in my output data and I padded those values with zeros. Please don't suggest me to delete Nan or impute with any other no. I want model to skip those nan positions.
example:
x = np.arange(0.5, 30)
x.shape = [10, 3]
x = [[ 0.5 1.5 2.5]
[ 3.5 4.5 5.5]
[ 6.5 7.5 8.5]
[ 9.5 10.5 11.5]
[12.5 13.5 14.5]
[15.5 16.5 17.5]
[18.5 19.5 20.5]
[21.5 22.5 23.5]
[24.5 25.5 26.5]
[27.5 28.5 29.5]]
y = np.arange(2, 10, 0.8)
y.shape = [10, 1]
y[4, 0] = 0.0
y[6, 0] = 0.0
y[7, 0] = 0.0
y = [[2. ]
[2.8]
[3.6]
[4.4]
[0. ]
[6. ]
[0. ]
[0. ]
[8.4]
[9.2]]
I expect keras deep learning model to predict zeros for 5th, 7th and 8th row as similar to the padded value in 'y'.
I have found code that i am interested in, on this forum.
But it's not working for my dataframe.
INPUT:
x , y ,value ,value2
1.0 , 1.0 , 12.33 , 1.23367543
2.0 , 2.0 , 11.5 , 1.1523123
4.0, 2.0 , 22.11 , 2.2112312
5.0, 5.0 , 78.13 , 7.8131239
6.0, 6.0 , 33.68 , 3.3681231
i need delete rows in distance between =1, and leave only one where is highest "value"
RESULT to get:
1.0 , 1.0 , 12.23 , 1.23367543
4.0, 2.0 , 22.11 , 2.2112312
5.0, 5.0 , 78.13 , 7.8131239
CODE:
def dist_value_comp(row):
x_dist = abs(df['y'] - row['y']) <= 1
y_dist = abs(df['x'] - row['x']) <= 1
xy_dist = x_dist & y_dist
max_value = df.loc[xy_dist, 'value2'].max()
return row['value2'] == max_value
df['keep_row'] = df.apply(dist_value_comp, axis=1)
df.loc[df['keep_row'], ['x', 'y','value', 'value2']]
PROBLEM:
When i am adding 4th columnvalue2 where valueshave more numbers after dot, code showing me only row with the highest value2 but result should be same as for value.
UPDATE:
it's working when i am using old pycharm and python 2.7 , on new version it's not, any idea why?
I have a text file with this format:
1 10.0e+08 1.0e+04 1.0
2 9.0e+07 9.0e+03 0.9
2 8.0e+07 8.0e+03 0.8
3 7.0e+07 7.0e+03 0.7
I would like to preserve the first variable of every line and to then normalize the data for all lines by the data on the first line. The end result would look something like;
1 1.0 1.0 1.0
2 0.9 0.9 0.9
2 0.8 0.8 0.8
3 0.7 0.7 0.7
so essentially, we are doing the following:
1 10.0e+08/10.0e+08 1.0e+04/1.0e+04 1.0/1.0
2 9.0e+07/10.0e+08 9.0e+03/1.0e+04 0.9/1.0
2 8.0e+07/10.0e+08 8.0e+03/1.0e+04 0.8/1.0
3 7.0e+07/10.0e+08 7.0e+03/1.0e+04 0.7/1.0
I'm still researching and reading on how to do this. I'll upload my attempt shortly. Also can anyone point me to a place where I can learn more about manipulating data files?
Read your file into a numpy array and use numpy broadcast feature:
import numpy as np
data = np.loadtxt('foo.txt')
data = data / data[0]
#array([[ 1. , 1. , 1. , 1. ],
# [ 2. , 0.09, 0.9 , 0.9 ],
# [ 2. , 0.08, 0.8 , 0.8 ],
# [ 3. , 0.07, 0.7 , 0.7 ]])
np.savetxt('new.txt', data)
Edit: Original question was flawed but I am leaving it here for reasons of transparency.
Original:
I have some x, y, z data where x and y are coordinates of a 2D grid and z is a scalar value corresponding to (x, y).
>>> import numpy as np
>>> # Dummy example data
>>> x = np.arange(0.0, 5.0, 0.5)
>>> y = np.arange(1.0, 2.0, 0.1)
>>> z = np.sin(x)**2 + np.cos(y)**2
>>> print "x = ", x, "\n", "y = ", y, "\n", "z = ", z
x = [ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
y = [ 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9]
z = [ 0.29192658 0.43559829 0.83937656 1.06655187 0.85571064 0.36317266
0.02076747 0.13964978 0.62437081 1.06008127]
Using xx, yy = np.meshgrid(x, y) I can get two grids containing x and y values corresponding to each grid position.
>>> xx, yy = np.meshgrid(x, y)
>>> print xx
[[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]]
>>> print yy
[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. ]
[ 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1]
[ 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2]
[ 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3]
[ 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4]
[ 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5]
[ 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6]
[ 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7]
[ 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8]
[ 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9]]
Now I want an array of the same shape for z, where the grid values correspond to the matching x and y values in the original data! But I cannot find an elegant, built-in solution where I do not need to re-grid the data, and I think I am missing some understanding of how I should approach it.
I have tried following this solution (with my real data, not this simple example data, but it should have the same result) but my final grid was not fully populated.
Please help!
Corrected question:
As was pointed out by commenters, my original dummy data was unsuitable for the question I am asking. Here is an improved version of the question:
I have some x, y, z data where x and y are coordinates of a 2D grid and z is a scalar value corresponding to (x, y). The data is read from a text file "data.txt":
#x y z
1.4 0.2 1.93164166734
1.4 0.3 1.88377897779
1.4 0.4 1.81946452501
1.6 0.2 1.9596778849
1.6 0.3 1.91181519535
1.6 0.4 1.84750074257
1.8 0.2 1.90890970517
1.8 0.3 1.86104701562
1.8 0.4 1.79673256284
2.0 0.2 1.78735230743
2.0 0.3 1.73948961789
2.0 0.4 1.67517516511
Loading the text:
>>> import numpy as np
>>> inFile = 'C:\data.txt'
>>> x, y, z = np.loadtxt(inFile, unpack=True, usecols=(0, 1, 2), comments='#', dtype=float)
>>> print x
[ 1.4 1.4 1.4 1.6 1.6 1.6 1.8 1.8 1.8 2. 2. 2. ]
>>> print y
[ 0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3 0.4]
>>> print z
[ 1.93164167 1.88377898 1.81946453 1.95967788 1.9118152 1.84750074
1.90890971 1.86104702 1.79673256 1.78735231 1.73948962 1.67517517]
Using xx, yy= np.meshgrid(np.unique(x), np.unique(y)) I can get two grids containing x and y values corresponding to each grid position.
>>> xx, yy= np.meshgrid(np.unique(x), np.unique(y))
>>> print xx
[[ 1.4 1.6 1.8 2. ]
[ 1.4 1.6 1.8 2. ]
[ 1.4 1.6 1.8 2. ]]
>>> print yy
[[ 0.2 0.2 0.2 0.2]
[ 0.3 0.3 0.3 0.3]
[ 0.4 0.4 0.4 0.4]]
Now each corresponding cell position in both xx and yy correspond to one of the original grid point locations.
I simply need an equivalent array where the grid values correspond to the matching z values in the original data!
"""e.g.
[[ 1.93164166734 1.9596778849 1.90890970517 1.78735230743]
[ 1.88377897779 1.91181519535 1.86104701562 1.73948961789]
[ 1.81946452501 1.84750074257 1.79673256284 1.67517516511]]"""
But I cannot find an elegant, built-in solution where I do not need to re-grid the data, and I think I am missing some understanding of how I should approach it. For example, using xx, yy, zz = np.meshgrid(x, y, z) returns three 3D arrays that I don't think I can use.
Please help!
Edit:
I managed to make this example work thanks to the solution from Jaime: Fill 2D numpy array from three 1D numpy arrays
>>> x_vals, x_idx = np.unique(x, return_inverse=True)
>>> y_vals, y_idx = np.unique(y, return_inverse=True)
>>> vals_array = np.empty(x_vals.shape + y_vals.shape)
>>> vals_array.fill(np.nan) # or whatever your desired missing data flag is
>>> vals_array[x_idx, y_idx] = z
>>> zz = vals_array.T
>>> print zz
But the code (with real input data) that led me on this path was still failing. I found the problem now. I have been using scipy.ndimage.zoom to resample my gridded data to a higher resolution before generating zz.
>>> import scipy.ndimage
>>> zoom = 2
>>> x = scipy.ndimage.zoom(x, zoom)
>>> y = scipy.ndimage.zoom(y, zoom)
>>> z = scipy.ndimage.zoom(z, zoom)
This produced an array containing many nan entries:
array([[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]])
When I skip the zoom stage, the correct array is produced:
array([[-22365.93400183, -22092.31794674, -22074.21420168, ...,
-14513.89091599, -12311.97437017, -12088.07062786],
[-29264.34039242, -28775.79743097, -29021.31886353, ...,
-21354.6799064 , -21150.76555669, -21046.41225097],
[-39792.93758344, -39253.50249278, -38859.2562673 , ...,
-24253.36838785, -25714.71895023, -29237.74277727],
...,
[ 44829.24733543, 44779.37084337, 44770.32987311, ...,
21041.42652441, 20777.00408692, 20512.58162671],
[ 44067.26616067, 44054.5398901 , 44007.62587598, ...,
21415.90416488, 21151.48168444, 20887.05918082],
[ 43265.35371973, 43332.5983711 , 43332.21743471, ...,
21780.32283309, 21529.39770759, 21278.47255848]])