I'm currently working a project to estimate flow meter uncertainty. The meter uncertainty is based on four different values:
Liquid Flowrate (liq)
Fluid Viscosity (cP)
Water Liquid Ratio (wlr)
Gas Volume Fraction (gvf)
A third party provides tables for the meter at multiple different values for liq, cP, wlr and gvf. As you can guess the data from the meter never perfectly falls into one of the predefined values. For example a minute of data may read:
Liquid Flowrate: 6532
Fluid Viscosity: 22
Water Liquid Ratio: 0.412
Gas Volume Fraction: 0.634
With the data above a four way interpolation on the tables is performed to find what the uncertainty.
I've come up with a solution but it seems clunky and I'm wondering if anyone has any ideas. I'm still new to the pandas game and really appreciate seeing other peoples solutions.
Initially I sort the data to reduce the table down to the values above and below the actual point that I'm looking for.
aliq = 6532 # stbpd
avisc = 22 # centipoise
awlr = 0.412 # water liquid ratio
agvf = 0.634 # gas volume fraction
def findclose(num, colm):
arr = colm.unique()
if num in arr:
clslo = num
clshi = num
else:
clslo = arr[arr > num].min() # close low value
clshi = arr[arr < num].max() # close high value
return [clslo, clshi]
df = tbl_vx52[
(tbl_vx52['liq'].isin(findclose(aliq,tbl_vx52['liq']))) &
(tbl_vx52['visc'].isin(findclose(avisc,tbl_vx52['visc']))) &
(tbl_vx52['wlr'].isin(findclose(awlr,tbl_vx52['wlr']))) &
(tbl_vx52['gvf'].isin(findclose(agvf,tbl_vx52['gvf'])))
].reset_index(drop=True)
The table is reduced down from 2240 to 16 values. Instead of including all the data (tbl_vx52). I've created some code to load so you can see what the sub dataframe looks like, called df, with just the values above and below the areas for this example.
df = pd.DataFrame({'liq':[5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 7000, 7000, 7000, 7000, 7000, 7000, 7000, 7000],
'visc':[10, 10, 10, 10, 30, 30, 30, 30, 10, 10, 10, 10, 30, 30, 30, 30],
'wlr':[0.375, 0.375, 0.5, 0.5, 0.375, 0.375, 0.5, 0.5, 0.375, 0.375, 0.5, 0.5, 0.375, 0.375, 0.5, 0.5],
'gvf':[0.625, 0.75, 0.625, 0.75, 0.625, 0.75, 0.625, 0.75, 0.625, 0.75, 0.625, 0.75, 0.625, 0.75, 0.625, 0.75],
'uncert':[0.0707, 0.0992, 0.0906, 0.1278, 0.0705, 0.0994, 0.091, 0.128, 0.0702, 0.0991, 0.0905, 0.1279, 0.0704, 0.0992, 0.0904, 0.1283],
})
Some pretty crude looping is done to start pairing the values based on individual inputs (either liq, visc, wlr or gvf). Shown below is the first loop on gvf.
pairs = [
slice(0,1),
slice(2,3),
slice(4,5),
slice(6,7),
slice(8,9),
slice(10,11),
slice(12,13),
slice(14,15)]
for pair in pairs:
df.loc[pair,'uncert'] = np.interp(
agvf,
df.loc[pair,'gvf'],
df.loc[pair,'uncert']
)
df.loc[pair,'gvf'] = agvf
df = df.drop_duplicates().reset_index(drop=True)
The duplicate values are dropped, reducing from 16 rows to 8 rows. This is then repeated again for wlr.
pairs = [
slice(0,1),
slice(2,3),
slice(4,5),
slice(6,7)
]
for pair in pairs:
df.loc[pair,'uncert'] = np.interp(
awlr,
df.loc[pair,'wlr'],
df.loc[pair,'uncert']
)
df.loc[pair,'wlr'] = awlr
df = df.drop_duplicates().reset_index(drop=True)
The structure above is repeated for visc (four rows) and finally liquid (two rows) until only one value in the sub array is left. Which gives the uncertainty in meter at your operating point.
I know its pretty clunky. Any input or thoughts on different methods is appreciated.
Alright, I was able to find and apply a matrix based solution. It is based on a matrix method for trilinear interpolation which can be expanded to quad-linear interpolation. Wikipedia provides a good write up on trilinear interpolation. The 8x8 matrix in the wikipedia article can be expanded to a 16x16 for quadlinear interpolation. A single function is written below to make each row inside the matrix.
def quad_row(x, y, z, k):
"""
Generate a row for the quad interpolation matrix
x, y, z, k are scalar input values
"""
qrow = [1,
x, y, z, k,
x*y, x*z, x*k, y*z, y*k, z*k,
x*y*z, x*y*k, x*z*k, y*z*k,
x*y*z*k]
return qrow
It should be evident that this is just an extension of the rows inside the trilinear matrix. The function can be looped across sixteen times to generate the entire matrix.
Side Note: If you want to get fancy you can accomplish the quad_row function using itertools combinations. The advantage is that you can input an array of any size and it returns the properly formatted row for the interpolation matrix. The function is more flexible, but ultimately slower.
from itertools import combinations
def interp_row(values):
values = np.asarray(values)
n = len(values)
intp_row = [1]
for i in range(1, n+1):
intp_row.extend([np.product(x) for x in list(combinations(values, i))])
return intp_row
The function that accepts an input table, finds the values close to your interpolated values, builds the interpolation matrix and performs the matrix math is shown below.
def quad_interp(values, table):
"""
values - four points to interpolate across, pass as list or numpy array
table - lookup data, four input columns and one output column
"""
table = np.asarray(table)
A, B, C, D, E = np.transpose(table)
a, b, c, d = values
in_vector = quad_row(a, b, c, d)
mask = (
np.isin(A, findclose(a, A)) &
np.isin(B, findclose(b, B)) &
np.isin(C, findclose(c, C)) &
np.isin(D, findclose(d, D)))
quad_matrix = []
c_vector = []
for row in table[mask]:
x, y, z, v, w = row
quad_matrix.append(quad_row(x, y, z, v))
c_vector.append(w)
quad_matrix = np.matrix(quad_matrix)
c_vector = np.asarray(c_vector)
a_vector = np.dot(np.linalg.inv(quad_matrix), c_vector)
return float(np.dot(a_vector, in_vector))
For example, calling the function would look like this.
df = pd.DataFrame({'liq':[5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 7000, 7000, 7000, 7000, 7000, 7000, 7000, 7000],
'visc':[10, 10, 10, 10, 30, 30, 30, 30, 10, 10, 10, 10, 30, 30, 30, 30],
'wlr':[0.375, 0.375, 0.5, 0.5, 0.375, 0.375, 0.5, 0.5, 0.375, 0.375, 0.5, 0.5, 0.375, 0.375, 0.5, 0.5],
'gvf':[0.625, 0.75, 0.625, 0.75, 0.625, 0.75, 0.625, 0.75, 0.625, 0.75, 0.625, 0.75, 0.625, 0.75, 0.625, 0.75],
'uncert':[0.0707, 0.0992, 0.0906, 0.1278, 0.0705, 0.0994, 0.091, 0.128, 0.0702, 0.0991, 0.0905, 0.1279, 0.0704, 0.0992, 0.0904, 0.1283],
})
values = [6532, 22, 0.412, 0.634]
quad_interp(values, df)
As seen, no error handling exists for the above function. It will break down if the following is attempted:
1. Interpolating values outside table boundaries.
2. Inputting lookup values that are already in the table, resulting in less than 16 points being selected.
Also, I acknowledge the following:
1. Naming convention could of been better
2. Faster way may exist for creating the mask function
The function findclose() is shown the original question.
Please let me know if you have any feedback or room for improvement .
Related
I have a data frame (see below sample) as below. I would like to calculate the average distance between each unordered pair for each firm each quarter. *Edit: the actual dataframe has over 700,000 rows. If you know a fast solution, that would be greatly appreciated!
df=pd.DataFrame(np.array([[0.1, 0.2, 0.1, 'facebook', 2021_01], [0.4, 0.5, 0, 'facebook', 2021_01], [0.2, 0.4, 0.3,'facebook', 2021_01],[0.3, 0.1, 0.2,'facebook', 2021_02],
[0.4, 0.2, 0.2,'facebook', 2021_02],[0.2, 0.4, 0.2,'facebook', 2021_02],[0.1, 0.2, 0.1, 'apple', 2021_01], [0.1, 0.5, 0.4, 'apple', 2021_01], [0.5, 0.2, 0.1, 'apple', 2021_01],
[0.1, 0.2, 0.1, 'apple', 2021_02],[0.2, 0.2, 0.9, 'apple', 2021_02],[0.2, 0.6, 0.5, 'apple', 2021_02]]),columns=['a', 'b', 'c','firm','quarter_year'])
To calculate the distance between unordered pairs regardless of firm or quarter in the FULL dataframe, I will use Jenson-Shannon Divergence Score from scipy as below:
from scipy.spatial.distance import jensenshannon as js
prob=df.iloc[:, 0:3]
output = []
for i,j in combinations(df.index.tolist(),2 ): #update df
# J-S for liwc
sim_cult = js(prob.loc[i], prob.loc[j])
output.append([i, j, sim_cult])
However, I have a hard time finding a way to add groupby(['firm','quarter_year']) function to the above loop. In essence, I'd like to get another column with the Average Divergence Score per company per quarter. For example, for facebook in 202101, the average will be based on the distance between
row0, row1, and row2
.
How do I run the above code but for each group (firm + quarter)?
You can simply wrap the code to compute the average distance in a function, and apply it to your dataframe after grouping.
def mean_js_dist(grp):
prob=grp.iloc[:, 0:3]
output = []
for i,j in combinations(grp.index.tolist(),2 ):
sim_cult = js(prob.loc[i], prob.loc[j])
output.append([i, j, sim_cult])
return np.mean(np.array(output), axis=0)
df.groupby(['firm','quarter_year']).apply(mean_js_dist)
This is the general idea, but I'm not sure exactly what format you want the output in, or how you want to handle the three return values. Feel free to specify in the comments.
I'm supposed to normalize an array. I've read about normalization and come across a formula:
I wrote the following function for it:
def normalize_list(list):
max_value = max(list)
min_value = min(list)
for i in range(0, len(list)):
list[i] = (list[i] - min_value) / (max_value - min_value)
That is supposed to normalize an array of elements.
Then I have come across this: https://stackoverflow.com/a/21031303/6209399
Which says you can normalize an array by simply doing this:
def normalize_list_numpy(list):
normalized_list = list / np.linalg.norm(list)
return normalized_list
If I normalize this test array test_array = [1, 2, 3, 4, 5, 6, 7, 8, 9] with my own function and with the numpy method, I get these answers:
My own function: [0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]
The numpy way: [0.059234887775909233, 0.11846977555181847, 0.17770466332772769, 0.23693955110363693, 0.29617443887954614, 0.35540932665545538, 0.41464421443136462, 0.47387910220727386, 0.5331139899831830
Why do the functions give different answers? Is there others way to normalize an array of data? What does numpy.linalg.norm(list) do? What do I get wrong?
There are different types of normalization. You are using min-max normalization. The min-max normalization from scikit learn is as follows.
import numpy as np
from sklearn.preprocessing import minmax_scale
# your function
def normalize_list(list_normal):
max_value = max(list_normal)
min_value = min(list_normal)
for i in range(len(list_normal)):
list_normal[i] = (list_normal[i] - min_value) / (max_value - min_value)
return list_normal
#Scikit learn version
def normalize_list_numpy(list_numpy):
normalized_list = minmax_scale(list_numpy)
return normalized_list
test_array = [1, 2, 3, 4, 5, 6, 7, 8, 9]
test_array_numpy = np.array(test_array)
print(normalize_list(test_array))
print(normalize_list_numpy(test_array_numpy))
Output:
[0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]
[0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]
MinMaxscaler uses exactly your formula for normalization/scaling:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html
#OuuGiii: NOTE: It is not a good idea to use Python built-in function names as varibale names. list() is a Python builtin function so its use as a variable should be avoided.
The question/answer that you reference doesn't explicitly relate your own formula to the np.linalg.norm(list) version that you use here.
One NumPy solution would be this:
import numpy as np
def normalize(x):
x = np.asarray(x)
return (x - x.min()) / (np.ptp(x))
print(normalize(test_array))
# [ 0. 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1. ]
Here np.ptp is peak-to-peak ie
Range of values (maximum - minimum) along an axis.
This approach scales the values to the interval [0, 1] as pointed out by #phg.
The more traditional definition of normalization would be to scale to a 0 mean and unit variance:
x = np.asarray(test_array)
res = (x - x.mean()) / x.std()
print(res.mean(), res.std())
# 0.0 1.0
Or use sklearn.preprocessing.normalize as a pre-canned function.
Using test_array / np.linalg.norm(test_array) creates a result that is of unit length; you'll see that np.linalg.norm(test_array / np.linalg.norm(test_array)) equals 1. So you're talking about two different fields here, one being statistics and the other being linear algebra.
The power of python is its broadcasting property, which allows you to do vectorizing array operations without explicit looping. So, You do not need to write a function using explicit for loop, which is slow and time-consuming, especially if your dataset is too big.
The pythonic way of doing min-max normalization is
test_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
normalized_test_array = (test_array - min(test_array)) / (max(test_array) - min(test_array))
output >> [ 0., 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1. ]
I have a list with a series of random floats that go from negative to positive, like:
values = [0.001, 0.05, 0.09, 0.1, 0.4, 0.8, 0.9, 0.95, 0.99]
I wish to filter out the indices that first meet the greater than/less than values that I wish. For example, if I want the first closest value less than 0.1 I would get an index of 2 and if I want the first highest value greater than 0.9 I'd get 7.
I have a find_nearest method that I am using but since this dataset is randomized, this is not ideal.
EDIT: Figured out a solution.
low = next(x[0] for x in enumerate(list(reversed(values))) if x[1] < 0.1)
high = next(x[0] for x in enumerate(values) if x[1] > 0.9)
if the values list gets long you may want the bisect module from the standard lib
bisect_left, bisect_right may serve as the >, < tests
import bisect
values = [0.001, 0.05, 0.09, 0.1, 0.4, 0.8, 0.9, 0.95, 0.99]
bisect.bisect_left(values, .1)
Out[226]: 3
bisect.bisect_right(values, .1)
Out[227]: 4
I have an numpy array, lets say one with 4 rows and 6 (always even number) columns:
m=np.round(np.random.rand(4,6))
array([[ 0.99, 0.48, 0.05, 0.26, 0.92, 0.44],
[ 0.81, 0.54, 0.19, 0.38, 0.5 , 0.02],
[ 0.11, 0.96, 0.04, 0.69, 0.78, 0.31],
[ 0.5 , 0.53, 0.94, 0.77, 0.6 , 0.75]])
I now want to plot graphs according to the column pairs, in this case
Graph 1: x-values=m[:,1] and y-values=m[:,0]
Graph 2: x-values=m[:,3] and y-values=m[:,2]
Graph 3: x-values=m[:,5] and y-values=m[:,4]
The first two columns are basically a pair of values, the next two are another pair of values and the last two also are a pair of values.
All the graphs should be in the same plot!
I need a general solution for plotting multiple graphs like this with an undefined but EVEN number of columns of the array. Something like a loop!
Hope somebody can help me :)
you can loop on all values of the column pairs
import matplotlib.pyplot
i=1
while i<len(m[0]):
x = m[:,i]
y = m[:,i-1]
plt.plot(x,y)
plt.savefig('placeholderName_%d.png' % i)
plt.close()
i=i+2
note that I'm starting at 1, and incrementing by two. this conforms to the example you presented
This gives terrible results with the m arra y you specified, but if it was just a sample and your data is more realistic, the following should do:
for i in range(m.shape[1] // 2):
plt.figure()
plt.plot(m[:, 2* i], m[:, 2 * i + 1])
If you want all the plots on the same figure, just move the plt.figure() out of the loop:
plt.figure()
for i in range(m.shape[1] // 2):
plt.plot(m[:, 2* i], m[:, 2 * i + 1])
I have periodic data with the index being a floating point number like so:
time = [0, 0.1, 0.21, 0.31, 0.40, 0.49, 0.51, 0.6, 0.71, 0.82, 0.93]
voltage = [1, -1, 1.1, -0.9, 1, -1, 0.9,-1.2, 0.95, -1.1, 1.11]
df = DataFrame(data=voltage, index=time, columns=['voltage'])
df.plot(marker='o')
I want to create a cross(df, y_val, direction='rise' | 'fall' | 'cross') function that returns an array of times (indexes) with all the
interpolated points where the voltage values equal y_val. For 'rise' only the values where the slope is positive are returned; for 'fall' only the values with a negative slope are retured; for 'cross' both are returned. So if y_val=0 and direction='cross' then an array with 10 values would be returned with the X values of the crossing points (the first one being about 0.025).
I was thinking this could be done with an iterator but was wondering if there was a better way to do this.
Thanks. I'm loving Pandas and the Pandas community.
To do this I ended up with the following. It is a vectorized version which is 150x faster than one that uses a loop.
def cross(series, cross=0, direction='cross'):
"""
Given a Series returns all the index values where the data values equal
the 'cross' value.
Direction can be 'rising' (for rising edge), 'falling' (for only falling
edge), or 'cross' for both edges
"""
# Find if values are above or bellow yvalue crossing:
above=series.values > cross
below=np.logical_not(above)
left_shifted_above = above[1:]
left_shifted_below = below[1:]
x_crossings = []
# Find indexes on left side of crossing point
if direction == 'rising':
idxs = (left_shifted_above & below[0:-1]).nonzero()[0]
elif direction == 'falling':
idxs = (left_shifted_below & above[0:-1]).nonzero()[0]
else:
rising = left_shifted_above & below[0:-1]
falling = left_shifted_below & above[0:-1]
idxs = (rising | falling).nonzero()[0]
# Calculate x crossings with interpolation using formula for a line:
x1 = series.index.values[idxs]
x2 = series.index.values[idxs+1]
y1 = series.values[idxs]
y2 = series.values[idxs+1]
x_crossings = (cross-y1)*(x2-x1)/(y2-y1) + x1
return x_crossings
# Test it out:
time = [0, 0.1, 0.21, 0.31, 0.40, 0.49, 0.51, 0.6, 0.71, 0.82, 0.93]
voltage = [1, -1, 1.1, -0.9, 1, -1, 0.9,-1.2, 0.95, -1.1, 1.11]
df = DataFrame(data=voltage, index=time, columns=['voltage'])
x_crossings = cross(df['voltage'])
y_crossings = np.zeros(x_crossings.shape)
plt.plot(time, voltage, '-ob', x_crossings, y_crossings, 'or')
plt.grid(True)
It was quite satisfying when this worked. Any improvements that can be made?