How to calculate mean signed error using pandas dataframes? - python

Let's say I have following dataframe:
df = pd.DataFrame([['x', 42, 50, 68, 12],
['y', 51, 60, 79, 22],
['z', 43, 50, 58, 12],
['w', 46, 70, 88, 22],
['xy', 38, 40, 69, 22],
['xz', 39, 40, 49, 12]],
columns=['system', 'Experimental', 'Prediction1', 'Prediction2', 'Prediction3'])
How can I calculate signed error? I could not find any info regarding this at all.

I assume you want to calculate this statistic. If so, you can define your own function and then pass in your predictions to compare against the observed values.
You did not mention which column in df is the observed value of the dependent variable, so I am assuming it's Experimental and I'm comparing it against Prediction1.
# Define custom function
def msd(y_true, y_pred):
return (y_pred - y_true).mean()
# Mean Signed Deviation of 'Experimental' and 'Prediction1'
msd(df['Experimental'], df['Prediction1'])
> 8.5

Related

Read a text file from s3 Node.js - Missing some part at the beginning

I'm new to AWS and Node.js.
I wrote a 224x224x3 array into a text file and saved it in a s3 bucket. I need to read that text file in another lambda function written in Node.js and assign that array to a variable.
But it seems like sometimes some part of the array has missed at the beginning.
Below is the Python code I used to store the array into a text file.
Please note that the returnArr is a 224x224x3 list.
with open("/tmp/out.txt", "w") as file:
file.write(str(returnArr))
file.close()
s3 = boto3.resource('s3')
BUCKET = "tempimagebucket1"
s3.Bucket(BUCKET).upload_file("/tmp/out.txt", "out.txt")
Below is the Node.js code I used to read the array stored in the text file.
const AWS = require('aws-sdk');
var s3 = new AWS.S3();
exports.handler = async (event, ctx, callback) => {
var params1 = {Bucket: 'tempimagebucket1', Key: 'out.txt'};
await s3.getObject(params1 , function (err, data1) {
if (err) {
console.log(err);
} else {
console.log("array : ",data1.Body.toString()); //Line 1
img = data1.Body.toString();
}
}).promise();
}
Issue - Sometimes Line 1 gives an incomplete array (Some part of the array has missed at the beginning).
Sometimes it gives the complete array successfully.
Please note that the array stored in the text file is always complete. So the issue should be in the Node.js Lambda function.
Some of the incomplete outputs are below (Line 1 output).
Read array is different at different times.
array : 57, 27], [92, 43, 13], [89, 41, 11], [90, 43, 14], [89, 44, 18], [79, 37, 15], [59, 26, 8], [48, 27, 11], [111, 66, 33], [118, 77, 40] ...
array : 72], [164, 120, 73], [153, 105, 61], [140, 89, 44], [142, 91, 43], [156, 107, 60], [156, 108, 62], [164, 116, 69], [171, 123, 77], [161, 112, 67], [160, 111, 67] ...
But it should be something like below - Output should be a complete 3d array
array : [[[107, 90, 72], [96, 79, 62], [86, 69, 52], [59, 43, 28], [43, 26, 15], [50, 32, 20], [58, 38, 21], [77, 46, 25], [81, 50, 22], [96, 65, 39]...
I can't figure out the reason for that.
Can someone please help me?
Thank you.
Your function and test file are correct. However, I see what you mean by missing beginning:
This is just console trimming the output. You can press Load more and you will see full text:

Spark UDF: Apply np.sum over a list of values in a data frame and filter values based on threshold

Very knew to using spark for data manipulation and UDF. I have a sample df with different test scores. There are 50 different columns like these. I am trying to define a custom apply function to filter values (total counts in each row) which are greater than 80.
test_scores
[65, 92, 96, 72, 70, 85, 72, 74, 79, 10, 82]
[59, 81, 91, 69, 66, 75, 65, 61, 71, 85, 69]
Below is what I am trying:
customfunc = udf(lambda val: (np.sum(val > 30)))
df2 = (df.withColumn('scores' ,customfunc('test_scores')))
Getting the below error:
TypeError: '>' not supported between instances of 'tuple' and 'str'

scipy.stats.binned_statistic_dd() bin numbering has lots of extra bins

I'm struggling to deal with a scipy.stats.binned_statistic_dd() result. I have an array of positions and another array of ids that I'm binning in 3 directions. I'm providing a list of the bin edges as input rather than a number of bins in each direction coupled with a range option. I have 3 bins in x, 2 in y, and 3 in z, or 18 bins.
However, when I check the binnumbers listed, they are all in a range greater than 20. How do I get the bin numbers to reflect the number of bins provided and get rid of all the extra bins?
I've tried to follow what was suggested in this post (Output in scipy.stats.binned_statistic_dd()) which deals with something similar, but I can't understand how to apply this to my case. As usual, the documentation is as cryptic as ever.
Any help on get my binnumbers between 1-18 in this example would be greatly appreciated!
pos = np.array([[-0.02042167, -0.0223282 , 0.00123734],
[-0.0420364 , 0.01196078, 0.00694259],
[-0.09625651, -0.00311446, 0.06125461],
[-0.07693234, -0.02749618, 0.03617278],
[-0.07578646, 0.01199925, 0.02991888],
[-0.03258293, -0.00371765, 0.04245596],
[-0.06765955, 0.02798434, 0.07075846],
[-0.02431445, 0.02774102, 0.06719837],
[ 0.02798265, -0.01096739, -0.01658691],
[-0.00584252, 0.02043389, -0.00827088],
[ 0.00623063, -0.02642285, 0.03232817],
[ 0.00884222, 0.01498996, 0.02912483],
[ 0.07189474, -0.01541584, 0.01916607],
[ 0.07239394, 0.0059483 , 0.0740187 ],
[-0.08519159, -0.02894125, 0.10923724],
[-0.10803509, 0.01365444, 0.09555333],
[-0.0442866 , -0.00845725, 0.10361843],
[-0.04246779, 0.00396127, 0.1418258 ],
[-0.08975861, 0.02999023, 0.12713186],
[ 0.01772454, -0.0020405 , 0.08824418]])
ids = np.array([16, 9, 6, 19, 1, 4, 10, 5, 18, 11, 2, 12, 13, 8, 3, 17, 14,
15, 20, 7])
xbinEdges = np.array([-0.15298488, -0.05108961, 0.05080566, 0.15270093])
ybinEdges = np.array([-0.051, 0. , 0.051])
zbinEdges = np.array([-0.053, 0.049, 0.151, 0.253])
ret = stats.binned_statistic_dd(pos, ids, bins=[xbinEdges, ybinEdges, zbinEdges],
statistic='count', expand_binnumbers=False)
bincounts = ret.statistic
binnumber = ret.binnumber.T
>>> binnumber = array([46, 51, 27, 26, 31, 46, 32, 52, 46, 51, 46, 51, 66, 72, 27, 32, 47,
52, 32, 47], dtype=int64)
ranges = [[-0.15298488071, 0.15270092971],
[-0.051000000000000004, 0.051000000000000004],
[-0.0530000000000001, 0.25300000000000006]]
ret3 = stats.binned_statistic_dd(pos, ids, bins=(3,2,3), statistic='count', expand_binnumbers=False, range=ranges)
bincounts = ret3.statistic
binnumber = ret3.binnumber.T
>>> binnumber = array([46, 51, 27, 26, 31, 46, 32, 52, 46, 51, 46, 51, 66, 72, 27, 32, 47,
52, 32, 47], dtype=int64)
Ok, after several days of background thinking and a quick scour through the binned_statistic_dd() source code I think I've come to the correct answer and it's pretty simple.
It seem binned_statistic_dd() adds an extra set of outlier bins in the binning phase and then removes these when returning the histogram results, but leaving the bin numbers untouched (I think this is in case you want to reuse the result for further stats outputs).
So it seems that if you export the expanded binnumbers (expand_binnumbers=True) and then subtract 1 from each binnumber to re-adjust the bin indices you can calculate the "correct" bin ids.
ret2 = stats.binned_statistic_dd(pos, ids, bins=[xbinEdges, ybinEdges, zbinEdges],
statistic='count', expand_binnumbers=True)
bincounts2 = ret2.statistic
binnumber2 = ret2.binnumber
indxnum2 = binnumber2-1
corrected_bin_ids = np.ravel_multi_index((indxnum2),(numX, numY, numZ))
Quick and simple in the end!

Finding the closest to value in two datasets using a for loop

In MATLAB, I am able to find to identify the values in data_b that come closest to the values in data_a, alongside the indices that indicate in which place in the matrix they occur, with the following code:
clear all; close all; clc;
data_a = [0; 15; 30; 45; 60; 75; 90];
data_b = randi([0, 90], [180, 101]);
[rows_a,cols_a] = size(data_a);
[rows_b,cols_b] = size(data_b);
val1 = zeros(rows_a,cols_b);
ind1 = zeros(rows_a,cols_b);
for i = 1:cols_b
for j = 1:rows_a
[val1(j,i),ind1(j,i)] = min(abs(data_b(:,i) - data_a(j)));
end
end
Since I would like to phase out MATLAB (I will be out of a license eventually), I decided to try the same in python, without any luck:
import numpy as np
data_a = np.array([[0],[15],[30],[45],[60],[75],[90]])
data_b = np.random.randint(91, size=(180, 101))
[rows_a,cols_a] = data_a.shape
[rows_b,cols_b] = data_b.shape
val1 = np.zeros((rows_a,cols_b))
ind1 = np.zeros((rows_a,cols_b))
for i in range(cols_b):
for j in range(rows_a):
[val1[j][i],ind1[j][i]] = np.amin(np.abs(data_b[:][i] - data_a[j]))
The code also produced an error that made me none the wiser:
TypeError: cannot unpack non-iterable numpy.int32 object
If anyone could find time to explain why I am an ignorant fool by indicating what I did wrong, and what I could do to fix it, I would be grateful as this has proven to become a major obstacle for my progress.
Thank you.
I think you are facing two problems:
Incorrect use of slicing for multidimensional arrays: use [i, j] instead of [i][j]
Improper translation of min() from MATLAB to NumPy: you have to use both argmin() and min().
Your fixed code would look like:
import numpy as np
# just to make it reproducible in testing, can be commented for production
np.random.seed(0)
data_a = np.array([[0],[15],[30],[45],[60],[75],[90]])
data_b = np.random.randint(91, size=(180, 101))
[rows_a,cols_a] = data_a.shape
[rows_b,cols_b] = data_b.shape
val1 = np.zeros((rows_a,cols_b), dtype=int)
ind1 = np.zeros((rows_a,cols_b), dtype=int)
for i in range(cols_b):
for j in range(rows_a):
ind1[j, i] = np.argmin(np.abs(data_b[:, i] - data_a[j]))
val1[j, i] = np.min(np.abs(data_b[:, i] - data_a[j])[ind1[j, i]])
However, I would avoid direct looping here and I would make good use of broadcasting:
import numpy as np
# just to make it reproducible in testing, can be commented for production
np.random.seed(0)
data_a = np.arange(0, 90 + 1, 15).reshape((-1, 1, 1))
data_b = np.random.randint(90 + 1, size=(1, 180, 101))
tmp_arr = np.abs(data_a.reshape(-1, 1, 1) - data_b.reshape(1, 180, -1), dtype=int)
min_idxs = np.argmin(tmp_arr, axis=1)
min_vals = np.min(tmp_arr, axis=1)
del tmp_arr # you can delete this if you no longer need it
where now ind1 == min_idxs and val1 == min_vals, i.e.:
print(np.all(min_idxs == ind1))
# True
print(np.all(min_vals == val1))
# True
Your error has to do with "[val1[j][i],ind1[j][i]] = (a single number)". You are trying to assign a single value to it which doesn't work in python. What about this?
import numpy as np
data_a = np.array([[0],[15],[30],[45],[60],[75],[90]])
data_b = np.random.randint(91, size=(180,101))
[rows_a,cols_a] = data_a.shape
[rows_b,cols_b] = data_b.shape
val1 = np.zeros((rows_a,cols_b))
ind1 = np.zeros((rows_a,cols_b))
for i in range(cols_b):
for j in range(rows_a):
array = np.abs(data_b[:][i] - data_a[j])
val = np.amin(array)
val1[j][i] = val
ind1[j][i] = np.where(val == array)[0][0]
Numpy amin does not return an index so you need to return it using np.where. This example does not store the full index, only the index of the first occurrence in the row. Then you can pull it out since your row order matches your column order in ind1 and data_b. So for instance on the first iteration.
In [2]: np.abs(data_b[:][0] - data_a[j0])
Out[2]:
array([ 3, 31, 19, 53, 28, 81, 10, 11, 89, 15, 50, 22, 40, 81, 43, 29, 63,
72, 22, 37, 54, 12, 19, 78, 85, 78, 37, 81, 41, 24, 29, 56, 37, 86,
67, 7, 38, 27, 83, 81, 66, 32, 68, 29, 71, 26, 12, 27, 45, 58, 17,
57, 54, 55, 23, 21, 46, 58, 75, 10, 25, 85, 70, 76, 0, 11, 19, 83,
81, 68, 8, 63, 72, 48, 18, 29, 0, 47, 85, 79, 72, 85, 28, 28, 7,
41, 80, 56, 59, 44, 82, 33, 42, 23, 42, 89, 58, 52, 44, 65, 65])
In [3]: np.amin(array)
Out[3]: 0
In [4]: val
Out[4]: 0
In [5]: np.where(val == array)[0][0]
Out[5]: 69
In [6]: data_b[0,69]
Out[6]: 0

Validating t-test results using Python scipy

I have simple Python function:
from scipy.stats import ttest_1samp
def tTest( expectedMean, sampleSet, alpha=0.05 ):
# T-value and P-value
tv, pv = ttest_1samp(sampleSet, expectedMean)
print(tv,pv)
return pv >= alpha
if __name__ == '__main__':
# Expected mean is 10
print tTest(10.0, [99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99])
My expectation is that t-test should fail for this sample, as it is nowhere near the expected population mean of 10. However, program produces result:
(1.0790344826428238, 0.3017839504736506)
True
I.e. the p-value is ~30% which is too high to reject the hypothesis. I am not very knowledgeable about the maths behind t-test but I don't understand how this result can be correct. Does anyone have any ideas?
I performed the test using R just to check if the results are the same and they are:
t.test(x=c(99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99), alternative = "two.sided",
mu = 10, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
data: c(99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99)
t = 1.079, df = 12, p-value = 0.3018
alternative hypothesis: true mean is not equal to 10
95 percent confidence interval:
-829.9978 2498.3055
sample estimates:
mean of x
834.1538
You can see that the p-value is 0.3.
This is a really interesting problem, I have a lot of issues with Hypothesis testing. First of all the sample size influences a lot, if u have a big sample size, lets say 5000 values, minor deviations from the expected value that you are testing will influence a lot the p-value, and so you will reject the null hypothesis most of the times, having small samples does the opposite.
And what is happening here is that you have a high variance in the data.
If you try to replace your data from [99, 99, 22, 77, 99, 55, 44, 33, 20, 9999, 99, 99, 99]
To
[99, 99, 99, 99, 100, 99, 99, 99, 99, 100, 99, 100, 100]
So it has a really small variance, your p-value will be a lot smaller, even tho the mean of this one is probably closer to 10.

Categories