I am trying to subtract every element in the column from its mean and divide by the standard deviation. I did it in two different ways (numeric_data1 and numeric_data2):
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std())
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0))
type(numeric_data1) # -> pandas.core.frame.DataFrame
type(numeric_data2) # -> pandas.core.frame.DataFrame
Both are pandas dataframes and they should have the same result. However, I get different results:
numeric_data2 == numeric_data1 # -> False
I think the problem stems from how numpy and pandas handle numeric precision:
numeric_data.mean() == np.mean(numeric_data, axis=0) # -> True
numeric_data.std(axis=0) == np.std(numeric_data, axis=0) # -> False
For mean numpy and pandas gave me the same thing, but for standard deviation, I got little different results.
Is my assessment correct or am I making some blunder?
When calculating the standard deviation it matters whether you are estimating the standard deviation of an entire population with a smaller sample of that population or are you calculating the standard deviation of the entire population.
If it is a smaller sample of a larger population, you need what is called the sample standard deviation. As it turns out, when you divide the sum of squared differences from the mean by the number of observations, you end up with a biased estimator. We correct for that by dividing by one less than the number of observations. We control for this with the argument ddof=1 for sample standard deviation or ddof=0 for population standard deviation.
Truth is, it doesn't matter much if your sample size is large. But you will see small differences.
Use the degrees of freedom argument in your pandas.DataFrame.std call:
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std(ddof=0)) # <<<
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0))
np.isclose(numeric_data1, numeric_data2).all() # -> True
Or in the np.std call:
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std())
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0, ddof=1)) # <<<
np.isclose(numeric_data1, numeric_data2).all() # -> True
Related
I'm trying to create some material for introductory statistics for a seminar. The above code computes a 95% confidence interval for estimating the mean, but the result is not the same from the one implemented in Python. Is there something wrong with my math / code? Thanks.
EDIT:
Data was sampled from here
import pandas as pd
import numpy as np
x = np.random.normal(60000,15000,200)
income = pd.DataFrame()
income = pd.DataFrame()
income['Data Scientist'] = x
# Manual Implementation
sample_mean = income['Data Scientist'].mean()
sample_std = income['Data Scientist'].std()
standard_error = sample_std / (np.sqrt(income.shape[0]))
print('Mean',sample_mean)
print('Std',sample_std)
print('Standard Error',standard_error)
print('(',sample_mean-2*standard_error,',',sample_mean+2*standard_error,')')
# Python Library
import scipy.stats as st
se = st.sem(income['Data Scientist'])
a = st.t.interval(0.95, len(income['Data Scientist'])-1, loc=sample_mean, scale=se)
print(a)
print('Standard Error from this code block',se)
You've got 2 errors.
First, you are using 2 for the multiplier for the CI. The more accurate value is 1.96. "2" is just a convenient estimator. That is making your CI generated manually too fat.
Second, you are comparing a normal distribution to the t-distribution. This probably isn't causing more than decimal dust in difference because you have 199 degrees of freedom for the t-dist, which is basically the normal.
Below is the z-score of 1.96 and computation of CI with apples-to-apples comparison to the norm distribution vs. t.
In [45]: st.norm.cdf(1.96)
Out[45]: 0.9750021048517795
In [46]: print('(',sample_mean-1.96*standard_error,',',sample_mean+1.96*standard_error,')')
( 57558.007862202685 , 61510.37559873406 )
In [47]: st.norm.interval(0.95, loc=sample_mean, scale=se)
Out[47]: (57558.044175045005, 61510.33928589174)
The problem was to create a normal distribution with mean 32 and standard deviation 4.5, setting the random seed to 1, and create a random sample of 100 elements from the above defined distribution.Finally, compute the absolute difference between the sample mean and the distribution mean.
This is some of the beginner stats problems in the course. I have had experience in Python but not in stats.
x = stats.norm(loc=32,state=4.5)
y = np.random.seed(1)
mean1 = np.mean(x)
mean2 = np.mean(y)
diff = abs(mean1 - mean2)
The error I've been encountering is x has a frozen value and y has a value of None.
random.seed(1) sets the state of the pseudorandom numbers generator so that every run of this script will give the same output - and give identical results for all students...
You need to execute this before generating your random numbers. The seed function doesn't have anything to return, so it return None. This is the default return value in Python for functions that don't return anything specific.
Then you create your sample of size 100, and calculate its mean. As it is a sample, its mean will differ from the mean of the distribution (32): we calculate the absolute difference between these means.
You can experiment with different sample sizes, and see how the difference tends towards 0 when the size of the sample grows - you'll learn more about it in your course!
from scipy.stats import norm
import numpy as np
np.random.seed(1)
distribution_mean = 32
sample = norm.rvs(loc=distribution_mean, scale=4.5, size=100)
sample_mean = np.mean(sample)
print('sample:', sample)
print('sample mean:', sample_mean)
abs_diff = abs(sample_mean - distribution_mean)
print('absolute difference:', abs_diff)
Output:
sample: [39.30955414 29.24709614 29.62322711 27.1716412 35.89433433 21.64307586
39.85165294 28.57456895 33.43567593 30.87783331 38.57948572 22.72936681
30.54912258 30.2717554 37.10196249 27.0504893 31.22407307 28.04963712
32.18996186 34.62266846 27.0472137 37.15125669 36.05715824 34.26122453
36.05385177 28.92322463 31.44699399 27.78903755 30.79450364 34.3865996
28.88752662 30.21460913 28.90772285 28.19657461 28.97939241 31.9430093
26.97210343 33.05487064 39.4691098 35.33919872 31.13674001 28.00566966
28.63778768 39.6160457 32.2286349 29.13351959 32.85911968 41.45114811
32.54071529 34.77741399 33.35076644 30.41487569 26.85866811 30.42795775
31.05997595 34.63980436 35.77542536 36.18995937 33.28514296 35.98313524
28.60520927 37.6379067 34.30818419 30.65858224 34.19833166 31.65992729
37.09233224 38.83917567 41.83508933 25.71576649 25.50148788 29.72990362
32.72016681 35.94276015 33.42035726 22.90009453 30.62208194 35.72588589
33.03542631 35.42905031 30.99952336 31.09658869 32.83952626 33.84523241
32.89234874 32.53553891 28.98201971 33.69903704 32.54819572 37.08267759
37.39513046 32.83320388 30.31121772 29.12571317 33.90572459 32.34803031
30.45265846 32.19618586 29.2099962 35.14114415]
sample mean: 32.27262283434065
absolute difference: 0.2726228343406518
Im using the following code to scale values from one interval to another. "outputs_max", "outputs_min" are numpy arrays, so are (as a result) "slope" and "intercept".
For higher clarity when displaying the result "scaled_outputs", I used pandas to create a DataFrame of the file "out.npy" which I called "output_array". The resulting array "scaled_outputs" hence is displayed in a DataFrame too and later on stored as a numpy file.
import pandas as pd
import numpy as np
output_file = np.load("U:\\out.npy")
output_array = pd.DataFrame(output_file)
desired_upper_bound = 1
desired_lower_bound = 0
slope = (desired_upper_bound - desired_lower_bound) / (outputs_max - outputs_min)
intercept = desired_upper_bound - (slope * rounded_outputs_max)
scaled_outputs = slope * output_array + intercept
np.save("U:\\scaled_outputs.npy", scaled_outputs)
Am I losing accuracy of the values by creating a DataFrame and passing it into the equation? Would it be better to pass the numpy array "output_file" and creating a DataFrame of "scaled_outputs"?
The result in the console is displayed with 5 decimals at max, which is why I'm asking.
No, you're not losing precision or accuracy by using a dataframe in your equation. What you're seeing on the console is a result of display precision. You can change the display.precision property to see more digits when a dataframe is displayed.
pandas.set_option("display.precision", 10)
When doing some estimations, calculations, and other fun stuff in Python I came across something really weird and upsetting.
I have this thing where I estimate some parameters using ML-estimation, and have previously assumed that everything was peachy and fine. I read csv-data with pandas, and use the subsequent data for the estimation. Therefore, the data has originally been passed down to the ML-estimation function as Pandas Series. Today I wanted to try some matrix-operations on a thing in the calculation for kicks-and-giggles, and converted the input-data to numpy arrays. However, when I ran the code, the estimation results were different. After restoring some of the multiplications, it was still different. Then I changed back to using Pandas series, and it returned to the previously expected result.
This is where I got curious and now turn to you. Is it so that there is a rounding error between float64 Numpy arrays and float64 Pandas Series so different that when doing my calculations, they get so drastically different?
Consider the following code-example containing a sample from my ML-estimator
import pandas as pd
import numpy as np
import math
values = [3.41527085753, 3.606855606852, 3.5550625070226231, 3.680327020956565, \
3.30270511221, 3.704752803295, 3.6307205395804001, 3.200863997609199, \
2.90599272353, 3.555062501231, 2.8047528032711295, 3.415270760685753, \
3.50599277872, 3.445622506242, 3.3047528084632258, 3.219431344191376, \
3.68032756565, 3.451542245654, 3.2244456543387564, 2.999848273256456]
Ps = pd.Series(values, dtype=np.float64)
Narr = np.array(values, dtype=np.float64)
def getLambda(S, delta = 1/255):
n = len(S) - 1
Sx = sum( S[0:-1] )
Sy = sum( S[1:] )
Sxx = sum( S[0:-1]**2 )
Sxy = sum( S[0:-1]*S[1:] )
mu = (Sy*Sxx - Sx*Sxy) / ( n*(Sxx - Sxy) - (Sx**2 - Sx*Sy) )
lambd = np.log((Sxy - mu*Sx - mu*Sy + n*mu**2) / (Sxx -2*mu*Sx + n*mu**2) )/ delta
a = math.exp(-lambd*delta)
return mu, a, lambd
print("Numpy Array calculation gives me mu = {}, alpha = {} and Lambda = {}".format(getLambda(Narr)[0], getLambda(Narr)[1], getLambda(Narr)[2]))
print("Pandas Series calculation gives me mu = {}, alpha = {} and Lambda = {}".format(getLambda(Ps)[0], getLambda(Ps)[1], getLambda(Ps)[2]))
The values are just some random value picked from my original data in my larger project.
This will, atleast for me, print:
>> Numpy Array calculation gives me mu = 3.378432651661709, alpha = 102.09644146650535 and Lambda = -1179.6090571432392
>> Pandas Series calculation gives me mu = 3.3981019891871247, alpha = nan and Lambda = nan
The procedure, method, and original data is identical, and it still gets a difference of about 0.019669 already in the calculation of mu, which is for me really weird and upsetting.
If this is due to difference (keep in mind that I explicity stated that it should be float64 in both cases) in rounding between the two methods of handling data its weird as this just makes me question which and why I should use any of them. Otherwise, there has to be a bug in one of them? Or is there a third alternative which explains everything and was something that I did not know of to begin with?
I'm trying to use some Time Series Analysis in Python, using Numpy.
I have two somewhat medium-sized series, with 20k values each and I want to check the sliding correlation.
The corrcoef gives me as output a Matrix of auto-correlation/correlation coefficients. Nothing useful by itself in my case, as one of the series contains a lag.
The correlate function (in mode="full") returns a 40k elements list that DO look like the kind of result I'm aiming for (the peak value is as far from the center of the list as the Lag would indicate), but the values are all weird - up to 500, when I was expecting something from -1 to 1.
I can't just divide it all by the max value; I know the max correlation isn't 1.
How could I normalize the "cross-correlation" (correlation in "full" mode) so the return values would be the correlation on each lag step instead those very large, strange values?
You are looking for normalized cross-correlation. This option isn't available yet in Numpy, but a patch is waiting for review that does just what you want. It shouldn't be too hard to apply it I would think. Most of the patch is just doc string stuff. The only lines of code that it adds are
if normalize:
a = (a - mean(a)) / (std(a) * len(a))
v = (v - mean(v)) / std(v)
where a and v are the inputted numpy arrays of which you are finding the cross-correlation. It shouldn't be hard to either add them into your own distribution of Numpy or just make a copy of the correlate function and add the lines there. I would do the latter personally if I chose to go this route.
Another, quite possibly better, alternative is to just do the normalization to the input vectors before you send it to correlate. It's up to you which way you would like to do it.
By the way, this does appear to be the correct normalization as per the Wikipedia page on cross-correlation except for dividing by len(a) rather than (len(a)-1). I feel that the discrepancy is akin to the standard deviation of the sample vs. sample standard deviation and really won't make much of a difference in my opinion.
According to this slides, I would suggest to do it this way:
def cross_correlation(a1, a2):
lags = range(-len(a1)+1, len(a2))
cs = []
for lag in lags:
idx_lower_a1 = max(lag, 0)
idx_lower_a2 = max(-lag, 0)
idx_upper_a1 = min(len(a1), len(a1)+lag)
idx_upper_a2 = min(len(a2), len(a2)-lag)
b1 = a1[idx_lower_a1:idx_upper_a1]
b2 = a2[idx_lower_a2:idx_upper_a2]
c = np.correlate(b1, b2)[0]
c = c / np.sqrt((b1**2).sum() * (b2**2).sum())
cs.append(c)
return cs
For a full mode, would it make sense to compute corrcoef directly on the lagged signal/feature? Code
from dataclasses import dataclass
from typing import Any, Optional, Sequence
import numpy as np
ArrayLike = Any
#dataclass
class XCorr:
cross_correlation: np.ndarray
lags: np.ndarray
def cross_correlation(
signal: ArrayLike, feature: ArrayLike, lags: Optional[Sequence[int]] = None
) -> XCorr:
"""
Computes normalized cross correlation between the `signal` and the `feature`.
Current implementation assumes the `feature` can't be longer than the `signal`.
You can optionally provide specific lags, if not provided `signal` is padded
with the length of the `feature` - 1, and the `feature` is slid/padded (creating lags)
with 0 padding to match the length of the new signal. Pearson product-moment
correlation coefficients is computed for each lag.
See: https://en.wikipedia.org/wiki/Cross-correlation
:param signal: observed signal
:param feature: feature you are looking for
:param lags: optional lags, if not provided equals to (-len(feature), len(signal))
"""
signal_ar = np.asarray(signal)
feature_ar = np.asarray(feature)
if np.count_nonzero(feature_ar) == 0:
raise ValueError("Unsupported - feature contains only zeros")
assert (
signal_ar.ndim == feature_ar.ndim == 1
), "Unsupported - only 1d signal/feature supported"
assert len(feature_ar) <= len(
signal
), "Unsupported - signal should be at least as long as the feature"
padding_sz = len(feature_ar) - 1
padded_signal = np.pad(
signal_ar, (padding_sz, padding_sz), "constant", constant_values=0
)
lags = lags if lags is not None else range(-padding_sz, len(signal_ar), 1)
if np.max(lags) >= len(signal_ar):
raise ValueError("max positive lag must be shorter than the signal")
if np.min(lags) <= -len(feature_ar):
raise ValueError("max negative lag can't be longer than the feature")
assert np.max(lags) < len(signal_ar), ""
lagged_patterns = np.asarray(
[
np.pad(
feature_ar,
(padding_sz + lag, len(signal_ar) - lag - 1),
"constant",
constant_values=0,
)
for lag in lags
]
)
return XCorr(
cross_correlation=np.corrcoef(padded_signal, lagged_patterns)[0, 1:],
lags=np.asarray(lags),
)
Example:
signal = [0, 0, 1, 0.5, 1, 0, 0, 1]
feature = [1, 0, 0, 1]
xcorr = cross_correlation(signal, feature)
assert xcorr.lags[xcorr.cross_correlation.argmax()] == 4