Expected mean of correlated data in python - python

I have successfully generated three correlated random variables with Cholesky. I use the same mean(10) and the same standard deviation(5) for all of them. However, I tried to calculate the expected mean of the correlated variables, but I got some an unpleasant results I can't seem to know where exactly the problem. Please here is a working code:
import numpy as np
import pandas as pd
corr = np.array([[1,0.7,0.7], [0.7,1,0.7],[0.7,0.7,1]])
chol = np.linalg.cholesky(corr)
N=1000
rand_data = np.random.normal(10, 5, size=(3,N))
# generate uncorrelated data
uncorrelated_data = pd.DataFrame(rand_data, index=['A','B','C']).T/100
uncorrelated_data.corr() # shows barely any correlation as it should
uncorrelated_data.mean()*100 # shows each mean around 10
Output
A 10.308595
B 9.931958
C 10.165347
Generating correlation among them
x = np.dot(chol, rand_data) # cholesky
correlated_data = pd.DataFrame(x, index=['A','B','C']).T/100
print(correlated_data.corr()) # shows there are correlations among variable
sim_corr_rets.mean()*100 # mean keep increasing in across the variables
Output:
A 10.308595
B 14.308853
C 16.752117
The means of the uncorrelated variables were as expected but the mean of the correlated variables keeps increasing from the first variable to the last variable. My expectation is that each mean will be around the actual mean. Please could my noble seniors help me figure out the problem or suggest an alternative solution?

Related

Numpy random draw gives different results

I try to get 20 data points drawn from a random distribution. It's straightforward but two methods below give different results, despite the seed is the same.
Method 1
np.random.seed(1)
x1 = np.random.uniform(low=-10, high=10, size=20)
y1 = np.random.uniform(low=20, high=80, size=20)
The result is:
x1 = [-1.65955991 4.40648987 -9.9977125 -3.95334855 -7.06488218 -8.1532281
-6.27479577 -3.08878546 -2.06465052 0.77633468 -1.61610971 3.70439001
-5.91095501 7.56234873 -9.45224814 3.4093502 -1.65390395 1.17379657
-7.19226123 -6.03797022]
and
y1 = [68.04467412 78.09569454 38.80545069 61.53935694 72.58334914 73.67639981
25.10265268 22.34328699 30.18982517 72.68855021 25.90081003 45.2664575
77.47337181 51.9899171 61.51262684 38.93093786 61.19005566 70.07754031
21.09729664 65.0086589 ]
Method 2
N = 20
np.random.seed(1)
points = [(np.random.uniform(-10,10), np.random.uniform(20,80)) for i in range(N)]
The result is:
[(-1.6595599059485195, 63.219469606529486), (-9.997712503653101, 38.13995435791038), (-7.064882183657739, 25.54031568612787), (-6.274795772446582, 40.73364362258286), (-2.0646505153866013, 52.32900404020142), (-1.6161097119341044, 61.11317002380557), (-5.910955005369651, 72.68704618345672), (-9.452248136041476, 60.22805061070413), (-1.6539039526574602, 53.5213897067451), (-7.192261228095324, 31.886089345092728), (6.014891373510732, 78.09569454316386), (-3.7315164368151432, 61.53935694015885), (7.527783045920767, 73.67639981023083), (-8.299115772604441, 22.343286993972942), (-6.6033916087086215, 72.68855020576478), (-8.033063323338999, 45.26645750030313), (9.157790603010039, 51.98991709838103), (3.837542279009467, 38.93093786036378), (3.7300185536316732, 70.07754031384238), (-9.634234453116164, 65.00865889669805)]
Could anyone help with explaining the difference?
The first method first generates 20 numbers from the first distribution, followed by 20 numbers from the second distribution. In the second method, you alternate the distribution from which numbers are being generated. These methods do not generate corresponding numbers in the same order, so you should not expect to get the same results. Each time you generate a random number, the internal state of the generator changes, and that change affects all subsequent invocations of the generator, regardless of whether it is applied to the same distribution. numpy.random methods all use the same global internal state.
As an aside, NumPy recommends the use of numpy.random.Generator methods instead of numpy.random or numpy.random.RandomState methods (see here).

Numerical vs. categorical vars: Why 100% correlation for categorical variable with high cardinality?

I am new to data science and trying to get a grip on exploratory data analysis. My goal is to get a correlation matrix between all the variables. For numerical variables I use Pearson's R, for categorical variables I use the corrected Cramer's V. The issue now is to get a meaningful correlation between categorical and numerical variables. For that I use the correlation ratio, as outlined here. The issue with that is that categorical variables with high cardinality show a high correlation no matter what:
correlation matrix cat vs. num
This seems nonsensical, since this would practically show the cardinality of the the categorical variable instead of the correlation to the numerical variable. The question is: how to deal with the issue in order to get a meaningful correlation.
The Python code below shows how I implemented the correlation ratio:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
train = pd.DataFrame({
'id': [0,1,2,3,4,5,6,7,8,9,10,11], 'num3': [6,3,3,9,6,9,9,3,6,3,6,9],
'cat2': [0,1,0,1,0,1,0,1,0,1,0,1], 'cat3': [0,1,2,0,1,2,0,1,2,0,1,2],
'cat6': [0,4,8,2,6,10,0,4,8,2,6,10], 'cat12': [0,7,2,9,4,11,6,1,8,3,10,5],
})
cat_cols, num_cols = ['cat2','cat3','cat6','cat12'], ['id','num3']
def corr_ratio(cats, nums):
avgtotal = nums.mean()
elements_avg, elements_count = np.zeros(len(cats.index)), np.zeros(len(cats.index))
cu = cats.unique()
for i in range(cu.size):
cn = cu[i]
filt = cats == cn
elements_count[i] = filt.sum()
elements_avg[i] = nums[filt].mean(axis=0)
numerator = np.sum(np.multiply(elements_count, np.power(np.subtract(elements_avg, avgtotal), 2)))
denominator = np.sum(np.power(np.subtract(nums, avgtotal), 2)) # total variance
return 0.0 if numerator == 0 else np.sqrt(numerator / denominator)
rows = []
for cat in cat_cols:
col = []
for num in num_cols:
col.append(round(corr_ratio(train[cat], train[num]), 2))
rows.append(col)
df = pd.DataFrame(np.array(rows), columns=num_cols, index=cat_cols)
sns.heatmap(df)
plt.tight_layout()
plt.show()
It could be because I think you are visualising something more related to chi-2 in your seaborn plot. Cramer's V is a number derived from chi-2 but not equivalent. So it means you could have a high value for a specific cell but a more relevant value for Cramer's V. I'm not even sure it makes sense to compare raw modalities values because they could be on a totally different order of magnitude.
Chi 2 formula
Cramer's V formula
If I am not mistaken, there is another method called Theil’s U. How about trying this out and see if the same problem will occur?
You can use this:
num_cols: your_df.select_dtypes(include=['number']).columns.to_list()
cat_target_cols: your_df.select_dtypes(include=['object']).columns.to_list()
corr_df = pd.DataFrame(associations(dataset=your_df, numerical_columns=num_cols, nom_nom_assoc='theil', figsize=(20, 20), nominal_columns=cat_target_cols).get('corr'))

How to fix: frozen and None value

The problem was to create a normal distribution with mean 32 and standard deviation 4.5, setting the random seed to 1, and create a random sample of 100 elements from the above defined distribution.Finally, compute the absolute difference between the sample mean and the distribution mean.
This is some of the beginner stats problems in the course. I have had experience in Python but not in stats.
x = stats.norm(loc=32,state=4.5)
y = np.random.seed(1)
mean1 = np.mean(x)
mean2 = np.mean(y)
diff = abs(mean1 - mean2)
The error I've been encountering is x has a frozen value and y has a value of None.
random.seed(1) sets the state of the pseudorandom numbers generator so that every run of this script will give the same output - and give identical results for all students...
You need to execute this before generating your random numbers. The seed function doesn't have anything to return, so it return None. This is the default return value in Python for functions that don't return anything specific.
Then you create your sample of size 100, and calculate its mean. As it is a sample, its mean will differ from the mean of the distribution (32): we calculate the absolute difference between these means.
You can experiment with different sample sizes, and see how the difference tends towards 0 when the size of the sample grows - you'll learn more about it in your course!
from scipy.stats import norm
import numpy as np
np.random.seed(1)
distribution_mean = 32
sample = norm.rvs(loc=distribution_mean, scale=4.5, size=100)
sample_mean = np.mean(sample)
print('sample:', sample)
print('sample mean:', sample_mean)
abs_diff = abs(sample_mean - distribution_mean)
print('absolute difference:', abs_diff)
Output:
sample: [39.30955414 29.24709614 29.62322711 27.1716412 35.89433433 21.64307586
39.85165294 28.57456895 33.43567593 30.87783331 38.57948572 22.72936681
30.54912258 30.2717554 37.10196249 27.0504893 31.22407307 28.04963712
32.18996186 34.62266846 27.0472137 37.15125669 36.05715824 34.26122453
36.05385177 28.92322463 31.44699399 27.78903755 30.79450364 34.3865996
28.88752662 30.21460913 28.90772285 28.19657461 28.97939241 31.9430093
26.97210343 33.05487064 39.4691098 35.33919872 31.13674001 28.00566966
28.63778768 39.6160457 32.2286349 29.13351959 32.85911968 41.45114811
32.54071529 34.77741399 33.35076644 30.41487569 26.85866811 30.42795775
31.05997595 34.63980436 35.77542536 36.18995937 33.28514296 35.98313524
28.60520927 37.6379067 34.30818419 30.65858224 34.19833166 31.65992729
37.09233224 38.83917567 41.83508933 25.71576649 25.50148788 29.72990362
32.72016681 35.94276015 33.42035726 22.90009453 30.62208194 35.72588589
33.03542631 35.42905031 30.99952336 31.09658869 32.83952626 33.84523241
32.89234874 32.53553891 28.98201971 33.69903704 32.54819572 37.08267759
37.39513046 32.83320388 30.31121772 29.12571317 33.90572459 32.34803031
30.45265846 32.19618586 29.2099962 35.14114415]
sample mean: 32.27262283434065
absolute difference: 0.2726228343406518

How can I create a function from this data?

I have a dataset in the form of a table:
Score Percentile
381 1
382 2
383 2
...
569 98
570 99
The complete table is here as a Google spreadsheet.
Currently, I am computing a score and then doing a lookup on this dataset (table) to find the corresponding percentile rank.
Is it possible to create a function to calculate the corresponding percentile rank for a given score using a formula instead of looking it up in the table?
It's impossible to recreate the function that generated a given table of data, if no information is provided about the process behind that data.
That being said, we can make some speculation.
Since it's a "percentile" function, it probably represents the cumulative value of a probability distribution of some sort. A very common probability distribution is the normal distribution, whose "cumulative" counterpart (i.e. its integral) is the so called "error function" ("erf").
In fact, your tabulated data looks a lot like an error function for a variable whose average value is 473.09:
your dataset: orange; fitted error function (erf): blue
However, the agreement is not perfect and that could be because of three reasons:
the fitting procedure I've used to generate the parameters for the error function didn't use the right constraints (because I have no idea what I'm modelling!)
your dataset doesn't represent an exact normal distribution, but rather real world data whose underlying distribution is the normal distribution. The features of your sample data that deviate from the model are being ignored altogether.
the underlying distribution is not a normal distribution at all, its integral just happens to look like the error function by chance.
There is literally no way for me to tell!
If you want to use this function, this is its definition:
import numpy as np
from scipy.special import erf
def fitted_erf(x):
c = 473.09090474
w = 37.04826334
return 50+50*erf((x-c)/(w*np.sqrt(2)))
Tests:
In [2]: fitted_erf(439) # 17 from the table
Out[2]: 17.874052406601457
In [3]: fitted_erf(457) # 34 from the table
Out[3]: 33.20270318344252
In [4]: fitted_erf(474) # 51 from the table
Out[4]: 50.97883169390196
In [5]: fitted_erf(502) # 79 from the table
Out[5]: 78.23955071273468
however I'd strongly advise you to check if a fitted function, made without knowledge of your data source, is the right tool for your task.
P.S.
In case you're interested, this is the code used to obtain the parameters:
import numpy as np
from scipy.special import erf
from scipy.optimize import curve_fit
tab=np.genfromtxt('table.csv', delimiter=',', skip_header=1)
# using a 'table.csv' file generated by Google Spreadsheets
x = tab[:,0]
y = tab[:,1]
def parametric_erf(x, c, w):
return 50+50*erf((x-c)/(w*np.sqrt(2)))
pars, j = curve_fit(parametric_erf, x, y, p0=[475,10])
print(pars)
# outputs [ 473.09090474, 37.04826334]
and to generate the plot
import matplotlib.pyplot as plt
plt.plot(x,parametric_erf(x,*pars))
plt.plot(x,y)
plt.show()
Your question is quite vague but it seems whatever calculation you do ends up with a number in the range 381-570, is this correct. You have a multiline calculation which gives this number? I'm guessing you are repeating this in many places in your code which is why you want to procedurise it?
For any calculation you can wrap it in a function. For instance:
answer = variable_1 * variable_2 + variable_3
can be written as:
def calculate(v1, v2, v3):
''' calculate the result from the inputs
'''
return v1 * v2 + v3
answer = calculate(variable_1, variable_2, variable_3)
if you would like a definitive answer then simply post your calculation and I can make it into a function for you

Python Numpy Random Numbers - inconsistent?

I am trying to generate log-normally distributed random numbers in python (for later MC simulation), and I find the results to be quite inconsistent when parameters are a bit larger.
Below I am generating a series of LogNormals from Normals (and then using Exp) and directly from LogNormals.
The resulting means are bearable, but the variances - quite imprecise.. this also holds for mu = 4,5,...
If you re-run the below code a couple of times - the results come back quite different.
Code:
import numpy as np
mu = 10;
tmp1 = np.random.normal(loc=-mu, scale=np.sqrt(mu*2),size=1e7)
tmp1 = np.exp(tmp1)
print tmp1.mean(), tmp1.var()
tmp2 = np.random.lognormal(mean=-mu, sigma=np.sqrt(mu*2), size=1e7)
print tmp2.mean(), tmp2.var()
print 'True Mean:', np.exp(0), 'True Var:',(np.exp(mu*2)-1)
Any advice how to fix this?
I've tried this also on Wakari.io - so the result is consistent there as well
Update:
I've taken the 'True' Mean and Variance formula from Wikipedia: https://en.wikipedia.org/wiki/Log-normal_distribution
Snapshots of results:
1)
0.798301881219 57161.0894726
1.32976988569 2651578.69947
True Mean: 1.0 True Var: 485165194.41
2)
1.20346203176 315782.004309
0.967106664211 408888.403175
True Mean: 1.0 True Var: 485165194.41
3) Last one with n=1e8 random numbers
1.17719369919 2821978.59163
0.913827160458 338931.343819
True Mean: 1.0 True Var: 485165194.41
Even with the large sample size that you have, with these parameters, the estimated variance is going to change wildly from run to run. That's just the nature of the fat-tailed lognormal distribution. Try running the np.exp(np.random.normal(...)).var() several times. You will see a similar swing of values as np.random.lognormal(...).var().
In any case, np.random.lognormal() is just implemented as np.exp(np.random.normal()) (well, the C equivalent).
Ok, as you have just built the sample, and using the notation in wikipedia (first section, mu and sigma) and the example given by you:
from numpy import log, exp, sqrt
import numpy as np
mu = -10
scale = sqrt(2*10) # scale is sigma, not variance
tmp1 = np.random.normal(loc=mu, scale=scale, size=1e8)
# Just checking
print tmp1.mean(), tmp1.std()
# 10.0011028634 4.47048010775, perfectly accurate
tmp1_exp = exp(tmp1) # Not sensible to use the same name for two samples
# WIKIPEDIA NOTATION!
m = tmp1_exp.mean() # until proven wrong, this is a meassure of the mean
v = tmp1_exp.var() # again, until proven wrong, this is sigma**2
#Now, according to wikipedia
print "This: ", log(m**2/sqrt(v+m**2)), "should be similar to", mu
# I get This: 13.9983309499 should be similar to 10
print "And this:", sqrt(log(1+v/m**2)), "should be similar to", scale
# I get And this: 3.39421327037 should be similar to 4.472135955
So, even if the values are not exactly perfect, I wouldn't claim that they are completely wrong.

Categories