R `summary` function closest equivalent in python - python

Is there any kind of helper to find the min, max (and ideally standard deviation) of each dimension in a multidimensional array within numpy? I'm looking for something like the summary() function in R.
My data is essentially a huge 2D array (list of lists), in which the sublists contain n dimensional values. E.g. currently I have data with 3 dimensional attributes x,y,z:
a = np.random.rand(100,3)
For each of those dimensions (x,y,z) I want to know the min, max, mean, and std.
I know one can loop through the axes and measure these values, e.g.:
for i in range(a.shape[-1]):
vals = a[:,i]
print(np.min(vals), np.max(vals), np.std(vals))
I find myself writing the code to do that almost every time I have a new dataset. Any way to expedite this operation would be hugely helpful!

Without pandas:
from scipy import stats
import numpy as np
a = np.random.rand(100,3)
summary = stats.describe(a, axis = 0)
print(summary.mean)
print(summary.minmax)
...
Using pandas:
import pandas as pd
summary_across_rows = pd.DataFrame(a).describe() # across axis=0
print(summary)
0 1 2
count 100.000000 100.000000 100.000000
mean 0.495204 0.573827 0.476202
std 0.275131 0.246189 0.271626
min 0.005202 0.037195 0.023595
25% 0.295210 0.399358 0.258712
50% 0.512023 0.562181 0.417322
75% 0.710216 0.790970 0.712047
max 0.998371 0.997717 0.980840
Note: for the summary across the other dimension you need:
summary_across_columns = pd.DataFrame(a.T).describe() # across axis=1

Without pandas:
from scipy import stats
stats.describe(lst)
stats.scoreatpercentile(lst,(5,10,50,90,95))
Here is an example:
from scipy import stats
import numpy as np
stdev = 10
mu = 10
a=stdev*np.random.randn(100)+mu
stats.describe(a)
[OUT1]: DescribeResult(nobs=100, minmax=(-13.180682481878286, 40.6109521437826), mean=10.352380786199149, variance=103.27168865119998, skewness=0.13852516641657087, kurtosis=0.2691915766145532)
stats.scoreatpercentile(a,(5,10,50,90,95))
[OUT2]: array([-7.21731609, -3.22696662, 10.39364637, 21.78527621, 24.20685179])

Related

Pandas correlation on just one column containing np arrays

I'm working with a dataframe with a column containing a np.array per row (in this case representing the mean waveform of brain recordings trought the time). I want to calculate the pearson correlation of this column (array by array).
This is my code
lenght = len(df.Mean)
Mean = []
for i in range(len(df.Mean)):
Mean.append(df.Mean[i])
Correlation_p = np.zeros((lenght,lenght))
P_Value_p = np.zeros((lenght,lenght))
for i in range(lenght):
for j in range(lenght):
Correlation_p[i][j],P_Value_p[i][j] = stats.pearsonr(df.Mean[i],df.Mean[j])
This works, but I want to know if there is a more pythonic way to do it, maybe using df.corr(). I tried but I failed in how to do it.
EDIT: the output of df.Mean.head()
0 [-0.2559348091247745, 0.02743063113723536, 0.3...
1 [-0.37025615099744325, -0.11299328141596175, 0...
2 [-1.0543681894876467, -0.8452798699354909, -0....
3 [-0.23527437766943646, -0.28657810260136585, -...
4 [0.45557980303095674, 0.6055674269814991, 0.74...
Name: Mean, dtype: object
The arrays that you would like to correlate seem in single cells of the DataFrame, if I am not mistaken. The following brings it in a format where each single array occupies a single column.
I made an data example that resembles the format of df.Mean.head():
df = pd.DataFrame({'x':[np.random.randint(0,5,10), np.random.randint(0,5,10), np.random.randint(0,5,10)]})
You can turn these arrays into columns using this:
df = pd.DataFrame(np.array(df['x'].tolist()).transpose())
Adapt the reshape parameters according to your own dimensions.
From there, it would be fairly straightforward.
A correlation matrix can be created by:
df.corr()
A visualization of the correlation matrix:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()

How to calculate covariance on 2 columns out of multiple columns in python?

I've provided a sample data below. It contains 8x10 matrix which contains two-dimensional normal distribution. For ex, col1 and col2 is 1 set, col3/col4 is another and so on. I'm trying to calculate covariance of the individual set in python. So far, I've been unsuccessful and i'm new to python. However, below is what I've tried:
import pandas
import numpy
import matplotlib.pyplot as plg
data = pandas.read_excel("testfile.xlsx", header=None)
dataNpy = pandas.DataFrame.to_numpy(data)
mean = numpy.mean(dataNpy, axis=0)
dataAWithoutMean = dataNpy - mean
covB = numpy.cov(dataAWithoutMean)
print("cov is: " + str(covB))
I've been tasked to calculate 4 separate covariance matrices and plot the covariance value for each set. In addition, plot the variance of each set.
dataset:
5.583566716 -0.441667252 -0.663300181 -1.249623134 -6.530464227 -4.984165997 2.594874802 2.646629654
6.129721509 2.374902708 -2.583949571 -2.224729817 0.279965502 -0.850298098 -1.542499771 -2.686894831
5.793226266 1.133844629 -1.939493549 1.570726544 -2.125423302 -1.33966397 -0.42901856 -0.09814741
3.413049714 -0.1133744 -0.032092831 -0.122147373 2.063549449 0.685517481 5.887909556 4.056242954
-2.639701885 -0.716557389 -0.851273969 -0.522784614 -7.347432606 -2.653482175 1.043389849 0.774192416
-1.84827484 -0.636893709 -2.223488277 -1.227420764 0.253999505 0.540299783 -1.593071594 -0.70980532
0.754029441 1.427571018 5.486147486 2.956320758 2.054346142 1.939929175 -3.559875405 -3.074861749
2.009806308 1.916796155 7.820990369 2.953681659 2.071682641 0.105056782 -1.120995825 -0.036335483
1.875128481 1.785216268 -2.607698929 0.244415372 -0.793431956 -1.598343481 -2.120852679 -2.777871862
0.168442246 0.324606905 0.53741174 0.274617158 -2.99037756 -3.323958514 -3.288399345 -2.482277047
Thanks for helping in advance :)
Is this what you need?
import pandas
import numpy
import matplotlib.pyplot as plt
data = pandas.read_excel("Book1.xlsx", header=None)
mean = data.mean(axis=0)
dataAWithoutMean = data - mean
# Variance of each set
dataAWithoutMean.var()
# Covariance matrix
cov = dataAWithoutMean.cov()
plt.matshow(cov)
plt.show()

Pearson correlation of two 3D arrays with NaN in python

I have two three-dimensional arrays a and b with [time,lat,lon]. I want to correlate the time series of each grid cell like correlate(a[:,0,0],b[:,0,0]), correlate(a[:,0,1],b[:,0,1]), ... . I'm aiming for two correlations. One with the entire time series and one only where array a surpasses a certain threshold.
The datasets also include some missing values in the time series and I read in both datasets with Xarray. Correlations and masking are done using numpy.
At the moment I walk through each latitude and longitude, grabbing the time series, mask it to account for nan and the threshold and correlate them. My code looks like this:
def correlate(A, B, var1, var2, TH):
name = "corr_"+var1+"_"+var2+"_TH_"+str(TH)+".nc"
a = xr.open_dataset(A).sel(time=slice('1950-03','2013-12'))
b = xr.open_dataset(B).sel(time=slice('1950-03','2013-12'))
corr = np.empty([a[var1].shape[1],a[var1].shape[2]],dtype=float)
corr_TH = corr
varname_TH = "r_TH_"+str(TH)
for lt in range(corr.shape[0]):
for ln in range(corr.shape[1]):
corr[lt,ln] = np.ma.corrcoef(a[var1][:,lt,ln],b[var2][:,lt,ln], rowvar=True)[0,1]
corr_TH[lt,ln] = np.ma.corrcoef(np.ma.masked_greater(a[var1][:,lt,ln],TH),b[var2][:,lt,ln], rowvar=True)[0,1]
# save whole correlations
ds = xr.Dataset({'r': (['lat', 'lon'], corr),varname_TH: (['lat', 'lon'], corr_TH)},coords={'lon': a['lon'],'lat': a['lat']})
return ds
This works in general but is super slow. I found the Xarray function array.stack() to flatten the arrays and tried something like:
A_stack = A.var1.stack(z=('lat','lon'))
B_stack = B.var2.stack(z=('lat','lon'))
cov = ((A_stack - A_stack.mean(axis=0))* (B_stack - B_stack.mean(axis=0))).mean(axis=0)
corr = cov / (A_stack.std(axis=0) * B_stack.std(axis=0))
The multi index 'z' over which the array is stacked is retained through the process, however, the correlation array in the end is empty. I suppose that's because of the Nans.
Does anyone have an idea of the do this?
thanks

How to discretize a signal?

If I have a function like below:
G(s)= C/(s-p) where s=jw, c and p are constant number.
Also, the available frequency is wa= 100000 rad/s. How can I discretize the signal at ∆w = 0.0001wa in Python?
Use numpy.arange to accomplish this:
import numpy as np
wa = 100000
# np.arange will generate every discrete value given the start, end and the step value
discrete_wa = np.arange(0, wa, 0.0001*wa)
# lets say you have previously defined your function
g_s = [your_function(value) for value in discrete_wa]

How to create np array random data on age vs time?

How to create np array random data on age vs time?
My aim is to create a scatter plot representing random data on age vs. time spent watching TV.
from pylab import randn
X = randn(500)
Y = randn(500)
plt.scatter(X,Y)
plt.show()
I want age between 18 and 50 and time between 0 to 24 hours
You can try :
import random
import numpy as np
age=np.array(random.sample(list(range(18,51)),10))
time=np.array(random.sample(list(range(0,24)),10))
random.sample takes a list of elements as first argument and the number of samples you want as the second argument.
That gives :
age : [47 45 37 19 23 34 39 24 32 42]
time : [18 12 13 1 15 21 23 22 3 17]
On plotting it :
import matplotlib.pyplot as plt
plt.scatter(age, time)
plt.show()
To recreate the same random numbers every time you run it, you can use random.seed()
It's super easy with numpy. You can use numpy library to do this:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
age = np.random.randint(18, 50, 20)
time = np.random.randint(0, 24, 20)
plt.scatter(age, time)
plt.show()
Column-wise multiplication in numpy
You can easily create custom-sized random arrays with numpy with the commands numpy.random.rand(d0, d1, …, dn) for uniform distributions or numpy.random.randn(d0, d1, …, dn) for normal distributions, where dn is the number of samples in the nth dimension. In your case you'll have d0=500 and d1=2.
However the values will be sampled from the interval [0, 1) in numpy.random.rand(d0, d1, …, dn). Or the standard normal distribution for numpy.random.randn(d0, d1, …, dn) (i.e. mean = 0 and variance = 1).
A nice turnaround for this is to sum and multiply the arrays column-wise to shilft the distributions to the desired values. To multiply in a column-wise fashion an array arr with a vector vec you can use this small snippet of code arr.dot(np.diag(vec)). Be careful, vec should have as much elements as arr has columns.
This snippet works by turning vec into a diagonal matrix (i.e. a matrix where everything is zero except the main diagonal) and the multiplying arr to the diagonal matrix.
For uniform distributions
Remeber that to turn a sample x from an uniform distribution [0, 1) to [min, max), you do new_x = (max - min) * x + min. So if you want an uniform distribution and you know the max and min limits for boths variables, you can do as use the following code:
import numpy as np
n_samples = 500
max_age, min_age = 80, 10
max_hours, min_hours = 10, 0
array = np.random.rand(n_samples, 2) #returns samples from the uniform distribution
range_vector = np.array([max_age - min_age, max_hours - min_hours])
min_vector = np.array([min_age, min_hours])
sample = array.dot(np.diag(range_vector)) + np.ones(array.shape).dot(np.diag(min_vector))
Normal distributions
If you want a normal distribution and you know the mean and variances of both columns use the following code. Remeber that to shift a sample x from an standard normal distribution to a distribution with a different mean and standard deviation, you go new_x = deviation * x + mean.
import numpy as np
n_samples = 500
mean_age, deviation_age = 40, 20
mean_hours, deviation_hours = 5, 2
array = np.random.rand(n_samples, 2) #returns samples from the standard normal distribution
deviation_vector = np.array([deviation_age, deviation_hours])
mean_vector = np.array([mean_age, mean_hours])
sample = array.dot(np.diag(deviation_vector)) + np.ones(array.shape).dot(np.diag(mean_vector))
Be careful however, with the normal distributions you can end up withg negative values.
You can also have a look at all the documentation numpy has on random variables: https://docs.scipy.org/doc/numpy/reference/routines.random.html
Finally please notice that column-wise multiplication only works when you want both samples to be independant.

Categories