I understand that I have to put this all into a function and then call the function from a for loop ten times but I'm not sure how. Any help would be deeply appreciated.
import random
import matplotlib.pyplot as plt
import statistics as stats
plt.hist(list1, bins=100, alpha = 0.5)
array1 = np.array(list1)
array2 = np.array(list2)
array3 = np.array(list3)
# Run the t-test using scipy library
scipy.stats.ttest_ind(array1,array2)
Use range (for x in range(0,10)):
import random
import matplotlib.pyplot as plt
import statistics as stats
import numpy as np
# Library for scientific statistics
import scipy.stats
for x in range(0,10):
print(x)
# Create two lists of random numbers that follow a normal ("Gaussian") distribution
# Start with an empty list named "list1"
list1 = []
# Loop that runs 30 times - starts at 1, goes to 30
for x in range(1,30):
# Random numbers drawn from pool that has mean of 12 and standard deviation of 5
value1 = random.gauss(12,5)
# Add random value to the first list, list1
list1.append(value1)
print(list1)
# Do the same with a second list
list2 = []
for x in range(1,30):
# Random numbers drawn from pool that has mean of 14 and standard deviation of 4
value2 = random.gauss(14,4)
list2.append(value2)
print(list2)
# Create a histogram of the two lists using matplotlib library
plt.hist(list1, bins=50, alpha = 0.5)
plt.hist(list2, bins=50, alpha = 0.5)
# Run a t-test on the two sets of data
array1 = np.array(list1)
array2 = np.array(list2)
# Run the t-test using scipy library
scipy.stats.ttest_ind(array1,array2)
Related
I'm trying to create a simulation that samples from two different normal distributions at specified probabilities. I want the simulation to choose a new value from the distribution during each simulation. I created the code below, but it picks a random value on each distribution one time, and then simulates it 50 times. How can I get new values from each distribution during each iteration of the simulation?
import numpy as np
from numpy.random import normal
number_simulations = 50
P1 = normal(loc=75, scale=5)
P2 = normal(loc=25, scale=5)
elements = [P1, P2]
probabilities = [.80, .20]
simulation = np.random.choice(elements, number_simulations, p=probabilities)
print(simulation)
[26.40889965 71.60833802 71.60833802 26.40889965 71.60833802, etc]
You could generate all 50 samples per P using size. Then use random to choose either index 0 of elements (P1) or index 1 of elements (P2) and then call random on the resulting distribution. You can use list comprehension to generate your 50 simulations.
import numpy as np
from numpy.random import normal
number_simulations = 50
P1 = normal(loc=75, scale=5, size=number_simulations)
P2 = normal(loc=75, scale=5, size=number_simulations)
elements = [P1, P2]
probabilities = [.80, .20]
[np.random.choice(elements[np.random.choice([0,1], p=probabilities)]) for x in range(number_simulations)]
Maybe a bit smoother would be to generate 50 samples with mean 50 and then either add or subtract 25 depending on the result:
import numpy as np
number_simulations = 50
probabilities = [.20, .80]
x = np.random.normal(loc = 50, scale = 5, size = number_simulations)
a = np.random.choice([-25,25], p = probabilities, size = number_simulations)
print(list(x+a))
I'm having a long list of values which is triple indexed (i,j,t). For all i in I and j in J I have to extract all t values and calculate the coefficient of variation (cv) successively. The length of the cv list ist len(I)*len(J). Then I plot the cv list and check whether the cv converged to sum number.
Right now I am looping, which is rather inefficient (see example). Is there another possibility which avoids the loops?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
iN = 10
jN = 10
tN = 20
I = range(iN)
J = range(jN)
T = range(tN)
idx = pd.MultiIndex.from_product([I,J,T])
data = np.random.normal(loc=1, size=iN*jN*tN)
df = pd.DataFrame(data, index=idx, columns=['value'])
values_lst = []
cv_lst = []
for i in I:
for j in J:
values_lst.extend(df.loc[(i,j,slice(None)), 'value'])
sd = np.std(values_lst, ddof=1)
mean = np.mean(values_lst)
cv_lst.append(sd/mean)
plt.plot(cv_lst)
plt.show()
Im posting this in the assumtion that you can easily extract your data into a (i,j,t) sized numpy array. In that case you can let numpy do its magic:
cv_list = np.std(data, axis=2, ddof=1)/np.mean(data, axis=2)
axis=2 means you do your mean or std over your axis t. The result is still a 2D array you can reshape or flatten as you like.
This is what I have tried so far
import itertools
import numpy as np
import matplotlib.pyplot as plt
with open('base.txt','r') as f:
vst = map(int, itertools.imap(float, f))
v1=vst[::3]
print type(v1)
a=np.asarray(v1)
print len(a)
a11=a.reshape(50,100)
plt.imshow(a11, cmap='hot')
plt.colorbar()
plt.show()
I have (50,100) array and each element has numerical value(range 1200-5400).I would like to have image that would represent array.But I got this
What should I change to get proper image?
I don't have data from base.txt.
However, in order to simulate your problem, I created random numbers between 1500 to 5500 and created a 50 x 100 numpy array , which I believe is close to your data and requirement.
Then I simply plotted the data as per your plot code.
I am getting true representation of the array.
See if this helps.
Demo Code
#import itertools
import numpy as np
from numpy import array
import matplotlib.pyplot as plt
import random
#Generate a list of 5000 int between 1200,5500
M = 5000
myList = [random.randrange(1200,5500) for i in xrange(0,M)]
#Convert to 50 x 100 list
n = 50
newList = [myList[i:i+n] for i in range(0, len(myList), n)]
#Convert to 50 x 100 numpy array
nArray = array(newList)
print nArray
a11=nArray.reshape(50,100)
plt.imshow(a11, cmap='hot')
plt.colorbar()
plt.show()
Plot
I am new to Python. I have a numpy.array which size is 66049x1 (66049 rows and 1 column). The values are sorted smallest to largest and are of float type, with some of them being repeated.
I need to determine the frequency of occurrences of each value (the number of times a given value is equalled but not surpassed, e.g. X<=x in statistical terms), in order to later plot the Sample Cumulative Distribution Function.
The code I am currently using is as follows, but it is extremely slow, as it has to loop 66049x66049=4362470401 times. Is there any way to augment the speed of such piece of code? Will perhaps the use of dictionaries help in any way? Unfortunately I cannot change the size of the arrays I am working with.
+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))
#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
temp=x3[i]
for j in range(len(x3)):
if (temp<=x3[j]):
count[j]=count[j]+1
#Creates a 2D array with (value, occurrences)
x4=numpy.zeros((len(x3), 2))
for i in range(len(x3)):
x4[i,0]=x3[i]
x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data'])
df_p.T.plot(kind='hist')
plt.show()
That whole script took a very short period to execute (~2s) for (100,000x1) array. I didn't time, but if you provide the time it took to do yours we can compare.
I used [Counter][2] from collections to count the number of occurrences, my experiences with it have always been great (timewise). I converted it into DataFrame to plot and used T to transpose.
Your data does replicate a bit, but you can try and refine it some more. As it is, it's pretty fast.
Edit
Create CDF using cumsum()
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p['cumu'].plot(kind='line')
plt.show()
Edit 2
For scatter() plot you must specify the (x,y) explicitly. Also, calling df_p['cumu'] will result in a Series, not a DataFrame.
To properly display a scatter plot you'll need the following:
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p.plot(kind='scatter', x='data', y='cumu')
plt.show()
You should use np.where and then count the length of the obtained vector of indices:
indices = np.where(x3 <= value)
count = len(indices[0])
If efficiency counts, you can use the numpy function bincount, which need integers :
import numpy as np
a=np.random.rand(66049).reshape((66049,1)).round(3)
z=np.bincount(np.int32(1000*a[:,0]))
it takes about 1ms.
Regards.
# for counting a single value
mask = (my_np_array == value_to_count).astype('uint8')
# or a condition
mask = (my_np_array <= max_value).astype('uint8')
count = np.sum(mask)
is there a more efficient way to take an average of an array in prespecified bins? for example, i have an array of numbers and an array corresponding to bin start and end positions in that array, and I want to just take the mean in those bins? I have code that does it below but i am wondering how it can be cut down and improved. thanks.
from scipy import *
from numpy import *
def get_bin_mean(a, b_start, b_end):
ind_upper = nonzero(a >= b_start)[0]
a_upper = a[ind_upper]
a_range = a_upper[nonzero(a_upper < b_end)[0]]
mean_val = mean(a_range)
return mean_val
data = rand(100)
bins = linspace(0, 1, 10)
binned_data = []
n = 0
for n in range(0, len(bins)-1):
b_start = bins[n]
b_end = bins[n+1]
binned_data.append(get_bin_mean(data, b_start, b_end))
print binned_data
It's probably faster and easier to use numpy.digitize():
import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
An alternative to this is to use numpy.histogram():
bin_means = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])
Try for yourself which one is faster... :)
The Scipy (>=0.11) function scipy.stats.binned_statistic specifically addresses the above question.
For the same example as in the previous answers, the Scipy solution would be
import numpy as np
from scipy.stats import binned_statistic
data = np.random.rand(100)
bin_means = binned_statistic(data, data, bins=10, range=(0, 1))[0]
Not sure why this thread got necroed; but here is a 2014 approved answer, which should be far faster:
import numpy as np
data = np.random.rand(100)
bins = 10
slices = np.linspace(0, 100, bins+1, True).astype(np.int)
counts = np.diff(slices)
mean = np.add.reduceat(data, slices[:-1]) / counts
print mean
The numpy_indexed package (disclaimer: I am its author) contains functionality to efficiently perform operations of this type:
import numpy_indexed as npi
print(npi.group_by(np.digitize(data, bins)).mean(data))
This is essentially the same solution as the one I posted earlier; but now wrapped in a nice interface, with tests and all :)
I would add, and also to answer the question find mean bin values using histogram2d python that the scipy also have a function specially designed to compute a bidimensional binned statistic for one or more sets of data
import numpy as np
from scipy.stats import binned_statistic_2d
x = np.random.rand(100)
y = np.random.rand(100)
values = np.random.rand(100)
bin_means = binned_statistic_2d(x, y, values, bins=10).statistic
the function scipy.stats.binned_statistic_dd is a generalization of this funcion for higher dimensions datasets
Another alternative is to use the ufunc.at. This method applies in-place a desired operation at specified indices.
We can get the bin position for each datapoint using the searchsorted method.
Then we can use at to increment by 1 the position of histogram at the index given by bin_indexes, every time we encounter an index at bin_indexes.
np.random.seed(1)
data = np.random.random(100) * 100
bins = np.linspace(0, 100, 10)
histogram = np.zeros_like(bins)
bin_indexes = np.searchsorted(bins, data)
np.add.at(histogram, bin_indexes, 1)