Tukey five number summary in Python

Tukey five number summary in Python - python

I have been unable to find this function in any of the standard packages, so I wrote the one below. Before throwing it toward the Cheeseshop, however, does anyone know of an already published version? Alternatively, please suggest any improvements. Thanks.
def fivenum(v):
"""Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input vector, a list or array of numbers based on 1.5 times the interquartile distance"""
import numpy as np
from scipy.stats import scoreatpercentile
try:
np.sum(v)
except TypeError:
print('Error: you must provide a list or array of only numbers')
q1 = scoreatpercentile(v,25)
q3 = scoreatpercentile(v,75)
iqd = q3-q1
md = np.median(v)
whisker = 1.5*iqd
return np.min(v), md-whisker, md, md+whisker, np.max(v),

pandas Series and DataFrame have a describe method, which is similar to R's summary:
In [3]: import numpy as np
In [4]: import pandas as pd
In [5]: s = pd.Series(np.random.rand(100))
In [6]: s.describe()
Out[6]:
count 100.000000
mean 0.540376
std 0.296250
min 0.002514
25% 0.268722
50% 0.593436
75% 0.831067
max 0.991971
NAN's are handled correctly.

I would get rid of these two things:
import numpy as np
from scipy.stats import scoreatpercentile
You should be importing at the module level. This means that users will be aware of missing dependencies as soon as they import your module, rather than when they call the function.
try:
sum(v)
except TypeError:
print('Error: you must provide a list or array of only numbers')
Several problems with this:
Don't type check in Python. Document what the function takes.
How do you know callers will see this? They might not be running at a console, and even if they are, they might not want your error message interfering with their output.
Don't type check in Python.
If you do want to raise some sort of exception for invalid data (not type checking), either let an existing exception propagate, or wrap it in your own exception type.

In case anybody ever needs a version that works with NaN in the data, here is my modification. I didn't want to change the original poster answer to avoid confusion.
import numpy as np
from scipy.stats import scoreatpercentile
from scipy.stats import nanmedian
def fivenum(v):
"""Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input vector, a list or array of numbers based on 1.5 times the interquartile distance"""
try:
np.sum(v)
except TypeError:
print('Error: you must provide a list or array of only numbers')
q1 = scoreatpercentile(v[~np.isnan(v)],25)
q3 = scoreatpercentile(v[~np.isnan(v)],75)
iqd = q3-q1
md = nanmedian(v)
whisker = 1.5*iqd
return np.nanmin(v), md-whisker, md, md+whisker, np.nanmax(v),

Try this:
import numpy as np
import numpy.random
from statstools import run
from scipy.stats import scoreatpercentile
data=np.random.randn(5)
return (min(data), md-whisker, md, md+whisker, max(data))

I am new to Python, but the return is calculated incorrectly: it should be max(min(v), q1-whisker) for the lower bound and min (max(v), q3+whisker) for the upper bound. It is how it's done in R (the summary() function), and that's what shows up on the boxplots in matplotlib.pyplot and in R.

Minimal, but it gets the job done. :)
import numpy as np
[round(np.percentile(results[:,4], i), 1) for i in [1, 2, 5, 10, 25, 50]]

import numpy as np
# np_array = np.array(np.random.random(100))
np.percentile(np_array, [0, 25, 50, 75, 100])
percentiles selection can be configured with the interpolation argument which is linear by default

import pandas as pd
def fivenum(x):
series=pd.Series(x)
mi = series.min()
q1 = series.quantile(q=0.25, interpolation='nearest')
me = series.median()
q3 = series.quantile(q=0.75, interpolation='nearest')
ma = series.max()
return pd.Series([mi, q1, me, q3, ma], index=['min', 'q1', 'median', 'q3', 'max'])

Related

How can I generate in python an array like this?

I try to figure it out how can I implement a function in python to generate arrays like this (please ignore the fact that is the same number):
[[0101111010, 0101111010,0101111010, 0101111010,0101111010,0101111010,0101111010],
[0101111010,0101111010,0101111010,0101111010,0101111010,0101111010,0101111010]]
I write this code but I don't know if it's the best idea:
import numpy as np
import random
sol_per_pop = 20
num_genes = 10
pop_size = (sol_per_pop,num_genes)
population = np.random.randint(2, size=pop_size)
print(population)
I don't want to use string. I want to find the best solution. Thank you!

I don't really see what this would be useful for. But it might still be fun.
import random
int("{:b}".format(random.randint(0,1<<64)))
Or:
import random
r = random.randint(0,1<<64)
sum(10**i*((r>>i)&1) for i in range(64))
Or, if we must use numpy:
import numpy as np
significance = 10**np.cumsum(np.ones(64))
np.sum(significance*np.random.default_rng().integers(0, 2, 64))
Yet another idea:
import numpy as np
rng = np.random.default_rng()
significance = 10**np.cumsum(np.ones(64))
result = np.zeros(64)
result[nrng.choice(range(64), rng.integers(0, 64))] = 1
np.sum(significance*result)
As you can see, there are many approaches to solve the same problem.

How to use a function argument as npy method, if possible

I am trying to create a function that will take a numpy dstr name as an argument and plot a histogram of random data points from that distribution.
if it only works on npy distributions that require 1 argument that is okay. Just really stuck trying to create the np.random.distribution()...
\
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#Define a function (Fnc) that produces random numpy distributions (dstr)
#Fnc args: [npy dstr name as lst of str], [num of data pts]
def get_rand_dstr(dstr_name):
npdstr = dstr_name
dstr = np.random.npdstr(user.input("How many datapoints?"))
#here pass each dstr from dstr_name through for loop
#for loop will prompt user for required args of dstr (nbr of desired datapoints)
return plt.hist(df)
get_rand_dstr('chisquare')

Use this code, it might be helped you
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
def get_rand_dstr(dstr_name):
# npdstr = dstr_name
dstr = 'np.random.{}({})'.format(dstr_name, (input("How many datapoints?"))) # for using any distribution need to manipulate here
# cause number of args are diffrent for diffrent distibutions
print(dstr)
df = eval(dstr)
print(df)
# dstr1 = np.random.chisquare(int(input("How many datapoints?")))
# print(dstr1)
return plt.hist(df)
# get_rand_dstr('geometric')
get_rand_dstr('chisquare')

The accepted answer is incorrect and does not work. The problem is that NumPy's random distributions take different required arguments, so it's a little fiddly to pass size to all of them because it's a kwarg. (That's why the example in the accepted solution returns the wrong number of samples — only 1, not the 5 that were requested. That's because the first argument for chisquare is df, not size.)
It's common to want to invoke functions by name. As well as not working, the accepted answer uses eval() which is a common suggested solution to the issue. But it's generally accepted to be a bad idea, for various reasons.
A better way to achieve what you want is to define a dictionary that maps strings representing the names of functions to the functions themselves. For example:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
DISTRIBUTIONS = {
'standard_cauchy': np.random.standard_cauchy,
'standard_exponential': np.random.standard_exponential,
'standard_normal': np.random.standard_normal,
'chisquare': lambda size: np.random.chisquare(df=1, size=size),
}
def get_rand_dstr(dstr_name):
npdstr = DISTRIBUTIONS[dstr_name]
size = int(input("How many datapoints?"))
dstr = npdstr(size=size)
return plt.hist(dstr)
get_rand_dstr('chisquare')
This works fine — for the functions I made keys for. You could make more — there are 35 I think — but the problem is that they don't all have the same API. In other words, you can't call them all just with size as an argument. For example, np.random.chisquare() requires the parameter df or 'degrees of freedom'. Other functions require other things. You could make assumptions about those things and wrap all of the function calls (like I did, above, for chisquare)... if that's what you want to do?

Python: faster way of counting occurences in numpy arrays (large dataset)

I am new to Python. I have a numpy.array which size is 66049x1 (66049 rows and 1 column). The values are sorted smallest to largest and are of float type, with some of them being repeated.
I need to determine the frequency of occurrences of each value (the number of times a given value is equalled but not surpassed, e.g. X<=x in statistical terms), in order to later plot the Sample Cumulative Distribution Function.
The code I am currently using is as follows, but it is extremely slow, as it has to loop 66049x66049=4362470401 times. Is there any way to augment the speed of such piece of code? Will perhaps the use of dictionaries help in any way? Unfortunately I cannot change the size of the arrays I am working with.
+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))
#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
temp=x3[i]
for j in range(len(x3)):
if (temp<=x3[j]):
count[j]=count[j]+1
#Creates a 2D array with (value, occurrences)
x4=numpy.zeros((len(x3), 2))
for i in range(len(x3)):
x4[i,0]=x3[i]
x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++

import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data'])
df_p.T.plot(kind='hist')
plt.show()
That whole script took a very short period to execute (~2s) for (100,000x1) array. I didn't time, but if you provide the time it took to do yours we can compare.
I used [Counter][2] from collections to count the number of occurrences, my experiences with it have always been great (timewise). I converted it into DataFrame to plot and used T to transpose.
Your data does replicate a bit, but you can try and refine it some more. As it is, it's pretty fast.
Edit
Create CDF using cumsum()
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p['cumu'].plot(kind='line')
plt.show()
Edit 2
For scatter() plot you must specify the (x,y) explicitly. Also, calling df_p['cumu'] will result in a Series, not a DataFrame.
To properly display a scatter plot you'll need the following:
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p.plot(kind='scatter', x='data', y='cumu')
plt.show()

You should use np.where and then count the length of the obtained vector of indices:
indices = np.where(x3 <= value)
count = len(indices[0])

If efficiency counts, you can use the numpy function bincount, which need integers :
import numpy as np
a=np.random.rand(66049).reshape((66049,1)).round(3)
z=np.bincount(np.int32(1000*a[:,0]))
it takes about 1ms.
Regards.

# for counting a single value
mask = (my_np_array == value_to_count).astype('uint8')
# or a condition
mask = (my_np_array <= max_value).astype('uint8')
count = np.sum(mask)

Plot using pandas

I have some event times in a list and I would like to plot an exponentially weighted moving average of them. I can do this using the following code.
import numpy as np
import matplotlib.pyplot as plt
print "Code runnning"
a=0.01
l = [3.0,7.0,10.0,20.0,200.0]
y = np.zeros(1000)
for item in l:
y[item]=1
s = np.zeros(1000)
x = np.linspace(0,1000,1000)
for i in xrange(1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
plt.plot(x, s)
plt.show()
This is clearly a horrible way to use python however. What's the right way to do this? Is it possible to do it without making all these extra sparse arrays?
The output should look like this.

Pandas comes to mind for this task:
import pandas as pd
l = [3.0,7.0,10.0,20.0,200.0]
s = pd.Series(np.ones_like(l), index=l)
y = s.reindex(range(1000), fill_value=0)
pd.ewma(y, 199).plot()
The period 199 is related to your parameter alpha 0.01 as n=2/(a+1). Result:

AFAIK there's not a very good way to do this with numpy or the scipy.sparse module -- the sparse matrices in scipy.sparse are designed to be 2D matrices, and to create one in the first place you'd basically need to use the code you've already written in your first loop (i.e., to set all of the nonzero locations in a sparse matrix), with the additional complexity of always having to specify two index values.
As if that's not bad enough, np.convolve doesn't work with sparse arrays, so you'd still need to write out the computation in your second loop to compute the moving average.
My recommendation, which probably isn't much help if you're looking for a fancy numpy version, is to fall back on Python's excellent support as a general-purpose language :
import matplotlib.pyplot as plt
a=0.01
l = set([3, 7, 10, 20, 200])
s = np.zeros(1000)
for i in xrange(len(s)):
s[i] = a * int(i-1 in l) + (1-a) * s[i-1]
plt.plot(s)
plt.show()
Here, I've stored the event index values in l, just as you did, but I used a set to make lookup times O(1) -- though if len(l) isn't very large, you might even be better off with a plain list or tuple, you'd need to measure it to be sure. Then you can avoid creating the y array and just rely on Iverson's convention to convert the Boolean value x in y into an int. You might not even need the explicit cast, but I find it helpful to be explicit.

I think you're looking for something like this:
import numpy as np
import matplotlib.pyplot as plt
from scikits.timeseries.lib.moving_funcs import mov_average_expw
l = [ 3.0, 7.0, 10.0, 20.0, 200.0 ]
y = np.zeros(1000)
y[[l]] = 1
emav = mov_average_expw(y, 199)
plt.plot(emav)
plt.show()
This makes use of mov_average_expw from scikits.timeseries. Check that method's documentation to see how I came up with the span parameter based on your code's a variable.

While loops for lists?

I'm pretty new to programming and I have a quick question. I am trying to make a Gaussian function for a range of stars. However i want the size of undercurve be at 100 for all the stars. I was thinking of doing a while loop saying that while the total length of undercurve be 100. However, I get an error and I'm guessing it has something to do with it being a list. I'm showing you guys my code to see if you can help me out here. Thanks!
I get a syntax error: can't assign to call function
import numpy
import random
import math
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import scipy
from scipy import stats
from math import sqrt
from numpy import zeros
from numpy import numarray
variance = input("Input variance of the star:")
mean = input("Input mean of the star:")
space=numpy.linspace(-4,1,1000)
sigma = sqrt(variance)
Max = max(mlab.normpdf(space,mean,sigma))
normalized = (mlab.normpdf(space,mean,sigma))/Max
def random_y_pt():
return random.uniform(0,1)
def random_x_pt():
return random.uniform(-4,1)
import random
def undercurve(size):
result = []
for i in range(0,size):
y = random_y_pt()
x = random_x_pt()
if y < scipy.stats.norm(scale=variance,loc=mean).pdf(x):
result.append((x))
return result
size = 1
while len(undercurve(size)) < 100:
undercurve(size) = undercurve(1)+undercurve(size)
print undercurve(size)
plt.hist(undercurve(size),bins=20)
plt.show()

If your error is something like SyntaxError: can't assign to function call then that's because of your line
undercurve(size) = undercurve(1)+undercurve(size)
Which is trying to set the output of the right-hand side as the value of undercurve(size), which you cannot do.
It sounds like you actually want to see just the first 100 items in the list returned by undercurve(size). For that, use
undercurve(size)[:100]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tukey five number summary in Python - python

Try this: import numpy as np import numpy.random from statstools import run from scipy.stats import scoreatpercentile data=np.random.randn(5) return (min(data), md-whisker, md, md+whisker, max(data))

I am new to Python, but the return is calculated incorrectly: it should be max(min(v), q1-whisker) for the lower bound and min (max(v), q3+whisker) for the upper bound. It is how it's done in R (the summary() function), and that's what shows up on the boxplots in matplotlib.pyplot and in R.

Minimal, but it gets the job done. :) import numpy as np [round(np.percentile(results[:,4], i), 1) for i in [1, 2, 5, 10, 25, 50]]

import numpy as np # np_array = np.array(np.random.random(100)) np.percentile(np_array, [0, 25, 50, 75, 100]) percentiles selection can be configured with the interpolation argument which is linear by default

Related

How can I generate in python an array like this?

How to use a function argument as npy method, if possible

Python: faster way of counting occurences in numpy arrays (large dataset)

Plot using pandas

While loops for lists?

Categories

Resources