Difference in output between numpy linspace and numpy logspace - python

Numpy linspace returns evenly spaced numbers over a specified interval. Numpy logspace return numbers spaced evenly on a log scale.
I don't understand why numpy logspace often returns values "out of range" from the bounds I set. Take numbers between 0.02 and 2.0:
import numpy as np
print np.linspace(0.02, 2.0, num=20)
print np.logspace(0.02, 2.0, num=20)
The output for the first is:
[ 0.02 0.12421053 0.22842105 0.33263158 0.43684211 0.54105263
0.64526316 0.74947368 0.85368421 0.95789474 1.06210526 1.16631579
1.27052632 1.37473684 1.47894737 1.58315789 1.68736842 1.79157895
1.89578947 2. ]
That looks correct. However, the output for np.logspace() is wrong:
[ 1.04712855 1.33109952 1.69208062 2.15095626 2.73427446
3.47578281 4.41838095 5.61660244 7.13976982 9.07600522
11.53732863 14.66613875 18.64345144 23.69937223 30.12640904
38.29639507 48.68200101 61.88408121 78.6664358 100. ]
Why does it output 1.047 to 100.0?

2017 update: The numpy 1.12 includes a function that does exactly what the original question asked, i.e. returns a range between two values evenly sampled in log space.
The function is numpy.geomspace
>>> np.geomspace(0.02, 2.0, 20)
array([ 0.02 , 0.0254855 , 0.03247553, 0.04138276, 0.05273302,
0.06719637, 0.08562665, 0.1091119 , 0.13903856, 0.17717336,
0.22576758, 0.28768998, 0.36659614, 0.46714429, 0.59527029,
0.75853804, 0.96658605, 1.23169642, 1.56951994, 2. ])

logspace computes its start and end points as base**start and base**stop respectively. The base value can be specified, but is 10.0 by default.
For your example you have a start value of 10**0.02 == 1.047 and a stop value of 10**2 == 100.
You could use the following parameters (calculated with np.log10) instead:
>>> np.logspace(np.log10(0.02) , np.log10(2.0) , num=20)
array([ 0.02 , 0.0254855 , 0.03247553, 0.04138276, 0.05273302,
0.06719637, 0.08562665, 0.1091119 , 0.13903856, 0.17717336,
0.22576758, 0.28768998, 0.36659614, 0.46714429, 0.59527029,
0.75853804, 0.96658605, 1.23169642, 1.56951994, 2. ])

This is pretty simple.
NumPy gives you numbers evenly distributed in log space.
i.e. 10^(value). where value is evenly spaced between your start and stop values.
You'll note 10^0.02 is 1.04712 ... and 10^2 is 100

From documentation for numpy.logspace() -
numpy.logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None)
Return numbers spaced evenly on a log scale.
In linear space, the sequence starts at base ** start (base to the
power of start) and ends with base ** stop (see endpoint below).
For your case, base is defaulting to 10, so its going from 10 raised to 0.02 to 10 raised to 2 (100).

Related

Which dtype would be correct to prevent numpy.arange() from getting the wrong length?

I am trying to get a shifting array containing 200 values in a range with a difference of 40.
Therefore i am using numpy.arange(a, b, 0.2) with starting values a=0 and b=40 and going upwards (a=0.2 b=40.2, a=0.4 b=40.4 and so on).
When I reach numpy.arange(25.4, 65.4, 0.2) however I suddenly get an array with a length 201 values:
x = numpy.arange(25.2, 65.2, 0.2)
print(len(x))
Returns 200
x = numpy.arange(25.4, 65.4, 0.2)
print(len(x))
Returns 201
I got so far to notice that this happens probably due to rounding issues because of the data type...
I know there is a option 'dtype' in numpy.arrange():
numpy.arange(star, stop, step, dtype)
The question is which data type would fit this problem and why? (I am not so confident with data types jet and https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html#numpy.dtype hasn't helped me to get this issue resolved. Please help!
np.arange is most useful when you want to precisely control the difference between adjacent elements. np.linspace, on the other hand, gives you precise control over the total number of elements. It sounds like you want to use np.linspace instead:
import numpy as np
offset = 25.4
x = np.linspace(offset, offset + 40, 200)
print(x)
print(len(x))
Here's the documentation page for np.linspace: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html

Unable to extract factor loadings from sklearn PCA

I want factor loadings to see which factor loads to which variables. I am referring to following link:
Factor Loadings using sklearn
Here is my code where input_data is the master_data.
X=master_data_predictors.values
#Scaling the values
X = scale(X)
#taking equal number of components as equal to number of variables
#intially we have 9 variables
pca = PCA(n_components=9)
pca.fit(X)
#The amount of variance that each PC explains
var= pca.explained_variance_ratio_
#Cumulative Variance explains
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
print var1
[ 74.75 85.85 94.1 97.8 98.87 99.4 99.75 100. 100. ]
#Retaining 4 components as they explain 98% of variance
pca = PCA(n_components=4)
pca.fit(X)
X1=pca.fit_transform(X)
print pca.components_
array([[ 0.38454129, 0.37344315, 0.2640267 , 0.36079567, 0.38070046,
0.37690887, 0.32949014, 0.34213449, 0.01310333],
[ 0.00308052, 0.00762985, -0.00556496, -0.00185015, 0.00300425,
0.00169865, 0.01380971, 0.0142307 , -0.99974635],
[ 0.0136128 , 0.04651786, 0.76405944, 0.10212738, 0.04236969,
0.05690046, -0.47599931, -0.41419841, -0.01629199],
[-0.09045103, -0.27641087, 0.53709146, -0.55429524, 0.058524 ,
-0.19038107, 0.4397584 , 0.29430344, 0.00576399]])
import math
loadings = pca.components_.T * math.sqrt(pca.explained_variance_)
It gives me following error 'only length-1 arrays can be converted to Python scalars
I understand the problem. I have to traverse the pca.components_ and pca.explained_variance_ arrays such as:
##just a thought
Loading=np.empty((8,4))
for i,j in (pca.components_, pca.explained_variance_):
loading=i*math.sqrt(j)
Loading=Loading.append(loading)
##unable to proceed further
##something wrong here
This is simply a problem of mixing modules. For numpy arrays, use np.sqrt instead of math.sqrt (which only works on single values, not arrays).
Your last line should thus read:
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
This is a mistake in the original answers you linked to. I have edited them accordingly.

Rescale price list from a longer length to a smaller length

Given the following pandas data frame with 60 elements.
import pandas as pd
data = [60,62.75,73.28,75.77,70.28
,67.85,74.58,72.91,68.33,78.59
,75.58,78.93,74.61,85.3,84.63
,84.61,87.76,95.02,98.83,92.44
,84.8,89.51,90.25,93.82,86.64
,77.84,76.06,77.75,72.13,80.2
,79.05,76.11,80.28,76.38,73.3
,72.28,77,69.28,71.31,79.25
,75.11,73.16,78.91,84.78,85.17
,91.53,94.85,87.79,97.92,92.88
,91.92,88.32,81.49,88.67,91.46
,91.71,82.17,93.05,103.98,105]
data_pd = pd.DataFrame(data, columns=["price"])
Is there a formula to rescale this in such a way so that for each window bigger than 20 elements starting from index 0 to index i+1, the data is rescaled down to 20 elements?
Here is a loop that is creating the windows with the data for rescaling, i just do not know any way of doing the rescaling itself for this problem at hand. Any suggestions on how this might be done?
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd[0:i]
scaledDataToMinLenght = dataForScaling #do the scaling here so that the length of the rescaled data is always equal to miniLenght
rescaledData.append(scaledDataToMinLenght)
Basically after the rescaling the rescaledData should have 40 arrays, each with a length of 20 prices.
From reading the paper, it looks like you are resizing the list back to 20 indices, then interpolating the data at your 20 indices.
We'll make the indices like they do (range(0, len(large), step = len(large)/miniLenght)), then use numpys interp - there are a million ways of interpolating data. np.interp uses a linear interpolation, so if you asked for eg index 1.5, you get the mean of points 1 and 2, and so on.
So, here's a quick modification of your code to do it (nb, we could probably fully vectorize this using 'rolling'):
import numpy as np
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd['price'][0:i]
#figure out how many 'steps' we have
steps = len(dataForScaling)
#make indices where the data needs to be sliced to get 20 points
indices = np.arange(0,steps, step = steps/miniLenght)
#use np.interp at those points, with the original values as given
rescaledData.append(np.interp(indices, np.arange(steps), dataForScaling))
And the output is as expected:
[array([ 60. , 62.75, 73.28, 75.77, 70.28, 67.85, 74.58, 72.91,
68.33, 78.59, 75.58, 78.93, 74.61, 85.3 , 84.63, 84.61,
87.76, 95.02, 98.83, 92.44]),
array([ 60. , 63.2765, 73.529 , 74.9465, 69.794 , 69.5325,
74.079 , 71.307 , 72.434 , 77.2355, 77.255 , 76.554 ,
81.024 , 84.8645, 84.616 , 86.9725, 93.568 , 98.2585,
93.079 , 85.182 ]),.....

Limiting a sequence of ratios to a range whilst maintaining overall increase/decrease of values they are multiplying

Sorry my maths isn't fantastic so you'll have to bear with me.
Let's say I have a ratio limit of 3.
I have a numpy array of sizes that are to be multiplied by the ratios and a numpy array of the ratios, some of which are within the limit, some of which aren't.
I need the ratios that are above the limit to be set to the limit and the ratios that are below the limit to be increased to account for the reduction of the ratios that were over the limit. The result would be the the sum of the sizes is still the same but the individual sizes haven't been altered by more than the limit
In [1]: import numpy as np
In [2]: sizes = np.array([2.0,4.0,6.0,8.0,10.0])
In [3]: ratios = np.array([0.5, 0.5, 5.0, 4.0, 0.5])
In [4]: print np.sum(sizes * ratios)
70.0
#result after limiting ratios would still be 70
Edit:
So in the example above the resulting ratios would be:
np.array([1.75, 1.75, 3.0, 3.0, 1.75])
In [4]: print np.sum(sizes * ratios)
70.0
The ratios that were previously above the limit have been reduced and the ratios that were below have been raised to compensate.
I think you are looking for something like this:
import numpy as np
def Spread_Ratios(ratios,sizes):
if np.dot(ratios,sizes)/np.sum(sizes)>3.:
print 'There is no solution!\n'
return None
if np.any(ratios>3.):
score = np.dot(sizes,ratios)
ratios_reduced = np.where(ratios>3.,3.,ratios)
score_reduced = np.dot(sizes,ratios_reduced)
delta_ratios = (score - score_reduced) / np.sum(sizes[ratios<3.])
new_ratios = ratios_reduced + np.where(ratios<3.,delta_ratios,0.)
return Spread_Ratios(new_ratios,sizes)
else:
return ratios,sizes
The recursive definition is necessary since it is possible that a weight below 3 (but close) is lifted above 3.
Furthermore it is possible that there exists no solution at all. This case is handled with the first if condition.

NumPy or SciPy to calculate weighted median

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.
For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.
So far I've
imported the csv showing the weights as an array, masking values of 0, and
created an array of the "Y value" the same shape and size as the weights array (113x32). I'm not entirely sure I need to do this, but thought it would be easier than a for loop for the purpose of weighting.
I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.
I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!
Update: here's some code for what I've done so far:
#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)
#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)
#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]
for i in range (rowLength):
createArr = np.arange(0, fieldLength*10, 10)
nCreateArr = np.array(createArr)
massArr.append(nCreateArr)
nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()
What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.
One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.
Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).
So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.
The code to do it like so:
import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])
c = np.cumsum(freq_count, axis=1)
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...
#This is just for explanation.
print "median masses is:", masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))
Output will be:
median masses is: [10 20 20 0 30]
[[ 30 191 9 0] <- The test data
[ 10 20 300 10]
[ 10 20 30 40]
[100 10 10 10]
[ 1 1 1 100]]
[[ 30. 221. 230. 230. 115. ] <- cumsum results with median added to the end.
[ 10. 30. 330. 340. 170. ] you can see from this where they fit in.
[ 10. 30. 60. 100. 50. ]
[ 100. 110. 120. 130. 65. ]
[ 1. 2. 3. 103. 51.5]]
wquantiles is a small python package that will do exactly what you need. It just uses np.cumsum() and np.interp() under the hood.
Since this is the top hit on Google for weighted median in NumPy, I will add my minimal function to select the weighted median from two arrays without changing their contents, and with no assumptions about the order of the values (on the off-chance that anyone else comes here looking for a quick recipe for the same exact pre-conditions).
def weighted_median(values, weights):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, 0.5 * c[-1])]]
Using argsort lets us maintain the alignment between the two arrays without changing or copying their content. It should be straight-forward to extend is to an arbitrary number of arbitrary quantiles.
Update
Since it may not be fully obvious at first blush exactly how easy it is to extend to arbitrary quantiles, here is the code:
def weighted_quantiles(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, np.array(quantiles) * c[-1])]]
This defaults to median, but you can pass in any quantile, or a list of quantiles. The return type is equivalent to what you pass in as quantiles, with lists promoted to NumPy arrays. With enough uniformly distributed values, you can indeed approximate the input poorly:
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
array([0.01235101, 0.05341077, 0.25355715, 0.50678338, 0.75697424,0.94962936, 0.98980785])
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), 0.5)
0.5036283072043176
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.5])
array([0.49851076])
Update 2
In small data sets where the median/quantile is not actually observed, it may be important to be able to interpolate a point between two observations. This can be fairly easily added by calculating the mid point between two number in the case where the weight mass is equally (or quantile/1-quantile) divided between them. Due to the need for a conditional, this function always returns a NumPy array, even when quantiles is a single scalar. The inputs also need to be NumPy arrays now (except quantiles that may still be a single number).
def weighted_quantiles_interpolate(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
q = np.searchsorted(c, quantiles * c[-1])
return np.where(c[q]/c[-1] == quantiles, 0.5 * (values[i[q]] + values[i[q+1]]), values[i[q]])
This function will fail with arrays smaller than 2 (the original would handle non-empty arrays).
>>> weighted_quantiles_interpolate(np.array([2, 1]), np.array([1, 1]), 0.5)
array(1.5)
Note that this extension is fairly unlikely to be needed when working with actual data sets where we typically have (a) large data sets, and (b) real-values weights that make the odds of ending up exactly at a quantile edge very long, and probably due to rounding errors when it does happen. Including it for completeness nonetheless.
I ended up writing that function based on #muzzle and #maesers replies:
def weighted_quantiles(values, weights, quantiles=0.5, interpolate=False):
i = values.argsort()
sorted_weights = weights[i]
sorted_values = values[i]
Sn = sorted_weights.cumsum()
if interpolate:
Pn = (Sn - sorted_weights/2 ) / Sn[-1]
return np.interp(quantiles, Pn, sorted_values)
else:
return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]
The difference between interpolate True and False is as follows:
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4))
> 2
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), interpolate=True)
> 2.5
(there is no difference for uneven arrays such as [1, 2, 3, 4, 5])
Speed tests show it is just as performant as #maesers' function in the uninterpolated case, and it is twice as performant in the interpolated case.
Sharing some code that I got a hand with. This allows you to run stats on each column of an excel spreadsheet.
import xlrd
import sys
import csv
import numpy as np
import itertools
from itertools import chain
book = xlrd.open_workbook('/filepath/workbook.xlsx')
sh = book.sheet_by_name("Sheet1")
ofile = '/outputfilepath/workbook.csv'
masses = sh.col_values(0, start_rowx=1) # first column has mass
age = sh.row_values(0, start_colx=1) # first row has age ranges
count = 1
mass = []
for a in ages:
age.append(sh.col_values(count, start_rowx=1))
count += 1
stats = []
count = 0
for a in ages:
expanded = []
# create a tuple with the mass vector
age_mass = zip(masses, age[count])
count += 1
# replicate element[0] for element[1] times
expanded = list(list(itertools.repeat(am[0], int(am[1]))) for am in age_mass)
# separate into one big list
medianlist = [x for t in expanded for x in t]
# convert to array and mask out zeroes
npa = np.array(medianlist)
npa = np.ma.masked_equal(npa,0)
median = np.median(npa)
meanMass = np.average(npa)
maxMass = np.max(npa)
minMass = np.min(npa)
stdev = np.std(npa)
stats1 = [median, meanMass, maxMass, minMass, stdev]
print stats1
stats.append(stats1)
np.savetxt(ofile, (stats), fmt="%d")

Categories