I'm trying to space out a number of points in between a start and end frequency.
In a way you can see here down below:
Startfreq = 1 Hz ( variable )
Stopfreq = 5402 Hz ( also variable )
How i want it to look:
1 - 2 - 3 - 4.. 10 - 20 - 30..100 - 200 - 300.. 1000 - 2000 - 3000 - 4000 - 5000 - 5402
1 - steps based on the stepsperdecade - 10 - steps based on the stepsperdecade - 100 .. 1000 - steps based on the stepsperdecade 5402.
SO i want the spacing to be same until it reaches the end frequency
I tried to do it in the following way in python.
from math import log10
import numpy as np
startfreq = 1
endfreq = 10000
points_per_decade = 10
numberdecades = log10(endfreq) - log10(startfreq)
points = int(numberdecades) * points_per_decade
points = np.logspace(log10(startfreq), log10(endfreq), num=points, endpoint=True, base=10)
But this way doesn't give me the 10 - 100 - 1000 i want in between the steps.
Would any one know or could someone hint me in the right direction.
I don't know if this works for you but using some basic maths I created this while loop snippet
from math import log10
startfreq = 1
endfreq = 5402
points_per_decade = 10
points = [startfreq]
ndig = int(log10(startfreq))
point = startfreq - startfreq % 10 ** ndig + 10 ** ndig
while point < endfreq:
ndig = int(log10(point))
point = round(point + 10 ** ndig, ndigits=-ndig)
I edited the answer to fix certain values, like startfreq = 175 should produce 200 as the next value, then continue in steps of +100: [175, 200, 300...]
You could do this comfortably with numpy arrays, by taking an outer product:
import numpy as np
exponents = np.arange(0, 4)# -> [0, 1, 2, 3]
prefactors = np.arange(1, 10)# -> [1, 2, ..., 9]
factor_matrix = np.outer(10**exponents, prefactors)
This will give you what you want in matrix form:
[[1, 2, ..., 9],
[10, 20, ..., 90],
[1000, 2000, ..., 9000]]
Of course, you want a flat array that stops before endpoint=5402, then append endpoint manually:
flattened_array = factor_matrix.flatten()
flattened_array = flattened_array[flattened_array<endpoint]
flattened_array = np.append(flattened_array, endpoint)
I'm trying to produce a geometric sequence, something similar to 1, 2, 4, 8...
I have the following code:
import numpy as np
lower_price = 1
upper_price = 2
total_grids = 10
grid_box = np.linspace(lower_price , upper_price, total_grids, retstep=True)
This outputs:
(array([1. , 1.11111111, 1.22222222, 1.33333333, 1.44444444,
1.55555556, 1.66666667, 1.77777778, 1.88888889, 2. ]), 0.1111111111111111)
This code creates an arithmetic, rather than a geometric sequence. How can I fix this code to produce the latter as opposed to the former?
You're looking for np.logspace, not np.linspace:
For example,
# Lower bound is 2**0 == 1
# Upper bound is 2**10 == 1024
np.logspace(0, 10, 10, base=2)
[1.00000000e+00 2.16011948e+00 4.66611616e+00 1.00793684e+01
2.17726400e+01 4.70315038e+01 1.01593667e+02 2.19454460e+02
4.74047853e+02 1.02400000e+03]
If you're trying to get 10 values between 1 and 2, use:
# Lower bound is 2**0 == 1
# Upper bound is 2**1 == 2
np.logspace(0, 1, 10, base=2)
'percentage' gives the % increment between each values. You can see that it remains constant for constant total_grids and changes only if you change it.
import numpy as np
lower_price = 10
upper_price = 2000
total_grids = 10
grid_box = np.linspace(lower_price , upper_price, total_grids, retstep=True)
full_range = upper_price - lower_price
correctedStartValue = grid_box[0][1] - lower_price
percentage = (correctedStartValue * 100) / full_range
How can I extract the maximum value from four points in the neighborhood of a specified coordinate?
import xarray as xr
import numpy as np
lat = [0, 10, 20]
lon = [50, 60, 70, 80]
#sample data
test_data = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
#to xarray
data_xarray = xr.DataArray(test_data, dims=("lat","lon"), coords={"lat":lat, "lon":lon})
#<xarray.DataArray (lat: 3, lon: 4)>
#array([[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12]])
# * lat (lat) int64 0 10 20
# * lon (lon) int64 50 60 70 80
What I want to implement
When 5.5 and 52 are specified for lat and lon respectively, extract 10, the maximum value of the four surrounding points.
Labeled indexing using sel supports nearest neighbour lookup.
You can use that to look up the four values of interest, reconcatenate them and then compute the max:
lat_search = 5.5
lon_search = 52
# Select the four nearest values
llat_llon = data_xarray.sel(lat=lat_search, lon=lon_search, method="pad")
ulat_ulon = data_xarray.sel(lat=lat_search, lon=lon_search, method="backfill")
ulat_llon = data_xarray.sel(lat=lat_search, method="backfill").sel(
lon=lon_search, method="pad"
llat_ulon = data_xarray.sel(lat=lat_search, method="pad").sel(
lon=lon_search, method="backfill"
# Combine the four values providing them in the correct order
ds_grid = [[llat_llon, ulat_llon], [llat_ulon, ulat_ulon]]
neighbours = xr.combine_nested(ds_grid, concat_dim=("lon", "lat"))
# Alternatively, combine them automatically
neighbours = xr.combine_by_coords(
x.to_dataset(name="foo").expand_dims(["lat", "lon"])
for x in [llat_llon, ulat_llon, llat_ulon, ulat_ulon]
# Compute the maximum value
I admit that selecting the four values manually and recombining them is not very elegant (particularly if you would like to scale that to more than two dimensions).
I don't see a general way to retrieve both neighbours at the same time using sel.
If you have a regularly spaced grid of coordinates, you can select all neighbours at the same time passing a slice to sel:
delta_lat = 10
delta_lon = 10
neighbours = data_xarray.sel(
lat=slice(lat_search - delta_lat, lat_search + delta_lat),
lon=slice(lon_search - delta_lon, lon_search + delta_lon),
For a data set consisting of:
coordinates x, y
depth z
a certain value c
I would like to do the following more efficient:
bin the data set in 2D bins based on the coordinates (x, y)
take the 10 deepest data points (z) per bin
calculate the mean value of c of these 10 data points per bin
Finally show a 2d heatmap with the calculated mean values.
I have found a working solution, but this takes too long for small bins and/or large data sets.
Is there a more efficient way of achieving the same result?
Current working example
Example dataframe:
import numpy as np
from numpy.random import rand
import pandas as pd
import math
import matplotlib.pyplot as plt
n = 10000
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
Bin the data set:
cell_size = 0.01
nx = math.ceil((max(df['x']) - min(df['x'])) / cell_size)
ny = math.ceil((max(df['y']) - min(df['y'])) / cell_size)
x_range = np.arange(0, nx)
y_range = np.arange(0, ny)
df['xbin'], x_edges = pd.cut(x=df['x'], bins=nx, labels=x_range, retbins=True)
df['ybin'], y_edges = pd.cut(x=df['y'], bins=ny, labels=y_range, retbins=True)
Code that now takes to long:
df = df.groupby(['xbin', 'ybin']).apply(
lambda d: d.sort_values('z').head(10).mean())
Update an empty DataFrame for the bins without data and show result:
index = pd.MultiIndex.from_product([x_range, y_range],
names=['xbin', 'ybin'])
tot_df = pd.DataFrame(index=index, columns=['z', 'c'])
zval = tot_df['c'].astype('float').values
zval = zval.reshape((nx, ny))
zval = zval.T
zval = np.flipud(zval)
extent = [min(x_edges), max(x_edges), min(y_edges), max(y_edges)]
plt.matshow(zval, aspect='auto', extent=extent)
you can use np.searchsorted to bin the rows by x and y and then use groupby to take 10 deep values and calculate means. As groupby will maintains the order in each group you can sort values before applying bins. groupby will perform better without apply
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
df = df.sort_values("z", ascending=False)
bins = np.linspace(0, 1, 11)
df["bin_x"] = np.searchsorted(bins, df['x'].values) - 1
df["bin_y"] = np.searchsorted(bins, df['y'].values) - 1
result = df.groupby(["bin_x", "bin_y"]).head(10)
result.groupby(["bin_x", "bin_y"])["c"].mean()
bin_x bin_y
0 0 0.369531
1 0.601803
2 0.554452
3 0.575464
4 0.455198
9 5 0.469838
6 0.420772
7 0.367549
8 0.379200
9 0.523083
Name: c, Length: 100, dtype: float64
I've got two musical files: one lossless with little sound gap (at this time it's just silence but it could be anything: sinusoid or just some noise) at the beginning and one mp3:
In [1]: plt.plot(y[:100000])
In [2]: plt.plot(y2[:100000])
This lists are similar but not identical so I need to cut this gap, to find the first occurrence of one list in another with lowest delta error.
And here's my solution (5.7065 sec.):
error = []
for i in range(25000):
y_n = y[i:100000]
y2_n = y2[:100000-i]
error.append(abs(y_n - y2_n).mean())
start = np.array(error).argmin()
print(start, error[start]) #23057 0.0100046
Is there any pythonic way to solve this?
After calculating the mean distance between special points (e.g. where data == 0.5) I reduce the area of search from 25000 to 2000. This gives me reasonable time of 0.3871s:
a = np.where(y[:100000].round(1) == 0.5)[0]
b = np.where(y2[:100000].round(1) == 0.5)[0]
mean = int((a - b[:len(a)]).mean())
delta = 1000
error = []
for i in range(mean - delta, mean + delta):
What you are trying to do is a cross-correlation of the two signals.
This can be done easily using signal.correlate from the scipy library:
import scipy.signal
import numpy as np
# limit your signal length to speed things up
lim = 25000
# do the actual correlation
corr = scipy.signal.correlate(y[:lim], y2[:lim], mode='full')
# The offset is the maximum of your correlation array,
# itself being offset by (lim - 1):
offset = np.argmax(corr) - (lim - 1)
You might want to take a look at this answer to a similar problem.
Let's generate some data first
N = 1000
y1 = np.random.randn(N)
y2 = y1 + np.random.randn(N) * 0.05
y2[0:int(N / 10)] = 0
In these data, y1 and y2 are almost the same (note the small added noise), but the first 10% of y2 are empty (similarly to your example)
We can now calculate the absolute difference between the two vectors and find the first element for which the absolute difference is below a sensitivity threshold:
abs_delta = np.abs(y1 - y2)
sel = abs_delta < THRESHOLD
ix_start = np.where(sel)[0][0]
fig, axes = plt.subplots(3, 1)
ax = axes[0]
ax.plot(y1, '-')
ax.axvline(ix_start, color='red')
ax = axes[1]
ax.plot(y2, '-')
ax.axvline(ix_start, color='red')
ax = axes[2]
ax.axvline(ix_start, color='red')
ax.set_title('abs diff')
This method works if the overlapping parts are indeed "almost identical". You will have to think of smarter alignment ways if the similarity is low.
I think what you are looking for is correlation. Here is a small example.
import numpy as np
equal_part = [0, 1, 2, 3, -2, -4, 5, 0]
y1 = equal_part + [0, 1, 2, 3, -2, -4, 5, 0]
y2 = [1, 2, 4, -3, -2, -1, 3, 2]+y1
np.argmax(np.correlate(y1, y2, 'same'))
So this returns the time-difference, where the correlation between both signals is at its maximum. As you can see, in the example the time difference should be 8, but this depends on your data...
Also note that both signals have the same length.
I have a set of data, and want to make an histogram of it. I need the bins to have the same size, by which I mean that they must contain the same number of objects, rather than the more common (numpy.histogram) problem of having equally spaced bins.
This will naturally come at the expenses of the bins widths, which can - and in general will - be different.
I will specify the number of desired bins and the data set, obtaining the bins edges in return.
data = numpy.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
bins_edges = somefunc(data, nbins=3)
>> [1.,1.3,2.1,2.12]
So the bins all contain 2 points, but their widths (0.3, 0.8, 0.02) are different.
There are two limitations:
- if a group of data is identical, the bin containing them could be bigger.
- if there are N data and M bins are requested, there will be N/M bins plus one if N%M is not 0.
This piece of code is some cruft I've written, which worked nicely for small data sets. What if I have 10**9+ points and want to speed up the process?
1 import numpy as np
3 def def_equbin(in_distr, binsize=None, bin_num=None):
5 try:
7 distr_size = len(in_distr)
9 bin_size = distr_size / bin_num
10 odd_bin_size = distr_size % bin_num
12 args = in_distr.argsort()
14 hist = np.zeros((bin_num, bin_size))
16 for i in range(bin_num):
17 hist[i, :] = in_distr[args[i * bin_size: (i + 1) * bin_size]]
19 if odd_bin_size == 0:
20 odd_bin = None
21 bins_limits = np.arange(bin_num) * bin_size
22 bins_limits = args[bins_limits]
23 bins_limits = np.concatenate((in_distr[bins_limits],
24 [in_distr[args[-1]]]))
25 else:
26 odd_bin = in_distr[args[bin_num * bin_size:]]
27 bins_limits = np.arange(bin_num + 1) * bin_size
28 bins_limits = args[bins_limits]
29 bins_limits = in_distr[bins_limits]
30 bins_limits = np.concatenate((bins_limits, [in_distr[args[-1]]]))
32 return (hist, odd_bin, bins_limits)
Using your example case (bins of 2 points, 6 total data points):
from scipy import stats
bin_edges = stats.mstats.mquantiles(data, [0, 2./6, 4./6, 1])
>> array([1. , 1.24666667, 2.05333333, 2.12])
I would like to mention also the existence of pandas.qcut, which does equi-populated binning in quite an efficient way. In your case it would work something like
data = np.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
# parameter q specifies the number of bins
qc = pd.qcut(data, q=3, precision=1)
# bin definition
bins = qc.categories
>> Index(['[1, 1.3]', '(1.3, 2.03]', '(2.03, 2.1]'], dtype='object')
# bin corresponding to each point in data
codes = qc.codes
>> array([0, 0, 1, 1, 2, 2], dtype=int8)
Update for skewed distributions :
I came across the same problem as #astabada, wanting to create bins each containing an equal number of samples. When applying the solution proposed #aganders3, I found that it didn't work particularly well for skewed distributions. In the case of skewed data (for example something with a whole lot of zeros), stats.mstats.mquantiles for a predefined number of quantiles will not guarantee an equal number of samples in each bin. You will get bin edges that look like this :
[0. 0. 4. 9.]
In which case the first bin will be empty.
In order to deal with skewed cases, I created a function that calls stats.mstats.mquantiles and then dynamically modifies the number of bins if samples are not equal within a certain tolerance (30% of the smallest sample size in the example code). If samples are not equal between bins, the code reduces the number of equally-spaced quantiles by 1 and calls stats.mstats.mquantiles again until sample sizes are equal or only one bin exists.
I hard coded the tolerance in the example, but this could be modified to a keyword argument if desired.
I also prefer giving the number of equally spaced quantiles as an argument to my function instead of giving user defined quantiles to stats.mstats.mquantiles in order to reduce accidental errors (i.e. something like [0., 0.25, 0.7, 1.]).
Here's the code :
import numpy as np
from scipy import stats
def equibins(dat, binnum, **kwargs):
numin = binnum
while numin>1.:
qtls = np.linspace(0.,1.0,num=numin,endpoint=False)
ebins =stats.mstats.mquantiles(dat,qtls,alphap=kwargs['alpha'],betap=kwargs['beta'])
allhist, allbin = np.histogram(dat, bins = ebins)
if (np.unique(ebins).shape!=ebins.shape or tolerence(allhist,0.3)==False) and numin>2:
numin= numin-1
del qtls, ebins
return ebins
def tolerence(narray, percent):
if percent>1.0:
per = percent/100.
per = percent
lev_tol = per*narray.min()
tolerate = np.all(narray[1:]-narray[0]<lev_tol)
return tolerate
Just sort the data, and divide it into fixed bins by length! Obviously you can never divide into exactly equally populated bins, if the number of samples does not divide exactly by the number of bins.
import math
import numpy as np
data = np.array([2,3,5,6,8,5,5,6,3,2,3,7,8,9,8,6,6,8,9,9,0,7,5,3,3,4,5,6,7])
data_sorted = np.sort(data)
nbins = 3
step = math.ceil(len(data_sorted)//nbins+1)
binned_data = []
for i in range(0,len(data_sorted),step):