Building a weighted histogram using two binary files - python

I have two binary files that I need to iterate through simultaneously so that the value yielded in one file corresponds correctly (same location) to the value yielded in the other. I'm sorting values into histogram bins and the value from one file corresponds to the weight of the value from the other file.
I tried the following syntax:
import numpy as np
import struct
import matplotlib.pyplot as plt
low = np.inf
high = -np.inf
struct_fmt = 'f'
struct_len = struct.calcsize(struct_fmt)
struct_unpack = struct.Struct(struct_fmt).unpack_from
file = "/projects/current/real-core-snaps/core4_256_velx_0009.bin"
file2 = "/projects/current/real-core-snaps/core4_256_dens_0009.bin"
def read_chunks(f, length):
while True:
data = f.read(length)
if not data: break
yield data
loop = 0
with open(file,"rb") as f:
for chunk in read_chunks(f, struct_len):
x = struct_unpack(chunk)
low = np.minimum(x, low)
high = np.maximum(x, high)
loop += 1
nbins = math.ceil(math.sqrt(loop))
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64)
f = open(file,"rb")
f2 = open(file2,"rb")
for chunk1,chunk2 in zip(read_chunks(f, struct_len),read_chunks(f2, struct_len)):
subtotal,e = np.histogram(struct_unpack(chunk1),bins=bin_edges,weights=struct_unpack(chunk2))
total = np.add(total,subtotal,out=total,casting="unsafe")
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
plt.savefig('hist-veldens.svg')
but the histogram produced is ridiculous (see below). What am I doing wrong?
The data files are located at https://drive.google.com/file/d/1fhia2CGzl_aRX9Q9Ng61W-4XJGQe1OCV/view?usp=sharing and https://drive.google.com/file/d/1CrhQjyG2axSFgK9LGytELbxjy3Ndon1S/view?usp=sharing.

The mistake is that total = np.zeros(nbins, np.int64) is assigning an integer type to each of the elements of the array total. Given that subtotal does not contain the count number in a weighted histogram but a float-type, total should also be of type float.

Related

How can I get the start and end indices of a note in a volume graph?

I am trying to make a program, that tells me when a note has been pressed.
I have the following notes exported as a .wav file (The C Major Scale 4 times with different rhythms, dynamics and in different octaves):
I can get the volumes of my sound file using the following code:
from scipy.io import wavfile
def get_volume(file):
sr, data = wavfile.read(file)
if data.ndim > 1:
data = data[:, 0]
return data
volumes = get_volume("FILE")
Here are some information about the output:
Max: 27851
Min: -25664
Mean: -0.7569383391943734
A Sample from the array: [ -7987 -8615 -8983 -9107 -9019 -8750 -8324 -7752 -7033 -6156
-5115 -3920 -2610 -1245 106 1377 2520 3515 4364 5077
5659 6113 6441 6639 6708 6662 6518 6288 5962 5525
4963 4265 3420 2418 1264 -27 -1429 -2901 -4388 -5814
-7101 -8186 -9028 -9614 -9955 -10077 -10012 -9785 -9401 -8846]
And here is what I get when I plot the volumes array (x is the index, y is the volume):
I want to get the indices of the start and end of the notes like the ones in the image (Did it by hand not accurate):
When I looked at the data I realized, that it is a 1d array and I also noticed, that when a note gets louder or quiter it is not smooth. It is like a ZigZag, but there is still a trend. So basically I can't just get the gradients (slope) of each point. So I though about grouping notes into batches and getting the average gradient there and thus doing the calculations with it, like so:
def get_average_gradient(arr):
# Calculates average gradient
return sum([i - (sum(arr) / len(arr)) for i in arr]) / len(arr)
def get_note_start_end(arr_size, batch_size, arr):
# Finds start and end indices
ranges = []
curr_range = [0]
prev_slope = curr_slope = "NO SLOPE"
has_ended = False
for i, j in enumerate(arr):
if j > 0:
curr_slope = "INCREASING"
elif j < 0:
curr_slope = "DECREASING"
else:
curr_slope = "NO SLOPE"
if prev_slope == "DECREASING" and not has_ended:
if i == len(arr) - 1 or arr[i + 1] < 0:
if curr_slope != "DECREASING":
curr_range.append((i + 1) * batch_size + batch_size)
ranges.append(curr_range)
curr_range = [(i + 1) * batch_size + batch_size + 1]
has_ended = True
if has_ended and curr_slope == "INCREASING":
has_ended = False
prev_slope = curr_slope
ranges[-1][-1] = arr_size - 1
return ranges
def get_notes(batch_size, arr):
# Gets the gradients of the batches
out = []
for i in range(0, len(arr), batch_size):
if i + batch_size > len(arr):
gradient = get_average_gradient(arr[i:])
else:
gradient = get_average_gradient(arr[i: i+batch_size])
# print(gradient, i)
out.append(gradient)
return get_note_start_end(len(arr), batch_size, out)
notes = get_notes(128, volumes)
The problem with this is, that if the batch size is too small, then it returns the indices of small peaks, which aren't a note on their own. If the batch size is too big then the program misses the start and end indices.
I also tried to get the notes, by using the silence.
Here is the code I used:
from pydub import AudioSegment, silence
audio = intro = AudioSegment.from_wav("C - Major - Test.wav")
dBFS = audio.dBFS
notes = silence.detect_nonsilent(audio, min_silence_len=50, silence_thresh=dBFS-10)
This worked the best, but it still wasn't good enough. Here is what I got:
It some notes pretty well, but it wasn't able to identify notes accurately if the notes themselves didn't become very quite before a different one was played (Like in the second scale and in the fourth scale).
I have been thinking about this problem for days and I have basically tried most if not all of the good(?) ideas I had. I am new to analysing audio files. Maybe I am using the wrong data to do what I want to do. Maybe I need to use the frequency data (I tried getting it, but couldn't make sense of it)
Frequency code:
from scipy.fft import *
from scipy.io import wavfile
import matplotlib.pyplot as plt
def get_freq(file, start_time, end_time):
sr, data = wavfile.read(file)
if data.ndim > 1:
data = data[:, 0]
else:
pass
# Fourier Transform
N = len(data)
yf = rfft(data)
xf = rfftfreq(N, 1 / sr)
return xf, yf
FILE = "C - Major - Test.wav"
plt.plot(*get_freq(FILE, 0, 10))
plt.show()
And the frequency graph:
And here is the .wav file:
https://drive.google.com/file/d/1CERH-eovu20uhGoV1_O3B2Ph-4-uXpiP/view?usp=sharing
Any help is appreciated :)
think this is what you need:
first you convert negative numbers into positive ones and smooth the line to eliminate noise, to find the lower peaks yo work with the negative values.
from scipy.io import wavfile
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
import numpy as np
from scipy.signal import savgol_filter
def get_volume(file):
sr, data = wavfile.read(file)
if data.ndim > 1:
data = data[:, 0]
return data
v1 = abs(get_volume("test.wav"))
#Smooth the curve
volumes=savgol_filter(v1,10000 , 3)
lv=volumes*-1
#find peaks
peaks,_ = find_peaks(volumes,distance=8000,prominence=300)
lpeaks,_= find_peaks(lv,distance=8000,prominence=300)
# plot them
plt.plot(volumes)
plt.plot(peaks,volumes[peaks],"x")
plt.plot(lpeaks,volumes[lpeaks],"o")
plt.plot(np.zeros_like(volumes), "--", color="gray")
plt.show()
Plot with your test file, x marks the high peaks and o the lower peaks
This article presents two python libraries (Aubio, librosa) to achieve what you need and includes examples of how to use them: How to Use Python to Detect Music Onsets by Lynn Zheng

How to speed up a high dimensional loop in python with numpy instead of pandas?

This Loop does its work in 5 hours. How can i speed it up? I read something about using numpy functions instead of pandas. I tried as you can see but i am to new to python to do it right. The big thing here is the high dimensional data with 6000 columns. Every data is static, except of the random weights. How do i write better code?
import numpy as np
import os
#Covarinace Matrix in Pandas Dataframe 6000 columns x 6000 rows
cov = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
mean_returns = input_table_2.copy().squeeze()
#Looping number
num_portfolios = 100.000
#Empty Resultsmatrix
results_matrix = np.zeros((len(cov.columns)+1, num_portfolios))
rf=0
#Loop corpus
for i in range(num_portfolios):
#Random numbers between 0 and 1 for every column
weights = np.random.uniform(0,1,len(cov.columns))
#Ensure sum of all random numbers is = 1
weights /= np.sum(weights)
#Some easy math operations
portfolio_return = np.sum(mean_returns * weights) * 252
portfolio_std = np.sqrt(np.dot(weights.T, np.dot(cov, weights))) * np.sqrt(252)
sharpe_ratio = (portfolio_return - rf) / portfolio_std
#write sharpe_ratio in result matrix as result for every loop
results_matrix[0,i] = sharpe_ratio
#iterate through the weight vector and add data to results array
for j in range(len(weights)):
results_matrix[j+1,i] = weights[j]
#output table as pandas data frame
output_table = pd.DataFrame(results_matrix.T,columns=['sharpe'] + [ticker for ticker in list(cov.columns)] )```
there is not a generic way to do that, first of all you must identify where your code is slow, and after that you can apply optimization.
First of all you have nested loop so complexity is O(n^2) not a bid deal here, because lot of work can be done using vectorial approach.
In python creation of new object is slow, so for example, if it can be stored in ram, the first np.random.uniform can be done one time and consumed during the cycle.
nested iterator, can be done in vectorial mode, this seem the best candidates for performance.
Anyway i suggest to use a tool like perf_tool that will guide you exactly on the slow piece of code [*]
[*] i'm the main developer of this tool.
#AmilaMGunawardana Here is my first try with tensorflow, but i is not fast enough. At the end i waited 5 hours for 100.000 rounds. Maybe i have to do something better?
Perftool showed me that evrything in the code is fast, except the Part:
vol_arr[x] = tnp.sqrt(tnp.dot(multi_randoms[x].T, np.dot(covData*252, multi_randoms[x]))) --> This part takes 90% of the execution Time.
covData = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
returns = input_table_2.copy().squeeze()
#Looping number
num_portfolios = 100000
rf=0
#print("mean_returns: ", mean_returns)
#print("cov2: ", cov2)
#print("cov: ", cov)
all_weights = np.zeros((num_ports, len(returns.columns))) #tnp.zeros([num_ports,len(returns.columns)], dtype=tnp.float32) #np.zeros((num_ports, len(returns.columns)))
ret_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)# pd.to_numeric(np.zeros(num_ports))
vol_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)
sharpe_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)
multi_randoms = np.random.normal(0, 1., size=(num_portfolios,len(covData.columns) ))
#perf_tool('main')
def main():
for x in range(num_ports):
with PerfTool('preparation1'):
# Save weights
all_weights[x,:] = multi_randoms[x]
with PerfTool('preparation2'):
# Expected return
ret_arr[x] = tnp.sum( (mean_returns * multi_randoms[x] * 252))
with PerfTool('preparation3'):
# Expected volatility
vol_arr[x] = tnp.sqrt(tnp.dot(multi_randoms[x].T, np.dot(covData*252, multi_randoms[x])))
with PerfTool('preparation4'):
# Sharpe Ratio
sharpe_arr[x] = ret_arr[x] - rf /vol_arr[x]
PerfTool.set_enabled()
main()
PerfTool.show_stats_if_enabled()```
This showes up one way of getting better with parallel loading. How could i get rid of the loop? Is there a way to do this calculations in just one step with using all_weights Dataframe once instead of looping over it?
import pandas as pd
import numpy as np
from perf_tool import PerfTool, perf_tool
from joblib import Parallel, delayed, parallel_backend
#Covarinace Matrix in Pandas Dataframe 6000 columns x 6000 rows
covData = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
mean_returns = input_table_2.copy().squeeze()
#Looping number
num_ports = 100000
all_weights = np.zeros((num_ports, len(mean_returns.columns)))
#multi_randoms = np.random.random(size=(len(df.columns) ))
for x in range(num_ports):
weights = np.array(np.random.random(len(mean_returns.columns)))
weights = weights/np.sum(weights)
all_weights[x,:] = weights
#print(weights)
#weights = np.array(np.random.random(len(returns.columns)))
#print(all_weights)
#print("cov2 type: ", type(cov2))
#cov = pd.DataFrame(np.random.normal(0, 1., size=(600,600)))
#print("cov type: ", type(cov))
rf=0
#print("mean_returns: ", mean_returns)
#print("cov2: ", cov2)
#print("cov: ", cov)
#all_weights = np.zeros((num_ports, len(returns.columns)))
ret_arr = pd.to_numeric(np.zeros(num_ports))
vol_arr = pd.to_numeric(np.zeros(num_ports))
sharpe_arr = pd.to_numeric(np.zeros(num_ports))
##perf_tool('main')
##jit(parallel=True)
def test(x):
#for x in range(num_ports):
#with PerfTool('preparation1'):
# Weights
#weights = np.array(np.random.random(len(returns.columns)))
#with PerfTool('preparation2'):
#weights = weights/np.sum(weights)
#with PerfTool('preparation3'):
# Save weights
weights= all_weights[x]
#with PerfTool('preparation4'):
# Expected return
ret_arr[x] = np.sum( (mean_returns * weights * 252))
#with PerfTool('preparation5'):
# Expected volatility
vol_arr[x] = np.sqrt(np.dot(weights.T, np.dot(covData*252, weights)))
#with PerfTool('preparation6'):
# Sharpe Ratio
return x, ret_arr[x] - rf /vol_arr[x]
#sharpe_arr[x] = (np.sum( (mean_returns * all_weights * 252)) - rf) /(np.sqrt(np.dot(all_weights.T, np.dot(covData*252, all_weights))))
#PerfTool.set_enabled()
sharpe= []
weighttable= []
weighttable, sharpe= zip(*Parallel(n_jobs=-1)([delayed(test)(i) for i in range(num_ports)]))```

Python program uses too much memory

Function coinT() tests if two time series are stationary using ADF test and Hurst exponent. Time series are stored in 1511x6 CSV files, but for testing only a vector of the 5th column is returned by the function stock(). There are 50 files in total. It seems that the program is using too much memory as it makes the PC crash after running for ~30 secs. It works fine on 15 files, but crashes on larger sets(>50).
Can somebody please help me find out what's using so much memory? I've tried splitting computations into multiple functions and deleting the object, but it didn't help much.
import numpy as np
import pandas as pd
import statsmodels.tsa.stattools as ts
import csv
import timeit
from numpy import log, polyfit, sqrt, std, subtract
from pandas.stats.api import ols
import os
src = 'C:/Users/PC/Desktop/Magistr/Ibpython/testing/'
filenames = next(os.walk(src))[2] #load all stock file names into array
cointegratedPairs = []
def hurst(ts):
"""Returns the Hurst Exponent of the time series vector ts
H<0.5 - The time series is mean reverting
H=0.5 - The time series is a Geometric Brownian Motion
H>0.5 - The time series is trending"""
# Create the range of lag values
lags = range(2, 100)
# Calculate the array of the variances of the lagged differences
tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]
# Use a linear fit to estimate the Hurst Exponent
poly = polyfit(log(lags), log(tau), 1)
del lags
del tau
# Return the Hurst exponent from the polyfit output
return poly[0]*2.0
#Convert file into an array
def stock(filename):
#read file into array and get it's length
delimiter = ","
with open(src + filename,'r') as dest_f:
data_iter = csv.reader(dest_f,
delimiter = delimiter,
quotechar = '"')
data = [data for data in data_iter]
data_array = np.asarray(data)[:,5]
return data_array
del data
del data_array
#Check if two time series are cointegrated
def coinTest(itemX, itemY):
indVar = map(float, stock(itemX)[0:1000]) #2009.05.22 - 2013.05.14
depVar = map(float, stock(itemY)[0:1000]) #2009.05.22 - 2013.05.14
#Calculate optimal hedge ratio "beta"
df = pd.DataFrame()
df[itemX] = indVar
df[itemY] = depVar
res = ols(y=df[itemY], x=df[itemX])
beta_hr = res.beta.x
alpha = res.beta.intercept
df["res"] = df[itemY] - beta_hr*df[itemX] - alpha
#Calculate the CADF test on the residuals
cadf = ts.adfuller(df["res"])
#Reject the null hypothesis at 1% confidence level
if cadf[4]['1%'] > cadf[0]:
#Hurst exponent test if residuals are mean reverting
if hurst(df["res"]) < 0.4:
cointegratedPairs.append((itemY,itemX))
del indVar
del depVar
del df[itemX]
del df[itemY]
del df["res"]
del cadf
#Main function
def coinT():
limit = 0
TotalPairs = 0
for itemX in filenames:
for itemY in filenames[limit:]:
TotalPairs +=1
if itemX == itemY:
next
else:
coinTest(itemX, itemY)
limit +=1

Python Joint Distribution of N Variables

So I need to calculate the joint probability distribution for N variables. I have code for two variables, but I am having trouble generalizing it to higher dimensions. I imagine there is some sort of pythonic vectorization that could be helpful, but, right now my code is very C like (and yes I know that is not the right way to write Python). My 2D code is below:
import numpy
import math
feature1 = numpy.array([1.1,2.2,3.0,1.2,5.4,3.4,2.2,6.8,4.5,5.6,1.9,2.8,3.7,4.4,7.3,8.3,8.1,7.0,8.0,6.8,6.2,4.9,5.7,6.3,3.7,2.4,4.5,8.5,9.5,9.9]);
feature2 = numpy.array([11.1,12.8,13.0,11.6,15.2,13.8,11.1,17.8,12.5,15.2,11.6,20.8,14.7,14.4,15.3,18.3,11.4,17.0,16.0,16.8,12.2,14.9,15.7,16.3,13.7,12.4,14.2,18.5,19.8,19.0]);
#===Concatenate All Features===#
numFrames = len(feature1);
allFeatures = numpy.zeros((2,numFrames));
allFeatures[0,:] = feature1;
allFeatures[1,:] = feature2;
#===Create the Array to hold all the Bins===#
numBins = int(0.25*numFrames);
allBins = numpy.zeros((allFeatures.shape[0],numBins+1));
#===Find the maximum and minimum of each feature===#
allRanges = numpy.zeros((allFeatures.shape[0],2));
for f in range(allFeatures.shape[0]):
allRanges[f,0] = numpy.amin(allFeatures[f,:]);
allRanges[f,1] = numpy.amax(allFeatures[f,:]);
#===Create the Array to hold all the individual feature probabilities===#
allIndividualProbs = numpy.zeros((allFeatures.shape[0],numBins));
#===Grab all the Individual Probs and the Bins===#
for f in range(allFeatures.shape[0]):
freqhist, binedges = numpy.histogram(allFeatures[f,:],bins=numBins,range=[allRanges[f,0],allRanges[f,1]],density=False);
allBins[f,:] = binedges;
allIndividualProbs[f,:] = freqhist;
#===Create the joint probability array===#
jointProbs = numpy.zeros((numBins,numBins));
#===Compute the joint probability distribution===#
numElements = 0;
for b1 in range(numBins):
for b2 in range(numBins):
for f1 in range(numFrames):
for f2 in range(numFrames):
if ( ( (feature1[f1] >= allBins[0,b1]) and (feature1[f1] <= allBins[0,b1+1]) ) and ((feature2[f2] >= allBins[1,b2]) and (feature2[f2] <= allBins[1,b2+1])) ):
jointProbs[b1,b2] += 1;
numElements += 1;
jointProbs /= numElements;
#===But what if I add the following===#
feature3 = numpy.array([21.1,21.8,23.5,27.6,25.2,23.8,22.1,22.8,26.5,25.2,28.6,20.8,24.7,24.4,29.3,28.3,27.4,26.0,26.2,26.1,25.9,24.0,22.7,22.3,23.7,26.4,24.2,28.5,29.8,29.0]);
How can I generalize the large loop? For N variables (features) this loop would be enormous. Is there a Pythonic way to do this easily?
Check out the function numpy.histogramdd. This function can compute histograms in arbitrary numbers of dimensions. If you set the parameter normed=True, it returns the bin count divided by the bin hypervolume. If you'd prefer something more like a probability mass function (where everything sums to 1), just normalize it yourself. All together, you'll have something like:
import numpy as np
numBins = 10 # number of bins in each dimension
data = np.random.randn(100000, 3) # generate 100000 3-d random data points
jointProbs, edges = np.histogramdd(data, bins=numBins)
jointProbs /= jointProbs.sum()

Python 2D array -- How to plug in x and retrieve y value?

I have been looking for an answer since yesterday but no luck. So I have a 1D spectrum (.fits) file with flux value at each wavelength. I have converted them into a 2D array (x,y)=(wavelength, flux) and want to write a program which will return flux(y) at some assigned wavelengths(x). I have tried this:
#modules
import scipy
import numpy as np
import pyfits as pf
#Target Global Vaiables
hdulist_tg = pf.open('cutmask1-2.0001.fits')
hdr_tg = hdulist_tg[0].header
flux_tg = hdulist_tg[0].data
crval_tg = hdr_tg['CRVAL1'] #Starting wavelength
cdel_tg = hdr_tg['CDELT1'] #Wavelength axis width
wave_tg = crval_tg + np.arange(3183)*cdel_tg #Create an x-axis
wavelist = [6207,6315,6369,6438,6490,6565,6588]
wave_flux=[]
diff = 10
for wave in wave_tg:
for flux in flux_tg:
wave_flux.append((wave,flux))
for item in wave_flux:
wave = item[0]
flux = item[1]
#Where I got my actual wavelength that exists in wave_tg
diffmatch = np.abs(wave - wavelist[0])
if diffmatch < diff:
flux_wave = flux
diff = diffmatch
wavematch = wave
print wavelist[0],flux_wave,wavematch
but the program always return the same flux value even though the wavelength is different. Please help...
I would skip the creation of the two dimensional table altogether and just use interp:
fluxvalues = np.interp(wavelist, wave_tg, flux_tg)
For the file you posted, the code you posted doesn't work due to the hard-coded length of the wave_tg array. I would therefore recommend you rather use
wave_tg = crval_tg + np.arange(len(flux_tg))*cdel_tg
Also, for some reason it seems that the file you posted doesn't actually go up to the wavelengths you are looking up. You might need to check that you are calculating the corresponding wavelengths correctly or check that you are looking up the right wavelengths.
I've made some changes in your code:
using numpy ot create wave_flux as a ndarray using np.hstack(), np.repeat() and np.tile()
using fancy indexing to get the values matching your search
The resulting code is:
#modules
import scipy
import numpy as np
import pyfits as pf
#Target Global Vaiables
hdulist_tg = pf.open('cutmask1-2.0001.fits')
hdr_tg = hdulist_tg[0].header
flux_tg = hdulist_tg[0].data
crval_tg = hdr_tg['CRVAL1'] #Starting wavelength
cdel_tg = hdr_tg['CDELT1'] #Wavelength axis width
wave_tg = crval_tg + np.arange(3183)*cdel_tg #Create an x-axis
wavelist = [6207,6315,6369,6438,6490,6565,6588]
wave_flux = np.vstack(( np.repeat(wave_tg, len(flux_tg)),
np.tile(flux_tg, len(wave_tg)) )).transpose()
wave_ref = wavelist[0]
diff = 10
print wave_flux[ np.abs(wave_flux[:,0]-wave_ref) < diff ]
Which will return a sub-group of wave_flux with the wave values in column 0 and flux values in column 1:
[[ 6197.10300138 500.21020508]
[ 6197.10300138 523.24102783]
[ 6197.10300138 510.6390686 ]
...,
[ 6216.68436446 674.94732666]
[ 6216.68436446 684.74255371]
[ 6216.68436446 712.20098877]]

Categories