Method for avoiding random number repetition - python

Method for avoiding random number repetition - python - python

I am using the random number routines in python in the following code in order to create a noise signal.
res = 10
# Add noise to each X bin accross the signal
X = np.arange(-600,600,res)
for i in range(10000):
noise = [random.uniform(-2,2) for i in xrange(len(X))]
# custom module to save output of X and noise to .fits file
wp.save_fits('test10000', X, noise)
plt.plot(V, I)
plt.show()
In this example I am generate 10,000 'noise.fits' files, that I then wish to co-add together in order to show the expected 1/sqrt(N) dependence of the stacked noise root-mean-square (rms) as a function of the number of objects co-added.
My problem is that the rms follows this dependancy up until ~1000 objects, at which point it deviates upwards, suggesting that the random number generator.
Is there a routine or way to structure the code which will avoid or minimise this repetition? (Ideally with the number as a float in between a max and min value >1 and <-1)?
Here is the output of the co-adding code as well as the code pasted at the bottom for reference.
If I use the module random.random() the result is worse.
Here is my code which adds the noise signal files together, averaging over the number of objects.
import os
import numpy as np
from astropy.io import fits
import matplotlib.pyplot as plt
import glob
rms_arr =[]
#vel_w_arr = []
filelist = glob.glob('/Users/thbrown/Documents/HI_stacking/mockcat/testing/test10000/M*.fits')
filelist.sort()
for i in (filelist[:]):
print(i)
#open an existing FITS file
hdulist = fits.open(str(i))
# assuming the first extension is the table we assign data to record array
tbdata = hdulist[1].data
#index = np.arange(len(filelist))
# Access the signal column
noise = tbdata.field(1)
# access the vel column
X = tbdata.field(0)
if i == filelist[0]:
stack = np.zeros(len(noise))
tot_rms = 0
#print len(stack)
# sum signal in loop
stack = (stack + noise)
rms = np.std(stack)
rms_arr = np.append(rms_arr, rms)
numgal = np.arange(1, np.size(filelist)+1)
avg_rms = rms_arr / numgal

Related

Python DGL increasing node degree after removals

I'm studying effects of node removals on the model training accuracy, using several heuristics, the CORA dataset and DGL graph library. The issue is when I try to remove by reversed order of degrees, such as nodes with higher degree are removed first. I extract the graph degree array, which is indexed by id's, and reverse argsort it. This would be the largest degree node ids, in decreasing order.
Finally, I remove the desired amount from the graph, returing the modified graph.
After a few iterations, I noticed the largest degree present tends to increase, something that should not happen from my algorithm, as I reverse argsort the degrees indexes, slice and remove.
I've inserted some prints inside the code to show progress and how the degrees change over time. To avoid having to clone the code, I saved the output inside the repository.
Here is the minimum reproducible example: github repository
import math
import random
import secrets
import time
import numpy as np
import torch
import dgl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import pdb
from dgl.data import CoraGraphDataset
import gnn
def remove_nodes(g, total):
degreeArray = g.in_degrees().numpy()
print('Mean of degrees: ', degreeArray.sum()/len(degreeArray))
print("Size of degree array: ", len(degreeArray))
print("__________")
#sort indexes and reverse, to get greater degrees first
sortIndexes = np.argsort(degreeArray)[::-1].copy()
#print("Sorted indexes: ", sortIndexes.tolist())
#2nd step: get degree value info
debug_sorted_degrees = np.array(degreeArray)[sortIndexes]
# indexes and degrees of 10 to be removed nodes
degreeDict = list(zip(sortIndexes, debug_sorted_degrees))[0:10]
#print("DegreeDict: ", degreeDict)
#take all degrees in graph to dataframe and group by degree
hist = pd.DataFrame(debug_sorted_degrees)
hist.columns = ['degrees in graph, grouped']
y = hist.groupby("degrees in graph, grouped").size()
print("number of nodes to be removed in round: ", total)
print(y)
#slice the desired number of nodes from sorted indexes
nodes = sortIndexes[0:total].copy()
#print(nodes.tolist())
removedNodesSearchedInGraph = g.in_degrees(torch.tensor(nodes)).numpy().tolist()
maiorGrau = max(removedNodesSearchedInGraph)
menorGrau = min(removedNodesSearchedInGraph)
print("\nSorted degree removals: ")
print(removedNodesSearchedInGraph[0:total], sep='\t')
print(f"Largest degree removed: {maiorGrau}")
print(f"Smallest degree removed: {menorGrau}")
g.remove_nodes(torch.tensor(nodes, dtype=torch.int64), store_ids=True)
return g, nodes
dataset = CoraGraphDataset()[0]
precision = []
trainingEpochs = 60
nodeRemovalsPerRound = 50
for i in range(7):
print(f"\n______________ITERATION #{i}______________________")
g, removedNodes = remove_nodes(dataset, nodeRemovalsPerRound)
currentPrecision = gnn.train(dataset, trainingEpochs)
precision.append(currentPrecision)
for i in range(len(precision)):
print(f"Precision of iteration {i+1}: {precision[i]}")
And this is the output I get from running the code
After the first iteration, it starts removing from the lowest nodes, not the highest ones.
What is it that I'm missing?

Time series dBFS plot output modification - current output plot not as expected (matplotlib)

I'm trying to plot the Amplitude (dBFS) vs. Time (s) plot of an audio (.wav) file using matplotlib. I managed to do that with the following code:
def convert_to_decibel(sample):
ref = 32768 # Using a signed 16-bit PCM format wav file. So, 2^16 is the max. value.
if sample!=0:
return 20 * np.log10(abs(sample) / ref)
else:
return 20 * np.log10(0.000001)
from scipy.io.wavfile import read as readWav
from scipy.fftpack import fft
import matplotlib.pyplot as gplot1
import matplotlib.pyplot as gplot2
import numpy as np
import struct
import gc
wavfile1 = '/home/user01/audio/speech.wav'
wavsamplerate1, wavdata1 = readWav(wavfile1)
wavdlen1 = wavdata1.size
wavdtype1 = wavdata1.dtype
gplot1.rcParams['figure.figsize'] = [15, 5]
pltaxis1 = gplot1.gca()
gplot1.axhline(y=0, c="black")
gplot1.xticks(np.arange(0, 10, 0.5))
gplot1.yticks(np.arange(-200, 200, 5))
gplot1.grid(linestyle = '--')
wavdata3 = np.array([convert_to_decibel(i) for i in wavdata1], dtype=np.int16)
yvals3 = wavdata3
t3 = wavdata3.size / wavsamplerate1
xvals3 = np.linspace(0, t3, wavdata3.size)
pltaxis1.set_xlim([0, t3 + 2])
pltaxis1.set_title('Amplitude (dBFS) vs Time(s)')
pltaxis1.plot(xvals3, yvals3, '-')
which gives the following output:
I had also plotted the Power Spectral Density (PSD, in dBm) using the code below:
from scipy.signal import welch as psd # Computes PSD using Welch's method.
fpsd, wPSD = psd(wavdata1, wavsamplerate1, nperseg=1024)
gplot2.rcParams['figure.figsize'] = [15, 5]
pltpsdm = gplot2.gca()
gplot2.axhline(y=0, c="black")
pltpsdm.plot(fpsd, 20*np.log10(wPSD))
gplot2.xticks(np.arange(0, 4000, 400))
gplot2.yticks(np.arange(-150, 160, 10))
pltpsdm.set_xlim([0, 4000])
pltpsdm.set_ylim([-150, 150])
gplot2.grid(linestyle = '--')
which gives the output as:
The second output above, using the Welch's method plots a more presentable output. The dBFS plot though informative is not very presentable IMO. Is this because of:
the difference in the domains (time in case of 1st output vs frequency in the 2nd output)?
the way plot function is implemented in pyplot?
Also, is there a way I can plot my dBFS output as a peak-to-peak style of plot just like in my PSD (dBm) plot rather than a dense stem plot?
Would be much helpful and would appreciate any pointers, answers or suggestions from experts here as I'm just a beginner with matplotlib and plots in python in general.

TLNR
This has nothing to do with pyplot.
The frequency domain is different from the time domain, but that's not why you didn't get what you want.
The calculation of dbFS in your code is wrong.
You should frame your data, calculate RMSs or peaks in every frame, and then convert that value to dbFS instead of applying this transformation to every sample point.
When we talk about the amplitude, we are talking about a periodic signal. And when we read in a series of data from a sound file, we read in a series of sample points of a signal(may be or be not periodic). The value of every sample point represents a, say, voltage value, or sound pressure value sampled at a specific time.
We assume that, within a very short time interval, maybe 10ms for example, the signal is stationary. Every such interval is called a frame.
Some specific function is applied to each frame usually, to reduce the sudden change at the edge of this frame, and these functions are called window functions. If you did nothing to every frame, you added rectangle windows to them.
An example: when the sampling frequency of your sound is 44100Hz, in a 10ms-long frame, there are 44100*0.01=441 sample points. That's what the nperseg argument means in your psd function but it has nothing to do with dbFS.
Given the knowledge above, now we can talk about the amplitude.
There are two methods a get the value of amplitude in every frame:
The most straightforward one is to get the maximum(peak) values in every frame.
Another one is to calculate the RMS(Root Mean Sqaure) of every frame.
After that, the peak values or RMS values can be converted to dbFS values.
Let's start coding:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
# Determine full scall(maximum possible amplitude) by bit depth
bit_depth = 16
full_scale = 2 ** bit_depth
# dbFS function
to_dbFS = lambda x: 20 * np.log10(x / full_scale)
# Read in the wave file
fname = "01.wav"
fs,data = wavfile.read(fname)
# Determine frame length(number of sample points in a frame) and total frame numbers by window length(how long is a frame in seconds)
window_length = 0.01
signal_length = data.shape[0]
frame_length = int(window_length * fs)
nframes = signal_length // frame_length
# Get frames by broadcast. No overlaps are used.
idx = frame_length * np.arange(nframes)[:,None] + np.arange(frame_length)
frames = data[idx].astype("int64") # Convert to in 64 to avoid integer overflow
# Get RMS and peaks
rms = ((frames**2).sum(axis=1)/frame_length)**.5
peaks = np.abs(frames).max(axis=1)
# Convert them to dbfs
dbfs_rms = to_dbFS(rms)
dbfs_peak = to_dbFS(peaks)
# Let's start to plot
# Get time arrays of every sample point and ever frame
frame_time = np.arange(nframes) * window_length
data_time = np.linspace(0,signal_length/fs,signal_length)
# Plot
f,ax = plt.subplots()
ax.plot(data_time,data,color="k",alpha=.3)
# Plot the dbfs values on a twin x Axes since the y limits are not comparable between data values and dbfs
tax = ax.twinx()
tax.plot(frame_time,dbfs_rms,label="RMS")
tax.plot(frame_time,dbfs_peak,label="Peak")
tax.legend()
f.tight_layout()
# Save serval details
f.savefig("whole.png",dpi=300)
ax.set_xlim(1,2)
f.savefig("1-2sec.png",dpi=300)
ax.set_xlim(1.295,1.325)
f.savefig("1.2-1.3sec.png",dpi=300)
The whole time span looks like(the unit of the right axis is dbFS):
And the voiced part looks like:
You can see that the dbFS values become greater while the amplitudes become greater at the vowel start point:

How do I generate 1000 data points?

I am a bit confused since I am trying to learn python.
My question is how can I generate 1000 datapoints for a noisy S-curve and then save it to a .txt file?

You may consider using the random module to generate a large list of random values
COUNT = 1000 # Number of data points
UPPER_BOUND = 100 # The domain they occupy, exclusive at the upper bound
LOWER_BOUND = 0
data_points = []
for _ in range(COUNT):
data_points.append(random.randint(LOWER_BOUND, UPPER_BOUND))
To save this to a text file, use the open() method with the "w" value to write into a file:
with open("filename.txt", "w") as f:
f.write(data_points)
The use of the with clause removes the need to call close() on the file after it is used.

You can use scipy.stats.logistic for the "S-shaped" curve and numpy.random.uniform for the noise:
import numpy as np
from scipy.stats import logistic
N = 1000
x = np.linspace(-10,10, num=N)
noise = np.random.uniform(0, 0.1, size=N)
points = logistic.cdf(x)+noise
np.savetxt('points.txt', points)
content of points.txt (first lines):
5.163273718724530059e-02
2.404908177729772611e-02
7.221953948290879555e-02
3.023476195714707923e-02
4.972362503720893084e-02
8.986980537557204274e-02
9.878733026764449643e-02
9.584209234526251675e-02
7.709992266714442433e-02
1.367468690439026940e-02
How the data looks like:
import matplotlib.pyplot as plt
plt.plot(x, points)

Efficiently using 1-D pyfftw on small slices of a 3-D numpy array

I have a 3D data cube of values of size on the order of 10,000x512x512. I want to parse a window of vectors (say 6) along dim[0] repeatedly and generate the fourier transforms efficiently. I think I'm doing an array copy into the pyfftw package and it's giving me massive overhead. I'm going over the documentation now since I think there is an option I need to set, but I could use some extra help on the syntax.
This code was originally written by another person with numpy.fft.rfft and accelerated with numba. But the implementation wasn't working on my workstation so I re-wrote everything and opted to go for pyfftw instead.
import numpy as np
import pyfftw as ftw
from tkinter import simpledialog
from math import ceil
import multiprocessing
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
ftw.interfaces.cache.enable()
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
dataChunk = np.random.random((1000,512,512))
numFrames = dataChunk.shape[0]
# select the window size
windowSize = int(simpledialog.askstring('Window Size',
'How many frames to demodulate a single time point?'))
numChannels = windowSize//2+1
# create fftw arrays
ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
dataChunk.shape[1],dataChunk.shape[2]])
channelChunks = getDFT(dataChunk,channelChunks,
ftwIn,ftwOut,fftObject,windowSize,numChannels)
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
windowSize,numChannels):
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
for yy in range(data.shape[1]):
for xx in range(data.shape[2]):
index = 0
for i in range(0,frameLen-windowSize+1,windowSize):
ftwIn[:] = data[i:i+windowSize,yy,xx]
fftObject()
channelOut[:,index,yy,xx] = 2*np.abs(ftwOut[:numChannels])/windowSize
index+=1
return channelOut
if __name__ == '__main__':
runme()
What happens is I get a 4D array; the variable channelChunks. I am saving out each channel to a binary (not included in the code above, but the saving part works fine).
This process is for a demodulation project we have, the 4D data cube channelChunks is then parsed into eval(numChannel) 3D data cubes (movies) and from that we are able to separate a movie by color given our experimental set up. I was hoping I could circumvent writing a C++ function that calls the fft on the matrix via pyfftw.
Effectively, I am taking windowSize=6 elements along the 0 axis of dataChunk at a given index of 1 and 2 axis and performing a 1D FFT. I need to do this throughout the entire 3D volume of dataChunk to generate the demodulated movies. Thanks.

The FFTW advanced plans can be automatically built by pyfftw.
The code could be modified in the following way:
Real to complex transforms can be used instead of complex to complex transform.
Using pyfftw, it typically writes:
ftwIn = ftw.empty_aligned(windowSize, dtype='float64')
ftwOut = ftw.empty_aligned(windowSize//2+1, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
Add a few flags to the FFTW planner. For instance, FFTW_MEASURE will time different algorithms and pick the best. FFTW_DESTROY_INPUT signals that the input array can be modified: some implementations tricks can be used.
fftObject = ftw.FFTW(ftwIn,ftwOut, flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
Limit the number of divisions. A division costs more than a multiplication.
scale=1.0/windowSize
for ...
for ...
2*np.abs(ftwOut[:,:,:])*scale #instead of /windowSize
Avoid multiple for loops by making use of FFTW advanced plan through pyfftw.
nbwindow=numFrames//windowSize
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
...
for yy in range(data.shape[1]):
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
fftObject()
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
Here is the modifed code. I also, decreased the number of frame to 100, set the seed of the random generator to check that the outcome is not modifed and commented tkinter. The size of the window can be set to a power of two, or a number made by multiplying 2,3,5 or 7, so that the Cooley-Tuckey algorithm can be efficiently applied. Avoid large prime numbers.
import numpy as np
import pyfftw as ftw
#from tkinter import simpledialog
from math import ceil
import multiprocessing
import time
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
ftw.interfaces.cache.enable()
ftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
np.random.seed(seed=42)
dataChunk = np.random.random((100,512,512))
numFrames = dataChunk.shape[0]
# select the window size
#windowSize = int(simpledialog.askstring('Window Size',
# 'How many frames to demodulate a single time point?'))
windowSize=32
numChannels = windowSize//2+1
nbwindow=numFrames//windowSize
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
#ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
#ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
dataChunk.shape[1],dataChunk.shape[2]])
channelChunks = getDFT(dataChunk,channelChunks,
ftwIn,ftwOut,fftObject,windowSize,numChannels)
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
windowSize,numChannels):
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
printed=0
nbwindow=data.shape[0]//windowSize
scale=1.0/windowSize
for yy in range(data.shape[1]):
#for xx in range(data.shape[2]):
index = 0
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
fftObject()
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
#for i in range(nbwindow):
#channelOut[:,i,yy,xx] = 2*np.abs(ftwOut[i,:])*scale
if printed==0:
for j in range(channelOut.shape[0]):
print j,channelOut[j,0,yy,0]
printed=1
return channelOut
if __name__ == '__main__':
seconds=time.time()
runme()
print "time: ", time.time()-seconds
Let us know how much it speeds up your computations! I went from 24s to less than 2s on my computer...

Save results of a Loop in a Matrix

I am currently programming a Python tool for performing a Geometric Brownian motion. The loop for performing the motion is done and works as intended. Now I have problems saving the various results of the simulations in a big matrix and to plot it then.
I tried to use the append function but it turns out that the result I get then is a list with another array for each simulation rather than a big matrix.
My Code:
import matplotlib.pyplot as plt
import numpy as np
T = 2
mu = 0.15
sigma = 0.10
S0 = 20
dt = 0.01
N = round(T/dt) ### Paths
simu = 20 ### number of simulations
i = 1
## creates an array with values from 0 to T with N elementes (T/dt)
t = np.linspace(0, T, N)
## empty Matrix for the end results
res = []
while i < simu + 1:
## random number showing the Wiener process
W = np.random.standard_normal(size = N)
W = np.cumsum(W)*np.sqrt(dt) ### standard brownian motion ###
X = (mu-0.5*sigma**2)*t + sigma*W
S = S0*np.exp(X) ### new Stock prices based on the simulated returns ###
res.append(S) #appends the resulting array to the result table
i += 1
#plotting of the result Matrix
plt.plot(t, res)
plt.show()
I would be very pleased if someone could help me with this problem since I intend to plot the time with the different paths (which are stored in the big matrix).
Thank you in advance,
Nick

To completely avoid the loop and use fast and clean pythonic vectorized operations, you can write your operation like this:
import matplotlib.pyplot as plt
import numpy as np
T = 2
mu = 0.15
sigma = 0.10
S0 = 20
dt = 0.01
N = round(T/dt) ### Paths
simu = 20 ### number of simulations
i = 1
## creates an array with values from 0 to T with N elementes (T/dt)
t = np.linspace(0, T, N)
## result matrix creation not needed, thanks to gboffi for the hint :)
## random number showing the Wiener process
W = np.random.standard_normal(size=(simu, N))
W = np.cumsum(W, axis=1)*np.sqrt(dt) ### standard brownian motion ###
X = (mu-0.5*sigma**2)*t + sigma*W
res = S0*np.exp(X) ### new Stock prices based on the simulated returns ###
Now your results are stored in a real matrix, or correctly a np.ndarray. np.ndarray is the standard array format of numpy and thus the most widely used and supported array format.
To plot it, you need to give further information, like: Do you want to plot each row of the result array? This would then look like:
for i in range(simu):
plt.plot(t, res[i])
plt.show()
If you want to check the shape for consistency after calculation, you can do the following:
assert res.shape == (simu, N), 'Calculation faulty!'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Method for avoiding random number repetition - python - python

Related

Python DGL increasing node degree after removals

Time series dBFS plot output modification - current output plot not as expected (matplotlib)

How do I generate 1000 data points?

Efficiently using 1-D pyfftw on small slices of a 3-D numpy array

Save results of a Loop in a Matrix

Categories

Resources