pnoise2() function in noise module produces SegFault python 3.8

pnoise2() function in noise module produces SegFault python 3.8 - python

I am trying to procedurally generate some maps for a game I am working on, and I am trying to use the perlin noise function from the noise module.
By following some tutorials on the internet, I found out that I had to use it this way:
import random
import noise
from noise import pnoise2
factor = 15 #it doesn't need to be 15, it can be anything different from greater than 1 or lower than -1, otherwise it will return 0.0
oct=1
seed=random.randint(0, 100)
print(pnoise2(x/factor, y/factor, octaves=oct, base=seed))
I can pass it the values x and y but if the factor meets the conditions in the comment, the program will throw a fatal error, segfault, when calling the pnoise2 function. What am I doing wrong?
I am also using pygame for the game, and the exact error I get is the following:
Fatal Python error: (pygame parachute) Segmentation Fault
Python runtime state: initialized
Current thread 0x00007fe3247f2740 (most recent call first):
File "scripts/world_manager.py", line 78 in generate
File "main.py", line 153 in reload_chunks
File "main.py", line 257 in update
File "main.py", line 207 in run
File "main.py", line 406 in <module>
Aborted
Here is the generate() function in world_manager.py:
# Generate a chunk at given coordinates using pnoise2 and adding it to the chunk list
def generate(self, chunkx, chunky):
# print("Generating chunk at", chunkx, chunky)
GRASS = "grass"
MOUNTAIN = "mountain"
EMPTY = "void"
floor_void_diff = 0.3
mountain = floor_void_diff + 0.2
factor = 3
floor = {}
items = {}
chunk = (chunkx, chunky)
if chunk not in self.chunks:
# print("Generating chunk at {}".format(chunk))
for y in range(chunky * CHUNKSIZE, chunky * CHUNKSIZE + CHUNKSIZE):
for x in range(chunkx * CHUNKSIZE, chunkx * CHUNKSIZE + CHUNKSIZE):
i = pnoise2(x/factor, y/factor, base=int(self.seed))
print(i)
if i >= floor_void_diff:
if i > mountain:
floor.update({(x, y): MOUNTAIN})
else:
floor.update({(x, y): GRASS})
spawner = random.randint(-1, ITEM_SPAWN_RATIO)
random_item = random.randint(0, len(ITEM_LIST)-1)
if spawner == 0:
items.update({(x, y): ITEM_LIST[random_item]})
elif i < floor_void_diff:
floor.update({(x, y): EMPTY})
else:
floor.update({(x, y): EMPTY})
self.chunks.update({chunk: {"floor": floor, "items": items}})
self.unsaved += 1
CHUNKSIZE is defined in a settings.py file, and right now CHUNKSIZE=4

I have not found a way to solve this problem, but an alternative. I turns out that the way I was using the noise module breaks it some how, so I started looking for perlin noise functions and I found a pretty good one at https://rosettacode.org/wiki/Perlin_noise#Python

It looks like you're passing seed in, for what is actually the octaves parameter. I don't how how this causes a segmentation fault, but a large number would cause the addition loop to take on a wild number of iterations. I don't know that this library (if it's indeed https://github.com/caseman/noise) actually supports seeding, but maybe I'm missing where it's supported.
Further, why not use the snoise2 function? pnoise2 (actual Perlin noise) creates a lot of 45 and 90 degree bias (image comparison, Perlin on top). The "Perlin" algorithm has the iconic name, but it's probably rarely the best choice for noise anymore. The snoise2 (Simplex noise 2D) is more subtle about grid alignment. This library (indirect shameless plug) does generate seedable noise in the Simplex category, and the output is a bit higher quality than the one in the other lib I would say, but a downside might be that it's implemented directly in Python instead of wrapping a potentially much faster C implementation. You would also need to implement octave summation / fractal brownian motion yourself if you were after that effect. The other lib provides it with the octaves parameter.

Related

Change the melody of human speech using FFT and polynomial interpolation

I'm trying to do the following:
Extract the melody of me asking a question (word "Hey?" recorded to
wav) so I get a melody pattern that I can apply to any other
recorded/synthesized speech (basically how F0 changes in time).
Use polynomial interpolation (Lagrange?) so I get a function that describes the melody (approximately of course).
Apply the function to another recorded voice sample. (eg. word "Hey." so it's transformed to a question "Hey?", or transform the end of a sentence to sound like a question [eg. "Is it ok." => "Is it ok?"]). Voila, that's it.
What I have done? Where am I?
Firstly, I have dived into the math that stands behind the fft and signal processing (basics). I want to do it programatically so I decided to use python.
I performed the fft on the entire "Hey?" voice sample and got data in frequency domain (please don't mind y-axis units, I haven't normalized them)
So far so good. Then I decided to divide my signal into chunks so I get more clear frequency information - peaks and so on - this is a blind shot, me trying to grasp the idea of manipulating the frequency and analyzing the audio data. It gets me nowhere however, not in a direction I want, at least.
Now, if I took those peaks, got an interpolated function from them, and applied the function on another voice sample (a part of a voice sample, that is also ffted of course) and performed inversed fft I wouldn't get what I wanted, right?
I would only change the magnitude so it wouldn't affect the melody itself (I think so).
Then I used spec and pyin methods from librosa to extract the real F0-in-time - the melody of asking question "Hey?". And as we would expect, we can clearly see an increase in frequency value:
And a non-question statement looks like this - let's say it's moreless constant.
The same applies to a longer speech sample:
Now, I assume that I have blocks to build my algorithm/process but I still don't know how to assemble them beacause there are some blanks in my understanding of what's going on under the hood.
I consider that I need to find a way to map the F0-in-time curve from the spectrogram to the "pure" FFT data, get an interpolated function from it and then apply the function on another voice sample.
Is there any elegant (inelegant would be ok too) way to do this? I need to be pointed in a right direction beceause I can feel I'm close but I'm basically stuck.
The code that works behind the above charts is taken just from the librosa docs and other stackoverflow questions, it's just a draft/POC so please don't comment on style, if you could :)
fft in chunks:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
import os
file = os.path.join("dir", "hej_n_nat.wav")
fs, signal = wavfile.read(file)
CHUNK = 1024
afft = np.abs(np.fft.fft(signal[0:CHUNK]))
freqs = np.linspace(0, fs, CHUNK)[0:int(fs / 2)]
spectrogram_chunk = freqs / np.amax(freqs * 1.0)
# Plot spectral analysis
plt.plot(freqs[0:250], afft[0:250])
plt.show()
spectrogram:
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import os
file = os.path.join("/path/to/dir", "hej_n_nat.wav")
y, sr = librosa.load(file, sr=44100)
f0, voiced_flag, voiced_probs = librosa.pyin(y, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))
times = librosa.times_like(f0)
D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
fig, ax = plt.subplots()
img = librosa.display.specshow(D, x_axis='time', y_axis='log', ax=ax)
ax.set(title='pYIN fundamental frequency estimation')
fig.colorbar(img, ax=ax, format="%+2.f dB")
ax.plot(times, f0, label='f0', color='cyan', linewidth=2)
ax.legend(loc='upper right')
plt.show()
Hints, questions and comments much appreciated.

The problem was that I didn't know how to modify the fundamental frequency (F0). By modifying it I mean modify F0 and its harmonics, as well.
The spectrograms in question show frequencies at certain points in time with power (dB) of certain frequency point.
Since I know which time bin holds which frequency from the melody (green line below) ...
....I need to compute a function that represents that green line so I can apply it to other speech samples.
So I need to use some interpolation method which takes as parameters the sample F0 function points.
One need to remember that degree of the polynomial should equal to the number of points. The example doesn't have that unfortunately, but the effect is somehow ok as for the prototype.
def _get_bin_nr(val, bins):
the_bin_no = np.nan
for b in range(0, bins.size - 1):
if bins[b] <= val < bins[b + 1]:
the_bin_no = b
elif val > bins[bins.size - 1]:
the_bin_no = bins.size - 1
return the_bin_no
def calculate_pattern_poly_coeff(file_name):
y_source, sr_source = librosa.load(os.path.join(ROOT_DIR, file_name), sr=sr)
f0_source, voiced_flag, voiced_probs = librosa.pyin(y_source, fmin=librosa.note_to_hz('C2'),
fmax=librosa.note_to_hz('C7'), pad_mode='constant',
center=True, frame_length=4096, hop_length=512, sr=sr_source)
all_freq_bins = librosa.core.fft_frequencies(sr=sr, n_fft=n_fft)
f0_freq_bins = list(filter(lambda x: np.isfinite(x), map(lambda val: _get_bin_nr(val, all_freq_bins), f0_source)))
return np.polynomial.polynomial.polyfit(np.arange(0, len(f0_freq_bins), 1), f0_freq_bins, 3)
def calculate_pattern_poly_func(coefficients):
return np.poly1d(coefficients)
Method calculate_pattern_poly_coeff calculates polynomial coefficients.
Using pythons poly1d lib I can compute function which can modify the speech. How to do that?
I just need to move up or down all values vertically at certain point in time.
for instance I want to move all frequencies at time bin 0,75 seconds up 3 times -> it means that frequency will be increased and the melody at that point will sound higher.
Code:
def transform(sentence_audio_sample, mode=None, show_spectrograms=False, frames_from_end_to_transform=12):
# cutting out silence
y_trimmed, idx = librosa.effects.trim(sentence_audio_sample, top_db=60, frame_length=256, hop_length=64)
stft_original = librosa.stft(y_trimmed, hop_length=hop_length, pad_mode='constant', center=True)
stft_original_roll = stft_original.copy()
rolled = stft_original_roll.copy()
source_frames_count = np.shape(stft_original_roll)[1]
sentence_ending_first_frame = source_frames_count - frames_from_end_to_transform
sentence_len = np.shape(stft_original_roll)[1]
for i in range(sentence_ending_first_frame + 1, sentence_len):
if mode == 'question':
by = int(_question_pattern(i) / 500)
elif mode == 'exclamation':
by = int(_exclamation_pattern(i) / 500)
else:
by = 0
rolled = _roll_column(rolled, i, by)
transformed_data = librosa.istft(rolled, hop_length=hop_length, center=True)
def _roll_column(two_d_array, column, shift):
two_d_array[:, column] = np.roll(two_d_array[:, column], shift)
return two_d_array
In this case I am simply rolling up or down frequencies referencing certain time bin.
This needs to be polished as it doesn't take into consideration an actual state of the transformed sample. It just rolls it up/down according to the factor calculated using the polynomial function computer earlier.
You can check full code of my project at github, "audio" package contains pattern calculator and audio transform algorithm described above.
Feel free to ask if something's unclear :)

Using find_peaks in loop to find peak heights from data

This problem is about using scipy.signal.find_peaks for extracting mean peak height from data files efficiently. I am a beginner with Python (3.7), so I am not sure if I have written my code in the most optimal way, with regard to speed and code quality.
I have a set of measurement files containing one million data points (30MB) each. The graph of this data is a signal with peaks at regular intervals, and with noise. Also, the baseline and the amplitude of the signal vary across parts of the signal. I attached an image of an example. The signal can be much less clean.
My goal is to calculate the mean height of the peaks for each file. In order to do this, first I use find_peaks to locate all the peaks. Then, I loop over each peak location and detect the peak in a small interval around the peak, to make sure I get the local height of the peak.
I then put all these heights in numpy arrays and calculate their mean and standard deviation afterwards.
Here is a barebone version of my code, it is a bit long but I think that might also be because I am doing something wrong.
import numpy as np
from scipy.signal import find_peaks
# Allocate empty lists for values
mean_heights = []
std_heights = []
mean_baselines = []
std_baselines = []
temperatures = []
# Loop over several files, read them in and process data
for file in file_list:
temperatures.append(file)
# Load in data from a file of 30 MB
t_dat, x_dat = np.loadtxt(file, delimiter='\t', unpack=True)
# Find all peaks in this file
peaks, peak_properties = find_peaks(x_dat, prominence=prom, width=0)
# Calculate window size, make sure it is even
if round(len(t_dat)/len(peaks)) % 2 == 0:
n_points = len(t_dat) // len(peaks)
else:
n_points = len(t_dat) // len(peaks) + 1
t_slice = t_dat[-1] / len(t_dat)
# Allocate np arrays for storing heights
baseline_list = np.zeros(len(peaks) - 2)
height_list = np.zeros(len(peaks) - 2)
# Loop over all found peaks, and re-detect the peak in a window around the peak to be able
# to detect its local height without triggering to a baseline far away
for i in range(len(peaks) - 2):
# Making a window around a peak_properties
sub_t = t_dat[peaks[i+1] - n_points // 2: peaks[i+1] + n_points // 2]
sub_x = x_dat[peaks[i+1] - n_points // 2: peaks[i+1] + n_points // 2]
# Detect the peaks (2 version, specific to the application I have)
h_min = max(sub_x) - np.mean(sub_x)
_, baseline_props = find_peaks(
sub_x, prominence=h_min, distance=n_points - 1, width=0)
_, height_props = find_peaks(np.append(
min(sub_x) - 1, sub_x), prominence=h_min, distance=n_points - 1, width=0)
# Add the heights to the np arrays storing the heights
baseline_list[i] = baseline_props["prominences"]
height_list[i] = height_props["prominences"]
# Fill lists with values, taking the stdev and mean of the np arrays with the heights
mean_heights.append(np.mean(height_list))
std_heights.append(np.std(height_list))
mean_baselines.append(np.mean(baseline_list))
std_baselines.append(np.std(baseline_list))
It takes ~30 s to execute. Is this normal or too slow? If so, can it be optimised?

In the meantime I have improved the speed by getting rid of various inefficiencies, that I found by using the Python profiler. I will list the optimisations here ordered by significance for the speed:
Using pandas pd.read_csv() for I/O instead of np.loadtxt() cut off about 90% of the runtime. As also mentioned here, this saves a lot of time. This means changing this:
t_dat, x_dat = np.loadtxt(file, delimiter='\t', unpack=True)
to this:
data = pd.read_csv(file, delimiter = "\t", names=["t_dat", "x_dat"])
t_dat = data.values[:,0]
x_dat = data.values[:,1]
Removing redundant len() calls. I noticed that len() was called many times, and then noticed that this happened unnecessary. Changing this:
if round(len(t_dat) / len(peaks)) % 2 == 0:
n_points = int(len(t_dat) / len(peaks))
else:
n_points = int(len(t_dat) / len(peaks) + 1)
to this:
n_points = round(len(t_dat) / len(peaks))
if n_points % 2 != 0:
n_points += 1
proved to be also a significant improvement.
Lastly, a disproportionally component of the computational time (about 20%) was used by the built in Python functions min(), max() and sum(). Since I was using numpy arrays already, switching to the numpy equivalents for these functions, resulted in a 84% improvement on this part. This means for example changing max(sub_x) to sub_x.max().
These are all unrelated optimisations that I still think might be useful for Python beginner like myself, and they do help a lot.

Shogun / quadratic MMD error caused by varying train_test_ratio

I'm using Shogun to run MMD (quadratic) and compare two nonparametric distributions based on their samples (code below is for 1D, but I've also looked at 2D samples). In the toy problem shown below, I try to change the ratio between training and testing samples in the process of selecting an optimized kernel (KSM_MAXIMIZE_MMD is the selection strategy; I've also used KSM_MEDIAN_HEURISTIC). It appears that any ratio other than 1 yields an error.
Am I allowed to change this ratio in this setting?
(I see that it is used at: http://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html, but it is set to 1 there)
Concise version of the my code (inspired by the notebook available at: http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html):
import shogun as sg
import numpy as np
from scipy.stats import laplace, norm
n = 220
mu = 0.0
sigma2 = 1
b=np.sqrt(0.5)
X = sg.RealFeatures((norm.rvs(size=n) * np.sqrt(sigma2) + mu).reshape(1,-1))
Y = sg.RealFeatures(laplace.rvs(size=n, loc=mu, scale=b).reshape(1,-1))
mmd = sg.QuadraticTimeMMD(X, Y)
mmd.add_kernel(sg.GaussianKernel(10, 1.0))
mmd.set_kernel_selection_strategy(sg.KSM_MAXIMIZE_MMD)
mmd.set_train_test_mode(True)
mmd.set_train_test_ratio(1)
mmd.select_kernel()
mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
kernel_width = mmd_kernel.get_width()
statistic = mmd.compute_statistic()
p_value = mmd.compute_p_value(statistic)
print p_value
This exact version runs and prints p-values just fine.
If I change the argument passed to mmd.set_train_test_ratio() from 1 to 2, I get:
SystemErrorTraceback (most recent call last)
<ipython-input-30-dd5fcb933287> in <module>()
25 kernel_width = mmd_kernel.get_width()
26
---> 27 statistic = mmd.compute_statistic()
28 p_value = mmd.compute_p_value(statistic)
29
SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90: assertion kernel_matrix.num_rows==size && kernel_matrix.num_cols==size failed in float32_t shogun::internal::mmd::ComputeMMD::operator()(const shogun::SGMatrix<T>&) const [with T = float; float32_t = float] file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90
It gets worse, if I use the value below 1. In addition to the following error,
jupyter notebook kernel crashes every time (after which I need to rerun the entire notebook; the message says: "The kernel appears to have died. It will restart automatically.").
SystemErrorTraceback (most recent call last)
<ipython-input-31-cb4a5224f4ef> in <module>()
20 mmd.set_train_test_ratio(0.5)
21
---> 22 mmd.select_kernel()
23
24 mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/kernel/Kernel.h line 210: GaussianKernel::kernel(): index out of Range: idx_a=146/146 idx_b=0/146
Complete code (in a jypyter notebook) can be found at: http://nbviewer.jupyter.org/url/dmitry.duplyakin.org/p/jn/kernel-minimal.ipynb
Please let me know if I am missing a step or need to try a different approach.
Side questions:
Both http://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html and http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html show examples of using sg.GaussianKernel(10, <width>). I couldn't find more information about the 1st parameter other than its name, cache size. How and when am I supposed to change it?
As mentioned in the referenced notebook, mmd.get_kernel_selection_strategy().get_name() returns only the generic name, specifically KernelSelectionStrategy. How can I obtain a more specific name for the selected strategy (e.g., KSM_MEDIAN_HEURISTIC) from an instance of the sg.QuadraticTimeMMD class?
Any relevant information or references will be greatly appreciated.
Shogun version: v6.1.3_2017-12-7_19:14

The train_test_ratio attribute is the ratio between the number of samples used in training and the number of samples used in testing. When you have train_test_mode turned on, the way it decides how many samples to fetch in each mode goes something like this.
num_training_samples = m_num_samples * train_test_ratio / (train_test_ratio + 1)
num_testing_samples = m_num_samples / (train_test_ratio + 1)
It implicitly assumes the divisibility. A train_test_ratio of 2 would, therefore, try to use 2/3rd of the data for training and 1/3rd of the data for testing, which is problematic for the total number of samples you have, 220. By the logic, it sets num_training_samples = 146 and num_testing_samples = 73, which doesn't add up to 220. Similar issues arise when using 0.5 as the train-test ratio. If you use some other values for the train_test_ratio which splits the total number of samples perfectly, I think these errors would go away.
I am not totally sure but I think the cache makes sense when you're using SVMLight with Shogun. Please check http://svmlight.joachims.org/ for details. From their page
-m [5..] - size of cache for kernel evaluations in MB (default 40)
The larger the faster...
There's no pretty-print for the kernel-selection strategy being used, but you could do mmd.get_kernel_selection_strategy().get_method() which returns you the enum value (of type EKernelSelectionMethod) which might be helpful. Since it's not documented yet in Shogun api-doc, here's the C++ equivalent for this that you might use.
enum EKernelSelectionMethod
{
KSM_MEDIAN_HEURISTIC,
KSM_MAXIMIZE_MMD,
KSM_MAXIMIZE_POWER,
KSM_CROSS_VALIDATION,
KSM_AUTO = KSM_MAXIMIZE_POWER
};

Summary (from comments):
The bug does not show up in the latest code
Solution is in: https://github.com/shogun-toolbox/shogun/pull/4134

Numerical Stability of Forward Substitution in Python

I am implementing some basic linear equation solvers in Python.
I have currently implemented forward and backward substitution for triangular systems of equations (so very straightforward to solve!), but the precision of the solutions becomes very poor even with systems of about 50 equations (50x50 coefficient matrix).
The following code performs the forward/backward substitution:
FORWARD_SUBSTITUTION = 1
BACKWARD_SUBSTITUTION = 2
def solve_triang_subst(A: np.ndarray, b: np.ndarray,
substitution=FORWARD_SUBSTITUTION) -> np.ndarray:
"""Solves a triangular system via
forward or backward substitution.
A must be triangular. FORWARD_SUBSTITUTION means A should be
lower-triangular, BACKWARD_SUBSTITUTION means A should be upper-triangular.
"""
rows = len(A)
x = np.zeros(rows, dtype=A.dtype)
row_sequence = reversed(range(rows)) if substitution == BACKWARD_SUBSTITUTION else range(rows)
for row in row_sequence:
delta = b[row] - np.dot(A[row], x)
cur_x = delta / A[row][row]
x[row] = cur_x
return x
I am using numpy and 64-bit floats.
Simple Testing Tool
I have set up a simple test suite which generates coefficient matrices and x vectors, computes the b, and then uses forward or backward substitution to recover the x, comparing it to the its known value for validity.
The following code performs these checks:
import numpy as np
import scipy.linalg as sp_la
RANDOM_SEED = 1984
np.random.seed(RANDOM_SEED)
def check(sol: np.ndarray, x_gt: np.ndarray, description: str) -> None:
if not np.allclose(sol, x_gt, rtol=0.1):
print("Found inaccurate solution:")
print(sol)
print("Ground truth (not achieved...):")
print(x_gt)
raise ValueError("{} did not work!".format(description))
def fuzz_test_solving():
N_ITERATIONS = 100
refine_result = True
for mode in [FORWARD_SUBSTITUTION, BACKWARD_SUBSTITUTION]:
print("Starting mode {}".format(mode))
for iteration in range(N_ITERATIONS):
N = np.random.randint(3, 50)
A = np.random.uniform(0.0, 1.0, [N, N]).astype(np.float64)
if mode == BACKWARD_SUBSTITUTION:
A = np.triu(A)
elif mode == FORWARD_SUBSTITUTION:
A = np.tril(A)
else:
raise ValueError()
x_gt = np.random.uniform(0.0, 1.0, N).astype(np.float64)
b = np.dot(A, x_gt)
x_est = solve_triang_subst(A, b, substitution=mode,
refine_result=refine_result)
# TODO report error and count, don't throw!
# Keep track of error norm!!
check(x_est, x_gt,
"Mode {} custom triang iteration {}".format(mode, iteration))
if __name__ == '__main__':
fuzz_test_solving()
Note that the maximum size of a test matrix is 49x49. Even in this case, the system cannot always compute decent solutions, and fails by more than a margin of 0.1. Here's an example of such a failure (this is doing backward substitution, so the biggest error is in the 0th coefficient; all the test data are sampled uniformly from [0, 1[):
Solution found with Mode 2 custom triang iteration 24:
[ 0.27876067 0.55200497 0.49499509 0.3259397 0.62420183 0.47041149
0.63557676 0.41155446 0.47191956 0.74385864 0.03002819 0.4700286
0.37989592 0.56527691 0.15072607 0.05659282 0.52587574 0.82252197
0.65662833 0.50250729 0.74139748 0.10852731 0.27864265 0.42981232
0.16327331 0.74097937 0.24411709 0.96934199 0.890266 0.9183985
0.14842446 0.51806495 0.36966843 0.18227989 0.85399593 0.89615663
0.39819336 0.90445931 0.21430972 0.61212349 0.85205597 0.66758689
0.1793689 0.38067267 0.39104614 0.6765885 0.4118123 ]
Ground truth (not achieved...)
[ 0.20881608 0.71009766 0.44735271 0.31169033 0.63982328 0.49075813
0.59669585 0.43844108 0.47764942 0.72222069 0.03497499 0.4707452
0.37679884 0.56439738 0.15120397 0.05635977 0.52616387 0.82230625
0.65670245 0.50251426 0.74139956 0.10845974 0.27864289 0.42981226
0.1632732 0.74097939 0.24411707 0.96934199 0.89026601 0.91839849
0.14842446 0.51806495 0.36966843 0.18227989 0.85399593 0.89615663
0.39819336 0.90445931 0.21430972 0.61212349 0.85205597 0.66758689
0.1793689 0.38067267 0.39104614 0.6765885 0.4118123 ]
I have also implemented the iterative refinement method described in Section 2.5 of [0], and while it did help a little, the results are still poor for larger matrices.
MATLAB Sanity Check
I also did this experiment in MATLAB, and even there, once there are more than 100 equations, the estimation error shoots up exponentially.
Here is the MATLAB code I used for this experiment:
err_norms = [];
range = 1:3:120;
for size=range
A = rand(size, size);
A = tril(A);
x_gt = rand(size, 1);
b = A * x_gt;
x_sol = A\b;
err_norms = [err_norms, norm(x_gt - x_sol)];
end
plot(range, err_norms);
set(gca, 'YScale', 'log')
And here is the resulting plot:
Main Question
My question is: Is this normal behavior, seeing as there is essentially no structure in the problem, given that I randomly generate the A matrix and x?
What about solving linear systems of 100s of equations for various practical applications? Are these limitations simply an accepted fact, and e.g., optimization algorithms are just naturally robust to these issues? Or am I missing some important facets of this problem?
[0]: Press, William H. Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press, 2007.

There are no limitations. This is a very fruitful exercise that we all came to realize; writing linear solvers are not that easy and that's why almost always LAPACK or its cousins in other languages are used with full confidence.
You are hit by almost singular matrices and because you are using matlab's backslash you don't see that matlab is switching to least squares solutions behind the scenes when near singularity is hit. If you just change A\b to linsolve(A,b) hence you restrict the solver to solve square systems you'll probably see lots of warnings on your console.
I didn't test it because I don't have a license anymore but if I write blindly this should show you the condition numbers of the matrices at each step.
err_norms = [];
range = 1:3:120;
for i=1:40
size = range(i);
A = rand(size, size);
A = tril(A);
x_gt = rand(size, 1);
b = A * x_gt;
x_sol = linsolve(A,b);
err_norms = [err_norms, norm(x_gt - x_sol)];
zzz(i) = rcond(A);
end
semilogy(range, err_norms);
figure,semilogy(range,zzz);
Note that because you are picking up numbers from a uniform distribution it becomes more and more likely to hit ill-conditioned matrices (wrt to inversion) as the rows have more probability to have rank deficiency. That's why the error becomes bigger and bigger. Sprinkle some identity matrix times a scalar and all errors should come back to eps*n levels.
But best, leave this to expert algorithms which have been tested through decades. It is really not that trivial to write any of these. You can read the Fortran codes, for example, dtrsm solves the triangular system.
On the Python side, you can use scipy.linalg.solve_triangular which uses ?trtrs routines from LAPACK.

python numpy data type error and extremely inefficient use of pyplot :(

[Using windows 10 and python 3.5 with newest modules]
Hello!
I have two slightly different problems that belong together because one is the buggy solution of the other. The first function here is extremely slow with datapoints over 75000 and does not work with 150000. This on does exactly what I want though.
#I call the functions like this:
plt.plot(logtime[:recmax-(degree*2-1)] - (logtime[0]-degree), smoothListTriangle(cpm, degree), color="green", linewidth=2, label="Smoothed n="+degree)
plt.plot(logtime[:recmax] - logtime[0], smoothListGaussian2(str(cpm), degree), color="lime", linewidth=5, label="")
#And cpm is always:
cpm = cpm.astype(int) #Array of big number of values
def smoothListTriangle(cpm,degree): #Thank you Scott from swharden.com!
weight=[]
window=degree*2-1
smoothed=[0.0]*(len(cpm)-window)
for x in range(1,2*degree):
weight.append(degree-abs(degree-x))
w=np.array(weight)
for i in range(len(smoothed)):
smoothed[i]=sum(np.array(cpm[i:i+window])*w)/float(sum(w))
#Very, VERY slow...
return smoothed
The higher "degree" is the longer it takes. But with lesser degree it would not look good.
...
The second function here should be (way?) more efficient, but i cant resolve the data type error:
def smoothListGaussian2(myarray, degree):
myarray = np.pad(myarray, (degree-1,degree-1), mode='edge')
window = degree*2-1
weight = np.arange(-degree+1, degree)/window
weight = np.exp(-(16*weight**2))
weight /= sum(weight)
#weight = weight.astype(int) #Does throw the "invalid literal" error
smoothed = np.convolve(myarray, weight, mode='valid')
return smoothed
#TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'
Im desperately trying to resolve this data type error here with numpy. Its killing me! IT seems to be the array "weight" thats the one who's float64, but converting it throws more errors like:
ValueError: invalid literal for int() with base 10: '[31 31 33 ..., 48 49 51]'
So... Im new to python and use this to log data from my geiger counter. Do you have any idea how to either make the first function WAY more efficient or resolve the error in the second? Im at a loss here.
I found the scripts here: http://www.swharden.com/wp/2008-11-17-linear-data-smoothing-in-python/#comments (I found Scotts other triangle-smooth-function on this site, but i couldnt get this to work either. Its more complicated)
Note that the number of data points are depending on the length in seconds of the measurement and this length can very well be several days. I guess one million data points and more are not unusual.
Thank you!

I just had a revelation of some sort. All i had to do is convert the "myarray" to float before convolving.
I had to do so many conversions to make the whole code work correctly, its ridiculous! I thought this is easy in Python, but no.. :(( Seems to me that c++ is better in that case.
def smoothListGaussian2(myarray, degree):
myarray = np.pad(myarray, (degree - 1, degree - 1), mode='edge')
window = degree * 2 - 1
weight = np.arange(-degree + 1, degree) / window
weight = np.exp(-(16 * weight ** 2))
weight /= sum(weight)
myarray = myarray.astype(float)
smoothed = np.convolve(myarray, weight, mode='valid')
return smoothed
Since this works now I coud test the speed and its pretty fast. I cant see a difference in speed between 40k and 150k data points anymore. cool

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.