I'm working on an application which, to make my life easier, requires a numeric converter for notes into frequencies that do a certain number of notes per second, including chords.
I found this article which highlighted the frequencies of each note, which manually blended (with pyaudio) to make my own rendition of Smoke On The Water using the mapped sequence from the article for each note.
This would work, and I could create chords by creating parallel processes, though I have no way of converting a note or tab number into a specific pitch. Most of my data is in the form of:
0 3 5 0 3 6 5 0 3 5 3 0
Essentially, I require an equation or function which can return the frequency for the input, with 0 being an open E-low-string and each value increase by 1 is one-fret up the fretboard (1 = F).
Isn't there a blatant pattern?
I'd wish so, but I suspect sine waves are the suspect. Taking the difference of E to F is 5.1, and F to F# is 5.2 and finally, F# to G being 5.5.
Thanks for any help, it's greatly appreciated.
Isn't there a blatant pattern?
Yes, for music in general there is. Two adjacent notes are separated by a factor of 2^(1/12). Wikipedia - Twelfth root of two Wikipedia - Semitone. It tried this out on the numbers in your linked article and the pattern fit perfectly to the number of significant digits shown in the article.
EDIT
OP asked for some code. Here's a quick -- but verbosely documented -- shot at that:
# A semitone (half-step) is the twelfth root of two
# https://en.wikipedia.org/wiki/Semitone
# https://en.wikipedia.org/wiki/Twelfth_root_of_two
SEMITONE_STEP = 2 ** (1/12)
# Standard tuning for a guitar - EADGBE
LOW_E_FREQ = 82.4 # Baseline - low 'E' is 82.4Hz
# In standard tuning, we use the fifth fret to tune the next string
# except for the next-to-highest string where we use the fourth fret.
STRING_STEPS = [5, 5, 5, 4, 5]
# Number of frets can vary but we will just presume it's 24 frets
N_FRETS = 24
# This will be a list of the frequencies of all six strings,
# a list of six lists, where each list is that string's frequencies at each fret
fret_freqs = []
# Start with the low string as our reference point
# We just short-hand the math of multipliying by SEMITONE_STEP over and over
fret_freqs.append([LOW_E_FREQ * (SEMITONE_STEP ** n) for n in range(N_FRETS)])
# Now go through the upper strings and base of each lower-string's fret, just like
# when we are tuning a guitar
for tuning_fret in STRING_STEPS:
# Pick off the nth fret of the previous string and use it as our base frequency
base_freq = fret_freqs[-1][tuning_fret]
fret_freqs.append([base_freq * (SEMITONE_STEP ** n) for n in range(N_FRETS)])
for stringFreqs in fret_freqs:
# We don't need 14 decimal places of precision, thank you very much.
print(["{:.1f}".format(f) for f in stringFreqs])
Output of this:
['82.4', '87.3', '92.5', '98.0', '103.8', '110.0', '116.5', '123.5', '130.8', '138.6', '146.8', '155.6', '164.8', '174.6', '185.0', '196.0', '207.6', '220.0', '233.1', '246.9', '261.6', '277.2', '293.6', '311.1']
['110.0', '116.5', '123.5', '130.8', '138.6', '146.8', '155.6', '164.8', '174.6', '185.0', '196.0', '207.6', '220.0', '233.1', '246.9', '261.6', '277.2', '293.6', '311.1', '329.6', '349.2', '370.0', '392.0', '415.3']
['146.8', '155.6', '164.8', '174.6', '185.0', '196.0', '207.6', '220.0', '233.1', '246.9', '261.6', '277.2', '293.6', '311.1', '329.6', '349.2', '370.0', '392.0', '415.3', '440.0', '466.1', '493.8', '523.2', '554.3']
['196.0', '207.6', '220.0', '233.1', '246.9', '261.6', '277.2', '293.6', '311.1', '329.6', '349.2', '370.0', '392.0', '415.3', '440.0', '466.1', '493.8', '523.2', '554.3', '587.3', '622.2', '659.2', '698.4', '739.9']
['246.9', '261.6', '277.2', '293.6', '311.1', '329.6', '349.2', '370.0', '392.0', '415.3', '440.0', '466.1', '493.8', '523.2', '554.3', '587.3', '622.2', '659.2', '698.4', '739.9', '783.9', '830.5', '879.9', '932.2']
['329.6', '349.2', '370.0', '392.0', '415.3', '440.0', '466.1', '493.8', '523.2', '554.3', '587.3', '622.2', '659.2', '698.4', '739.9', '783.9', '830.5', '879.9', '932.2', '987.7', '1046.4', '1108.6', '1174.6', '1244.4']
Related
I'm trying to get a subset of combinations such that every option is used the same amount of times, or close to it, from the total set of combinations without repetition. For example, I have 8 options (let's say A-H) and I need combinations of 4 letters where order doesn't matter. That would give me 70 possible combinations. I would like to take a subset of those combinations such that A appears as much as each other letter does, and A appears with B as much as C appears with D, etc. I know there are subsets where it is impossible to have each letter appear the same amount of times and appear with another letter the same amount of times so when I say "same amount of times" in this post, I mean the same amount or close to it.
If the options are written out in an organized list as is shown below, I couldn't just select the first N options because that would give A far more use than it would H. Also, A and B would appear together more than C and D. The main idea is to get as evenly distributed use of each letter combination as possible.
ABCD ABCE ABCF ABCG ABCH ABDE ABDF ABDG ABDH ABEF ABEG ABEH ABFG ABFH ABGH ACDE ACDF ACDG ACDH ACEF ACEG ACEH ACFG ACFH ACGH ADEF ADEG ADEH ADFG ADFH ADGH AEFG AEFH AEGH AFGH BCDE BCDF BCDG BCDH BCEF BCEG BCEH BCFG BCFH BCGH BDEF BDEG BDEH BDFG BDFH BDGH BEFG BEFH BEGH BFGH CDEF CDEG CDEH CDFG CDFH CDGH CEFG CEFH CEGH CFGH DEFG DEFH DEGH DFGH EFGH
I could take a random sample but being random, it doesn't exactly meet my requirements of taking a subset intentionally to get an even distribution. It could randomly choose a very uneven distribution.
Is there a tool or a mathematical formula to generate a list like I'm asking for? Building one in Python or some other coding language is an option if I had an idea of how to go about it.
You are asking the dealer to shuffle the deck.
The python standard library has a module, named random, containing a shuffle function. Present your eight options, shuffle them, and return the first four or however many you need. It will be random, obeying the distribution that you desire.
EDIT
I'm not sure how I could have expressed "shuffle" more clearly
but I will try, in math, in English and in code.
Draw a random permutation of 8 distinct elements and select the first 4.
Take a shuffled deck of 8 distinct cards, deal 4 of them, discard the rest.
#! /usr/bin/env python
from pprint import pp
import random
import matplotlib.pyplot as plt
import pandas as pd
import typer
class Options:
def __init__(self, all_options, k=4):
self.all_options = all_options
self.k = k
def new_deck(self):
deck = self.all_options.copy()
random.shuffle(deck)
return deck
def choose_options(self):
return self.new_deck()[: self.k]
def choose_many_options(self, n):
for _ in range(n):
yield "".join(self.choose_options())
def main(n: int = 10_000_000):
opt = Options(list("ABCDEFGH"))
demo = list(opt.choose_many_options(3))
pp(demo, width=22)
df = pd.DataFrame(opt.choose_many_options(n), columns=["opt"])
df["cnt"] = 1
with pd.option_context("display.min_rows", 16):
print(df.groupby("opt").sum())
cnts = df.groupby("opt").sum().cnt.tolist()
plt.plot(range(len(cnts)), cnts)
plt.gca().set_xlim((0, 1700))
plt.gca().set_ylim((0, None))
plt.gca().set_xlabel("combination of options")
plt.gca().set_ylabel("number of occurrences")
plt.show()
if __name__ == "__main__":
typer.run(main)
output:
['FABE',
'GEDC',
'FBAC']
cnt
opt
ABCD 6041
ABCE 5851
ABCF 6111
ABCG 5917
ABCH 6050
ABDC 5885
ABDE 5935
ABDF 5937
... ...
HGEC 5796
HGED 5922
HGEF 5859
HGFA 5936
HGFB 5880
HGFC 5869
HGFD 5942
HGFE 6049
[1680 rows x 1 columns]
P(n, k)
= P(8, 4) = n! / (n - k)!
= 40,320 / 24
= 1680
All possible combinations of options have been randomly drawn.
Here is the number of occurrences of each distinct draw.
Note that 5952 occurrences × 1680 gets us to ~ 10 million.
The PRNG arranged matters
"such that every option is used the same amount of times, or close to it."
Having repeatedly rolled a many-sided dice,
we see the anticipated mean and standard deviation
show up in the experimental results.
I have a load of 3 hour MP3 files, and every ~15 minutes a distinct 1 second sound effect is played, which signals the beginning of a new chapter.
Is it possible to identify each time this sound effect is played, so I can note the time offsets?
The sound effect is similar every time, but because it's been encoded in a lossy file format, there will be a small amount of variation.
The time offsets will be stored in the ID3 Chapter Frame MetaData.
Example Source, where the sound effect plays twice.
ffmpeg -ss 0.9 -i source.mp3 -t 0.95 sample1.mp3 -acodec copy -y
Sample 1 (Spectrogram)
ffmpeg -ss 4.5 -i source.mp3 -t 0.95 sample2.mp3 -acodec copy -y
Sample 2 (Spectrogram)
I'm very new to audio processing, but my initial thought was to extract a sample of the 1 second sound effect, then use librosa in python to extract a floating point time series for both files, round the floating point numbers, and try to get a match.
import numpy
import librosa
print("Load files")
source_series, source_rate = librosa.load('source.mp3') # 3 hour file
sample_series, sample_rate = librosa.load('sample.mp3') # 1 second file
print("Round series")
source_series = numpy.around(source_series, decimals=5);
sample_series = numpy.around(sample_series, decimals=5);
print("Process series")
source_start = 0
sample_matching = 0
sample_length = len(sample_series)
for source_id, source_sample in enumerate(source_series):
if source_sample == sample_series[sample_matching]:
sample_matching += 1
if sample_matching >= sample_length:
print(float(source_start) / source_rate)
sample_matching = 0
elif sample_matching == 1:
source_start = source_id;
else:
sample_matching = 0
This does not work with the MP3 files above, but did with an MP4 version - where it was able to find the sample I extracted, but it was only that one sample (not all 12).
I should also note this script takes just over 1 minute to process the 3 hour file (which includes 237,426,624 samples). So I can imagine that some kind of averaging on every loop would cause this to take considerably longer.
Trying to directly match waveforms samples in the time domain is not a good idea. The mp3 signal will preserve the perceptual properties but it is quite likely the phases of the frequency components will be shifted so the sample values will not match.
You could try trying to match the volume envelopes of your effect and your sample.
This is less likely to be affected by the mp3 process.
First, normalise your sample so the embedded effects are the same level as your reference effect. Constructing new waveforms from the effect and the sample by using the average of the peak values over time frames that are just short enough to capture the relevant features. Better still use overlapping frames. Then use cross-correlation in the time domain.
If this does not work then you could analyze each frame using an FFT this gives you a feature vector for each frame. You then try to find matches of the sequence of features in your effect with the sample. Similar to https://stackoverflow.com/users/1967571/jonnor suggestion. MFCC is used in speech recognition but since you are not detecting speech FFT is probably OK.
I am assuming the effect playing by itself (no background noise) and it is added to the recording electronically (as opposed to being recorded via a microphone). If this is not the case the problem becomes more difficult.
This is an Audio Event Detection problem. If the sound is always the same and there are no other sounds at the same time, it can probably be solved with a Template Matching approach. At least if there is no other sounds with other meanings that sound similar.
The simplest kind of template matching is to compute the cross-correlation between your input signal and the template.
Cut out an example of the sound to detect (using Audacity). Take as much as possible, but avoid the start and end. Store this as .wav file
Load the .wav template using librosa.load()
Chop up the input file into a series of overlapping frames. Length should be same as your template. Can be done with librosa.util.frame
Iterate over the frames, and compute cross-correlation between frame and template using numpy.correlate.
High values of cross-correlation indicate a good match. A threshold can be applied in order to decide what is an event or not. And the frame number can be used to calculate the time of the event.
You should probably prepare some shorter test files which have both some examples of the sound to detect as well as other typical sounds.
If the volume of the recordings is inconsistent you'll want to normalize that before running detection.
If cross-correlation in the time-domain does not work, you can compute the melspectrogram or MFCC features and cross-correlate that. If this does not yield OK results either, a machine learning model can be trained using supervised learning, but this requires labeling a bunch of data as event/not-event.
To follow up on the answers by #jonnor and #paul-john-leonard, they are both correct, by using frames (FFT) I was able to do Audio Event Detection.
I've written up the full source code at:
https://github.com/craigfrancis/audio-detect
Some notes though:
To create the templates, I used ffmpeg:
ffmpeg -ss 13.15 -i source.mp4 -t 0.8 -acodec copy -y templates/01.mp4;
I decided to use librosa.core.stft, but I needed to make my own implementation of this stft function for the 3 hour file I'm analysing, as it's far too big to keep in memory.
When using stft I tried using a hop_length of 64 at first, rather than the default (512), as I assumed that would give me more data to work with... the theory might be true, but 64 was far too detailed, and caused it to fail most of the time.
I still have no idea how to get cross-correlation between frame and template to work (via numpy.correlate)... instead I took the results per frame (the 1025 buckets, not 1024, which I believe relate to the Hz frequencies found) and did a very simple average difference check, then ensured that average was above a certain value (my test case worked at 0.15, the main files I'm using this on required 0.55 - presumably because the main files had been compressed quite a bit more):
hz_score = abs(source[0:1025,x] - template[2][0:1025,y])
hz_score = sum(hz_score)/float(len(hz_score))
When checking these scores, it's really useful to show them on a graph. I often used something like the following:
import matplotlib.pyplot as plt
plt.figure(figsize=(30, 5))
plt.axhline(y=hz_match_required_start, color='y')
while x < source_length:
debug.append(hz_score)
if x == mark_frame:
plt.axvline(x=len(debug), ymin=0.1, ymax=1, color='r')
plt.plot(debug)
plt.show()
When you create the template, you need to trim off any leading silence (to avoid bad matching), and an extra ~5 frames (it seems that the compression / re-encoding process alters this)... likewise, remove the last 2 frames (I think the frames include a bit of data from their surroundings, where the last one in particular can be a bit off).
When you start finding a match, you might find it's ok for the first few frames, then it fails... you will probably need to try again a frame or two later. I found it easier having a process that supported multiple templates (slight variations on the sound), and would check their first testable (e.g. 6th) frame and if that matched, put them in a list of potential matches. Then, as it progressed on to the next frames of the source, it could compare it to the next frames of the template, until all frames in the template had been matched (or failed).
This might not be an answer, it's just where I got to before I start researching the answers by #jonnor and #paul-john-leonard.
I was looking at the Spectrograms you can get by using librosa stft and amplitude_to_db, and thinking that if I take the data that goes in to the graphs, with a bit of rounding, I could potentially find the 1 sound effect being played:
https://librosa.github.io/librosa/generated/librosa.display.specshow.html
The code I've written below kind of works; although it:
Does return quite a few false positives, which might be fixed by tweaking the parameters of what is considered a match.
I would need to replace the librosa functions with something that can parse, round, and do the match checks in one pass; as a 3 hour audio file causes python to run out of memory on a computer with 16GB of RAM after ~30 minutes before it even got to the rounding bit.
import sys
import numpy
import librosa
#--------------------------------------------------
if len(sys.argv) == 3:
source_path = sys.argv[1]
sample_path = sys.argv[2]
else:
print('Missing source and sample files as arguments');
sys.exit()
#--------------------------------------------------
print('Load files')
source_series, source_rate = librosa.load(source_path) # The 3 hour file
sample_series, sample_rate = librosa.load(sample_path) # The 1 second file
source_time_total = float(len(source_series) / source_rate);
#--------------------------------------------------
print('Parse Data')
source_data_raw = librosa.amplitude_to_db(abs(librosa.stft(source_series, hop_length=64)))
sample_data_raw = librosa.amplitude_to_db(abs(librosa.stft(sample_series, hop_length=64)))
sample_height = sample_data_raw.shape[0]
#--------------------------------------------------
print('Round Data') # Also switches X and Y indexes, so X becomes time.
def round_data(raw, height):
length = raw.shape[1]
data = [];
range_length = range(1, (length - 1))
range_height = range(1, (height - 1))
for x in range_length:
x_data = []
for y in range_height:
# neighbours = []
# for a in [(x - 1), x, (x + 1)]:
# for b in [(y - 1), y, (y + 1)]:
# neighbours.append(raw[b][a])
#
# neighbours = (sum(neighbours) / len(neighbours));
#
# x_data.append(round(((raw[y][x] + raw[y][x] + neighbours) / 3), 2))
x_data.append(round(raw[y][x], 2))
data.append(x_data)
return data
source_data = round_data(source_data_raw, sample_height)
sample_data = round_data(sample_data_raw, sample_height)
#--------------------------------------------------
sample_data = sample_data[50:268] # Temp: Crop the sample_data (318 to 218)
#--------------------------------------------------
source_length = len(source_data)
sample_length = len(sample_data)
sample_height -= 2;
source_timing = float(source_time_total / source_length);
#--------------------------------------------------
print('Process series')
hz_diff_match = 18 # For every comparison, how much of a difference is still considered a match - With the Source, using Sample 2, the maximum diff was 66.06, with an average of ~9.9
hz_match_required_switch = 30 # After matching "start" for X, drop to the lower "end" requirement
hz_match_required_start = 850 # Out of a maximum match value of 1023
hz_match_required_end = 650
hz_match_required = hz_match_required_start
source_start = 0
sample_matched = 0
x = 0;
while x < source_length:
hz_matched = 0
for y in range(0, sample_height):
diff = source_data[x][y] - sample_data[sample_matched][y];
if diff < 0:
diff = 0 - diff
if diff < hz_diff_match:
hz_matched += 1
# print(' {} Matches - {} # {}'.format(sample_matched, hz_matched, (x * source_timing)))
if hz_matched >= hz_match_required:
sample_matched += 1
if sample_matched >= sample_length:
print(' Found # {}'.format(source_start * source_timing))
sample_matched = 0 # Prep for next match
hz_match_required = hz_match_required_start
elif sample_matched == 1: # First match, record where we started
source_start = x;
if sample_matched > hz_match_required_switch:
hz_match_required = hz_match_required_end # Go to a weaker match requirement
elif sample_matched > 0:
# print(' Reset {} / {} # {}'.format(sample_matched, hz_matched, (source_start * source_timing)))
x = source_start # Matched something, so try again with x+1
sample_matched = 0 # Prep for next match
hz_match_required = hz_match_required_start
x += 1
#--------------------------------------------------
I have a list of ids called users and want to split them randomly into two groups by the percentage of 80:20.
For example i have a list of 100 users ids and randomly put 80 users into group1 and remaining 20 into group2
def getLevelForIncrementality(Object[] args) {
try {
if (args.length >= 1 && args[0]!="") {
String seed = args[0] + "Testing";
int rnd = Math.abs(seed.hashCode() % 100);
return (rnd >= 80 ? 2 : 1);
}
} catch (Exception e) { }
return 3;
}
I have tried from the above groovy code which gives me in the ratio of 82:18.
Can someone give me some insights or suggestions or alogrithms which can solve the above problem for millions of user ids.
You can use random.sample to randomly extract the needed number of elements:
import random
a = list(range(1000))
b = random.sample(a, int(len(a) * 0.8))
len(b)
800
If you have unique IDs, you can try to convert these lists of IDs to sets and differ them like this:
c = list(set(a) - set(b))
it can be also done using train_test_split of sklearn
import numpy as np
from sklearn.model_selection import train_test_split
X = list(np.arange(1000))
x_80_percent, x_20_percent = train_test_split(X, test_size =.20, shuffle = True)
In order to distribute data "on the fly" without creating large lists, you can use a small control list that will tell you how to part users into the two groups (by chunks of 5).
spread = []
while getNextUser():
if not spread
spread = [1,1,1,1,0] # number of 1s and 0s is 4 vs 1 (80%)
random.shuffle(spread)
if spread.pop():
# place on 80% side
else:
# place on 20% side
This will ensure a perfect 80:20 split every fifth user through with a maximum imbalance of 4. As more users are processed this imbalance will become less and less significant.
Worst cases:
19.2% instead of 20% after 99 users, corrects to perfect 20% at 100
19.9% after 999 users, corrects to perfect 20% at 1000
19.99% after 9999 users, corrects to perfect 20% at 10000
Note: you can change the number of 1s and 0s in the spread list to get a different proportion. e.g. [1,1,0] will give you 2 vs 1; [1,1,1,0] is 3 vs 1 (75:25); [1]*13+[0]*7 is 13 vs 7 (65:35)
You can generalize this into a generator that will do the proper calculations and initializations for you:
import random
from math import gcd
def spreadRatio(a,b):
d = gcd(a,b)
base = [True]*(a//d)+[False]*(b//d)
spread = []
while True:
if not spread:
spread = base.copy()
random.shuffle(spread)
yield spread.pop()
pareto = spreadRatio(80,20)
while getNextUser():
if next(pareto):
# place on 80% side
else:
# place on 20% side
This also works for spliting a list:
A = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] ## Sample List
l = (len(A)/10) *8 ## making 80 %
B = A[:int(l)] ## Getting 80% of list
C = A[int(l):] ## Getting remaining list
Given the following (arbitrary) lap times:
John: 47.20
Mark: 51.14
Shellie: 49.95
Scott: 48.80
Jack: 46.60
Cheryl: 52.70
Martin: 57.65
Karl: 55.45
Yong: 52.30
Lynetta: 59.90
Sueann: 49.24
Tempie: 47.88
Mack: 51.11
Kecia: 53.20
Jayson: 48.90
Sanjuanita: 45.90
Rosita: 54.43
Lyndia: 52.38
Deloris: 49.90
Sophie: 44.31
Fleta: 58.12
Tai: 61.23
Cassaundra: 49.38
Oren: 48.39
We're doing a go-kart endurance race, and the idea, rather than allowing team picking is to write a tool to process the initial qualifying times and then spit out the closest-matched groupings.
My initial investigation makes me feel like this is a clique graphing type situation, but having never played with graphing algorithms I feel rather out of my depth.
What would be the fastest/simplest method of generating groups of 3 people with the closest overall average lap time, so as to remove overall advantage/difference between them?
Is this something I can use networkx to achieve, and if so, how would I best define the graph given the dataset above?
When you're faced with a problem like this, one approach is always to leverage randomness.
While other folks say they think X or Y should work, I know my algorithm will converge to at least a local maxima. If you can show that any state space can be reached from any other via pairwise swapping (a property that is true for, say, the Travelling Salesperson Problem), then the algorithm will find the global optimum (given time).
Further, the algorithm attempts to minimize the standard deviation of the average times across the groups, so it provides a natural metric of how good an answer you're getting: Even if the result is non-exact, getting a standard deviation of 0.058 is probably more than close enough for your purposes.
Put another way: there may be an exact solution, but a randomized solution is usually easy to imagine, doesn't take long to code, can converge nicely, and is able to produce acceptable answers.
#!/usr/bin/env python3
import numpy as np
import copy
import random
data = [
(47.20,"John"),
(51.14,"Mark"),
(49.95,"Shellie"),
(48.80,"Scott"),
(46.60,"Jack"),
(52.70,"Cheryl"),
(57.65,"Martin"),
(55.45,"Karl"),
(52.30,"Yong"),
(59.90,"Lynetta"),
(49.24,"Sueann"),
(47.88,"Tempie"),
(51.11,"Mack"),
(53.20,"Kecia"),
(48.90,"Jayson"),
(45.90,"Sanjuanita"),
(54.43,"Rosita"),
(52.38,"Lyndia"),
(49.90,"Deloris"),
(44.31,"Sophie"),
(58.12,"Fleta"),
(61.23,"Tai"),
(49.38 ,"Cassaundra"),
(48.39,"Oren")
]
#Divide into initial groupings
NUM_GROUPS = 8
groups = []
for x in range(NUM_GROUPS): #Number of groups desired
groups.append(data[x*len(data)//NUM_GROUPS:(x+1)*len(data)//NUM_GROUPS])
#Ensure all groups have the same number of members
assert all(len(groups[0])==len(x) for x in groups)
#Get average time of a single group
def FitnessGroup(group):
return np.average([x[0] for x in group])
#Get standard deviation of all groups' average times
def Fitness(groups):
avgtimes = [FitnessGroup(x) for x in groups] #Get all average times
return np.std(avgtimes) #Return standard deviation of average times
#Initially, the best grouping is just the data
bestgroups = copy.deepcopy(groups)
bestfitness = Fitness(groups)
#Generate mutations of the best grouping by swapping two randomly chosen members
#between their groups
for x in range(10000): #Run a large number of times
groups = copy.deepcopy(bestgroups) #Always start from the best grouping
g1 = random.randint(0,len(groups)-1) #Choose a random group A
g2 = random.randint(0,len(groups)-1) #Choose a random group B
m1 = random.randint(0,len(groups[g1])-1) #Choose a random member from group A
m2 = random.randint(0,len(groups[g2])-1) #Choose a random member from group B
groups[g1][m1], groups[g2][m2] = groups[g2][m2], groups[g1][m1] #Swap 'em
fitness = Fitness(groups) #Calculate fitness of new grouping
if fitness<bestfitness: #Is it a better fitness?
bestfitness = fitness #Save fitness
bestgroups = copy.deepcopy(groups) #Save grouping
#Print the results
for g in bestgroups:
for m in g:
print("{0:15}".format(m[1]), end='')
print("{0:15.3f}".format(FitnessGroup(g)), end='')
print("")
print("Standard deviation of teams: {0:.3f}".format(bestfitness))
Running this a couple of times gives a standard deviation of 0.058:
Cheryl Kecia Oren 51.430
Tempie Mark Karl 51.490
Fleta Deloris Jack 51.540
Lynetta Scott Sanjuanita 51.533
Mack Rosita Sueann 51.593
Shellie Lyndia Yong 51.543
Jayson Sophie Tai 51.480
Martin Cassaundra John 51.410
Standard deviation of teams: 0.058
If I understand correctly, just sort the list of times and group the first three, next three, up through the top three.
EDIT: I didn't understand correctly
So, the idea is to take the N people and group them into N/3 teams, making the average times N/3 teams [rather than the 3 people within each team as I mistakenly interpreted] as close as possible. In this case, I think you could still start by sorting the N drivers in decreasing order of times. Then, initialize an empty list of N/3 teams. Then for each driver in decreasing order of lap time, assign them to the team with the smallest current total lap time (or one of these teams, in case of ties). This is a variant of a simple bin packing algorithm.
Here is a simple Python implementation:
times = [47.20, 51.14, 49.95, 48.80, 46.60, 52.70, 57.65, 55.45, 52.30, 59.90, 49.24, 47.88, 51.11, 53.20, 48.90, 45.90, 54.43, 52.38, 49.90, 44.31, 58.12, 61.23, 49.38, 48.39]
Nteams = len(times)/3
team_times = [0] * Nteams
team_members = [[]] * Nteams
times = sorted(times,reverse=True)
for m in range(len(times)):
i = team_times.index(min(team_times))
team_times[i] += times[m]
team_members[i] = team_members[i] + [m]
for i in range(len(team_times)):
print(str(team_members[i]) + ": avg time " + str(round(team_times[i]/3,3)))
whose output is
[0, 15, 23]: avg time 51.593
[1, 14, 22]: avg time 51.727
[2, 13, 21]: avg time 51.54
[3, 12, 20]: avg time 51.6
[4, 11, 19]: avg time 51.48
[5, 10, 18]: avg time 51.32
[6, 9, 17]: avg time 51.433
[7, 8, 16]: avg time 51.327
(Note that the team members numbers refer to them in descending order of lap time, starting from 0, rather than to their original ordering).
One issue with this is that if the times varied dramatically, there is no hard restriction to make the number of players on each team exactly 3. However, for your purposes, maybe that's OK, if it makes the relay close, and its probably a rare occurrence when the spread in times is much less than the average time.
EDIT
If you do just want 3 players on each team, in all cases, then the code can be trivially modified to at each step find the team with the least total lap time that doesn't already have three assigned players. This requires a small modification in the main code block:
times = sorted(times,reverse=True)
for m in range(len(times)):
idx = -1
for i in range(Nteams):
if len(team_members[i]) < 3:
if (idx == -1) or (team_times[i] < team_times[idx]):
idx = i
team_times[idx] += times[m]
team_members[idx] = team_members[idx] + [m]
For the example problem in the question, the above solution is of course identical, because it did not try to fit more or less than 3 players per team.
The following algorithm appears to work pretty well. It takes the fastest and slowest people remaining and then finds the person in the middle so that the group average is closest to the global average. Since the extreme values are being used up first, the averages at the end shouldn't be that far off despite the limited selection pool.
from bisect import bisect
times = sorted([47.20, 51.14, 49.95, 48.80, 46.60, 52.70, 57.65, 55.45, 52.30, 59.90, 49.24, 47.88, 51.11, 53.20, 48.90, 45.90, 54.43, 52.38, 49.90, 44.31, 58.12, 61.23, 49.38, 48.39])
average = lambda c: sum(c)/len(c)
groups = []
average_time = average(times)
while times:
group = [times.pop(0), times.pop()]
# target value for the third person for best average
target = average_time * 3 - sum(group)
index = min(bisect(times, target), len(times) - 1)
# adjust if the left value is better than the right
if index and abs(target - times[index-1]) < abs(target - times[index]):
index -= 1
group.append(times.pop(index))
groups.append(group)
# [44.31, 61.23, 48.9]
# [45.9, 59.9, 48.8]
# [46.6, 58.12, 49.9]
# [47.2, 57.65, 49.38]
# [47.88, 55.45, 51.14]
# [48.39, 54.43, 51.11]
# [49.24, 53.2, 52.3]
# [49.95, 52.7, 52.38]
The sorting and the iterated binary search are both O(n log n), so the total complexity is O(n log n). Unfortunately, expanding this to larger groups might be tough.
The simplest would probably be to just create 3 buckets--a fast bucket, a medium bucket, and a slow bucket--and assign entries to the buckets by their qualifying times.
Then team together the slowest of the slow, the fastest of the fast, and the median or mean of the mediums. (Not sure whether median or mean is the best choice off the top of my head.) Repeat until you're out of entries.
I'm currently having a little issue with a fits file. The data is in table format, a format I haven't previously used. I'm a python user, and rely heavily on astropy.fits to manipulate fits images. A quick output of the info gives:
No. Name Type Cards Dimensions Format
0 PRIMARY PrimaryHDU 60 ()
1 BinTableHDU 29 3072R x 2C [1024E, 1024E]
The header for the BinTableHDU is as follows:
XTENSION= 'BINTABLE' /Written by IDL: Mon Jun 22 23:28:21 2015
BITPIX = 8 /
NAXIS = 2 /Binary table
NAXIS1 = 8192 /Number of bytes per row
NAXIS2 = 3072 /Number of rows
PCOUNT = 0 /Random parameter count
GCOUNT = 1 /Group count
TFIELDS = 2 /Number of columns
TFORM1 = '1024E ' /Real*4 (floating point)
TFORM2 = '1024E ' /Real*4 (floating point)
TTYPE1 = 'COUNT_RATE' /
TUNIT1 = '1e-6cts/s/arcmin^2' /
TTYPE2 = 'UNCERTAINTY' /
TUNIT2 = '1e-6cts/s/arcmin^2' /
HISTORY g000m90r1b120pm.fits created on 10/08/97. PI channel range: 8: 19
PIXTYPE = 'HEALPIX ' / HEALPIX pixelisation
ORDERING= 'NESTED ' / Pixel ordering scheme, either RING or NESTED
NSIDE = 512 / Healpix resolution parameter
NPIX = 3145728 / Total number of pixels
OBJECT = 'FULLSKY ' / Sky coverage, either FULLSKY or PARTIAL
FIRSTPIX= 0 / First pixel # (0 based)
LASTPIX = 3145727 / Last pixel # (zero based)
INDXSCHM= 'IMPLICIT' / indexing : IMPLICIT or EXPLICIT
GRAIN = 0 / GRAIN = 0: No index,
COMMENT GRAIN =1: 1 pixel index for each pixel,
COMMENT GRAIN >1: 1 pixel index for Grain consecutive pixels
BAD_DATA= -1.63750E+30 / Sentinel value given to bad pixels
COORDSYS= 'G ' / Pixelization coordinate system
COMMENT G = Galactic, E = ecliptic, C = celestial = equatorial
END
I'd like to access the fits image which is stored within the TTYPE labeled 'COUNT-RATE', and then have this in a format with which I can then add to other count-rate arrays with the same dimensions.
I started with my usual prodcedure for opening a fits file:
hdulist_RASS_SXRB_R1 = fits.open('/Users/.../RASS_SXRB_R1.fits')
hdulist_RASS_SXRB_R1.info()
image_XRAY_SKYVIEW_R1 = hdulist_RASS_SXRB_R1[1].data
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
image_XRAY_SKYVIEW_header_R1 = hdulist_RASS_SXRB_R1[1].header
But this is coming back with IndexError: too many indices for array. I've had a look at accessing table data in the astropy documentation here (Accessing data stored as a table in a multi-extension FITS (MEF) file)
If anyone has a tried and tested method for accessing such images from a fits table I'd be very grateful! Many thanks.
I can't be sure without seeing the full traceback but I think the exception you're getting is from this:
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
There's no reason to manually wrap numpy.array() around the array. It's already a Numpy array. But in this case it's a structured array (see http://docs.scipy.org/doc/numpy/user/basics.rec.html).
#Andromedae93's answer is right one. But also for general documentation on this see: http://docs.astropy.org/en/stable/io/fits/index.html#working-with-table-data
However, the way you're working (which is fine for images) of manually calling fits.open, accessing the .data attribute of the HDU, etc. is fairly low level, and Numpy structured arrays are good at representing tables, but not great for manipulating them.
You're better off generally using Astropy's higher-level Table interface. A FITS table can be read directly into an Astropy Table object with Table.read(): http://docs.astropy.org/en/stable/io/unified.html#fits
The only reason the same thing doesn't exist for FITS images is there's no a generic "Image" class yet.
I used astropy.io.fits during my internship in Astrophysics and this is my process to open file .fits and make some operations :
# Opening the .fits file which is named SMASH.fits
field = fits.open(SMASH.fits)
# Data fits reading
tbdata = field[1].data
Now, with this kind of method, tbdata is a numpy.array and you can make lots of things.
For example, if you have data like :
ID, Name, Object
1, HD 1527, Star
2, HD 7836, Star
3, NGC 6739, Galaxy
If you want to print data along one condition :
Data_name = tbdata['Name']
You will get :
HD 1527
HD 7836
NGC 6739
I don't know what do you want exactly with your data, but I can help you ;)