I have a python project where I have to make sure that there is no repeated songs in different folders even if they have different extensions (prefer to convert all m4a or wav files into mp3 but not a must)
so I don't need to rebuild shazam or any other machine learning algorithm in this project as it's not needed because I already have the files
things I have tried:
compare_mp3 is a library I've found that should be able to do this job for me but I need to have files in mp3 format so I tried to convert my files using pydub and kept getting FileNotFoundError although that I've installed ffmpeg
First you need to decode them into PCM and ensure it has specific sample rate, which you can choose beforehand (e.g. 16KHz). You'll need to resample songs that have different sample rate. High sample rate is not required since you need a fuzzy comparison anyway, but too low sample rate will lose too much details.
You can use the following code for that:
ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav
ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav
And below there's a code to get a number from 0 to 100 for the similarity from two audio files using python, it works by generating fingerprints from audio files and comparing them based out of them using cross correlation
It requires Chromaprint and FFMPEG installed, also it doesn't work for short audio files, if this is a problem, you can always reduce the speed of the audio like in this guide, be aware this is going to add a little noise.
# correlation.py
import subprocess
import numpy
# seconds to sample audio file for
sample_time = 500# number of points to scan cross correlation over
span = 150# step size (in points) of cross correlation
step = 1# minimum number of points that must overlap in cross correlation
# exception is raised if this cannot be met
min_overlap = 20# report match when cross correlation has a peak exceeding threshold
threshold = 0.5
# calculate fingerprint
def calculate_fingerprints(filename):
fpcalc_out = subprocess.getoutput('fpcalc -raw -length %i %s' % (sample_time, filename))
fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12
# convert fingerprint to list of integers
fingerprints = list(map(int, fpcalc_out[fingerprint_index:].split(',')))
return fingerprints
# returns correlation between lists
def correlation(listx, listy):
if len(listx) == 0 or len(listy) == 0:
# Error checking in main program should prevent us from ever being
# able to get here.
raise Exception('Empty lists cannot be correlated.')
if len(listx) > len(listy):
listx = listx[:len(listy)]
elif len(listx) < len(listy):
listy = listy[:len(listx)]
covariance = 0
for i in range(len(listx)):
covariance += 32 - bin(listx[i] ^ listy[i]).count("1")
covariance = covariance / float(len(listx))
return covariance/32
# return cross correlation, with listy offset from listx
def cross_correlation(listx, listy, offset):
if offset > 0:
listx = listx[offset:]
listy = listy[:len(listx)]
elif offset < 0:
offset = -offset
listy = listy[offset:]
listx = listx[:len(listy)]
if min(len(listx), len(listy)) < min_overlap:
# Error checking in main program should prevent us from ever being
# able to get here.
return
#raise Exception('Overlap too small: %i' % min(len(listx), len(listy)))
return correlation(listx, listy)
# cross correlate listx and listy with offsets from -span to span
def compare(listx, listy, span, step):
if span > min(len(listx), len(listy)):
# Error checking in main program should prevent us from ever being
# able to get here.
raise Exception('span >= sample size: %i >= %i\n' % (span, min(len(listx), len(listy))) + 'Reduce span, reduce crop or increase sample_time.')
corr_xy = []
for offset in numpy.arange(-span, span + 1, step):
corr_xy.append(cross_correlation(listx, listy, offset))
return corr_xy
# return index of maximum value in list
def max_index(listx):
max_index = 0
max_value = listx[0]
for i, value in enumerate(listx):
if value > max_value:
max_value = value
max_index = i
return max_index
def get_max_corr(corr, source, target):
max_corr_index = max_index(corr)
max_corr_offset = -span + max_corr_index * step
print("max_corr_index = ", max_corr_index, "max_corr_offset = ", max_corr_offset)
# report matches
if corr[max_corr_index] > threshold:
print(('%s and %s match with correlation of %.4f at offset %i' % (source, target, corr[max_corr_index], max_corr_offset)))
def correlate(source, target):
fingerprint_source = calculate_fingerprints(source)
fingerprint_target = calculate_fingerprints(target)
corr = compare(fingerprint_source, fingerprint_target, span, step)
max_corr_offset = get_max_corr(corr, source, target)
if __name__ == "__main__":
correlate(SOURCE_FILE, TARGET_FILE)
Code converted into python 3 from: https://shivama205.medium.com/audio-signals-comparison-23e431ed2207
Related
I have a lot of these small .mp3 files, and what I want to get here is to check if two audio speaks the same alphabet.
For example:
if audio_is_same("file1.mp3", "file2.mp3"):
print("Same")
else:
print("Different")
And here are some
Audio Samples (Some folders are empty.)
Since these audio are almost the same, I think it is possible to do it with a simple way?
Would training a audio recognition module simpler?
Audio files are just binary when you open them, so you can just compare the files after reading them in.
def compare_audio(file1, file2):
is_same = open(“file1”, "rb").read() == open(“file2”, "rb").read()
if is_same:
print('Same')
else:
print('Different')
If you have large files, compare in chunks as mentioned in the below link.
https://www.quora.com/How-do-I-compare-two-binary-files-in-Python
If you want to get some sort of similarity between the two you can use built in similarity function or some kind of model
from difflib import SequenceMatcher
threshold = 0.8
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
def compare_audio(file1, file2):
file1 = open(“file1”, "rb").read()
file2 = open(“file2”, "rb").read()
sim_ratio = similar(file1, file2)
if sim_ratio > threshold:
print('Same')
else:
print('Different')
You would need to decide what an appropriate threshold is.
I don't know specifically what differences are you looking for, but below there's a code to get a number from 0 to 100 for the similarity from two audio files using python, it works by generating fingerprints from audio files and comparing them based out of them using cross correlation
It requires Chromaprint and FFMPEG installed, also it doesn't work for short audio files, if this is a problem, you can always reduce the speed of the audio like in this guide, be aware this is going to add a little noise.
# correlation.py
import subprocess
import numpy
# seconds to sample audio file for
sample_time = 500# number of points to scan cross correlation over
span = 150# step size (in points) of cross correlation
step = 1# minimum number of points that must overlap in cross correlation
# exception is raised if this cannot be met
min_overlap = 20# report match when cross correlation has a peak exceeding threshold
threshold = 0.5
# calculate fingerprint
def calculate_fingerprints(filename):
fpcalc_out = subprocess.getoutput('fpcalc -raw -length %i %s' % (sample_time, filename))
fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12
# convert fingerprint to list of integers
fingerprints = list(map(int, fpcalc_out[fingerprint_index:].split(',')))
return fingerprints
# returns correlation between lists
def correlation(listx, listy):
if len(listx) == 0 or len(listy) == 0:
# Error checking in main program should prevent us from ever being
# able to get here.
raise Exception('Empty lists cannot be correlated.')
if len(listx) > len(listy):
listx = listx[:len(listy)]
elif len(listx) < len(listy):
listy = listy[:len(listx)]
covariance = 0
for i in range(len(listx)):
covariance += 32 - bin(listx[i] ^ listy[i]).count("1")
covariance = covariance / float(len(listx))
return covariance/32
# return cross correlation, with listy offset from listx
def cross_correlation(listx, listy, offset):
if offset > 0:
listx = listx[offset:]
listy = listy[:len(listx)]
elif offset < 0:
offset = -offset
listy = listy[offset:]
listx = listx[:len(listy)]
if min(len(listx), len(listy)) < min_overlap:
# Error checking in main program should prevent us from ever being
# able to get here.
return
#raise Exception('Overlap too small: %i' % min(len(listx), len(listy)))
return correlation(listx, listy)
# cross correlate listx and listy with offsets from -span to span
def compare(listx, listy, span, step):
if span > min(len(listx), len(listy)):
# Error checking in main program should prevent us from ever being
# able to get here.
raise Exception('span >= sample size: %i >= %i\n' % (span, min(len(listx), len(listy))) + 'Reduce span, reduce crop or increase sample_time.')
corr_xy = []
for offset in numpy.arange(-span, span + 1, step):
corr_xy.append(cross_correlation(listx, listy, offset))
return corr_xy
# return index of maximum value in list
def max_index(listx):
max_index = 0
max_value = listx[0]
for i, value in enumerate(listx):
if value > max_value:
max_value = value
max_index = i
return max_index
def get_max_corr(corr, source, target):
max_corr_index = max_index(corr)
max_corr_offset = -span + max_corr_index * step
print("max_corr_index = ", max_corr_index, "max_corr_offset = ", max_corr_offset)
# report matches
if corr[max_corr_index] > threshold:
print(('%s and %s match with correlation of %.4f at offset %i' % (source, target, corr[max_corr_index], max_corr_offset)))
def correlate(source, target):
fingerprint_source = calculate_fingerprints(source)
fingerprint_target = calculate_fingerprints(target)
corr = compare(fingerprint_source, fingerprint_target, span, step)
max_corr_offset = get_max_corr(corr, source, target)
if __name__ == "__main__":
correlate(SOURCE_FILE, TARGET_FILE)
Code converted into python 3 from: https://shivama205.medium.com/audio-signals-comparison-23e431ed2207
I am trying to create a script to compare two audio samples and determine if they are similar or not (factoring in things like background noise would mean a lower correlation, so a "pass" would be arbitrarily determined by some rule of thumb). I have a piece of code written but I am producing errors that I am not sure how to debug.
Specifically, the code generates the following error, in the correlation function: ValueError: object of too small depth for the desired array.
I am not sure if this is because I am passing too few arguments to the function, but I really can't source the problem. The code appears below and should be relatively complete.
# compare.py
import argparse
from numpy import correlate
def initialize():
parser = argparse.ArgumentParser()
parser.add_argument("-i ", "--source-file", help="source file")
parser.add_argument("-o ", "--target-file", help="target file")
args = parser.parse_args()
SOURCE_FILE = args.source_file if args.source_file else None
TARGET_FILE = args.target_file if args.target_file else None
SOURCE_FILE = "Comparison1.wav"
TARGET_FILE = "Comparison2.wav"
if not SOURCE_FILE or not TARGET_FILE:
raise Exception("Source or Target files not specified.")
return SOURCE_FILE, TARGET_FILE
if __name__ == "__main__":
SOURCE_FILE, TARGET_FILE = initialize()
correlate(SOURCE_FILE, TARGET_FILE)
# correlation.py
import commands
import numpy
# seconds to sample audio file for
sample_time = 500
# number of points to scan cross correlation over
span = 150
# step size (in points) of cross correlation
step = 1
# minimum number of points that must overlap in cross correlation
# exception is raised if this cannot be met
min_overlap = 20
# report match when cross correlation has a peak exceeding threshold
threshold = 0.5
# calculate fingerprint
def calculate_fingerprints(filename):
fpcalc_out = commands.getoutput('fpcalc -raw -length %i %s' % (sample_time, filename))
fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12
# convert fingerprint to list of integers
fingerprints = map(int, fpcalc_out[fingerprint_index:].split(','))
return fingerprints
# returns correlation between lists
def correlation(listx, listy):
if len(listx) == 0 or len(listy) == 0:
# Error checking in main program should prevent us from ever being
# able to get here.
raise Exception('Empty lists cannot be correlated.')
if len(listx) > len(listy):
listx = listx[:len(listy)]
elif len(listx) < len(listy):
listy = listy[:len(listx)]
covariance = 0
for i in range(len(listx)):
covariance += 32 - bin(listx[i] ^ listy[i]).count("1")
covariance = covariance / float(len(listx))
return covariance / 32
# return cross correlation, with listy offset from listx
def cross_correlation(listx, listy, offset):
if offset > 0:
listx = listx[offset:]
listy = listy[:len(listx)]
elif offset < 0:
offset = -offset
listy = listy[offset:]
listx = listx[:len(listy)]
if min(len(listx), len(listy)) < min_overlap:
# Error checking in main program should prevent us from ever being
# able to get here.
return
#raise Exception('Overlap too small: %i' % min(len(listx), len(listy)))
return correlation(listx, listy)
# cross correlate listx and listy with offsets from -span to span
def compare(listx, listy, span, step):
if span > min(len(listx), len(listy)):
# Error checking in main program should prevent us from ever being
# able to get here.
raise Exception('span >= sample size: %i >= %i\n' % (span, min(len(listx), len(listy))) +
'Reduce span, reduce crop or increase sample_time.')
corr_xy = []
for offset in numpy.arange(-span, span + 1, step):
corr_xy.append(cross_correlation(listx, listy, offset))
return corr_xy
# return index of maximum value in list
def max_index(listx):
max_index = 0
max_value = listx[0]
for i, value in enumerate(listx):
if value > max_value:
max_value = value
max_index = i
return max_index
def get_max_corr(corr, source, target):
max_corr_index = max_index(corr)
max_corr_offset = -span + max_corr_index * step
print("max_corr_index = ", max_corr_index, "max_corr_offset = ", max_corr_offset)
# report matches
if corr[max_corr_index] > threshold:
print('%s and %s match with correlation of %.4f at offset %i' %
(source, target, corr[max_corr_index], max_corr_offset))
def correlate(source, target):
fingerprint_source = calculate_fingerprints(source)
fingerprint_target = calculate_fingerprints(target)
corr = compare(fingerprint_source, fingerprint_target, span, step)
max_corr_offset = get_max_corr(corr, source, target)
You have a little error in your code, more specifically:
SOURCE_FILE = "Comparison1.wav"
TARGET_FILE = "Comparison2.wav"
Here you assign this variables to a string (not files). And so from then they are processed as "Comparison1.wav" and "Comparison2.wav" strings respectively.
I'm trying to implement a filter with Python to sort out the points on a point cloud generated by Agisoft PhotoScan. PhotoScan is a photogrammetry software developed to be user friendly but also allows to use Python commands through an API.
Bellow is my code so far and I'm pretty sure there is better way to write it as I'm missing something. The code runs inside PhotoScan.
Objective:
Selecting and removing 10% of points at a time with error within defined range of 50 to 10. Also removing any points within error range less than 10% of the total, when the initial steps of selecting and removing 10% at a time are done. Immediately after every point removal an optimization procedure should be done. It should stop when no points are selectable or when selectable points counts as less than 1% of the present total points and it is not worth removing them.
Draw it for better understanding:
Actual Code Under Construction (3 updates - see bellow for details):
import PhotoScan as PS
import math
doc = PS.app.document
chunk = doc.chunk
# using float with range and that by setting i = 1 it steps 0.1 at a time
def precrange(a, b, i):
if a < b:
p = 10**i
sr = a*p
er = (b*p) + 1
p = float(p)
for n in range(sr, er):
x = n/p
yield x
else:
p = 10**i
sr = b*p
er = (a*p) + 1
p = float(p)
for n in range(sr, er):
x = n/p
yield x
"""
Determine if x is close to y:
x relates to nselected variable
y to p10 variable
math.isclose() Return True if the values a and b are close to each other and
False otherwise
var is the tolerance here setted as a relative tolerance:
rel_tol is the relative tolerance – it is the maximum allowed difference
between a and b, relative to the larger absolute value of a or b. For example,
to set a tolerance of 5%, pass rel_tol=0.05. The default tolerance is 1e-09,
which assures that the two values are the same within about 9 decimal digits.
rel_tol must be greater than zero.
"""
def test_isclose(x, y, var):
if math.isclose(x, y, rel_tol=var): # if variables are close return True
return True
else:
False
# 1. define filter limits
f_ReconstUncert = precrange(50, 10, 1)
# 2. count initial point number
tiePoints_0 = len(chunk.point_cloud.points) # storing info for later
# 3. call Filter() and init it
f = PS.PointCloud.Filter()
f.init(chunk, criterion=PS.PointCloud.Filter.ReconstructionUncertainty)
a = 0
"""
Way to restart for loop!
should_restart = True
while should_restart:
should_restart = False
for i in xrange(10):
print i
if i == 5:
should_restart = True
break
"""
restartLoop = True
while restartLoop:
restartLoop = False
for count, i in enumerate(f_ReconstUncert): # for each threshold value
# count points for every i
tiePoints = len(chunk.point_cloud.points)
p10 = int(round((10 / 100) * tiePoints, 0)) # 10% of the total
f.selectPoints(i) # selects points
nselected = len([p for p in chunk.point_cloud.points if p.selected])
percent = round(nselected * 100 / tiePoints, 2)
if nselected == 0:
print("For threshold {} there´s no selectable points".format(i))
break
elif test_isclose(nselected, p10, 0.1):
a += 1
print("Threshold found in iteration: ", count)
print("----------------------------------------------")
print("# {} Removing points from cloud ".format(a))
print("----------------------------------------------")
print("# {}. Reconstruction Uncerntainty:"
" {:.2f}".format(a, i))
print("{} - {}"
" ({:.1f} %)\n".format(tiePoints,
nselected, percent))
f.removePoints(i) # removes points
# optimization procedure needed to refine cameras positions
print("--------------Optimizing cameras-------------\n")
chunk.optimizeCameras(fit_f=True, fit_cx=True,
fit_cy=True, fit_b1=False,
fit_b2=False, fit_k1=True,
fit_k2=True, fit_k3=True,
fit_k4=False, fit_p1=True,
fit_p2=True, fit_p3=False,
fit_p4=False, adaptive_fitting=False)
# count again number of points in point cloud
tiePoints = len(chunk.point_cloud.points)
print("= {} remaining points after"
" {} removal".format(tiePoints, a))
# reassigning variable to get new 10% of remaining points
p10 = int(round((10 / 100) * tiePoints, 0))
percent = round(nselected * 100 / tiePoints, 2)
print("----------------------------------------------\n\n")
# restart loop to investigate from range start
restartLoop = True
break
else:
f.resetSelection()
continue # continue to next i
else:
f.resetSelection()
print("for loop didnt work out")
print("{} iterations done!".format(count))
tiePoints = len(chunk.point_cloud.points)
print("Tiepoints 0: ", tiePoints_0)
print("Tiepoints 1: ", tiePoints)
Problems:
A. Currently I'm stuck on an endless processing because of a loop. I know it's about my bad coding. But how do I implement my objective and get away with the infinite loops? ANSWER: Got the code less confusing and updated above.
B. How do I start over (or restart) my search for valid threshold values in the range(50, 20) after finding one of them? ANSWER: Stack Exchange: how to restart a for loop
C. How do I turn the code more pythonic?
IMPORTANT UPDATE 1: altered above
Using a better range with float solution adapted from stackoverflow: how-to-use-a-decimal-range-step-value
# using float with range and that by setting i = 1 it steps 0.1 at a time
def precrange(a, b, i):
if a < b:
p = 10**i
sr = a*p
er = (b*p) + 1
p = float(p)
return map(lambda x: x/p, range(sr, er))
else:
p = 10**i
sr = b*p
er = (a*p) + 1
p = float(p)
return map(lambda x: x/p, range(sr, er))
# some code
f_ReconstUncert = precrange(50, 20, 1)
And also using math.isclose() to determine if selected points are close to the 10% selected points instead of using a manual solution through assigning new variables. This was implemented as follows:
"""
Determine if x is close to y:
x relates to nselected variable
y to p10 variable
math.isclose() Return True if the values a and b are close to each other and
False otherwise
var is the tolerance here setted as a relative tolerance:
rel_tol is the relative tolerance – it is the maximum allowed difference
between a and b, relative to the larger absolute value of a or b. For example,
to set a tolerance of 5%, pass rel_tol=0.05. The default tolerance is 1e-09,
which assures that the two values are the same within about 9 decimal digits.
rel_tol must be greater than zero.
"""
def test_threshold(x, y, var):
if math.isclose(x, y, rel_tol=var): # if variables are close return True
return True
else:
False
# some code
if test_threshold(nselected, p10, 0.1):
# if true then a valid threshold is found
# some code
UPDATE 2: altered on code under construction
Minor fixes and got to restart de for loop from beginning by following guidance from another Stack Exchange post on the subject. Have to improve the range now or alter the isclose() to get more values.
restartLoop = True
while restartLoop:
restartLoop = False
for i in range(0, 10):
if condition:
restartLoop = True
break
UPDATE 3: Code structure to achieve listed objectives:
threshold = range(0, 11, 1)
listx = []
for i in threshold:
listx.append(i)
restart = 0
restartLoop = True
while restartLoop:
restartLoop = False
for idx, i in enumerate(listx):
print("do something as printing i:", i)
if i > 5: # if this condition restart loop
print("found value for condition: ", i)
del listx[idx]
restartLoop = True
print("RESTARTING LOOP\n")
restart += 1
break # break inner while and restart for loop
else:
# continue if the inner loop wasn't broken
continue
else:
continue
print("restart - outer while", restart)
I have a huge list (45M+ data poitns), with numerical values:
[78,0,5,150,9000,5,......,25,9,78422...]
I can easily get the maximum and minimum values, the number of these values, and the sum of them:
file_handle=open('huge_data_file.txt','r')
sum_values=0
min_value=None
max_value=None
for i,line in enumerate(file_handle):
value=int(line[:-1])
if min_value==None or value<min_value:
min_value=value
if max_value==None or value>max_value:
max_value=value
sum_values+=value
average_value=float(sum_values)/i
However, this is not what I need. I need a list of 10 numbers, where the number of data points between each two consecutive points is equal, for example
median points [0,30,120,325,912,1570,2522,5002,7025,78422]
and we have the number of data points between 0 and 30 or between 30 and 120 to be almost 4.5 million data points.
How can we do this?
=============================
EDIT:
I am well aware that we will need to sort the data. The problem is that I cannot fit all this data in one variable in memory, but I need to read it sequentially from a generator (file_handle)
If you are happy with an approximation, here is a great (and fairly easy to implement) algorithm for computing quantiles from stream data: "Space-Efficient Online Computation of Quantile Summaries" by Greenwald and Khanna.
The silly numpy approach:
import numpy as np
# example data (produced by numpy but converted to a simple list)
datalist = list(np.random.randint(0, 10000000, 45000000))
# converted back to numpy array (start here with your data)
arr = np.array(datalist)
np.percentile(arr, 10), np.percentile(arr, 20), np.percentile(arr, 30)
# ref:
# http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html
You can also hack something together where you just do like:
arr.sort()
# And then select the 10%, 20% etc value, add some check for equal amount of
# numbers within a bin and then calculate the average, excercise for reader :-)
The thing is that calling this function several times will slow it down, so really, just sort the array and then select the elements yourself.
As you said in the comments that you want a solution that can scale to larger datasets then can be stored in RAM, feed the data into an SQLlite3 database. Even if your data set is 10GB and you only have 8GB RAM a SQLlite3 database should still be able to sort the data and give it back to you in order.
The SQLlite3 database gives you a generator over your sorted data.
You might also want to look into going beyond python and take some other database solution.
Here's a pure-python implementation of the partitioned-on-disk sort. It's slow, ugly code, but it works and hopefully each stage is relatively clear (the merge stage is really ugly!).
#!/usr/bin/env python
import os
def get_next_int_from_file(f):
l = f.readline()
if not l:
return None
return int(l.strip())
MAX_SAMPLES_PER_PARTITION = 1000000
PARTITION_FILENAME = "_{}.txt"
# Partition data set
part_id = 0
eof = False
with open("data.txt", "r") as fin:
while not eof:
print "Creating partition {}".format(part_id)
with open(PARTITION_FILENAME.format(part_id), "w") as fout:
for _ in range(MAX_SAMPLES_PER_PARTITION):
line = fin.readline()
if not line:
eof = True
break
fout.write(line)
part_id += 1
num_partitions = part_id
# Sort each partition
for part_id in range(num_partitions):
print "Reading unsorted partition {}".format(part_id)
with open(PARTITION_FILENAME.format(part_id), "r") as fin:
samples = [int(line.strip()) for line in fin.readlines()]
print "Disk-Deleting unsorted {}".format(part_id)
os.remove(PARTITION_FILENAME.format(part_id))
print "In-memory sorting partition {}".format(part_id)
samples.sort()
print "Writing sorted partition {}".format(part_id)
with open(PARTITION_FILENAME.format(part_id), "w") as fout:
fout.writelines(["{}\n".format(sample) for sample in samples])
# Merge-sort the partitions
# NB This is a very inefficient implementation!
print "Merging sorted partitions"
part_files = []
part_next_int = []
num_lines_out = 0
# Setup data structures for the merge
for part_id in range(num_partitions):
fin = open(PARTITION_FILENAME.format(part_id), "r")
next_int = get_next_int_from_file(fin)
if next_int is None:
continue
part_files.append(fin)
part_next_int.append(next_int)
with open("data_sorted.txt", "w") as fout:
while part_files:
# Find the smallest number across all files
min_number = None
min_idx = None
for idx in range(len(part_files)):
if min_number is None or part_next_int[idx] < min_number:
min_number = part_next_int[idx]
min_idx = idx
# Now add that number, and move the relevent file along
fout.write("{}\n".format(min_number))
num_lines_out += 1
if num_lines_out % MAX_SAMPLES_PER_PARTITION == 0:
print "Merged samples: {}".format(num_lines_out)
next_int = get_next_int_from_file(part_files[min_idx])
if next_int is None:
# Remove this partition, it's now finished
del part_files[min_idx:min_idx + 1]
del part_next_int[min_idx:min_idx + 1]
else:
part_next_int[min_idx] = next_int
# Cleanup partition files
for part_id in range(num_partitions):
os.remove(PARTITION_FILENAME.format(part_id))
My code a proposal for finding the result without needing much space. In testing it found a quantile value in 7 minutes 51 seconds for a dataset of size 45 000 000.
from bisect import bisect_left
class data():
def __init__(self, values):
random.shuffle(values)
self.values = values
def __iter__(self):
for i in self.values:
yield i
def __len__(self):
return len(self.values)
def sortedValue(self, percentile):
val = list(self)
val.sort()
num = int(len(self)*percentile)
return val[num]
def init():
numbers = data([x for x in range(1,1000000)])
print(seekPercentile(numbers, 0.1))
print(numbers.sortedValue(0.1))
def seekPercentile(numbers, percentile):
lower, upper = minmax(numbers)
maximum = upper
approx = _approxPercentile(numbers, lower, upper, percentile)
return neighbor(approx, numbers, maximum)
def minmax(list):
minimum = float("inf")
maximum = float("-inf")
for num in list:
if num>maximum:
maximum = num
if num<minimum:
minimum = num
return minimum, maximum
def neighbor(approx, numbers, maximum):
dif = maximum
for num in numbers:
if abs(approx-num)<dif:
result = num
dif = abs(approx-num)
return result
def _approxPercentile(numbers, lower, upper, percentile):
middles = []
less = []
magicNumber = 10000
step = (upper - lower)/magicNumber
less = []
for i in range(1, magicNumber-1):
middles.append(lower + i * step)
less.append(0)
for num in numbers:
index = bisect_left(middles,num)
if index<len(less):
less[index]+= 1
summing = 0
for index, testVal in enumerate(middles):
summing += less[index]
if summing/len(numbers) < percentile:
print(" Change lower from "+str(lower)+" to "+ str(testVal))
lower = testVal
if summing/len(numbers) > percentile:
print(" Change upper from "+str(upper)+" to "+ str(testVal))
upper = testVal
break
precision = 0.01
if (lower+precision)>upper:
return lower
else:
return _approxPercentile(numbers, lower, upper, percentile)
init()
I edited my code a bit and I now think that this way works at least decently even when it's not optimal.
I've written some python code to calculate a certain quantity from a cosmological simulation. It does this by checking whether a particle in contained within a box of size 8,000^3, starting at the origin and advancing the box when all particles contained within it are found. As I am counting ~2 million particles altogether, and the total size of the simulation volume is 150,000^3, this is taking a long time.
I'll post my code below, does anybody have any suggestions on how to improve it?
Thanks in advance.
from __future__ import division
import numpy as np
def check_range(pos, i, j, k):
a = 0
if i <= pos[2] < i+8000:
if j <= pos[3] < j+8000:
if k <= pos[4] < k+8000:
a = 1
return a
def sigma8(data):
N = []
to_do = data
print 'Counting number of particles per cell...'
for k in range(0,150001,8000):
for j in range(0,150001,8000):
for i in range(0,150001,8000):
temp = []
n = []
for count in range(len(to_do)):
n.append(check_range(to_do[count],i,j,k))
to_do[count][1] = n[count]
if to_do[count][1] == 0:
temp.append(to_do[count])
#Only particles that have not been found are
# searched for again
to_do = temp
N.append(sum(n))
print 'Next row'
print 'Next slice, %i still to find' % len(to_do)
print 'Calculating sigma8...'
if not sum(N) == len(data):
return 'Error!\nN measured = {0}, total N = {1}'.format(sum(N), len(data))
else:
return 'sigma8 = %.4f, variance = %.4f, mean = %.4f' % (np.sqrt(sum((N-np.mean(N))**2)/len(N))/np.mean(N), np.var(N),np.mean(N))
I'll try to post some code, but my general idea is the following: create a Particle class that knows about the box that it lives in, which is calculated in the __init__. Each box should have a unique name, which might be the coordinate of the bottom left corner (or whatever you use to locate your boxes).
Get a new instance of the Particle class for each particle, then use a Counter (from the collections module).
Particle class looks something like:
# static consts - outside so that every instance of Particle doesn't take them along
# for the ride...
MAX_X = 150,000
X_STEP = 8000
# etc.
class Particle(object):
def __init__(self, data):
self.x = data[xvalue]
self.y = data[yvalue]
self.z = data[zvalue]
self.compute_box_label()
def compute_box_label(self):
import math
x_label = math.floor(self.x / X_STEP)
y_label = math.floor(self.y / Y_STEP)
z_label = math.floor(self.z / Z_STEP)
self.box_label = str(x_label) + '-' + str(y_label) + '-' + str(z_label)
Anyway, I imagine your sigma8 function might look like:
def sigma8(data):
import collections as col
particles = [Particle(x) for x in data]
boxes = col.Counter([x.box_label for x in particles])
counts = boxes.most_common()
#some other stuff
counts will be a list of tuples which map a box label to the number of particles in that box. (Here we're treating particles as indistinguishable.)
Using list comprehensions is much faster than using loops---I think the reason is that you're basically relying more on the underlying C, but I'm not the person to ask. Counter is (supposedly) highly-optimized as well.
Note: None of this code has been tested, so you shouldn't try the cut-and-paste-and-hope-it-works method here.