If I have a sample of data containing 10 000 points that I know i binomially distributed, what is the best way I can plot the pmf of it? i.e I need to find what the parameters n and p are but I don't really know what is the easiest way to do that!
Is there any easy way to accomplish this? Here is what I'm getting when trying to "guess" the parameters
Using the following code
def binom_dist(data_list):
pmf_guess = sps.binom.pmf(data_list , 35 , 0.3)
pmf = sps.binom.pmf(data_list, max(data_list) , np.std(data_list)/max(data_list))
return pmf_guess , pmf
def plot_histo(data_list, bin_count , pmf, pmf_guess):
plt.hist(data_list, bins=bin_count)
plt.plot(data_list, pmf , color = 'red')
plt.plot(data_list , pmf_guess, color = 'green')
return plt.show()
pmf_guess , pmf = binom_dist(data_3)
plot_3 = plot_histo(data_3, 100 , pmf , pmf_guess)
Here is a shortened version of the data
data_3 =
[1.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0]
This is a part of my output. I want to find the average of the list within this dictionary:
{'Radial Velocity': {'number': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 3.0, 3.0, 3.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 3.0, 3.0, 3.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 3.0, 3.0, 3.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]}}
Here's how to solve it using custom code:
sumOfNumbers = 0
for number in dictionary['Radial Velocity']['number']:
sumOfNumbers += number
avg = sumOfNumbers / len(dictionary['Radial Velocity']['number'])
print(avg)
You can use the function mean() from numpy:
import numpy as np
output = {'Radial Velocity': {'number': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 3.0, 3.0, 3.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 3.0, 3.0, 3.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 3.0, 3.0, 3.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]}}
print(np.mean(output['Radial Velocity']['number']))
Output:
2.1607142857142856
Python has a statistics module in its standard library. It has, among other useful things, a mean() function to which you can pass a list and get the average:
from statistics import mean
mean(d['Radial Velocity']['number'])
I have to run soak tests for longer duration and capture 3 datasets (before the run, in-between the run, after the run), plot them and manually analyze the plots.
All the datasets span across the very large range (0-10^5). So, when I am plotting this data using matplotlib's bar function, the bar for smaller values is too small to be analyzed.
import matplotlib
matplotlib.use('Agg')
import sys,os,argparse,json,string,numpy
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
bx = ('smmpg_b1024k', 'smmpg_b10k', 'smmpg_b11k', 'smmpg_b128', 'smmpg_b128k', 'smmpg_b12k', 'smmpg_b13k', 'smmpg_b14k', 'smmpg_b15k', 'smmpg_b160', 'smmpg_b16k', 'smmpg_b17k', 'smmpg_b18k', 'smmpg_b192', 'smmpg_b192k', 'smmpg_b19k', 'smmpg_b1k', 'smmpg_b20k', 'smmpg_b21k', 'smmpg_b224', 'smmpg_b22k', 'smmpg_b23k', 'smmpg_b24k', 'smmpg_b256', 'smmpg_b256k', 'smmpg_b25k', 'smmpg_b26k', 'smmpg_b27k', 'smmpg_b288', 'smmpg_b28k', 'smmpg_b29k', 'smmpg_b2k', 'smmpg_b30k', 'smmpg_b31k', 'smmpg_b32', 'smmpg_b320', 'smmpg_b320k', 'smmpg_b32k', 'smmpg_b33k', 'smmpg_b34k', 'smmpg_b352', 'smmpg_b35k', 'smmpg_b36k', 'smmpg_b37k', 'smmpg_b384', 'smmpg_b384k', 'smmpg_b38k', 'smmpg_b39k', 'smmpg_b3k', 'smmpg_b40k', 'smmpg_b416', 'smmpg_b41k', 'smmpg_b42k', 'smmpg_b43k', 'smmpg_b448', 'smmpg_b448k', 'smmpg_b44k', 'smmpg_b45k', 'smmpg_b46k', 'smmpg_b47k', 'smmpg_b480', 'smmpg_b48k', 'smmpg_b49k', 'smmpg_b4k', 'smmpg_b50k', 'smmpg_b512', 'smmpg_b512k', 'smmpg_b51k', 'smmpg_b52k', 'smmpg_b53k', 'smmpg_b544', 'smmpg_b54k', 'smmpg_b55k', 'smmpg_b56k', 'smmpg_b576', 'smmpg_b576k', 'smmpg_b57k', 'smmpg_b58k', 'smmpg_b59k', 'smmpg_b5k', 'smmpg_b608', 'smmpg_b60k', 'smmpg_b61k', 'smmpg_b62k', 'smmpg_b63k', 'smmpg_b64', 'smmpg_b640', 'smmpg_b640k', 'smmpg_b64k', 'smmpg_b672', 'smmpg_b6k', 'smmpg_b704', 'smmpg_b704k', 'smmpg_b736', 'smmpg_b768', 'smmpg_b768k', 'smmpg_b7k', 'smmpg_b800', 'smmpg_b832', 'smmpg_b832k', 'smmpg_b864', 'smmpg_b896', 'smmpg_b896k', 'smmpg_b8k', 'smmpg_b928', 'smmpg_b96', 'smmpg_b960', 'smmpg_b960k', 'smmpg_b992', 'smmpg_b9k', 'smmpg_ccb', 'smmpg_msb', 'smmpg_twomb', 'total-pages', 'total-size')
before = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
intermediate = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
after = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
x_locations= numpy.arange(len(bx))
width=0.27
fig = plt.figure(figsize=(50, 20))
ax = fig.add_subplot(111)
before_test_mempools_bar = ax.bar(x_locations, list(before), width, color='r')
intermediate_test_mempools_bar = ax.bar(x_locations + width, list(intermediate), width, color='g')
after_test_mempools_bar = ax.bar(x_locations + width *2,list(after), width, color='b')
ax.set_ylabel('Memory')
ax.set_xticks(x_locations + width)
ax.set_xticklabels(bx,rotation=90)
ax.legend((before_test_mempools_bar[0],intermediate_test_mempools_bar[0],after_test_mempools_bar[0]),('BEFORE','INTERMEDIATE','AFTER'))
fig.savefig("plot.png")
plt.close()
The above code produces the following plot:
Goal:
My goal is to accommodate all the data in the plot that is visually nice and so the plot can be analyzed by any tester in the team.
Currently, it's hard to see what's happened with a smaller range of values.
One possible approach would be normalization but not sure if the data would be retained original.
Any possible solutions are appreciated.
Transcribing #Alexander Reynold's comment into an answer:
Use a logarithmic y-axis, i.e. instead of plot() use semilogy() – You can change the base depending on what the dynamic range you need to display is.
I didn't know that there is already an argument parameter in bar function to change the scale of Y-axis.
After adding log=True argument to all the bar functions as below,
before_test_mempools_bar = ax.bar(x_locations, list(before_test_mempools), width, color='r',log=True)
intermediate_test_mempools_bar = ax.bar(x_locations + width, list(intermediate_test_mempools), width, color='g',log=True)
after_test_mempools_bar = ax.bar(x_locations + width *2,list(after_test_mempools), width, color='b',log=True)
My plot looks much nicer now and easy to analyze.
If I may, I think your problem is not technical but that you didn't think enough about you want you to show and what you want the people to look at because the graphic you're showing doesn't seem to have a lot of "noise" - i.e. area of the graphics that don't give much or even any information.
So, even if you only provided simulated data, it seems that there is some room of improvement to make a much readable and "to the point" visualization.
For example you could:
remove uninteresting information (maybe those at 0.0 or those that haven't evolved ?)
regroup some categories by group (what about creating new aggregated categories ? or showing the data in a total different way with values on the x axes and names of categories on the y axes ?)
Also, maybe you're putting together different kind of things (those last 3 bx categories ('smmpg_twomb', 'total-pages' &'total-size') shouldn't they be put in a graph on their own ?)
Use a data structure like pandas' DataFrame to better handle and clean your data in order to do all of the three previous suggestions.
It's just a few suggestions but maybe it will help.
Here is an exemple of what you could do... Just to illustrate:
import matplotlib
matplotlib.use('Agg')
import sys,os,argparse,json,string,numpy
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
bx = ('smmpg_b1024k', 'smmpg_b10k', 'smmpg_b11k', 'smmpg_b128', 'smmpg_b128k', 'smmpg_b12k', 'smmpg_b13k',
'smmpg_b14k', 'smmpg_b15k', 'smmpg_b160', 'smmpg_b16k', 'smmpg_b17k', 'smmpg_b18k', 'smmpg_b192',
'smmpg_b192k', 'smmpg_b19k', 'smmpg_b1k', 'smmpg_b20k', 'smmpg_b21k', 'smmpg_b224', 'smmpg_b22k',
'smmpg_b23k', 'smmpg_b24k', 'smmpg_b256', 'smmpg_b256k', 'smmpg_b25k', 'smmpg_b26k', 'smmpg_b27k',
'smmpg_b288', 'smmpg_b28k', 'smmpg_b29k', 'smmpg_b2k', 'smmpg_b30k', 'smmpg_b31k', 'smmpg_b32',
'smmpg_b320', 'smmpg_b320k', 'smmpg_b32k', 'smmpg_b33k', 'smmpg_b34k', 'smmpg_b352', 'smmpg_b35k',
'smmpg_b36k', 'smmpg_b37k', 'smmpg_b384', 'smmpg_b384k', 'smmpg_b38k', 'smmpg_b39k', 'smmpg_b3k',
'smmpg_b40k', 'smmpg_b416', 'smmpg_b41k', 'smmpg_b42k', 'smmpg_b43k', 'smmpg_b448', 'smmpg_b448k',
'smmpg_b44k', 'smmpg_b45k', 'smmpg_b46k', 'smmpg_b47k', 'smmpg_b480', 'smmpg_b48k', 'smmpg_b49k',
'smmpg_b4k', 'smmpg_b50k', 'smmpg_b512', 'smmpg_b512k', 'smmpg_b51k', 'smmpg_b52k', 'smmpg_b53k',
'smmpg_b544', 'smmpg_b54k', 'smmpg_b55k', 'smmpg_b56k', 'smmpg_b576', 'smmpg_b576k', 'smmpg_b57k',
'smmpg_b58k', 'smmpg_b59k', 'smmpg_b5k', 'smmpg_b608', 'smmpg_b60k', 'smmpg_b61k', 'smmpg_b62k',
'smmpg_b63k', 'smmpg_b64', 'smmpg_b640', 'smmpg_b640k', 'smmpg_b64k', 'smmpg_b672', 'smmpg_b6k',
'smmpg_b704', 'smmpg_b704k', 'smmpg_b736', 'smmpg_b768', 'smmpg_b768k', 'smmpg_b7k', 'smmpg_b800',
'smmpg_b832', 'smmpg_b832k', 'smmpg_b864', 'smmpg_b896', 'smmpg_b896k', 'smmpg_b8k', 'smmpg_b928',
'smmpg_b96', 'smmpg_b960', 'smmpg_b960k', 'smmpg_b992', 'smmpg_b9k', 'smmpg_ccb', 'smmpg_msb',
'smmpg_twomb', 'total-pages', 'total-size')
before = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
intermediate = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
after = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
# Put your data in a DataFrame:
df = pd.DataFrame({'before': before,
'intermediate': intermediate,
'after': after, 'bx': bx,
'x_locations': numpy.arange(len(bx))
})
#filter columns - you can put them in another graph!
df_filt_cat = df.loc[(df.bx != 'smmpg_twomb') & (df.bx != 'total-pages') & (df.bx != 'total-size')]
# filter categories that stay 0 all the way
df_filt_zero = df_filt_cat.loc[(df_filt_cat.before != 0) & (df_filt_cat.intermediate != 0) & (df_filt_cat.after != 0)]
x_locations= numpy.arange(len(bx))
width=0.27
fig = plt.figure(figsize=(50, 20))
ax = fig.add_subplot(111)
before_test_mempools_bar = ax.bar(df_filt_zero.x_locations, df_filt_zero.before, width, color='r')
before_test_mempools_bar = ax.bar(df_filt_zero.x_locations, df_filt_zero.before, width, color='r')
intermediate_test_mempools_bar = ax.bar(df_filt_zero.x_locations + width, df_filt_zero.intermediate, width, color='g')
after_test_mempools_bar = ax.bar(df_filt_zero.x_locations + width *2, df_filt_zero.after, width, color='b')
ax.set_ylabel('Memory')
ax.set_xticks(x_locations + width)
ax.set_xticklabels(bx,rotation=90)
ax.legend((before_test_mempools_bar[0],intermediate_test_mempools_bar[0],after_test_mempools_bar[0]),('BEFORE','INTERMEDIATE','AFTER'))
# just to show the result I commented this line
#fig.savefig("plot.png")
# and put this one instead:
plt.show()
It obviously still needs improvement but it's already a bit more readable.
Problem
For a computation engineering model, I want to do a grid search for all feasible parameter combinations. Each parameter has a certain possibility range, e.g. (0 … 100) and the parameter combination must fulfil the condition a+b+c=100. An example:
ranges = {
'a': (95, 99),
'b': (1, 4),
'c': (1, 2)}
increment = 1.0
target = 100.0
So the combinations that fulfil the condition a+b+c=100 are:
[(95, 4, 1), (95, 3, 2), (96, 2, 2), (96, 3, 1), (97, 1, 2), (97, 2, 1), (98, 1, 1)]
This algorithm should run with any number of parameters, range lengths, and increments.
My solutions (so far)
The solutions I have come up with are all brute-forcing the problem. That means calculating all combinations and then discarding the ones that do not fulfil the given condition:
def solution1(ranges, increment, target):
combinations = []
for parameter in ranges:
combinations.append(list(np.arange(ranges[parameter][0], ranges[parameter][1], increment)))
# np.arange() is exclusive of the upper bound, let's fix that
if combinations[-1][-1] != ranges[parameter][1]:
combinations[-1].append(ranges[parameter][1])
combinations = list(itertools.product(*combinations))
df = pd.DataFrame(combinations, columns=ranges.keys())
# using np.isclose() so that the algorithm works for floats
return df[np.isclose(df.sum(axis=1), target)]
Since I ran into RAM problems with solution1(), I used itertools.product as an iterator.
def solution2(ranges, increment, target):
combinations = []
for parameter in ranges:
combinations.append(list(np.arange(ranges[parameter][0], ranges[parameter][1], increment)))
# np.arange() is exclusive of the upper bound, let's fix that
if combinations[-1][-1] != ranges[parameter][1]:
combinations[-1].append(ranges[parameter][1])
result = []
for combination in itertools.product(*combinations):
# using np.isclose() so that the algorithm works for floats
if np.isclose(sum(combination), target):
result.append(combination)
df = pd.DataFrame(result, columns=ranges.keys())
return df
However, this quickly takes a few days to compute. Hence, both solutions are not viable for large number of parameters and ranges. For instance, one set that I am trying to solve is (already unpacked combinations variable):
[[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0], [22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], [0.0, 1.0, 2.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], [0.0], [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0], [0.0]]
This results in memory use of >40 GB for solution1() and calculation time >400 hours for solution2().
Question
Do you see a solution that is either faster or more intelligent, i.e. not trying to brute-force the problem?
P.S.: I am not 100% sure if this question would be a better fit on one of the other Stackexchange sites. Please suggest in the comments if you think it should be moved and I will delete it here.
Here is a recursive solution:
a = [95, 100]
b = [1, 4]
c = [1, 2]
Params = (a, b, c)
def GetValidParamValues(Params, constriantSum, prevVals):
validParamValues = []
if (len(Params) == 1):
if (constriantSum >= Params[0][0] and constriantSum <= Params[0][1]):
validParamValues.append(constriantSum)
for v in validParamValues:
print(prevVals + v)
return
sumOfLowParams = sum([Params[x][0] for x in range(1, len(Params))])
sumOfHighParams = sum([Params[x][1] for x in range(1, len(Params))])
lowEnd = max(Params[0][0], constriantSum - sumOfHighParams)
highEnd = min(Params[0][1], constriantSum - sumOfLowParams) + 1
if (len(Params) == 2):
for av in range(lowEnd, highEnd):
bv = constriantSum - av
if (bv <= Params[1][1]):
validParamValues.append([av, bv])
for v in validParamValues:
print(prevVals + v)
return
for av in range(lowEnd, highEnd):
nexPrevVals = prevVals + [av]
subSeParams = Params[1:]
GetValidParamValues(subSeParams, constriantSum - av, nexPrevVals)
GetValidParamValues(Params, 100)
The idea is that if there were 2 parameters, a and b, we could list all the valid pairs by passing through the values of a, and taking (ai, S - ai) and just checking if S-ai is a valid value for b.
This is improved on since we can calculate ahead of time which values of ai will make S-ai a valid value for b, so we never check values that don't work.
When the number of params is more than 2, we can again look at every valid value of ai, and we know the sum of the other numbers must be S - ai. So the only thing we need is every possible way for the other numbers to add to S - ai, which is the same problem with one fewer parameter. So by using recursion we can get it go all the way down to size 2 and solve it.