How to find the parameters of a sample of data (Binomial distribution) - python

If I have a sample of data containing 10 000 points that I know i binomially distributed, what is the best way I can plot the pmf of it? i.e I need to find what the parameters n and p are but I don't really know what is the easiest way to do that!
Is there any easy way to accomplish this? Here is what I'm getting when trying to "guess" the parameters
Using the following code
def binom_dist(data_list):
pmf_guess = sps.binom.pmf(data_list , 35 , 0.3)
pmf = sps.binom.pmf(data_list, max(data_list) , np.std(data_list)/max(data_list))
return pmf_guess , pmf
def plot_histo(data_list, bin_count , pmf, pmf_guess):
plt.hist(data_list, bins=bin_count)
plt.plot(data_list, pmf , color = 'red')
plt.plot(data_list , pmf_guess, color = 'green')
return plt.show()
pmf_guess , pmf = binom_dist(data_3)
plot_3 = plot_histo(data_3, 100 , pmf , pmf_guess)
Here is a shortened version of the data
data_3 =
[1.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
2.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
3.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
4.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0,
5.0]

Related

Calculating mean of list within dictionary in Python

This is a part of my output. I want to find the average of the list within this dictionary:
{'Radial Velocity': {'number': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 3.0, 3.0, 3.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 3.0, 3.0, 3.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 3.0, 3.0, 3.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]}}
Here's how to solve it using custom code:
sumOfNumbers = 0
for number in dictionary['Radial Velocity']['number']:
sumOfNumbers += number
avg = sumOfNumbers / len(dictionary['Radial Velocity']['number'])
print(avg)
You can use the function mean() from numpy:
import numpy as np
output = {'Radial Velocity': {'number': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 3.0, 3.0, 3.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 3.0, 3.0, 3.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 3.0, 3.0, 3.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]}}
print(np.mean(output['Radial Velocity']['number']))
Output:
2.1607142857142856
Python has a statistics module in its standard library. It has, among other useful things, a mean() function to which you can pass a list and get the average:
from statistics import mean
mean(d['Radial Velocity']['number'])

Relabelling ticks on Seaborn axes?

I'm doing a log-log plot with Seaborn; the data is actually derived from a StackOverflow developer survey. I tried using the built-in log scale, but the results didn't make sense, so this simply calculates the logs before plotting.
df = pd.DataFrame( {'company_size_range': {7800: 7.0, 7801: 700.0, 7802: 7.0, 7803: 20000.0, 7805: 200.0, 7806: 20000.0, 7808: 2000.0, 7809: 2000.0, 7810: 7.0, 7811: 200.0, 7812: 50.0, 7813: 20000.0, 7816: 2.0, 7819: 200.0, 7820: 2000.0, 7824: 2.0, 7825: 2.0, 7827: 2.0, 7828: 50.0, 7830: 14.0, 7831: 50.0, 7833: 200.0, 7834: 50.0, 7835: 50.0, 7838: 2.0, 7840: 50.0, 7841: 50.0, 7842: 7000.0, 7843: 20000.0, 7844: 14.0, 7846: 2.0, 7850: 20000.0, 7851: 700.0, 7852: 200.0, 7853: 200.0, 7855: 200.0, 7856: 7.0, 7857: 50.0, 7858: 700.0, 7861: 20000.0, 7863: 20000.0, 7865: 20000.0, 7867: 700.0, 7868: 20000.0, 7870: 50.0, 7871: 2000.0, 7872: 50.0, 7873: 20000.0, 7874: 200.0, 7876: 14.0, 7877: 20000.0, 7879: 50.0, 7880: 50.0 }, 'team_size_range': {7800: 7.0, 7801: 7.0, 7802: 7.0, 7803: 2.0, 7805: 7.0, 7806: 2.0, 7808: 7.0, 7809: 7.0, 7810: 2.0, 7811: 17.0, 7812: 7.0, 7813: 2.0, 7816: 2.0, 7819: 7.0, 7820: 30.0, 7824: 2.0, 7825: 2.0, 7827: 2.0, 7828: 2.0, 7830: 2.0, 7831: 7.0, 7833: 2.0, 7834: 2.0, 7835: 7.0, 7838: 2.0, 7840: 7.0, 7841: 30.0, 7842: 7.0, 7843: 7.0, 7844: 2.0, 7846: 2.0, 7850: 7.0, 7851: 11.0, 7852: 7.0, 7853: 7.0, 7855: 2.0, 7856: 7.0, 7857: 7.0, 7858: 11.0, 7861: 7.0, 7863: 2.0, 7865: 30.0, 7867: 7.0, 7868: 7.0, 7870: 2.0, 7871: 17.0, 7872: 7.0, 7873: 17.0, 7874: 7.0, 7876: 2.0, 7877: 7.0, 7879: 17.0, 7880: 7.0}} )
g=sns.jointplot(x=np.log10(df['company_size_range']+1),
y=np.log10(df['team_size_range']+1), kind='kde', color='g')
That's fine, but the axes show the log values, not the underlying values. The X-axis, for example, is:
-1, 1, 2, 3, 4, 5, 6
So I added this to fix it, using the X position of the labels as the X values:
g.ax_joint.set_xticklabels(["{:.0f}".format(10**label.get_position()[0]-1)
for label in g.ax_joint.get_xticklabels()])
The trouble is the resulting X-axis labels are nonsense:
1, 2, 3, 5, 9, 0, 0, 0
What is going on, and how best to fix it, please?
You could make use of a FuncFormatter. The benefit would be that the ticks are always drawn right also after resizing the window.
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import numpy as np
import pandas as pd
import seaborn as sns
def tickformat_pow10(value, tick_number):
return f'{10**value:,.0f}'
# df = ...
g = sns.jointplot(x=np.log10(df['company_size_range'] + 1),
y=np.log10(df['team_size_range'] + 1), kind='kde', color='g')
g.ax_joint.xaxis.set_major_formatter(FuncFormatter(tickformat_pow10))
g.ax_joint.yaxis.set_major_formatter(FuncFormatter(tickformat_pow10))
Try the following by first using the canvas.draw(). Also, I do not understand why you are subtracting 1
g.fig.canvas.draw()
g.ax_joint.set_xticklabels(["{:.0f}".format(10**label.get_position()[0]-1)
for label in g.ax_joint.get_xticklabels()]);

How to plot large range values with matplotlib?

I have to run soak tests for longer duration and capture 3 datasets (before the run, in-between the run, after the run), plot them and manually analyze the plots.
All the datasets span across the very large range (0-10^5). So, when I am plotting this data using matplotlib's bar function, the bar for smaller values is too small to be analyzed.
import matplotlib
matplotlib.use('Agg')
import sys,os,argparse,json,string,numpy
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
bx = ('smmpg_b1024k', 'smmpg_b10k', 'smmpg_b11k', 'smmpg_b128', 'smmpg_b128k', 'smmpg_b12k', 'smmpg_b13k', 'smmpg_b14k', 'smmpg_b15k', 'smmpg_b160', 'smmpg_b16k', 'smmpg_b17k', 'smmpg_b18k', 'smmpg_b192', 'smmpg_b192k', 'smmpg_b19k', 'smmpg_b1k', 'smmpg_b20k', 'smmpg_b21k', 'smmpg_b224', 'smmpg_b22k', 'smmpg_b23k', 'smmpg_b24k', 'smmpg_b256', 'smmpg_b256k', 'smmpg_b25k', 'smmpg_b26k', 'smmpg_b27k', 'smmpg_b288', 'smmpg_b28k', 'smmpg_b29k', 'smmpg_b2k', 'smmpg_b30k', 'smmpg_b31k', 'smmpg_b32', 'smmpg_b320', 'smmpg_b320k', 'smmpg_b32k', 'smmpg_b33k', 'smmpg_b34k', 'smmpg_b352', 'smmpg_b35k', 'smmpg_b36k', 'smmpg_b37k', 'smmpg_b384', 'smmpg_b384k', 'smmpg_b38k', 'smmpg_b39k', 'smmpg_b3k', 'smmpg_b40k', 'smmpg_b416', 'smmpg_b41k', 'smmpg_b42k', 'smmpg_b43k', 'smmpg_b448', 'smmpg_b448k', 'smmpg_b44k', 'smmpg_b45k', 'smmpg_b46k', 'smmpg_b47k', 'smmpg_b480', 'smmpg_b48k', 'smmpg_b49k', 'smmpg_b4k', 'smmpg_b50k', 'smmpg_b512', 'smmpg_b512k', 'smmpg_b51k', 'smmpg_b52k', 'smmpg_b53k', 'smmpg_b544', 'smmpg_b54k', 'smmpg_b55k', 'smmpg_b56k', 'smmpg_b576', 'smmpg_b576k', 'smmpg_b57k', 'smmpg_b58k', 'smmpg_b59k', 'smmpg_b5k', 'smmpg_b608', 'smmpg_b60k', 'smmpg_b61k', 'smmpg_b62k', 'smmpg_b63k', 'smmpg_b64', 'smmpg_b640', 'smmpg_b640k', 'smmpg_b64k', 'smmpg_b672', 'smmpg_b6k', 'smmpg_b704', 'smmpg_b704k', 'smmpg_b736', 'smmpg_b768', 'smmpg_b768k', 'smmpg_b7k', 'smmpg_b800', 'smmpg_b832', 'smmpg_b832k', 'smmpg_b864', 'smmpg_b896', 'smmpg_b896k', 'smmpg_b8k', 'smmpg_b928', 'smmpg_b96', 'smmpg_b960', 'smmpg_b960k', 'smmpg_b992', 'smmpg_b9k', 'smmpg_ccb', 'smmpg_msb', 'smmpg_twomb', 'total-pages', 'total-size')
before = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
intermediate = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
after = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
x_locations= numpy.arange(len(bx))
width=0.27
fig = plt.figure(figsize=(50, 20))
ax = fig.add_subplot(111)
before_test_mempools_bar = ax.bar(x_locations, list(before), width, color='r')
intermediate_test_mempools_bar = ax.bar(x_locations + width, list(intermediate), width, color='g')
after_test_mempools_bar = ax.bar(x_locations + width *2,list(after), width, color='b')
ax.set_ylabel('Memory')
ax.set_xticks(x_locations + width)
ax.set_xticklabels(bx,rotation=90)
ax.legend((before_test_mempools_bar[0],intermediate_test_mempools_bar[0],after_test_mempools_bar[0]),('BEFORE','INTERMEDIATE','AFTER'))
fig.savefig("plot.png")
plt.close()
The above code produces the following plot:
Goal:
My goal is to accommodate all the data in the plot that is visually nice and so the plot can be analyzed by any tester in the team.
Currently, it's hard to see what's happened with a smaller range of values.
One possible approach would be normalization but not sure if the data would be retained original.
Any possible solutions are appreciated.
Transcribing #Alexander Reynold's comment into an answer:
Use a logarithmic y-axis, i.e. instead of plot() use semilogy() – You can change the base depending on what the dynamic range you need to display is.
I didn't know that there is already an argument parameter in bar function to change the scale of Y-axis.
After adding log=True argument to all the bar functions as below,
before_test_mempools_bar = ax.bar(x_locations, list(before_test_mempools), width, color='r',log=True)
intermediate_test_mempools_bar = ax.bar(x_locations + width, list(intermediate_test_mempools), width, color='g',log=True)
after_test_mempools_bar = ax.bar(x_locations + width *2,list(after_test_mempools), width, color='b',log=True)
My plot looks much nicer now and easy to analyze.
If I may, I think your problem is not technical but that you didn't think enough about you want you to show and what you want the people to look at because the graphic you're showing doesn't seem to have a lot of "noise" - i.e. area of the graphics that don't give much or even any information.
So, even if you only provided simulated data, it seems that there is some room of improvement to make a much readable and "to the point" visualization.
For example you could:
remove uninteresting information (maybe those at 0.0 or those that haven't evolved ?)
regroup some categories by group (what about creating new aggregated categories ? or showing the data in a total different way with values on the x axes and names of categories on the y axes ?)
Also, maybe you're putting together different kind of things (those last 3 bx categories ('smmpg_twomb', 'total-pages' &'total-size') shouldn't they be put in a graph on their own ?)
Use a data structure like pandas' DataFrame to better handle and clean your data in order to do all of the three previous suggestions.
It's just a few suggestions but maybe it will help.
Here is an exemple of what you could do... Just to illustrate:
import matplotlib
matplotlib.use('Agg')
import sys,os,argparse,json,string,numpy
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
bx = ('smmpg_b1024k', 'smmpg_b10k', 'smmpg_b11k', 'smmpg_b128', 'smmpg_b128k', 'smmpg_b12k', 'smmpg_b13k',
'smmpg_b14k', 'smmpg_b15k', 'smmpg_b160', 'smmpg_b16k', 'smmpg_b17k', 'smmpg_b18k', 'smmpg_b192',
'smmpg_b192k', 'smmpg_b19k', 'smmpg_b1k', 'smmpg_b20k', 'smmpg_b21k', 'smmpg_b224', 'smmpg_b22k',
'smmpg_b23k', 'smmpg_b24k', 'smmpg_b256', 'smmpg_b256k', 'smmpg_b25k', 'smmpg_b26k', 'smmpg_b27k',
'smmpg_b288', 'smmpg_b28k', 'smmpg_b29k', 'smmpg_b2k', 'smmpg_b30k', 'smmpg_b31k', 'smmpg_b32',
'smmpg_b320', 'smmpg_b320k', 'smmpg_b32k', 'smmpg_b33k', 'smmpg_b34k', 'smmpg_b352', 'smmpg_b35k',
'smmpg_b36k', 'smmpg_b37k', 'smmpg_b384', 'smmpg_b384k', 'smmpg_b38k', 'smmpg_b39k', 'smmpg_b3k',
'smmpg_b40k', 'smmpg_b416', 'smmpg_b41k', 'smmpg_b42k', 'smmpg_b43k', 'smmpg_b448', 'smmpg_b448k',
'smmpg_b44k', 'smmpg_b45k', 'smmpg_b46k', 'smmpg_b47k', 'smmpg_b480', 'smmpg_b48k', 'smmpg_b49k',
'smmpg_b4k', 'smmpg_b50k', 'smmpg_b512', 'smmpg_b512k', 'smmpg_b51k', 'smmpg_b52k', 'smmpg_b53k',
'smmpg_b544', 'smmpg_b54k', 'smmpg_b55k', 'smmpg_b56k', 'smmpg_b576', 'smmpg_b576k', 'smmpg_b57k',
'smmpg_b58k', 'smmpg_b59k', 'smmpg_b5k', 'smmpg_b608', 'smmpg_b60k', 'smmpg_b61k', 'smmpg_b62k',
'smmpg_b63k', 'smmpg_b64', 'smmpg_b640', 'smmpg_b640k', 'smmpg_b64k', 'smmpg_b672', 'smmpg_b6k',
'smmpg_b704', 'smmpg_b704k', 'smmpg_b736', 'smmpg_b768', 'smmpg_b768k', 'smmpg_b7k', 'smmpg_b800',
'smmpg_b832', 'smmpg_b832k', 'smmpg_b864', 'smmpg_b896', 'smmpg_b896k', 'smmpg_b8k', 'smmpg_b928',
'smmpg_b96', 'smmpg_b960', 'smmpg_b960k', 'smmpg_b992', 'smmpg_b9k', 'smmpg_ccb', 'smmpg_msb',
'smmpg_twomb', 'total-pages', 'total-size')
before = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
intermediate = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
after = (0.0, 2.0, 2.0, 4.0, 8.0, 2.0, 2.0, 2.0, 2.0, 6.0, 2.0, 4.0, 44.0, 76.0, 6.0, 2.0, 2.0, 2.0, 18.0, 2.0, 18.0, 30.0, 32.0, 2.0, 12.0, 2.0, 170.0, 0.0, 4.0, 2.0, 0.0, 24.0, 0.0, 2.0, 10.0, 2.0, 12.0, 2.0, 36.0, 0.0, 2.0, 0.0, 0.0, 0.0, 12.0, 22.0, 2.0, 0.0, 272.0, 2.0, 4.0, 2.0, 0.0, 2.0, 4.0, 2.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 4.0, 0.0, 2.0, 2.0, 2.0, 0.0, 0.0, 8.0, 2.0, 0.0, 2.0, 2.0, 6.0, 0.0, 0.0, 0.0, 34.0, 2.0, 0.0, 2.0, 0.0, 2.0, 92.0, 2.0, 0.0, 2.0, 2.0, 40.0, 2.0, 0.0, 2.0, 2.0, 0.0, 14.0, 2.0, 4.0, 2.0, 2.0, 2.0, 0.0, 18.0, 2.0, 28.0, 4.0, 0.0, 2.0, 2.0, 6.0, 214.0, 26226.0, 13813.0, 27626.0)
# Put your data in a DataFrame:
df = pd.DataFrame({'before': before,
'intermediate': intermediate,
'after': after, 'bx': bx,
'x_locations': numpy.arange(len(bx))
})
#filter columns - you can put them in another graph!
df_filt_cat = df.loc[(df.bx != 'smmpg_twomb') & (df.bx != 'total-pages') & (df.bx != 'total-size')]
# filter categories that stay 0 all the way
df_filt_zero = df_filt_cat.loc[(df_filt_cat.before != 0) & (df_filt_cat.intermediate != 0) & (df_filt_cat.after != 0)]
x_locations= numpy.arange(len(bx))
width=0.27
fig = plt.figure(figsize=(50, 20))
ax = fig.add_subplot(111)
before_test_mempools_bar = ax.bar(df_filt_zero.x_locations, df_filt_zero.before, width, color='r')
before_test_mempools_bar = ax.bar(df_filt_zero.x_locations, df_filt_zero.before, width, color='r')
intermediate_test_mempools_bar = ax.bar(df_filt_zero.x_locations + width, df_filt_zero.intermediate, width, color='g')
after_test_mempools_bar = ax.bar(df_filt_zero.x_locations + width *2, df_filt_zero.after, width, color='b')
ax.set_ylabel('Memory')
ax.set_xticks(x_locations + width)
ax.set_xticklabels(bx,rotation=90)
ax.legend((before_test_mempools_bar[0],intermediate_test_mempools_bar[0],after_test_mempools_bar[0]),('BEFORE','INTERMEDIATE','AFTER'))
# just to show the result I commented this line
#fig.savefig("plot.png")
# and put this one instead:
plt.show()
It obviously still needs improvement but it's already a bit more readable.

Tensorflow reuse neural network

I'm new in tensorflow and I've been training a simple neural network, but once is trained, I don't know how to reuse the NN to get the outputs of an input.
def train_neural_network(x,y,aDataTrain,aTargetTrain,aDataTest,aTargetTest):
batch_size = 500
prediction = neural_network_model(x,len(aDataTrain[0]))
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epochs = 1
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epochs):
epoch_loss = 0
i = 0
while i < len(aDataTrain):
start = i
end = i + batch_size
batch_x = np.array(aDataTrain[start:end])
batch_y = np.array(aTargetTrain[start:end])
_,c = sess.run([optimizer,cost],feed_dict={x:batch_x,y:batch_y})
epoch_loss += c
i += batch_size
print ("Epoch", epoch, "completed out of", hm_epochs, "loss", epoch_loss)
correct =tf.equal(tf.argmax(prediction,1), tf.argmax(y,1))
accurracy = tf.reduce_mean(tf.cast(correct,'float'))
finalAcc = accurracy.eval({x:aDataTest,y:aTargetTest})
saver.save(sess, 'model/model.ckpt')
print("Accuracy:",finalAcc)
So, once I've saved the model and try to restore it, I don't know how to continue to get the output of the NN from the "input_data".
def execute_neural_network(x,y,aDataTrain,aTargetTrain,aDataTest,aTargetTest):
batch_size = 1
y_pred = []
prediction = neural_network_model(x,len(aDataTrain[0]))
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
input_data = [5.0, 3.0, 1.0, 5.0, 6.0, 5.0, 2.0, 4.0, 7.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 3.0, 2.0, 3.0, 2.0, 3.0, 3.0, 4.0, 3.0, 3.0, 2.0, 4.0, 3.0, 3.0, 2.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 61.0, 21.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 75.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 35.0, 11.0, 10.0, 33.0, 24.0, 6.0, 2.0, 2.0, 3.0, 4.0, 3.0, 3.0, 8.0, 6.0, 5.0, 6.0, 5.0, 8.0, 9.0, 13.0, 7.0, 25.0, 11.0, 2.0, 2.0, 2.0, 2.0, 2.0]
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, 'model/model.ckpt')
#Get neural network output from input_data
Assuming you create your graph/network's model somehow like this:
with tf.Session() as sess:
#do other stuff
predictionOp = tf.argmax(py_x, 1)
saver.save(sess, 'model')
where predictionOp is the variable which is the output of your network.
You can add something like this afterwards: tf.add_to_collection("predictionOp", predictionOp)
to give the predictionOp a name to be easier to find. Then, you can reload your model and get the predictions with:
with tf.Session() as sess:
new_saver = tf.train.import_meta_graph('model.meta')
new_saver.restore(sess, 'model')
predictionOp = tf.get_collection("predictionOp")[0]
#get the prediction
prediction = sess.run(predictionOp, feed_dict={"x:0": input_data})
For more information, please take a look at the tensorflow documentation and here for some more information about the basics. Also, there are some other threads, which deal with similar problems, like this and this one.

import from file data error in Python

I'm having some issues importing data from file in Python. I am quite new to Python, so my error is probably quite simple.
I am reading in 3 column, tab-delimited text files with no headers. I am creating 3 instances of the data file using three different datafiles.
I can see that each object is referencing a different memory location, so they are separate.
When I look at the data stored in each instance, each instance has the same contents, consisting of the three datafiles appended to each other.
What have I done wrong?
The class to read in the data is:
class Minimal:
def __init__(self, data=[]):
self.data = data
def readFile(self, filename):
f = open(filename, 'r')
for line in f:
line = line.strip()
columns = line.split()
#creates a list of angle, intensity and error and appends it to the diffraction pattern
self.data.append( [float(columns[0]), float(columns[1]), float(columns[2])] )
f.close()
def printData(self):
for dataPoint in self.data:
print str(dataPoint)
The datafiles look like:
1 4 2
2 5 2.3
3 4 2
4 6 2.5
5 8 5
6 10 3
The program I am using to actually create the instances of Minimal is:
from minimal import Minimal
d1 = Minimal()
d1.readFile("data1.xye")
d2 = Minimal()
d2.readFile("data2.xye")
d3 = Minimal()
d3.readFile("data3.xye")
print "Data1"
print d1
d1.printData()
print "\nData2"
print d2
d2.printData()
print "\nData3"
print d3
d3.printData()
The output is:
Data1
<minimal.Minimal instance at 0x016A35F8>
[1.0, 4.0, 2.0]
[2.0, 5.0, 2.3]
[3.0, 4.0, 2.0]
[4.0, 6.0, 2.5]
[5.0, 8.0, 5.0]
[6.0, 10.0, 3.0]
[2.0, 4.0, 2.0]
[3.0, 5.0, 2.3]
[4.0, 4.0, 2.0]
[5.0, 6.0, 2.5]
[6.0, 8.0, 5.0]
[7.0, 10.0, 3.0]
[3.0, 4.0, 2.0]
[4.0, 5.0, 2.3]
[5.0, 4.0, 2.0]
[6.0, 6.0, 2.5]
[7.0, 8.0, 5.0]
[8.0, 10.0, 3.0]
Data2
<minimal.Minimal instance at 0x016A3620>
[1.0, 4.0, 2.0]
[2.0, 5.0, 2.3]
[3.0, 4.0, 2.0]
[4.0, 6.0, 2.5]
[5.0, 8.0, 5.0]
[6.0, 10.0, 3.0]
[2.0, 4.0, 2.0]
[3.0, 5.0, 2.3]
[4.0, 4.0, 2.0]
[5.0, 6.0, 2.5]
[6.0, 8.0, 5.0]
[7.0, 10.0, 3.0]
[3.0, 4.0, 2.0]
[4.0, 5.0, 2.3]
[5.0, 4.0, 2.0]
[6.0, 6.0, 2.5]
[7.0, 8.0, 5.0]
[8.0, 10.0, 3.0]
Data3
<minimal.Minimal instance at 0x016A3648>
[1.0, 4.0, 2.0]
[2.0, 5.0, 2.3]
[3.0, 4.0, 2.0]
[4.0, 6.0, 2.5]
[5.0, 8.0, 5.0]
[6.0, 10.0, 3.0]
[2.0, 4.0, 2.0]
[3.0, 5.0, 2.3]
[4.0, 4.0, 2.0]
[5.0, 6.0, 2.5]
[6.0, 8.0, 5.0]
[7.0, 10.0, 3.0]
[3.0, 4.0, 2.0]
[4.0, 5.0, 2.3]
[5.0, 4.0, 2.0]
[6.0, 6.0, 2.5]
[7.0, 8.0, 5.0]
[8.0, 10.0, 3.0]
Tool completed successfully
Default value data is evaluated only once; data attributes of Minimal instances reference the same list.
>>> class Minimal:
... def __init__(self, data=[]):
... self.data = data
...
>>> a1 = Minimal()
>>> a2 = Minimal()
>>> a1.data is a2.data
True
Replace as follow:
>>> class Minimal:
... def __init__(self, data=None):
... self.data = data or []
...
>>> a1 = Minimal()
>>> a2 = Minimal()
>>> a1.data is a2.data
False
See “Least Astonishment” in Python: The Mutable Default Argument.
Consider the following:
def d():
print("d() invoked")
return 1
def f(p=d())
pass
print"("Start")
f()
f()
It prints
d() invoked
Start
Not
Start
d() invoked
d() invoked
Why? Because default arguments are computed on function definition (and stored in some kind of internal global for reuse every subsequent time they are needed). They are not computed on each function invocation.
In other words, they behave more or less like:
_f_p_default= d()
def f(p)
if p is None: p= _f_p_default
pass
Make the above substitution in your code, and you will understand the problem immediately.
The correct form for your code was already provided by #falsetru . I'm just trying to explain the rationale.

Categories