Python: How to plot a conditional cumulative frequency histogram? - python

I have this list of data for which I would like to plot a histogram chart. However the graph is not very readable for large values ​​of the X axis and are not really important to keep them.
Here is a sub sample of my data:
print(v)
1 1738 #the values ​​I want to plot on the histogram
2 2200
3 1338
4 1222
5 939
6 898
I calculated the cumulative frequency as follows:
v = x.cumsum()
t = [round(100*v/x.sum(),2)]
t
and the output is:
1 9.90
2 22.44
3 30.06
4 37.02
5 42.37
How can I represent on the histogram only the data for which the cumulative frequency is less than or equal to 40%?
I don't know how to do in python, thank you in advance for your help

The short answer is: Slice the numpy array to filter values <= 40%. For example, if a is a 1D numpy array:
a[a <= 40]
A longer answer is provided by the example below, which shows:
A generation of normally distributed random data (as the provided dataset is very small)
Performing your calculation on the numpy array
Slicing the array to return values which are <= 40%
Plotting the results using the Plotly library - API only.
Example code:
import numpy as np
import plotly.io as pio
# Generate random dataset (for demo only).
np.random.seed(1)
X = np.random.normal(0, 1, 10000)
# Calculate the cumulative frequency.
X_ = np.cumsum(X)*100/X.sum()
data = X_[X_ <= 40]
# Plot the histogram.
pio.show({'data': {'x': data,
'type': 'histogram',
'marker': {'line': {'width': 0.5}}},
'layout': {'title': 'Cumulative Frequency Demo'}})
Output:

Related

How do I format a y-axis 'y' in matplotlib going between pandas dataframes and simple variables?

CSV1only is a dataframe uploaded from a CSV
Let CSV1only as a dataframe be a column such that:
TRADINGITEMID:
1233
2455
3123
1235
5098
as a small example
How can I plot a scatterplot accordingly, specifically the y-axis?
I tried:
import pandas as pd
import matplotlib.pyplot as plt
CSV1only.plot(kind='scatter',x='TRADINGITEMID', y= [1,2], color='b')
plt.xlabel('TRADINGITEMID Numbers')
plt.ylabel('Range')
plt.title('Distribution of ItemIDNumbers')
and it doesn't work because of the y.
So, my main question is just how I can get a 0, 1, 2 y-axis for this scatter plot, as I want to make a distribution graph.
The following code doesn't work because it doesn't match the amount of rows included in the original TRADINGITEMID column, which has 5000 rows:
newcolumn_values = [1, 2]
CSV1only['un et deux'] = newcolumn_values
#and then I changed the y = [1,2] from before into y = ['un et deux']
Therefore the solution would need to work from any integer 1 to N, N being the # of rows. Yet, it would only have a range of [0, 2] or some [0, m], m being some arbitrary integer.
Don't need to worry about the actual pandas data frame CSV1only.
The 'TRADINGITEMIDNUMBERS' contains 5000 rows of unique numbers, so I just wanna plot those numbers on a line, with the y-axis being instances (which will never pass 1 since it is unique).
I think you are looking for the following: You need to generate y-values starting from 0 until n-1 where n is the total number of rows
y = np.arange(len(CSV1only['TRADINGITEMID']))
plt.scatter(CSV1only['TRADINGITEMID'], y, c='DarkBlue')

Compact way of visualizing heat maps of correlated data

I am trying to visualize the correlation of the Result column with every other column.
A_B A_C B_C Result
0 0.318182 0.925311 0.860465 91
1 -0.384030 0.991803 0.996344 12
2 -0.818182 0.411765 0.920000 53
3 0.444444 0.978261 0.944444 64
A_B = (A-B)/(A+B) correspondingly all other values too.
which works for smaller no. of columns but if I increase the no. of columns then no. of rows in heatmap keeps on stacking up.Is there any compact way to represent it.
Following code will reproduce the output-
import pandas as pd
import seaborn as sns
data = {'A':[232,243,12,546,67,12,78,11,245],
'B':[120,546,120,210,56,120,56,89,12],
'C':[9,1,5,6,7,43,7,12,64],
'Result':[91,12,53,64,71,436,74,123,641],
}
df = pd.DataFrame(data,columns=['A','B','C','Result'])
#Responsible for (A-B)/(A+B) ,(A-C)/(A+C) and similarly
colnames = df.columns.tolist()[:-1]
for i,c in enumerate(colnames):
if i!=len(colnames):
for k in range(i+1,len(colnames)):
df[c+'_'+colnames[k]]=(df[c]-df[colnames[k]])/(df[c]+df[colnames[k]])
newdf = df[['A_B','A_C','B_C','Result']].copy()
#Plotting A_B,A_C,B_C by ignoring the output of result of itself
plot = pd.DataFrame(newdf.corr().iloc[:-1,-1])
sns.heatmap(plot,annot=True)
A technique which I heard but unable to find any source ,is representing each correlation factor in the mini-recangles like
So according to it, considering the given map as a matrix of 3*3 and (0,0) starting from left-bottom, A_B will be represented in (1,1)
A_C in (2,1),B_C in (2,2).
But ,I am not getting it how to do it ?
You can plot the correlation of each column against the Result column and other columns as well. Below is one way to do so. Providing the x- and y-ticklabels guides you better for comparing the correlations. You can also annotate the correlation values to be displayed on the heat map.
cor = newdf.corr()
sns.heatmap(cor, xticklabels=cor.columns.values,
yticklabels=cor.columns.values, annot=True)

How to create np array random data on age vs time?

How to create np array random data on age vs time?
My aim is to create a scatter plot representing random data on age vs. time spent watching TV.
from pylab import randn
X = randn(500)
Y = randn(500)
plt.scatter(X,Y)
plt.show()
I want age between 18 and 50 and time between 0 to 24 hours
You can try :
import random
import numpy as np
age=np.array(random.sample(list(range(18,51)),10))
time=np.array(random.sample(list(range(0,24)),10))
random.sample takes a list of elements as first argument and the number of samples you want as the second argument.
That gives :
age : [47 45 37 19 23 34 39 24 32 42]
time : [18 12 13 1 15 21 23 22 3 17]
On plotting it :
import matplotlib.pyplot as plt
plt.scatter(age, time)
plt.show()
To recreate the same random numbers every time you run it, you can use random.seed()
It's super easy with numpy. You can use numpy library to do this:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
age = np.random.randint(18, 50, 20)
time = np.random.randint(0, 24, 20)
plt.scatter(age, time)
plt.show()
Column-wise multiplication in numpy
You can easily create custom-sized random arrays with numpy with the commands numpy.random.rand(d0, d1, …, dn) for uniform distributions or numpy.random.randn(d0, d1, …, dn) for normal distributions, where dn is the number of samples in the nth dimension. In your case you'll have d0=500 and d1=2.
However the values will be sampled from the interval [0, 1) in numpy.random.rand(d0, d1, …, dn). Or the standard normal distribution for numpy.random.randn(d0, d1, …, dn) (i.e. mean = 0 and variance = 1).
A nice turnaround for this is to sum and multiply the arrays column-wise to shilft the distributions to the desired values. To multiply in a column-wise fashion an array arr with a vector vec you can use this small snippet of code arr.dot(np.diag(vec)). Be careful, vec should have as much elements as arr has columns.
This snippet works by turning vec into a diagonal matrix (i.e. a matrix where everything is zero except the main diagonal) and the multiplying arr to the diagonal matrix.
For uniform distributions
Remeber that to turn a sample x from an uniform distribution [0, 1) to [min, max), you do new_x = (max - min) * x + min. So if you want an uniform distribution and you know the max and min limits for boths variables, you can do as use the following code:
import numpy as np
n_samples = 500
max_age, min_age = 80, 10
max_hours, min_hours = 10, 0
array = np.random.rand(n_samples, 2) #returns samples from the uniform distribution
range_vector = np.array([max_age - min_age, max_hours - min_hours])
min_vector = np.array([min_age, min_hours])
sample = array.dot(np.diag(range_vector)) + np.ones(array.shape).dot(np.diag(min_vector))
Normal distributions
If you want a normal distribution and you know the mean and variances of both columns use the following code. Remeber that to shift a sample x from an standard normal distribution to a distribution with a different mean and standard deviation, you go new_x = deviation * x + mean.
import numpy as np
n_samples = 500
mean_age, deviation_age = 40, 20
mean_hours, deviation_hours = 5, 2
array = np.random.rand(n_samples, 2) #returns samples from the standard normal distribution
deviation_vector = np.array([deviation_age, deviation_hours])
mean_vector = np.array([mean_age, mean_hours])
sample = array.dot(np.diag(deviation_vector)) + np.ones(array.shape).dot(np.diag(mean_vector))
Be careful however, with the normal distributions you can end up withg negative values.
You can also have a look at all the documentation numpy has on random variables: https://docs.scipy.org/doc/numpy/reference/routines.random.html
Finally please notice that column-wise multiplication only works when you want both samples to be independant.

Unequal width binned histogram in python

I have an array with probability values stored in it. Some values are 0. I need to plot a histogram such that there are equal number of elements in each bin. I tried using matplotlibs hist function but that lets me decide number of bins. How do I go about plotting this?(Normal plot and hist work but its not what is needed)
I have 10000 entries. Only 200 have values greater than 0 and lie between 0.0005 and 0.2. This distribution isnt even as 0.2 only one element has whereas 2000 approx have value 0.0005. So plotting it was an issue as the bins had to be of unequal width with equal number of elements
The task does not make much sense to me, but the following code does, what i understood as the thing to do.
I also think the last lines of the code are what you really wanted to do. Using different bin-widths to improve visualization (but don't target the distribution of equal amount of samples within each bin)! I used astroml's hist with method='blocks' (astropy supports this too)
Code
# Python 3 -> beware the // operator!
import numpy as np
import matplotlib.pyplot as plt
from astroML import plotting as amlp
N_VALUES = 1000
N_BINS = 100
# Create fake data
prob_array = np.random.randn(N_VALUES)
prob_array /= np.max(np.abs(prob_array),axis=0) # scale a bit
# Sort array
prob_array = np.sort(prob_array)
# Calculate bin-borders,
bin_borders = [np.amin(prob_array)] + [prob_array[(N_VALUES // N_BINS) * i] for i in range(1, N_BINS)] + [np.amax(prob_array)]
print('SAMPLES: ', prob_array)
print('BIN-BORDERS: ', bin_borders)
# Plot hist
counts, x, y = plt.hist(prob_array, bins=bin_borders)
plt.xlim(bin_borders[0], bin_borders[-1] + 1e-2)
print('COUNTS: ', counts)
plt.show()
# And this is, what i think, what you really want
fig, (ax1, ax2) = plt.subplots(2)
left_blob = np.random.randn(N_VALUES/10) + 3
right_blob = np.random.randn(N_VALUES) + 110
both = np.hstack((left_blob, right_blob)) # data is hard to visualize with equal bin-widths
ax1.hist(both)
amlp.hist(both, bins='blocks', ax=ax2)
plt.show()
Output

Applying pandas qcut bins to new data

I am using pandas qcut to split some data into 20 bins as part of data prep for training of a binary classification model like so:
data['VAR_BIN'] = pd.qcut(cc_data[var], 20, labels=False)
My question is, how can I apply the same binning logic derived from the qcut statement above to a new set of data, say for model validation purposes. Is there an easy way to do this?
Thanks
You can do it by passing retbins=True.
Consider the following DataFrame:
import pandas as pd
import numpy as np
prng = np.random.RandomState(0)
df = pd.DataFrame(prng.randn(100, 2), columns = ["A", "B"])
pd.qcut(df["A"], 20, retbins=True, labels=False) returns a tuple whose second element is the bins. So you can do:
ser, bins = pd.qcut(df["A"], 20, retbins=True, labels=False)
ser is the categorical series and bins are the break points. Now you can pass bins to pd.cut to apply the same grouping to the other column:
pd.cut(df["B"], bins=bins, labels=False, include_lowest=True)
Out[38]:
0 13
1 19
2 3
3 9
4 13
5 17
...
User #Karen said:
By using this logic, I am getting Na values in my validation set. Is there some way to solve it?
If this is happening to you, it most likely means that the validation set has values below (or above) the smallest (or greatest) value from the training data. Therefore, some values will fall out of range and will therefore not be assigned a bin.
You can solve this problem by extending the range of the training data:
# Make smallest value arbitrarily smaller
train.loc[train['value'].eq(train['value'].min()), 'value'] = train['value'].min() - 100
# Make greatest value arbitrarily greater
train.loc[train['value'].eq(train['value'].max()), 'value'] = train['value'].max() + 100
# Make bins from training data
s, b = pd.qcut(train['value'], 20, retbins=True)
# Cut validation data
test['bin'] = pd.cut(test['value'], b)

Categories