I have to add a weight column in the titanic dataset to calculate adult passengers' weight using a normal distribution with std = 20 and mean = 70 kg. I have tried this code:
df['Weight'] = np.random.normal(20, 70, size=891)
df['Weight'].fillna(df['Weight'].iloc[0], inplace=True)
but I am concerned about two things:
It generates negative values, not just positive; how can this be considered normal weight value, is there anything that I can change in code to generate just positive values.
Since I am targeting the adults' age group, what about children. Some of them also have abnormal weight values, such as 7 kg for adults or 30 kg for a child; how can this be solved.
I appreciate any help you can provide.
Edit:
This code worked for me
Weight = np.random.normal(80, 20, 718)
adults['Weight'] = Weight
Now I have to calculate probability for people weighted less than 70
and who is between 70 and 100.
I have tried the following code but it raise an error: TypeError: unsupported operand type(s) for -: 'str' and 'int'.
import pandas as pd
import numpy as np
import scipy.stats
adults = df[(df['Age'] >= 20) & (df['Age'] <= 70)]
Weight = np.random.normal(80, 20, 718)
adults['Weight'] = Weight
p1 = adults['Weight'] < 70
p2 = adults[(adults['Weight'] > 70) & (adults['Weight'] < 100)]
scipy.stats.norm.pdf(p1)
scipy.stats.norm.pdf(p2)
Range of a Normal distribution is not restricted. It spans all across real numbers. If you want to restrict it, you should do it manually or use other distributions.
df['Weight'] = np.random.normal(20, 70, size=891)
df.loc[df['Weight'] < min_value, 'Weight'] = min_value
df.loc[df['Weight'] > max_value, 'Weight'] = max_value
Since weights of children and adults are not iid's you should sample it from different distributions
# use different distributions
df.loc[df['person_type'] == 'child', 'Weight'] = np.random.normal(x1, y1, size=children_size)
df.loc[df['person_type'] == 'adult', 'Weight'] = np.random.normal(x2, y2, size=adult_size)
You can use a truncated normal distribution, if you want to avoid negative values, for example, to get a vector of with mean 70 and sd 20, you can do:
myclip_a = 0
myclip_b = +np.Inf
my_mean = 70
my_std = 20
a, b = (myclip_a - my_mean) / my_std, (myclip_b - my_mean) / my_std
We set the seed, and you can see the truncated normal has all values above zero, but not the normal u have used:
np.random.seed(100)
x1 = truncnorm.rvs(a= a,b=b,size=50000,loc=70,scale=20)
np.sum(x1<0)
0
x2 = norm.rvs(loc=70,scale=20,size=50000)
np.sum(x2<0)
10
Not very sure how you are filling in the nas.. Will need the data frame to address that but I suspect its another question altogether.
Related
I trying to generate random data with Pandas.
Data is need to be stored in two columns. The first column needs to contain categorical variables (from Stratum_1 until Stratum_19) each of these stratums can contain a random number of values.
Second column needs to have data in the range between 1 to 180000000 with a standard deviation of 453210, a mean of 170000, and a number of rows 100000.
I try to
categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9',
'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']}
desired_mean = 170000
desired_std_dev = 453210
df = pd.DataFrame(np.random.randint(0,180000000,size=(100000, 1)),columns=list('1'))
I tried with this code above but don't know how to implement categorical and numerical values together with desired mean and standard deviation. So can anybody help how to solve this problem and generate?
I decided to use the gamma distribution to generate your desired sample after thinking that the given parameters are not suitable for the normal distribution.
Code
import numpy as np
import pandas as pd
# desired parameters
n_rows = 100000
lower, upper = 1, 180000000
mu, sigma = 170000, 453210
# amount of shift
delta = lower
# parameters for the gamma distribution
shape = ((mu - delta) / sigma) ** 2
scale = sigma**2 / (mu - delta)
# Create a dataframe
categories = {'name': [f'Stratum_{i}' for i in range(1, 19 + 1)]}
df = pd.DataFrame(categories).sample(n=n_rows, replace=True).reset_index(drop=True)
# Generate samples along with your desired parameters
generator = np.random.default_rng()
while True:
df['value'] = generator.gamma(shape=shape, scale=scale, size=n_rows) + delta
if df.value.max() <= upper:
break
# Show statistics
print(df.describe())
Output
value
count
100,000
mean
169,403 (Target: 170,000)
std
449,668 (Target: 453,210)
min
1
25%
39.4267
50%
5529.28
75%
105,748
max
9.45114e+06
Try:
import numpy as np
categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9',
'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']}
desired_mean = 170000
desired_std_dev = 453210
df = pd.DataFrame({'num':np.random.normal(170000, 453210,size=(300000, 1)).reshape(-1), 'cat':np.random.choice(categorical['name'], 300000)})
df[(0<df['num'])&(df['num']<180000000)].sample(100000)
result:
I am preparing a dataset for regression modelling. I would like to remove all outliers prior to doing so. The dataset has 7 variables which are continuous in nature. Five of the variables can be addressed universally. However, two variables need to be divided between male and female participants first, these two variables are height and weight. Clearly these two measurements will differ between males and females, therefore to acquire the outliers I need to differentiate the data by male and females, then assess/remove the outliers across both height and weight for each, then incorporate this data back with the data I have already prepared. Is there a simple way of doing this ? I have been using the inter quartile range thus far on the adjacent 5 variables which do not need to be divided by males and females, using this code for each variable...
Q1 = df["Variable"].quantile(0.25)
Q3 = df["Variable"].quantile(0.75)
IQR = Q3-Q1
Lower_Fence = Q1 - (1.5*IQR)
Upper_Fence = Q3 + (1.5*IQR)
print(Lower_Fence)
print(Upper_Fence)
df[((df["Variable"] < Lower_Fence) | (df["Variable"] > Upper_Fence))] # Detection of outliers
df[~((df["Variable"] < Lower_Fence) | (df["Variable"] > Upper_Fence))]` # Removal of outliers
I am relatively new to python.
You can define a function for your "outlier" logic, then apply that repeatedly for all columns, with or without groupby:
def is_outlier(s, quantiles=[.25, .75], thresholds=[-.5, .5]):
# change the thresholds to [-1.5, 1.5] to reflect IQR as per your question
a, b = s.quantile(quantiles)
iqr = b - a
lo, hi = np.array(thresholds) * iqr + [a, b]
return (s < lo) | (s > hi)
Simple test:
n = 20
np.random.seed(0)
df = pd.DataFrame(dict(
status=np.random.choice(['dead', 'alive'], n),
gender=np.random.choice(['M', 'F'], n),
weight=np.random.normal(150, 40, n),
diastolic=np.random.normal(80, 10, n),
cholesterol=np.random.normal(200, 20, n),
))
Example usage:
mask = is_outlier(df['diastolic']) # overall outliers
# or
mask = df.groupby('gender')['weight'].apply(is_outlier) # per gender group
Usage to filter out data:
mask = False
# overall outliers
for k in ['diastolic', 'cholesterol']: # etc
mask |= is_outlier(df[k])
# per-gender outliers
gb = df.groupby('gender')
for k in ['weight']: # and any other columns needed for per-gender
mask |= gb[k].apply(is_outlier)
# finally, select the non-outliers
df_filtered = df.loc[~mask]
BTW, note how per-gender outliers are different than overall, e.g. for 'weight':
df.groupby('gender')['weight'].apply(is_outlier) == is_outlier(df['weight'])
I have a existing distribution of values and I want to draw samples of size 5, but those 5 samples need to have a std of X within some tolerance. For example, I need 5 samples that have a std of 10 (even though the overall distribution is std=~32).
The example code below somewhat works, but is quite slow for large dataset. It randomly samples the distribution until it finds something close to the target std, then removes those elements so they can't be drawn again.
Is there a smarter way to do this properly and faster? It works ok for some target_std (above 6), but it isn't accurate below 6.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(23)
# Create a distribution
d1 = np.random.normal(95, 5, 200)
d2 = np.random.normal(125, 5, 200)
d3 = np.random.normal(115, 10, 200)
d4 = np.random.normal(70, 10, 100)
d5 = np.random.normal(160, 5, 200)
d6 = np.random.normal(170, 20, 100)
dist = np.concatenate((d1, d2, d3, d4, d5, d6))
print(f"Full distribution: len={len(dist)}, mean={np.mean(dist)}, std={np.std(dist)}")
plt.hist(dist, bins=100)
plt.title("Full Distribution")
plt.show();
batch_size = 5
num_batches = math.ceil(len(dist)/batch_size)
target_std = 10
tolerance = 1
# how many samples to search
num_samples = 100
result = []
# Find samples of batch_size that are closest to target_std
for i in range(num_batches):
samples = []
idxs = np.arange(len(dist))
for j in range(num_samples):
indices = np.random.choice(idxs, size=batch_size, replace=False)
sample = dist[indices]
std = sample.std()
err = abs(std - target_std)
samples.append((sample, indices, std, err, np.mean(sample), max(sample), min(sample)))
if err <= tolerance:
# close enough, stop sampling
break
# sort by smallest err first, then take the first/best result
samples = sorted(samples, key=lambda x: x[3])
best = samples[0]
if i % 100 == 0:
pass
print(f"{i}, std={best[2]}, err={best[3]}, nsamples={num_samples}")
result.append(best)
# remove the data from our source
dist = np.delete(dist, best[1])
df_samples = pd.DataFrame(result, columns=["sample", "indices", "std", "err", "mean", "max", "min"])
df_samples["err"].plot(title="Errors (target_std - batch_std)")
batch_std = df_samples["std"].mean()
batch_err = df_samples["err"].mean()
print(f"RESULT: Target std: {target_std}, Mean batch std: {batch_std}, Mean batch err: {batch_err}")
Since your problem is not restricted to a certain distribution, I use a normally random distribution, but this should work for any distribution. However the run time will depend on the population size.
population = np.random.randn(1000)*32
std = 10.
tol = 1.
n_samples = 5
samples = list(np.random.choice(population, n_samples))
while True:
center = np.mean(samples)
dis = [abs(i-center) for i in samples]
if np.std(samples)>(std+tol):
samples.pop(dis.index(max(dis)))
elif np.std(samples)<(std-tol):
samples.pop(dis.index(min(dis)))
else:
break
samples.append(np.random.choice(population, 1)[0])
Here is how the code works.
First, draw n_samples, probably the std is not in the range you want, so we calculate the mean and absolute distance of each sample to the mean. Then if the std is larger than the desired value plus tolerance, we kick the furthest sample and draw a new one and vice versa.
Note that if this takes too much time to calculate for your data, after kicking the outlier out, you can calculate what should be the range of the next element that should be drawn in the population, instead of randomly taking one. Hopefully this works for you.
DISCLAIMER: This is not a random draw anymore, and you should be aware that the draw is biased and is not representative of the population.
I am trying to simulate a pandas dataframe, using random values, with a combination of hard upper/lower values. I am using np.random.normal, as the original data is fairly normally distributed.
The code I am using to create the dataframe is:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
})
In the above example, I would like there to be a hard lower and upper bound for all three values. For example, Rel. Hum. could not go below 0, or above 100. Edit: all three values would not have the same bounds, either upper or lower. Temp can go negative, while sun would be bounded at 0, and 24)
How can I force these values, while creating a relatively normally distribution, and passing them to the dataframe at the same time?
Edit : Note that this samples from a truncated normal for the given parameters and will most likely not be truly normally distributed, sorry for the confusion.
Use scipy truncated normal defined as :
"The standard form of this distribution is a standard normal truncated to the range [a, b]"
from scipy.stats import truncnorm
low_bound = 0
upper_bound = 100
mean = 8
std = 2
a, b = (low_bound - mean) / std, (upper_bound - mean) / std
n_samples = 1000
samples = truncnorm.rvs(a = a, b = b,
loc = mean, scale = std,
size = n_samples)
Thanks to ALollz for the corrections !
Try clip() function to bound the values, example:
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
Name: Rel Hum, Length: 93, dtype: float64
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
>>> df['Rel Hum'].clip(0, 100, inplace=True) # assigns values outside boundary to 0 and 100
>>> df.head()
Temp Sun Rel Hum
0 9.714943 6.255931 93.105135
1 0.551001 3.063972 85.923184
2 7.780588 3.580514 79.124139
3 3.766066 3.684801 84.543149
4 8.541507 -3.066196 83.598925
>>> df[df['Rel Hum']>100].head()
Empty DataFrame
Columns: [Temp, Sun, Rel Hum]
Index: []
Just do a clip:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
}).clip(0,100)
And plot:
df.plot.density(subplots=True);
gives:
You can clip, though this leaves you with a spike at the edges:
import pandas as pd
import numpy as np
N = 10**5
df = pd.DataFrame({"Rel Hum": np.random.normal(87.153118,5.529958, N)})
df['Rel Hum'].clip(lower=0, upper=100).plot(kind='hist', bins=np.arange(60,101,1))
If you want to avoid that spike redraw out of bounds points until everything is within bounds:
while not df['Rel Hum'].between(0, 100).all():
m = ~df['Rel Hum'].between(0, 100)
df.loc[m, 'Rel Hum'] = np.random.normal(87.153118, 5.529958, m.sum())
df['Rel Hum'].plot(kind='hist', bins=np.arange(60,101,1))
So I need to calculate the joint probability distribution for N variables. I have code for two variables, but I am having trouble generalizing it to higher dimensions. I imagine there is some sort of pythonic vectorization that could be helpful, but, right now my code is very C like (and yes I know that is not the right way to write Python). My 2D code is below:
import numpy
import math
feature1 = numpy.array([1.1,2.2,3.0,1.2,5.4,3.4,2.2,6.8,4.5,5.6,1.9,2.8,3.7,4.4,7.3,8.3,8.1,7.0,8.0,6.8,6.2,4.9,5.7,6.3,3.7,2.4,4.5,8.5,9.5,9.9]);
feature2 = numpy.array([11.1,12.8,13.0,11.6,15.2,13.8,11.1,17.8,12.5,15.2,11.6,20.8,14.7,14.4,15.3,18.3,11.4,17.0,16.0,16.8,12.2,14.9,15.7,16.3,13.7,12.4,14.2,18.5,19.8,19.0]);
#===Concatenate All Features===#
numFrames = len(feature1);
allFeatures = numpy.zeros((2,numFrames));
allFeatures[0,:] = feature1;
allFeatures[1,:] = feature2;
#===Create the Array to hold all the Bins===#
numBins = int(0.25*numFrames);
allBins = numpy.zeros((allFeatures.shape[0],numBins+1));
#===Find the maximum and minimum of each feature===#
allRanges = numpy.zeros((allFeatures.shape[0],2));
for f in range(allFeatures.shape[0]):
allRanges[f,0] = numpy.amin(allFeatures[f,:]);
allRanges[f,1] = numpy.amax(allFeatures[f,:]);
#===Create the Array to hold all the individual feature probabilities===#
allIndividualProbs = numpy.zeros((allFeatures.shape[0],numBins));
#===Grab all the Individual Probs and the Bins===#
for f in range(allFeatures.shape[0]):
freqhist, binedges = numpy.histogram(allFeatures[f,:],bins=numBins,range=[allRanges[f,0],allRanges[f,1]],density=False);
allBins[f,:] = binedges;
allIndividualProbs[f,:] = freqhist;
#===Create the joint probability array===#
jointProbs = numpy.zeros((numBins,numBins));
#===Compute the joint probability distribution===#
numElements = 0;
for b1 in range(numBins):
for b2 in range(numBins):
for f1 in range(numFrames):
for f2 in range(numFrames):
if ( ( (feature1[f1] >= allBins[0,b1]) and (feature1[f1] <= allBins[0,b1+1]) ) and ((feature2[f2] >= allBins[1,b2]) and (feature2[f2] <= allBins[1,b2+1])) ):
jointProbs[b1,b2] += 1;
numElements += 1;
jointProbs /= numElements;
#===But what if I add the following===#
feature3 = numpy.array([21.1,21.8,23.5,27.6,25.2,23.8,22.1,22.8,26.5,25.2,28.6,20.8,24.7,24.4,29.3,28.3,27.4,26.0,26.2,26.1,25.9,24.0,22.7,22.3,23.7,26.4,24.2,28.5,29.8,29.0]);
How can I generalize the large loop? For N variables (features) this loop would be enormous. Is there a Pythonic way to do this easily?
Check out the function numpy.histogramdd. This function can compute histograms in arbitrary numbers of dimensions. If you set the parameter normed=True, it returns the bin count divided by the bin hypervolume. If you'd prefer something more like a probability mass function (where everything sums to 1), just normalize it yourself. All together, you'll have something like:
import numpy as np
numBins = 10 # number of bins in each dimension
data = np.random.randn(100000, 3) # generate 100000 3-d random data points
jointProbs, edges = np.histogramdd(data, bins=numBins)
jointProbs /= jointProbs.sum()