Python - Create a data set with correlating numeric variables - python

I want to create a dataset where I have years of experience from 1 to 10 and have salary from 30k to 100k. I want these salaries to be random and to follow the years of experience. Sometimes a person with more experience may make less than a person with less experience.
For example:
years of experience | Salary
1 | 30050
2 | 28500
3 | 36000
...
10 | 100,500
Here is what I have done so far:
import numpy as np
import random
import pandas as pd
years = np.linspace(1.0, 10.0, num=10)
salary = np.linspace(30000.0, 100000.0, num=10) + random.uniform(-1,1)*5000#plus/minus 5k
data = pd.DataFrame({'experience' : years, 'salary': salary})
print (data)
Which gives me:
experience salary
0 1.0 31060.903965
1 2.0 38838.681742
2 3.0 46616.459520
3 4.0 54394.237298
4 5.0 62172.015076
5 6.0 69949.792853
6 7.0 77727.570631
7 8.0 85505.348409
8 9.0 93283.126187
9 10.0 101060.903965
we can see that we do not get some records where a person with higher experience made less than a person with lower experience. How can I fix this? Of course I want to scale this to give me 1000 rows

scikit-learn comes with some useful functions to generate correlated data, such as make_regression.
You could for example do:
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
np.random.seed(0)
n_samples = 1000
X, y = make_regression(n_samples=n_samples, n_features=1, n_informative=1,
noise=80, random_state=0)
# Scale X (years of experience) to 0..10 range
X = np.interp(X, (X.min(), X.max()), (0, 10))
# Scale y (salary) to 30000..100000 range
y = np.interp(y, (y.min(), y.max()), (30000, 100000))
# To dataframe
df = pd.DataFrame({'experience': X.flatten(), 'salary': y}
print(df.head(10))
From what you describe, it seems as though you want to add some variance to the response. This can be done by adjusting the noise parameter. Let's plot it to make it more obvious:
from matplotlib import pyplot as plt
plt.scatter(X, y, color='blue', marker='.', label='Salary')
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()
For example, using noise=80:
Or using noise=250:
As a side note: This generates continuous values for "years of experience". If you instead want them rounded to integers, you can do that using X = np.rint(X)

You can define the salary to be equal to the number of years times some coefficient, plus some constant value, plus some random value.
import numpy as np
import random
import pandas as pd
N = 1000
intercept = 30000
coeff = 7000
years = np.random.uniform(low=1, high=10, size=N)
salary = intercept + years*coeff + np.random.normal(loc=0, scale=10000, size=N)
data = pd.DataFrame({'experience' : years, 'salary': salary})
data.plot.scatter(x='experience', y='salary', alpha=0.3)

In this case I would change the line:
salary = np.linspace(30000.0, 100000.0, num=10) + random.uniform(-1,1)*5000#plus/minus 5k
I think it is better to have the random section apart, in this way you can change that easly and make all your modification depending on the values you want to reach.
Here is something I did:
import numpy as np
import random
import pandas as pd
years = np.linspace(1.0, 10.0, num=10)
random_list = [random.random()*1000*_*5 for _ in range(10)]
print(random_list)
salary = np.linspace(30000.0, 100000.0, num=10)- random_list
data = pd.DataFrame({'experience' : years, 'salary': salary})
print (data)
The random components has more variance, when the salary grows.

random.uniform(-1,1)*5000 means that your salary value will be changed by a range of -5k to +5k, but as uniform is continuous in its output, it could well be that the salary is changed by a very small amount.
seeing how the salary without the random element changes by 7777.77... per step up in experience, it is quite unlikely to get a lower salary for a higher experience. i would suggest you increase the factor behind your random element.
try random.uniform(-1,1) * 10000 for example. how high you crank that randomness is up to you, depends on how likely it should be to get an overpaid inexperienced person.

import numpy as np
import random
import pandas as pd
years = np.linspace(1.0, 10.0, num=10)
salary = np.random.randint(30000.0, 100000.0, 10)
data = pd.DataFrame({'experience' : years, 'salary': salary})
print (data)

Related

How to graph a mathematical function for "Distance and Speed over Time" in Python?

I'm struggling with some Python homework.
I'm really new to Python, and coding in general. I have really basic knowledge in Python, and somewhat acceptable level in JavaScript.
My issue: I have to make a graph to represent these two functions:
distance = (x**2/2 - np.cos(5*x) - 7)
speed = (x + 5*np.sin(5*x))
Between the timestamps 3 and 6 (inclusive)
I know I have to use Pandas to make a DataFrame, I know I have to use MatPlotLib to make the actual plot, and I have to use Numpy for the math to work, but I can't get the math to be recognised as mathematical functions because I simply don't know how.
This is what the graph should look like:
Graph for Distance and Speed over Time
This is what my code looks for now:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
x = 10
time = [3, 6]
distance = (x**2/2 - np.cos(5*x) - 7)
speed = (x + 5*np.sin(5*x))
values = {'Distance': distance, 'Speed': speed, 'Time': time}
df = pd.DataFrame(data= values)
df.plot(title='Distance and speed', xlabel='Time (hours)', ylabel='Distance (km) / Speed (km/h)', x='Time')
plt.show()
x = 10 I know shouldn't be included, but since I'm missing the part that makes the math work, I have to include it to make it "work" and not get an error.
I have a vague idea that using Numpy is the answer to my problem, but I don't know how (for now, hopefully).
How wrong am I? Can anyone help me?
Corrections to posted version
Use variable t for time (rather than x)
np.arange(3, 6.01, 0.01) to get time from 3 to 6 inclusive
Code
# time values from 3 to 6 inclusive in steps of 0.01 (use 6.01 to include 6)
t = np.arange(3, 6.01, 0.01) # t for time
# Use NumPy array operations to compute distance and speed at all time values (i.e. x axis)
distance = (t**2/2 - np.cos(5*t) - 7)
speed = (t + 5*np.sin(5*t))
values = {'Distance': distance, 'Speed': speed, 'Time': t} # x is time t
df = pd.DataFrame(data= values)
df.plot(title='Distance and speed', xlabel='Time (hours)', ylabel='Distance (km) / Speed (km/h)', x='Time')
plt.show()

Incorrect results from a numpy Monte Carlo simulation of investment returns

How much money will you get by investing $10,000 periodically at an average rate of 8% annually for 30 years and using monte carlo simulations?
I am trying to solve a problem like the one described above by using Monte Carlo simulations for the interest rate in python. I came up with the following code and it seems right but it is terribly skewed and I suspect that I did something wrong. Below the code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def sni(i,n):
sni = round(((1+i)**n-1)/i,2)
return sni
df = pd.DataFrame()
investment = 10000
for p in range(1000):
i = np.random.normal(0.08,0.18)
lst = []
for n in range(30):
final = investment * sni(i,n)
lst.append(final)
df[p]=lst
I don't recall the equations out of my sleeve for this, but suspect that you in any case have variation of interest rate over the years, so this exponential stuff wouldn't work.
Then I just did it like this:
import numpy as np
import pylab
years = 30
investment = 10000.0
def one_run():
account = 0
for n in range(years):
interest = np.random.normal(0.08, 0.018)
account = account * (1 + interest) + investment
return account
df = [one_run() for _ in range(10000)]
# ****** everything below here is just plotting
p, b = np.histogram(df,50, density=True)
pylab.plot(b[:-1], p)
pylab.grid()
pylab.xlabel("Return [$]")
pylab.ylabel("Probability density [1/$]")
pylab.show()
Also, you varied the interest rate with a scale of 0.18 around 0.08, i.e. quite a lot of times it would be negative. I took the freedom to insert another zero here (to values of 0.08 +- 0.018).

How to create np array random data on age vs time?

How to create np array random data on age vs time?
My aim is to create a scatter plot representing random data on age vs. time spent watching TV.
from pylab import randn
X = randn(500)
Y = randn(500)
plt.scatter(X,Y)
plt.show()
I want age between 18 and 50 and time between 0 to 24 hours
You can try :
import random
import numpy as np
age=np.array(random.sample(list(range(18,51)),10))
time=np.array(random.sample(list(range(0,24)),10))
random.sample takes a list of elements as first argument and the number of samples you want as the second argument.
That gives :
age : [47 45 37 19 23 34 39 24 32 42]
time : [18 12 13 1 15 21 23 22 3 17]
On plotting it :
import matplotlib.pyplot as plt
plt.scatter(age, time)
plt.show()
To recreate the same random numbers every time you run it, you can use random.seed()
It's super easy with numpy. You can use numpy library to do this:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
age = np.random.randint(18, 50, 20)
time = np.random.randint(0, 24, 20)
plt.scatter(age, time)
plt.show()
Column-wise multiplication in numpy
You can easily create custom-sized random arrays with numpy with the commands numpy.random.rand(d0, d1, …, dn) for uniform distributions or numpy.random.randn(d0, d1, …, dn) for normal distributions, where dn is the number of samples in the nth dimension. In your case you'll have d0=500 and d1=2.
However the values will be sampled from the interval [0, 1) in numpy.random.rand(d0, d1, …, dn). Or the standard normal distribution for numpy.random.randn(d0, d1, …, dn) (i.e. mean = 0 and variance = 1).
A nice turnaround for this is to sum and multiply the arrays column-wise to shilft the distributions to the desired values. To multiply in a column-wise fashion an array arr with a vector vec you can use this small snippet of code arr.dot(np.diag(vec)). Be careful, vec should have as much elements as arr has columns.
This snippet works by turning vec into a diagonal matrix (i.e. a matrix where everything is zero except the main diagonal) and the multiplying arr to the diagonal matrix.
For uniform distributions
Remeber that to turn a sample x from an uniform distribution [0, 1) to [min, max), you do new_x = (max - min) * x + min. So if you want an uniform distribution and you know the max and min limits for boths variables, you can do as use the following code:
import numpy as np
n_samples = 500
max_age, min_age = 80, 10
max_hours, min_hours = 10, 0
array = np.random.rand(n_samples, 2) #returns samples from the uniform distribution
range_vector = np.array([max_age - min_age, max_hours - min_hours])
min_vector = np.array([min_age, min_hours])
sample = array.dot(np.diag(range_vector)) + np.ones(array.shape).dot(np.diag(min_vector))
Normal distributions
If you want a normal distribution and you know the mean and variances of both columns use the following code. Remeber that to shift a sample x from an standard normal distribution to a distribution with a different mean and standard deviation, you go new_x = deviation * x + mean.
import numpy as np
n_samples = 500
mean_age, deviation_age = 40, 20
mean_hours, deviation_hours = 5, 2
array = np.random.rand(n_samples, 2) #returns samples from the standard normal distribution
deviation_vector = np.array([deviation_age, deviation_hours])
mean_vector = np.array([mean_age, mean_hours])
sample = array.dot(np.diag(deviation_vector)) + np.ones(array.shape).dot(np.diag(mean_vector))
Be careful however, with the normal distributions you can end up withg negative values.
You can also have a look at all the documentation numpy has on random variables: https://docs.scipy.org/doc/numpy/reference/routines.random.html
Finally please notice that column-wise multiplication only works when you want both samples to be independant.

Empty Plot when dealing with a huge number of rows

I attempted to use the code below to plot a graph to show the Speed per hour by days.
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import glob, os
taxi_df = pd.read_csv('ChicagoTaxi.csv')
taxi_df['trip_start_timestamp'] = pd.to_datetime(taxi_df['trip_start_timestamp'], format = '%Y-%m-%d %H:%M:%S', errors = 'raise')
taxi_df['trip_end_timestamp'] = pd.to_datetime(taxi_df['trip_end_timestamp'], format = '%Y-%m-%d %H:%M:%S', errors = 'raise')
#For filtering away any zero values when trip_Seconds or trip_miles = 0
filterZero = taxi_df[(taxi_df.trip_seconds != 0) & (taxi_df.trip_miles != 0)]
filterZero['trip_seconds'] = filterZero['trip_seconds']/60
filterZero['trip_seconds'] = filterZero['trip_seconds'].apply(lambda x: round(x,0))
filterZero['speed'] = filterZero['trip_miles']/filterZero['trip_seconds']
filterZero['speed'] *= 60
filterZero = filterZero.reset_index(drop=True)
filterZero.groupby(filterZero['trip_start_timestamp'].dt.strftime('%w'))['speed'].mean().plot()
plt.xlabel('Day')
plt.ylabel('Speed(Miles per Minutes)')
plt.title('Mean Miles per Hour By Days')
plt.show() #Not working
Example rows
0 2016-01-13 06:15:00 8.000000
1 2016-01-22 09:30:00 10.500000
Small Dataset : [1250219 rows x 2 columns]
Big Dataset: [15172212 rows x 2 columns]
For a smaller dataset the code works perfectly and the plot is shown. However when I attempted to use a dataset with 15 million rows the plot shown was empty as the values were "inf" despite running mean(). Am i doing something wrong here?
0 inf
1 inf
...
5 inf
6 inf
The speed is "Miles Per Hour" by day! I was trying out all time format so there is a mismatch in the picture sorry.
Image of failed Plotting(Larger Dataset):
Image of successful Plotting(Smaller Dataset):
I can't really be sure because you do not provide a real example of your dataset, but I'm pretty sure your problem comes from the column trip_seconds.
See these two lines:
filterZero['trip_seconds'] = filterZero['trip_seconds']/60
filterZero['trip_seconds'] = filterZero['trip_seconds'].apply(lambda x: round(x,0))
If some of your values in the column trip_seconds are ≤ 30, then this line will round them to 0.0.
filterZero['speed'] = filterZero['trip_miles']/filterZero['trip_seconds']
Therefore this line will be filled with some inf values (as anything / 0.0 = inf). Taking the mean() of an array with inf values will return inf regardless.
Two things to consider:
if your values in the column trip_seconds are actually in seconds, then after dividing your values by 60, they will be in minutes, which will make your speed in miles per minutes, not per hour.
You should try without rounding the times

Classification of continious data

I've got a Pandas df that I use for Machine Learning in Scikit for Python.
One of the columns is a target value which is continuous data (varying from -10 to +10).
From the target-column, I want to calculate a new column with 5 classes where the number of rows per class is the same, i.e. if I have 1000 rows I want to distribute into 5 classes with roughly 200 in each class.
So far, I have done this in Excel, separate from my Python code, but as the data has grown it's getting unpractical.
In Excel I have calculated the percentiles and then used some logic to build the classes.
How to do this in Python?
#create data
import numpy as np
import pandas as pd
df = pd.DataFrame(20*np.random.rand(50, 1)-10, columns=['target'])
#find quantiles
quantiles = df['target'].quantile([.2, .4, .6, .8])
#labeling of groups
df['group'] = 5
df['group'][df['target'] < quantiles[.8]] = 4
df['group'][df['target'] < quantiles[.6]] = 3
df['group'][df['target'] < quantiles[.4]] = 2
df['group'][df['target'] < quantiles[.2]] = 1
looking for an answer to similar question found this post and the following tip: What is the difference between pandas.qcut and pandas.cut?
import numpy as np
import pandas as pd
#generate 1000 rows of uniform distribution between -10 and 10
rows = np.random.uniform(-10, 10, size = 1000)
#generate the discretization in 5 classes
rows_cut = pd.qcut(rows, 5)
classes = rows_cut.factorize()[0]

Categories