Classification of continious data - python

I've got a Pandas df that I use for Machine Learning in Scikit for Python.
One of the columns is a target value which is continuous data (varying from -10 to +10).
From the target-column, I want to calculate a new column with 5 classes where the number of rows per class is the same, i.e. if I have 1000 rows I want to distribute into 5 classes with roughly 200 in each class.
So far, I have done this in Excel, separate from my Python code, but as the data has grown it's getting unpractical.
In Excel I have calculated the percentiles and then used some logic to build the classes.
How to do this in Python?

#create data
import numpy as np
import pandas as pd
df = pd.DataFrame(20*np.random.rand(50, 1)-10, columns=['target'])
#find quantiles
quantiles = df['target'].quantile([.2, .4, .6, .8])
#labeling of groups
df['group'] = 5
df['group'][df['target'] < quantiles[.8]] = 4
df['group'][df['target'] < quantiles[.6]] = 3
df['group'][df['target'] < quantiles[.4]] = 2
df['group'][df['target'] < quantiles[.2]] = 1

looking for an answer to similar question found this post and the following tip: What is the difference between pandas.qcut and pandas.cut?
import numpy as np
import pandas as pd
#generate 1000 rows of uniform distribution between -10 and 10
rows = np.random.uniform(-10, 10, size = 1000)
#generate the discretization in 5 classes
rows_cut = pd.qcut(rows, 5)
classes = rows_cut.factorize()[0]

Related

Outlier detection based on the moving mean in Python

I am trying to translate an algorithm from MATLAB to Python. The algorithm works with large datasets, and need an outlier detection and elimination technique to be applied.
In the MATLAB code, the outlier deletion technique I use is movmedian:
Outlier_T=isoutlier(Data_raw.Temperatura,'movmedian',3);
Data_raw(find(Outlier_T),:)=[]
Which detects outliers with a rolling median, by finding desproportionate values in the centre of a three value moving window. So If I have a column "Temperatura" with a 40 on row 3, it is detected and the entire row is deleted.
Temperatura Date
1 24.72 2.3
2 25.76 4.6
3 40 7.0
4 25.31 9.3
5 26.21 15.6
6 26.59 17.9
... ... ...
To my understanding, this is achieved with pandas.DataFrame.rolling. I have seen several posts examplify its use, but I am not managing to make it work with my code:
Attempt A:
Dataframe.rolling(df["t_new"]))
Attempt B:
df-df.rolling(3).median().abs()>200
#based on #Ami Tavory's answer
Am I missing something obvious here? What is the right way of doing this?
Thank you for your time.
Code below drops the rows based on threshold. This threshold could be adjusted as needed. Not sure if it replicates Matlab code though.
# Import Libraries
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'Temperatura': [24.72, 25.76, 40, 25.31, 26.21, 26.59],
'Date':[2.3,4.6,7.0,9.3,15.6,17.9]
})
# Set threshold for difference with rolling median
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temperatura'].rolling(window=3).median()
# Calculate difference
df['diff'] = df['Temperatura'] - df['rolling_temp']
# Flag rows to be dropped as `1`
df['drop_flag'] = np.where((df['diff']>upper_threshold)|(df['diff']<lower_threshold),1,0)
# Drop flagged rows
df = df[df['drop_flag']!=1]
df = df.drop(['rolling_temp', 'rolling_temp', 'diff', 'drop_flag'],axis=1)
Output
print(df)
Temperatura Date
0 24.72 2.3
1 25.76 4.6
3 25.31 9.3
4 26.21 15.6
5 26.59 17.9
Late to the party, based on Nilesh Ingle's answer. Modified to be more general, verbose (graphs!), and a percentage threshold instead of the data's real values.
# Calculate rolling median
df["Temp_Rolling"] = df["Temp"].rolling(window=3).median()
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df["Temp_Scaled"] = scaler.fit_transform(df["Temp"].values.reshape(-1, 1))
df["Temp_Rolling"] = scaler.fit_transform(df["Temp_Rolling"].values.reshape(-1, 1))
# Calculate difference
df["Temp_Diff"] = df["Temp_Scaled"] - df["Temp_Rolling"]
import numpy as np
import matplotlib.pyplot as plt
# Set threshold for difference with rolling median
upper_threshold = 0.4
lower_threshold = -0.4
# Flag rows to be keepped True
df["Temp_Keep_Flag"] = np.where( (df["Temp_Diff"] > upper_threshold) | (df["Temp_Diff"] < lower_threshold), False, True)
# Keep flagged rows
print('dropped rows')
print(df[~df["Temp_Keep_Flag"]].index)
print('Your new graph')
df_result = df[df["Temp_Keep_Flag"].values]
df_result["Temp"].plot()
Once you're satisfied with the data cleaning
# Satisfied, replace data
df = df[df["Temp_Keep_Flag"].values]
df.drop(columns=["Temp_Rolling", "Temp_Diff", "Temp_Keep_Flag"], inplace=True)
df.plot()
Nilesh answer works perfectly, to iterate on his code you could also do :
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temp'].rolling(window=3).median()
# all in one line
df = df.drop(df[(df['Temp']-df['rolling_temp']>upper_threshold)|(df['Temp']- df['rolling_temp']<lower_threshold)].index)
# if you want to drop the column as well
del df["rolling_temp"]

Optimally sampling n rows from location Pandas dataframe

I have a dataframe of latitudes and longitudes. I want to get a sample of size n_samples that covers the majority of the region in the dataframe. At first, I wanted to first sort the dataframe by latitude and longitude and then use modular arithmetic to select evenly spread-out rows. However, this does not work in cases where I wish to sample, say, 27 out of 100 rows, since 100 % 27 is not 0. Note this problem gets even worse when trying to use n_sample=80 (since it would sample all 100 rows). So it will not be useful to simply adjust the number of rows afterward.
import random
import pandas as pd
n_samples = 27
lat = [random.uniform(30, 50) for i in range(100)]
lon = [random.uniform(-130, -100) for i in range(100)]
loc_df = pd.DataFrame([lat, lon]).T
loc_df.columns = ['lat', 'lon']
# Sort loc_df by lat/lon
loc_df = loc_df.sort_values(['lat', 'lon'])
# Sample every n rows
# Tends to sample either too many or too few rows
# In this case, we will be sampling 24 instead of 27 rows
every_n = round(loc_df.shape[0]/n_samples)
sample_df = loc_df[::every_n].reset_index(drop=True)
The first thing that comes to mind is 'train_test_split' from sklearn.
from sklearn.model_selection import train_test_split
test_pct = 1 - (n_samples / len(loc_df))
X = loc_df.iloc[:,0]
Y = loc_df.iloc[:,1]
X_sample, X_remain, Y_sample, Y_remain = train_test_split( X, Y, test_size=test_pct, random_state=0)
sample_df = X_sample.to_frame().join(Y_sample).reset_index(drop=True)
This assigns a test_pct variable based on your n_samples and the size of the dataframe
Create X (lat) and Y(lon) arrays from your loc_df
Use "train_test_split" to randomly select your "n_samples"
and then create your sample_df dataframe containing "n_samples" of data.
Will that work for your purposes?

Python - Create a data set with correlating numeric variables

I want to create a dataset where I have years of experience from 1 to 10 and have salary from 30k to 100k. I want these salaries to be random and to follow the years of experience. Sometimes a person with more experience may make less than a person with less experience.
For example:
years of experience | Salary
1 | 30050
2 | 28500
3 | 36000
...
10 | 100,500
Here is what I have done so far:
import numpy as np
import random
import pandas as pd
years = np.linspace(1.0, 10.0, num=10)
salary = np.linspace(30000.0, 100000.0, num=10) + random.uniform(-1,1)*5000#plus/minus 5k
data = pd.DataFrame({'experience' : years, 'salary': salary})
print (data)
Which gives me:
experience salary
0 1.0 31060.903965
1 2.0 38838.681742
2 3.0 46616.459520
3 4.0 54394.237298
4 5.0 62172.015076
5 6.0 69949.792853
6 7.0 77727.570631
7 8.0 85505.348409
8 9.0 93283.126187
9 10.0 101060.903965
we can see that we do not get some records where a person with higher experience made less than a person with lower experience. How can I fix this? Of course I want to scale this to give me 1000 rows
scikit-learn comes with some useful functions to generate correlated data, such as make_regression.
You could for example do:
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
np.random.seed(0)
n_samples = 1000
X, y = make_regression(n_samples=n_samples, n_features=1, n_informative=1,
noise=80, random_state=0)
# Scale X (years of experience) to 0..10 range
X = np.interp(X, (X.min(), X.max()), (0, 10))
# Scale y (salary) to 30000..100000 range
y = np.interp(y, (y.min(), y.max()), (30000, 100000))
# To dataframe
df = pd.DataFrame({'experience': X.flatten(), 'salary': y}
print(df.head(10))
From what you describe, it seems as though you want to add some variance to the response. This can be done by adjusting the noise parameter. Let's plot it to make it more obvious:
from matplotlib import pyplot as plt
plt.scatter(X, y, color='blue', marker='.', label='Salary')
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()
For example, using noise=80:
Or using noise=250:
As a side note: This generates continuous values for "years of experience". If you instead want them rounded to integers, you can do that using X = np.rint(X)
You can define the salary to be equal to the number of years times some coefficient, plus some constant value, plus some random value.
import numpy as np
import random
import pandas as pd
N = 1000
intercept = 30000
coeff = 7000
years = np.random.uniform(low=1, high=10, size=N)
salary = intercept + years*coeff + np.random.normal(loc=0, scale=10000, size=N)
data = pd.DataFrame({'experience' : years, 'salary': salary})
data.plot.scatter(x='experience', y='salary', alpha=0.3)
In this case I would change the line:
salary = np.linspace(30000.0, 100000.0, num=10) + random.uniform(-1,1)*5000#plus/minus 5k
I think it is better to have the random section apart, in this way you can change that easly and make all your modification depending on the values you want to reach.
Here is something I did:
import numpy as np
import random
import pandas as pd
years = np.linspace(1.0, 10.0, num=10)
random_list = [random.random()*1000*_*5 for _ in range(10)]
print(random_list)
salary = np.linspace(30000.0, 100000.0, num=10)- random_list
data = pd.DataFrame({'experience' : years, 'salary': salary})
print (data)
The random components has more variance, when the salary grows.
random.uniform(-1,1)*5000 means that your salary value will be changed by a range of -5k to +5k, but as uniform is continuous in its output, it could well be that the salary is changed by a very small amount.
seeing how the salary without the random element changes by 7777.77... per step up in experience, it is quite unlikely to get a lower salary for a higher experience. i would suggest you increase the factor behind your random element.
try random.uniform(-1,1) * 10000 for example. how high you crank that randomness is up to you, depends on how likely it should be to get an overpaid inexperienced person.
import numpy as np
import random
import pandas as pd
years = np.linspace(1.0, 10.0, num=10)
salary = np.random.randint(30000.0, 100000.0, 10)
data = pd.DataFrame({'experience' : years, 'salary': salary})
print (data)

Applying pandas qcut bins to new data

I am using pandas qcut to split some data into 20 bins as part of data prep for training of a binary classification model like so:
data['VAR_BIN'] = pd.qcut(cc_data[var], 20, labels=False)
My question is, how can I apply the same binning logic derived from the qcut statement above to a new set of data, say for model validation purposes. Is there an easy way to do this?
Thanks
You can do it by passing retbins=True.
Consider the following DataFrame:
import pandas as pd
import numpy as np
prng = np.random.RandomState(0)
df = pd.DataFrame(prng.randn(100, 2), columns = ["A", "B"])
pd.qcut(df["A"], 20, retbins=True, labels=False) returns a tuple whose second element is the bins. So you can do:
ser, bins = pd.qcut(df["A"], 20, retbins=True, labels=False)
ser is the categorical series and bins are the break points. Now you can pass bins to pd.cut to apply the same grouping to the other column:
pd.cut(df["B"], bins=bins, labels=False, include_lowest=True)
Out[38]:
0 13
1 19
2 3
3 9
4 13
5 17
...
User #Karen said:
By using this logic, I am getting Na values in my validation set. Is there some way to solve it?
If this is happening to you, it most likely means that the validation set has values below (or above) the smallest (or greatest) value from the training data. Therefore, some values will fall out of range and will therefore not be assigned a bin.
You can solve this problem by extending the range of the training data:
# Make smallest value arbitrarily smaller
train.loc[train['value'].eq(train['value'].min()), 'value'] = train['value'].min() - 100
# Make greatest value arbitrarily greater
train.loc[train['value'].eq(train['value'].max()), 'value'] = train['value'].max() + 100
# Make bins from training data
s, b = pd.qcut(train['value'], 20, retbins=True)
# Cut validation data
test['bin'] = pd.cut(test['value'], b)

Python pandas grouping for correlation analysis

Assume two dataframes, each with a datetime index, and each with one column of unnamed data. The dataframes are of different lengths and the datetime indexes may or may not overlap.
df1 is length 20. df2 is length 400. The data column consists of random floats.
I want to iterate through df2 taking 20 units per iteration, with each iteration incrementing the start vector by one unit - and similarly the end vector by one unit. On each iteration I want to calculate the correlation between the 20 units of df1 and the 20 units I've selected for this iteration of df2. This correlation coefficient and other statistics will then be recorded.
Once the loop is complete I want to plot df1 with the 20-unit vector of df2 that satisfies my statistical search - thus needing to keep up with some level of indexing to reacquire the vector once analysis has been completed.
Any thoughts?
Without knowing more specifics of the questions such as, why are you doing this or do dates matter, this will do what you asked. I'm happy to update based on your feedback.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
df1 = pd.DataFrame({'a':[random.randint(0, 20) for x in range(20)]}, index = pd.date_range(start = '2013-01-01',periods = 20, freq = 'D'))
df2 = pd.DataFrame({'b':[random.randint(0, 20) for x in range(400)]}, index = pd.date_range(start = '2013-01-10',periods = 400, freq = 'D'))
corr = pd.DataFrame()
for i in range(0,380):
t0 = df1.reset_index()['a'] # grab the numbers from df1
t1 = df2.iloc[i:i+20].reset_index()['b'] # grab 20 days, incrementing by one each time
t2 = df2.iloc[i:i+20].index[0] # set the index to be the first day of df2
corr = corr.append(pd.DataFrame({'corr':t0.corr(t1)}, index = [t2])) #calculate the correlation and append it to the DF
# plot it and save the graph
corr.plot()
plt.title("Correlation Graph")
plt.ylabel("(%)")
plt.grid(True)
plt.show()
plt.savefig('corr.png')

Categories