Turning clusters assigned through numpy array into separate datasets - python

I have a very large dataset that I am running a clustering model on. The clustering outputs a numpy array formatted as such:
[ 0 1 2 1 1 0 0 0 1 2 1 0 2 0 1 2 1 0 2 2 0 0 1 ... ]
I want to take the original dataset, and create three datasets based on the array. How would I go about this?
Initial Dataset Work:
import pandas as pd
pd.options.mode.chained_assignment = None
raw_data = pd.read_csv("LendingClub2012to2013.csv", low_memory = False, skiprows=[0])
//Some cleaning done, target leakage removed, dummies created, imputation, etc.
clean_data = raw_data.drop(text2d + leakage2d + noinfo2d + irr2d, axis = 1)

I assume that your dataset is a numpy array. Try creating masks to select the elements you want from the original data set. Some verbose code:
# Your original data set (2d numpy array)
orig_data = ...
# The cluster assignments output by the algorithm (1d numpy array)
cluster_assignments = ...
clusters = []
for cluster_id in xrange(3):
mask = (cluster_assignments == cluster_id)
clusters.append(orig_data[mask])
A more concise version:
clusters = [orig_data[cluster_assignments == id] for id in xrange(3))]
If your dataset is a pandas Dataframe rather than a numpy array, simple replace orig_data[...] with orig_data.loc[...].
The output of this code is a list clusters in which each element is a dataset with the data for just one of the clusters.

Related

How to split random generated array values in two separate array using numpy and random?

how to create a vector called row_min that contains the minimum value for each of the 25 rows (this implies the shape of this vector will be (25,)) Create a vector called col_max that contains the maximum value for each of the 8 columns (col_max will be a vector of shape (8,))
I have developed the code and I'm new to vector concept, need some suggestions.
import random
import numpy
c = numpy.random.rand(25,8)
print("Random float array 25X8 between range of 0.0 to 1.0 \n")
print(c,"\n")
I didn't find the source to understand the concept.
You have to specify the axis np.max( .., axis=...) should work on:
import random
import numpy as np
c = np.random.rand(5,3) # smaller for less output
print(c,"\n")
print( np.max(c, axis=0)) # column
print( np.max(c, axis=1)) # row
Output:
[[0.47894278 0.80356294 0.34453725]
[0.33802491 0.82795648 0.28438504]
[0.46838701 0.73664987 0.82215448]
[0.66245476 0.59981989 0.43837083]
[0.28515865 0.86093323 0.92248524]]
# axis 0 (columns)
[0.66245476 0.86093323 0.92248524]
# axis 1 (rows)
[0.80356294 0.82795648 0.82215448 0.66245476 0.92248524]
See matrix.max() ... min() works the same.

How to Include continuous and categorical predictors in Keras LSTM?

I want to use Keras LSTM (or similar) to forecast energy consumption of businesses based on:
historical consumption data
some numerical features (e.g. total yearly consumption)
some categorical features (e.g. business type)
This is a cold-start problem because, while 2. and 3. are present both for the training and the test set, 1. is not, i.e. I am trying to predict consumption of new businesses for which there is no historical data.
My question is: how to structure the dataframe and the RNN to accomodate both 2. (numerical features) and 3. (categorical data) as my predictors?
Here is a made-up example of the data:
# generate x (predictors dataframe)
import pandas as pd
x = pd.DataFrame({'ID':[0,1,2,3],'business_type':[0,2,2,1], 'contract_type':[0,0,2,1], 'yearly_consumption':[1000,200,300,900], 'n_sites':[9,1,2,5]})
print(x)
# note: the first 2 are categorical and the second 2 are numerical
ID business_type contract_type yearly_consumption n_sites
0 0 0 0 1000 9
1 1 2 0 200 1
2 2 2 2 300 2
3 3 1 1 900 5
# generate y (timeseries data)
import numpy as np
time_series = []
data_length = 6
period = 1
for k in range(4):
level = 10 * np.random.rand()
seas_amplitude = (0.1 + 0.3*np.random.rand()) * level
sig = 0.05 * level # noise parameter (constant in time)
time_ticks = np.array(range(data_length))
source = level + seas_amplitude*np.sin(time_ticks*(2*np.pi)/period)
noise = sig*np.random.randn(data_length)
data = source + noise
index = pd.DatetimeIndex(start=t0, freq=freq, periods=data_length)
time_series.append(pd.Series(data=data, index=['t0','t1','t2','t3','t4','t5']))
y = pd.DataFrame(time_series)
print(y)
t0 t1 t2 t3 t4 t5
0 9.611984 8.453227 8.153665 8.801166 8.208920 8.399184
1 2.139507 2.118636 2.160479 2.216049 1.943978 2.008407
2 0.131757 0.133401 0.135168 0.141212 0.136568 0.123730
3 5.990021 6.219840 6.637837 6.745850 6.648507 5.968953
# note: the real data has thousands of data points (one year with half hourly frequency)
# note: the first row belongs to ID = 0 in x, the second row to ID = 1 etc.
I have looked extensively online, and there seem to be no example where both categorical, numerical and time-series data are used. For a simple forecasting problem, this post explains that in order to learn from the previous time period, the LSTM must be fed something like this:
# process df for a classical forecasting problem for first ID
y_lstm = pd.DataFrame(y.iloc[0,:])
y_lstm.columns = ['t']
y_lstm['t-1'] = y_lstm['t'].shift()
print(y_lstm)
t t-1
t0 9.611984 NaN
t1 8.453227 9.611984
t2 8.153665 8.453227
t3 8.801166 8.153665
t4 8.208920 8.801166
t5 8.399184 8.208920
# note: t-1 represents the previous time point
However, while this works for a single timeseries, it is unclear how to structure the dataset when there are multiple timeseries, and how to include the rest of the predictors in this structure.
This post talks about how to include both categorical and numerical variables through embedding, but does not fit my problem where also timeseries data has to be included. This post discusses between one-hot encoding and embedding without any example code and does not answer my question.
Could anyone please provide me with example code on how to structure the data appropriately for the RNN and/or how a simple LSTM structure with Keras would look like? Note that this stucture should be able to use the timeseries data for training, but not for predictions (i.e. only x and not y is available for the test set)
Thank you very much in advance.

What is the meaning of normalization in machine learning language? Does it correspond to one sample?

I am dealing with a classification problem I want to classify data into 2 classes. I generate 1000 samples at different temperatures ranging from 1 to 5. I load data using following function load_data. Where "data" is 2 dimensional array (1000,16), Rows correspond to number of samples at "1.0.npy" and similarly for other points and 16 is number of features. So I picked max and min values from each sample by applying a for loop. But I'm afraid that my normalization is not correct because I'm not sure what is the strategy of normalization in machine learning. Should I pick np.amax(each sample) or should I pick np.amax("1.0.npy") mean from all 1000 samples that contained in 1.0.npy files. My goal is to normalize data between 0 and 1.
`def load_data():
path ="./directory"
files =sorted(os.listdir(path)) #{1.0.npy, 2.0.npy,.....5.0.npy}
dictData ={}
for df in sorted(files):
print(df)
data = np.load(os.path.join(path,df))
a=data
lis =[]
for i in range(len(data)):
old_range = np.amax(a[i]) - np.amin(a[i])
new_range = 1 - 0
f = ((a[i] - np.amin(a[i])) / old_range)*new_range + 0
lis.append(f)`
After normalization I get following result such that first value of every sample is 0 and last value is one.
[0, ...., 1] #first sample
[0,.....,1] #second sample

Generate laplacian matrix from non-square dataset

I have a dataset as the following where the first and second columns indicate nodes connection from to:
fromNode toNode
0 1
0 2
0 31
0 73
1 3
1 56
2 10
...
I want to generate laplacian matrix from this dataset. I use the following code to do so but it complains as the dataset itself is not square matrix. Is there a function that accept this type of dataset and generates the matrix?
from numpy import genfromtxt
from scipy.sparse import csgraph
import csv
G = genfromtxt('./data.csv', delimiter='\t').astype(int)
dataset = csgraph.laplacian(G, normed=False)
Rather than find a function that will except your data, process your data into the correct format.
Fake data f simulates a file object. Use io.StringIO for Python 3.6.
data = '''0 1
0 2
0 31
0 73
1 3
1 56
2 10'''
f = io.BytesIO(data)
Read each line of the data and process it into a list of edges with the form (node1, node1).
edges = []
for line in f:
line = line.strip()
(node1, node2) = map(int, line.split())
edges.append((node1,node2))
Find the highest node number, create a square numpy ndarray based on the highest node number. You need to be aware of your node numbering - is it zero based?
N = max(x for edge in edges for x in edge)
G = np.zeros((N+1,N+1), dtype = np.int64)
Iterate over the edges and assign the edge weight to the Graph
for row, column in edges:
G[row,column] = 1
Here is a solution making use of numpy integer array indexing.
z = np.genfromtxt(f, dtype = np.int64)
n = z.max() + 1
g = np.zeros((n,n), dtype = np.int64)
rows, columns = z.T
g[rows, columns] = 1
Of course both of those assume all edge weights are equal.
See Graph Representations in the scipy docs. I couldn't try this graph to see if it is valid, I'm getting an import error for csgraph - probably need to update.

Classification of continious data

I've got a Pandas df that I use for Machine Learning in Scikit for Python.
One of the columns is a target value which is continuous data (varying from -10 to +10).
From the target-column, I want to calculate a new column with 5 classes where the number of rows per class is the same, i.e. if I have 1000 rows I want to distribute into 5 classes with roughly 200 in each class.
So far, I have done this in Excel, separate from my Python code, but as the data has grown it's getting unpractical.
In Excel I have calculated the percentiles and then used some logic to build the classes.
How to do this in Python?
#create data
import numpy as np
import pandas as pd
df = pd.DataFrame(20*np.random.rand(50, 1)-10, columns=['target'])
#find quantiles
quantiles = df['target'].quantile([.2, .4, .6, .8])
#labeling of groups
df['group'] = 5
df['group'][df['target'] < quantiles[.8]] = 4
df['group'][df['target'] < quantiles[.6]] = 3
df['group'][df['target'] < quantiles[.4]] = 2
df['group'][df['target'] < quantiles[.2]] = 1
looking for an answer to similar question found this post and the following tip: What is the difference between pandas.qcut and pandas.cut?
import numpy as np
import pandas as pd
#generate 1000 rows of uniform distribution between -10 and 10
rows = np.random.uniform(-10, 10, size = 1000)
#generate the discretization in 5 classes
rows_cut = pd.qcut(rows, 5)
classes = rows_cut.factorize()[0]

Categories