Preparing variable-length data for sklearn - python

Since this is a complicated problem (at least for me), I will try to keep this as brief as possible.
My data is of the form
import pandas as pd
import numpy as np
# edit: a1 and a2 are linked as they are part of the same object
a1 = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
a2 = np.array([[5, 6, 5], [2, 3], [3, 4, 8, 1]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a1=a1.tolist(),a2=a2.tolist(), b=b, y=y))
a1 a2 b y
0 [1, 2, 3] [5, 6, 5] 6 0
1 [4, 5] [2, 3] 15 1
2 [7, 8, 9, 10] [3, 4, 8, 1] 24 1
which I would like to use in sklearn for classification, e.g.
from sklearn import tree
X = df[['a1', 'a2', 'b']]
Y = df['y']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print(clf.predict([[2., 2.]]))
However, while pandas can handle lists as entries, sklearn, by design, cannot. In this example the clf.fit will result in ValueError: setting an array element with a sequence. to which you can find plenty of answers.
But how do you deal with such data?
I tried to split the data up into multiple columns (i.e. a1[0] ... a1[3] - code for that is a bit lengthy), but a1[3] will be empty (NaN, 0 or whatever invalid value you think of). Imputation does not make sense here, since no value is supposed to be there.
Of course, such a procedure has an impact on the result of the classification as the algorithm might pick up the "zero" value as something meaningful.
If the dataset is large enough, so I thought, it might be worth splitting it up in equal lengths of a1. But this procedure can reduce the power of the classification algorithm, since the length of a1 might help to distinguish between classes.
I also thought of using warm start for algorithms that support (e.g. Perceptron) and fit it to data split by the length of a1. But this would surely fail, would it not? The datasets would have different number of features, so I assume that something would go wrong.
Solutions to this problem surely must exist and I've simply not found the right place in the documentation.

Lets assume for a second those numbers are numerical categories.
What you can do is transform column 'a' into a set of binary columns, of which each corresponds to a possible value of 'a'.
Taking your example code, we would:
import pandas as pd
import numpy as np
a = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a=a.tolist(),b=b,y=y))
from sklearn.preprocessing import MultiLabelBinarizer
MLB = MultiLabelBinarizer()
df_2 = pd.DataFrame(MLB.fit_transform(df['a']), columns=MLB.classes_)
df_2
1 2 3 4 5 7 8 9 10
0 1 1 1 0 0 0 0 0 0
1 0 0 0 1 1 0 0 0 0
2 0 0 0 0 0 1 1 1 1
Than, we can just concat the old and new data:
new_df = pd.concat([df_2, df.drop('a',1)],1)
1 2 3 4 5 7 8 9 10 b y
0 1 1 1 0 0 0 0 0 0 6 0
1 0 0 0 1 1 0 0 0 0 15 1
2 0 0 0 0 0 1 1 1 1 24 1
Please do notice that if you have a training and a test set, it would be wise to first concat em, do the transform, and than separate 'em. Thats because one of the data sets can contain terms that do not belong to the other.
Hope that helps
Edit:
If you are worried that might make your df too big, its perfectly okay to apply PCA to the binarized variables. It will reduce cardinality while maintaining an arbitrary amount of variance/correlation.

Sklearn likes the data in 2d array i.e. shape (batch_size, features)
The simplest solution is to prepare one feature vector by concatenating the arrays using numpy.concatenate. The pass this feature vector to sklearn. Since the length of each column is fixed this should work.

Related

how to count the match number between two dataframe fast?

I'm writing some programs on calculate the match item number between two dataframes.
for example,
A is the dataframe as : A = pd.DataFrame({'pick_num1':[1, 2, 3], 'pick_num2':[2, 3, 4], 'pick_num3':[4, 5, 6]})
B is the answer I want to match, like:
B = pd.DataFrame({'ans_num1':[1, 2, 3], 'ans_num2':[2, 3, 4], 'ans_num3':[4, 5, 6], 'ans_num4':[7, 8, 1], 'ans_num5':[9, 1, 9]})
DataFrame A
pick_num1 pick_num2 pick_num3 match_num
0 1 2 4 2
1 2 3 5 2
2 3 4 6 2
DataFrame B
ans_num1 ans_num2 ans_num3 ans_num4 ans_num5
0 1 2 4 7 9
1 2 3 5 8 1
2 3 4 6 1 9
and I want to append a new column of ['match_num'] at the end of A.
Now I have tried to write a mapping function to compare and calculate, and I found the speed is not that fast while the dataframe is huge, the functions are below:
def win_prb_func(df1, p_name):
df1['match_num'] += np.sum(pd.concat([df1[p_name]]*5, axis=1).values==df1[open_ball_name_ls].values, 1)
return df1
def compute_win_prb(df1):
return list(map(lambda p_name: win_prb_func(df1, p_name), pick_name_ls))
df1 = pd.concat([A, B], axis=1)
df1['win prb.'] = 0
result_df = compute_win_prb(df1)
where pick_name_ls is ['pick_num1', 'pick_num2', 'pick_num3'], and open_ball_name_ls is ['ans_num1', 'ans_num2', 'ans_num3', 'ans_num4', 'ans_num5'].
I'm wondering is it possible to make the computation more fast or smart than I did?
now the performance would is: 0.015626192092895508 seconds
Thank you for helping me!
You can use broadcasting instead of concatenating the columns:
def win_prb_func(df1, p_name):
df1['match_num'] += np.sum(df1[p_name].values[:, np.newaxis] == df1[open_ball_name_ls].values, 1)
return df1
Since df1[p_name].values will return an 1-D array, you have to convert it into the column vector by adding a new axis. It only takes me 0.004 second.

How to Find Closest data points for a query data point in a pandas dataframe?

I have a query data point with 15 columns and I have a pandas data frame with same columns(15) and i want to find closest data points present in data frame to my query data point. can some one guide me on this ?
Example:
query data point
[1, 2, 3, 4]
df
1 3 5 6
2 7 9 1
2 8 1 8
5 4 9 0
2 4 6 7
here, below rows are closest , in the same way i want to retrieve first n closest data points to my query point.
1 3 5 6
2 4 6 7
I tried clustering but it was too complex for me to understand and KNN is expecting a target variable, so need your help .Thank you!
You can use the Euclidean distance or L2Norm to calculate the distance between each row of your dataframe and your query point.
df = pd.DataFrame([[1, 3, 5, 6],
[2, 7, 9, 1],
[2, 8, 1, 8],
[5, 4, 9, 0],
[2, 4, 6, 7]])
vec = [1, 2, 3, 4]
dist = df.sub(vec, axis=1).pow(2).sum(axis=1).pow(.5)
This gives the output,
0 3.000000
1 8.426150
2 7.549834
3 8.485281
4 4.795832
dtype: float64
You can select the shortest n distances, which give you the positions of n-closest data points to your query points.
Or you can use the np.linlag.norm
dist = np.linalg.norm(source.to_numpy() - vec, axis=1)
which gives you the output
array([3. , 8.42614977, 7.54983444, 8.48528137, 4.79583152])
Check out the answers to this question.
You can try:
query_point = [1, 2, 3, 4]
n = 2
n_closest_points = df.loc[(df - query_point).pow(2).sum(axis=1).nsmallest(n).index]
gives
0 1 2 3
0 1 3 5 6
4 2 4 6 7
We take the sum of squared distance between each row and the query_point by chaining subtraction (which broadcasts), taking square (pow) and summing (sum). Then we require the n closest rows via getting the rows that have the smallest distance (nsmallest). Then this gives a series with values being the squared distance and index indicating the desired rows, so we take its index and look them into the original df (.loc).

Appending seperate dataframes, each as column

I am currently working on the following:
data - with the correct index
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data_values)
wcss.append(kmeans.inertia_)
kmeans = KMeans(n_clusters=2).fit(data_values)
y = kmeans.fit_predict(data_values) # prediction of k
df= pd.DataFrame(y,index = data.index)
....
#got here multiple dicts
Example of y:
[1 2 3 4 5 2 2 5 1 0 0 1 0 0 1 0 1 4 4 4 3 1 0 0 1 0 0 ...]
f = pd.DataFrame(y, columns = [buster] )
f.to_csv('busters.csv, mode = 'a')
y = clusters after determination
I dont know how did I stuck on this.. I am iterating over 20 dataframes, each one consists of one columns and values from 1-9. The index is irrelevent. I am trying to append all frame together but instead it just prints them one after the other. If I put ".T" to transpose it , I still got rows with irrelevent values as index, which I cant remove them because they are actually headers.
Needed result
If the dicts produced in each iteration look like {'Buster1': [0, 2, 2, 4, 5]}, {'Buster2': [1, 2, 3, 4, 5]} ..., using 5 elements here for illustration purposes, and all the lists, i.e., values in the dicts, have the same number of elements (as it is the case in your example), you could create a single dict and use pd.DataFrame directly. (You may also want to take a look at pandas.DataFrame.from_dict.)
You may have lists with more than 5 elements, more than 3 dicts (and thus columns), and you will be generating the dicts with a loop, but the code below should be sufficient for getting the idea.
>>> import pandas as pd
>>>
>>> d = {}
>>> # update d in every iteration
>>> d.update({'Buster 1': [0, 2, 2, 4, 5]})
>>> d.update({'Buster 2': [1, 2, 3, 4, 5]})
>>> # ...
>>> d.update({'Buster n': [0, 9, 3, 0, 0]})
>>>
>>> pd.DataFrame(d, columns=d.keys())
Buster 1 Buster 2 Buster n
0 0 1 0
1 2 2 9
2 2 3 3
3 4 4 0
4 5 5 0
If you have the keys, e.g., 'Buster 1', and values, e.g., [0, 2, 2, 4, 5], separated, as I believe is the case, you can simplify the above (and make it more efficient) by replacing d.update({'Buster 1': [0, 2, 2, 4, 5]}) with d['Buster 1']=[0, 2, 2, 4, 5].
I included columns=d.keys() because depending on your Python and pandas version the ordering of the columns may not be as you expect it to be. You can specify the ordering of the columns through specifying the order in which you provide the keys. For example:
>>> pd.DataFrame(d, columns=sorted(d.keys(),reverse=True))
Buster n Buster 2 Buster 1
0 0 1 0
1 9 2 2
2 3 3 2
3 0 4 4
4 0 5 5
Although it may not apply to your use case, if you do not want to print the index, you can take a look at How to print pandas DataFrame without index.

Python Machine Learning Algorithm to Recognize Known Events

I have two sets of data. These data are logged voltages of two points A and B in a circuit. Voltage A is the main component of the circuit, and B is a sub-circuit. Every positive voltage in B is (1) considered a B event and (2) known to be composite of A. I have included sample data where there is a B voltage event, 4,4,0,0,4,4. A real training data set would have many more available data.
How can I train a Python machine learning algorithm to recognize B events given only A data?
Example data:
V(A), V(B)
0, 0
2, 0
5, 4
3, 4
1, 0
3, 4
4, 4
1, 0
0, 0
2, 0
5, 0
7, 0
2, 0
5, 4
9, 4
3, 0
5, 0
4, 4
6, 4
3, 0
2, 0
An idea:
from sklearn.ensemble import RandomForestClassifier
n = 5
X = [df.A.iloc[i:i+n] for i in df.index[:-n+1]]
labels = (df.B > 0)[n-1:]
model = RandomForestClassifier()
model.fit(X, labels)
model.predict(X)
What this does is, it takes the previous n observations as predictors for the 'B' value. On this small data set it achieves 0.94 accuracy (could be overfitting).
EDIT: Corrected a small alignment error.

Python and creating 2d numpy arrays from list

I'm running into some trouble converting values in columns into a Numpy 2d array. What I have as an output from my code is something the following:
38617.0 0 0
40728.0 0 1
40538.0 0 2
40500.5 0 3
40214.0 0 4
40545.0 0 5
40352.5 0 6
40222.5 0 7
40008.0 0 8
40017.0 0 9
40126.0 0 10
40029.0 0 11
39681.5 0 12
39973.0 0 13
39903.0 0 14
39766.5 0 15
39784.0 0 16
39528.5 0 17
39513.5 0 18
And this continues for ~300,000 lines. The coords of the data are arranged as (z,x,y), and I want to convert it into a 2d array with dimensions 765X510 (x,y) so that the z-coordinates are sitting at their respective (x,y) coordinates so that I may write it to an image file.
Any ideas? I've been looking around and I haven't found anything on the matter.
EDIT:
This is the while-loop that's creating the above columns of data (it's actually two, a function is called within another while-loop):
def make_median_image(x,y):
while y < 509:
y = y + 1 # Makes the first value (x,0), b/c Python is indexed at 0
median_first_row0 = sc.median([a11[y,x],a22[y,x],a33[y,x],a44[y,x],a55[y,x],a66[y,x],a77[y,x],a88[y,x],a99[y,x],a1010[y,x]])
print median_first_row0,x,y
list1 = [median_first_row0,x,y]
list = list1.append(
while x < 764:
x = x + 1
make_median_image(x,y)
import numpy as np
l = [[1,2,3], [4,5,6], [7,8,9], [0,0,0]]
You can directly pass a python 2D list into a numpy array.
>>> np.array(l)
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[0, 0, 0]])
If you only want the latter two columns (which are your x,y values)
>>> np.array([[i[1],i[2]] for i in l])
array([[2, 3],
[5, 6],
[8, 9],
[0, 0]])
Would it be possible to create an array from the data below like so?
Data:
38617.0 0 0
40728.0 0 1
40538.0 0 2
40500.5 0 3
40214.0 0 4
40545.0 0 5
40352.5 0 6
40222.5 0 7
40008.0 0 8
40017.0 0 9
40126.0 0 10
40029.0 0 11
39681.5 0 12
39973.0 0 13
39903.0 0 14
39766.5 0 15
39784.0 0 16
39528.5 0 17
39513.5 0 18
... (continues for ~100,000 lines, so you can guess why I'm adamant to find an answer)
What I would like:
numpy_ndarray = [[38617.0, 40728.0, 40538.0, 40500.5, 40214.0, 40545.0, 40352.5, ... (continues until the last column value in the data above is 764) ], [begin next line, when x = 1, ... (until last y-value is 764)], ... [ ... (some last pixel value)]]
So it basically builds a matrix/image grid out of the pixel values in the first column of data that's associated with the (x,y) coordinate in the second and third columns.
lets say your array is l
import numpy as np
l = [[1,2,3], [4,5,6], [7,8,9], [0,0,0]]
>>> np.array(l)
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[0, 0, 0]])
this should work
>>> np.array([[i[1] for i in l],[i[2] for i in l]])
array([[2, 5, 8, 0],
[3, 6, 9, 0]])

Categories