Classification based on categorical data - python

I have a dataset
Inp1 Inp2 Output
A,B,C AI,UI,JI Animals
L,M,N LI,DO,LI Noun
X,Y AI,UI Extras
For these values, I need to apply a ML algorithm. Which algorithm would be best suited to find relations in between these groups to assign an output class to them?

Assuming each cell is a list (as you have multiple strings stored in each), and that you are not looking for a specific encoding. The following should work. It can also be adjusted to suit different encodings.
import pandas as pd
A = [["Inp1", "Inp2", "Inp3", "Output"],
[["A","B","C"], ["AI","UI","JI"],["Apple","Bat","Dog"],["Animals"]],
[["L","M","N"], ["LI","DO","LI"], ["Lawn", "Moon", "Noon"], ["Noun"]]]
dataframe = pd.DataFrame(A[1:], columns=A[0])
def my_encoding(row):
encoded_row = []
for ls in row:
encoded_ls = []
for s in ls:
sbytes = s.encode('utf-8')
sint = int.from_bytes(sbytes, 'little')
encoded_ls.append(sint)
encoded_row.append(encoded_ls)
return encoded_row
print(dataframe.apply(my_encoding))
output:
Inp1 ... Output
0 [65, 66, 67] ... [32488788024979009]
1 [76, 77, 78] ... [1853189966]
if my assumptions are incorrect or this is not what you're looking for let me know.

As you mentioned, you are going to apply ML algorithm (say classification), I think One Hot Encoding is what you are looking for.
Requested format:
Inp1 Inp2 Inp3 Output
7,44,87 4,65,2 47,36,20 45
This format can't help you to train your model as multiple labels in a single cell. However you have to pre-process again like OHE.
Suggesting format:
A B C L M N X Y AI DO JI LI UI Apple Bat Dog Lawn Moon Noon Yemen Zombie
1 1 1 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 0
0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0
0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1
Hereafter you can label encode / ohe the output field as per your model requires.
Happy learning !

BCE is for multi-label classifications, whereas categorical CE is for multi-class classification where each example belongs to a single class. In your task you need to understand if for a single example you end in a single class only (CE) or single example may end in multiple classes (BCE). Probable the second is true since animal can be a noun. ;)

Related

How to prioritize certain features with max_features parameter in countvectorizer

I have a working program but I realized that some important n-grams in the test data were not a part of the 6500 max_features I had allowed in the training data. Is it possible to add a feature like "liar" or "pathetic" as a feature that I would train with my training data?
This is what I currently have for making the vectorizer:
vectorizer = CountVectorizer(ngram_range=(1, 2)
,max_features=6500)
X = vectorizer.fit_transform(train['text'])
feature_names = vectorizer.get_feature_names()
This is hacky, and you probably cannot count on it working in the future, but CountVectorizer primarily relies on the learned attribute vocabulary_, which is a dictionary with tokens as keys and "feature index" as values. You can add to that dictionary and everything appears to work as intended; borrowing from the example in the docs:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(X2.toarray())
## Output:
# [[0 0 1 1 0 0 1 0 0 0 0 1 0]
# [0 1 0 1 0 1 0 1 0 0 1 0 0]
# [1 0 0 1 0 0 0 0 1 1 0 1 0]
# [0 0 1 0 1 0 1 0 0 0 0 0 1]]
# Now we tweak:
vocab_len = len(vectorizer2.vocabulary_)
vectorizer2.vocabulary_['new token'] = vocab_len # append to end
print(vectorizer2.transform(["And this document has a new token"]).toarray())
## Output
# [[1 0 0 0 0 0 0 0 0 0 1 0 0 1]]

How to create an adjacency matrix in pandas such that the labels are preserved when rows and cols are rearranged

I have never used pandas or numpy for this purpose before and am wondering what's the idiomatic way to construct labeled adjacency matrices in pandas.
My data comes in a shape similar to this. Each "uL22" type of thing is a protein and the the arrays are the neighbors of this protein. Hence( in this example below) an adjacency matrix would have 1s in bL31 row, uL5 column, and the converse, etc.
My problem is twofold:
The actual dimension of the adjacency matrix is dictated by a set of protein-names that is generally much larger than those contained in the nbrtree, so i'm wondering what's the best way to map my nbrtree data to that set, say a 100 by 100 matrix corresponding to neighborhood relationships of a 100 proteins.
I'm not quite sure how to "bind" the names(i.e.uL32etc.) of those 100 proteins to the rows and columns of this matrix such that when I start moving rows around the names move accordingly. ( i'm planning to rearange the adjacency matrix into to have a block-diagonal structure)
"nbrtree": {
"bL31": ["uL5"],
"uL5": ["bL31"],
"bL32": ["uL22"],
"uL22": ["bL32","bL17"],
...
"bL33": ["bL35"],
"bL35": ["bL33","uL15"],
"uL13": ["bL20"],
"bL20": ["uL13","bL21"]
}
>>>len(nbrtree)
>>>40
I'm sure this is a manipulation that people perform daily, i'm just not quite familiar with how dataframes function properly, so i'm probably looking for something very obvious.
Thank you so much!
I don't fully understand your question, But from what I get try out this code.
from pprint import pprint as pp
import pandas as pd
dic = {"first": {
"a": ["b","d"],
"b": ["a","h"],
"c": ["d"],
"d": ["c","g"],
"e": ["f"],
"f": ["e","d"],
"g": ["h","a"],
"h": ["g","b"]
}}
col = list(dic['first'].keys())
data = pd.DataFrame(0, index = col, columns = col, dtype = int)
for x,y in dic['first'].items():
data.loc[x,y] = 1
pp(data)
The output from this code being
a b c d e f g h
a 0 1 0 1 0 0 0 0
b 1 0 0 0 0 0 0 1
c 0 0 0 1 0 0 0 0
d 0 0 1 0 0 0 1 0
e 0 0 0 0 0 1 0 0
f 0 0 0 1 1 0 0 0
g 1 0 0 0 0 0 0 1
h 0 1 0 0 0 0 1 0
Note that this adjaceny matrix here is not symmetric as I have taken some random data
To absorb your labels into the dataframe change to the following
data = pd.DataFrame(0, index = ['index']+col, columns = ['column']+col, dtype = int)
data.loc['index'] = [0]+col
data.loc[:, 'column'] = ['*']+col

How can I build an LSTM AutoEncoder with PyTorch?

I have my data as a DataFrame:
dOpen dHigh dLow dClose dVolume day_of_week_0 day_of_week_1 ... month_6 month_7 month_8 month_9 month_10 month_11 month_12
639 -0.002498 -0.000278 -0.005576 -0.002228 -0.002229 0 0 ... 0 0 1 0 0 0 0
640 -0.004174 -0.005275 -0.005607 -0.005583 -0.005584 0 0 ... 0 0 1 0 0 0 0
641 -0.002235 0.003070 0.004511 0.008984 0.008984 1 0 ... 0 0 1 0 0 0 0
642 0.006161 -0.000278 -0.000281 -0.001948 -0.001948 0 1 ... 0 0 1 0 0 0 0
643 -0.002505 0.001113 0.005053 0.002788 0.002788 0 0 ... 0 0 1 0 0 0 0
644 0.004185 0.000556 -0.000559 -0.001668 -0.001668 0 0 ... 0 0 1 0 0 0 0
645 0.002779 0.003056 0.003913 0.001114 0.001114 0 0 ... 0 0 1 0 0 0 0
646 0.000277 0.004155 -0.002227 -0.002782 -0.002782 1 0 ... 0 0 1 0 0 0 0
647 -0.005540 -0.007448 -0.003348 0.001953 0.001953 0 1 ... 0 0 1 0 0 0 0
648 0.001393 -0.000278 0.001960 -0.003619 -0.003619 0 0 ... 0 0 1 0 0 0 0
My input will be 10 rows (already one-hot encoded). I want to create an n-dimensional auto encoded representation. So as I understand it, my input and output should be the same.
I've seen some examples to construct this, but am still stuck on the first step. Is my training data just a lot of those samples as to make a matrix? What then?
I apologize for the general nature of the question. Any questions, just ask and I will clarify in the comments.
Thank you.
It isn't quite clear from the question what you are trying to achieve. Based on what you wrote you want to create an autoencoder with the same input and output and that doesn't quite make sense to me when I see your data set. In the common case, the encoder part of the autoencoder creates a model which, based on a large set of input features produces a small output vector and decoder is performing an inverse operation of reconstruction of the plausible input features based on the full set of output and input features. A result of using an autoencoder is enhanced (in some meaning, like with noise removed, etc) input.
You can find a few examples here with the 3rd use case providing code for the sequence data, learning random number generation model. Here is another example, which looks closer to your application. A sequential model is constructed to encode a large data set with information loss. If that is what you are trying to achieve, you'll find the code there.
If the goal is a sequence prediction (like future stock prices), this and that example seem to be more appropriate as you likely only want to predict a handful of values in your data sequence (say dHigh and dLow) and you don't need to predict day_of_week_n or the month_n (even though that part of autoencoder model probably will train much more reliable as the pattern is pretty clear). This approach will allow you to predict a single consequent output feature value (tomorrow's dHigh and dLow)
If you want to predict a sequence of future outputs you can use a sequence of outputs, rather than a single one in your model.
In general, the structure of inputs and outputs is totally up to you

Keras predict(..) output interpretation

I currently use a keras model for text classification. Calling the evaluate method I often have accuracies around 90 percent. However, calling the predict function and printing the output does not seem interpretable to me. I am using binary_crossentropy. I do not know which value will trigger the neurons to be active, or how to see that at all.
I attached some outputs(the binary ones are the actual classes). How does evaluate compute the accuracy?
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0.02632797 0.02205164 0.00884359 0.00948936 0.21821289 0.02533042
0.07450009 0.01911888 0.22753781 0.00904192 0.0023979 0.03065717
0.0049532 0.09980826 0.0047154 ]
[1 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0.17915486 0.1063956 0.05139401 0.01718497 0.06058983 0.11605757
0.11845534 0.03865225 0.6665891 0.01648878 0.02570258 0.14659531
0.01044943 0.04226198 0.02007598]
[1 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0.07659172 0.07020403 0.00733146 0.01322867 0.43747708 0.02796873
0.03419256 0.03095324 0.15433209 0.02747604 0.01686232 0.0165229
0.0226498 0.01947697 0.07312528]
Use 'categorical_crossentropy' instead of 'binary_crossentropy'.
Check if you are normalizing the training data (for example X_train/255) and not normalizing the test data.

regression coefficient calculation in python

I have a Dataframe and an input text file of activity.Dataframe is produced via pandas.I want to find out the regression coefficient of each term using following formula
Y=C1aX1a+C1bX1b+...+C2aX2a+C2bX2b+....C0 ,
where Y is the activity Cna the regression coefficient for the residue choice a at position n, X the dummy variable coding (xna= 1 or 0) corresponding to the presence or absence of residue choice a at position n, and C0 the mean value of the activity.
My dataframe look likes
2u 2s 4r 4n 4m 7h 7v
0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0
Here 1 and 0 represents the presence and absence of residues respectively.
Using MLR(multiple linear regression) how can i find out the regression coefficient of each residue ie, 2u,2s,4r,4n,4m,7h,7v.
C1a represents the regression coefficient of residue a at 1st position(here 1a is 2u,1b is 2s, 2a is 4r...) X1a represents the dummy value ie 0 or 1 corresponding to 1a.
Activity file contain following data
6.5
5.9
5.7
6.4
5.2
So first equation will look like
6.5=C1a*0+C1b*1+C2a*1+C2b*0+C2c*0+C3a*0+C3b*1+C0
…
Can I get regression coefficient using numpy?.Please help me, All suggestions will be appreciated.
Let A be your dataframe (you can get it as a pure and simple numpy array. Read it in using np.loadtxt if it's CSV), and y be your activity file (again, a numpy array), and use np.linalg.lstsq
DF = """0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0"""
res = """6.5, 5.9, 5.7, 6.4, 5.2"""
A = np.fromstring ( DF, sep=" " ).reshape((5,7))
y = np.fromstring(res, sep=" ")
(x, res, rango, svals ) = np.linalg.lstsq(A, y )
print x
# 2.115625, 2.490625, 1.24375 , 1.19375 , 2.16875 , 2.115625, 2.490625
print np.sum(A.dot(x)**2) # Sum of squared residuals:
# 177.24750000000003
print A.dot(x) # Print predicition
# 6.225, 6.175, 5.425, 6.4 , 5.475

Categories