How to find patterns between numerious causes and the result in python? - python

For each instance I have a set of problems and a result, like this:
df = pd.DataFrame({
"problems": [[1,2,3], [1,2,4], [1,4,5], [3,4,5], [1,5,6]],
"results": ["A", "A", "C", "C", "A"]
})
I want to find patterns in the relationship between the problems and the result.
My first thought was Association Rule Mining, but this is more for finding patters within the problems (for example). I guess machine learning could help somehow, but I'm not interested in solely predicting the result, but in the patters that lead to that prediction.
I would be interested in patters like
Problem 1 causes result A
The combination of problems 4 and 5 cause result C
Any thoughts on that?
As I'd implement with Python, corresponding packages are welcomed hints, too.
Thanks a lot!

I was curious and I did some experimental stuff, based on the comment of Daniel Möller in this thread in tensorflow 2.0 with keras:
Update: Make the order not matter anymore:
To make the order not matty anymore, we need to remove the order information from our dataset. To do this, we first convert it to a one-hot vector, then we take the max() value to squash the dimensions into 3 again:
x_no_order = tf.keras.utils.to_categorical(x)
This gives us a one-hot vector looking like this:
array([[[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.]],
[[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.]],
[[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.]],
[[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.]],
[[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 1.]]], dtype=float32)
Taking the np.max() from that vector gives us a vector, that only knows about which numbers occur, without any information about the position, looking like this:
x_no_order.max(axis=1)
array([[0., 1., 1., 1., 0., 0., 0.],
[0., 1., 1., 0., 1., 0., 0.],
[0., 1., 0., 0., 1., 1., 0.],
[0., 0., 0., 1., 1., 1., 0.],
[0., 1., 0., 0., 0., 1., 1.]], dtype=float32)
First create the dataframe and create the training data
Thats a multiclass-classification task, so I use the tokenizer (there are for sure better approaches, since its rather for text)
import tensorflow as tf
import numpy as np
import pandas as pd
df = pd.DataFrame({
"problems": [[1,2,3], [1,2,4], [1,4,5], [3,4,5], [1,5,6]],
"results": ["A", "A", "C", "C", "A"]
})
x = df['problems']
y = df['results']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(y)
y_train = tokenizer.texts_to_sequences(y)
x = np.array([np.array(i,dtype=np.int32) for i in x])
y_train = np.array(y_train, dtype=np.int32)
**Then create the model **
input_layer = tf.keras.layers.Input(shape=(3))
dense_layer = tf.keras.layers.Dense(6)(input_layer)
dense_layer2 = tf.keras.layers.Dense(20)(dense_layer)
out_layer = tf.keras.layers.Dense(3, activation="softmax")(dense_layer2)
model = tf.keras.Model(inputs=[input_layer], outputs=[out_layer])
model.compile(optimizer="Nadam", loss="sparse_categorical_crossentropy",metrics=["accuracy"])
Train the model by fitting it
hist = model.fit(x,y_train, epochs=100)
Then, as based on Daniels comment, you take the sequence you want to test and mask out certain values, to test their influence
arr =np.reshape(np.array([1,2,3]), (1,3))
print(model.predict(arr))
arr =np.reshape(np.array([0,2,3]), (1,3))
print(model.predict(arr))
arr =np.reshape(np.array([1,0,3]), (1,3))
print(model.predict(arr))
arr =np.reshape(np.array([1,2,0]), (1,3))
print(model.predict(arr))
This will print this result, have in mind that since y starts at one, the first value is a placeholder, so the second value stands for "A"
[[0.00441748 0.7981055 0.19747704]]
[[0.00103579 0.9863035 0.01266076]]
[[0.0031549 0.9953074 0.00153765]]
[[0.01631758 0.00633342 0.977349 ]]
There we can see, that in the first place A is correctly predicted by 0.7981..
When the of [1,2,3] we change 3 to 0, so [1,2,0] we see that the model suddenly predicts "C". So the influence of 3 on position 3 is the biggest. Putting that in a function, you can use all training data you have and build statistic metrics to analyze that further.
This is just a very simple approach, but keep in mind that it is a big research field called sensitivity analysis. You might want to have a deeper look at that topic, if you are interested.

Related

How to replace the value of multiple cells in multiple rows in a Pytorch tensor?

I have a tensor
import torch
torch.zeros((5,10))
>>> tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
How can I replace the values of X random cells in each row with random inputs (torch.rand())?
That is, if X = 2, in each row, 2 random cells should be replaced with torch.rand().
Since I need it to not break backpropagation I found here that replacing the .data attribute of the cells should work.
The only familiar thing to me is using a for loop but it's not efficient for a large tensor
You can try tensor.scatter_().
x = torch.zeros(3,4)
n_replace = 3 # number of cells to be replaced with random number
src = torch.randn(x.size())
index = torch.stack([torch.randperm(x.size()[1]) for _ in range(x.size()[0])])[:,:n_replace]
x.scatter_(1, index, src)
Out[22]:
tensor([[ 0.0000, 0.5769, 0.7432, -0.1776],
[-2.1673, -1.0802, 0.0000, 0.6241],
[-0.6421, 0.1315, 0.0000, -2.7224]])
To avoid repetition,
perm = torch.randperm(tensor.size(0))
idx = perm[:k]
samples = tensor[idx]

Vectorize Sequences explanation

Studying Deep Learning with Python, I can't comprehend the following simple batch of code which encodes the integer sequences into a binary matrix.
def vectorize_sequences(sequences, dimension=10000):
# Create an all-zero matrix of shape (len(sequences), dimension)
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1. # set specific indices of results[i] to 1s
return results
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
x_train = vectorize_sequences(train_data)
And the output of x_train is something like
x_train[0]
array([ 0., 1.,1., ...,0.,0.,0.])
Can someone put some light of the 0.'s existance in x_train array while only 1.'s are appending in each next i iteration?
I mean shouldn't be all 1's?
The script transforms you dataset into a binary vector space model. Let's disect things one by one.
First, if we examine the x_train content we see that each review is represented as a sequence of word ids. Each word id corresponds to one specific word:
print(train_data[0]) # print the first review
[1, 14, 22, 16, 43, 530, 973, ..., 5345, 19, 178, 32]
Now, this would be very difficult to feed the network. The lengths of reviews varies, fractional values between any integers have no meaning (e.g. what if on the output we get 43.5, what does it mean?)
So what we can do, is create a single looong vector, the size of the entire dictionary, dictionary=10000 in your example. We will then associate each element/index of this vector with one word/word_id. So word represented by word id 14 will now be represented by 14-th element of this vector.
Each element will either be 0 (word is not present in the review) or 1 (word is present in the review). And we can treat this as a probability, so we even have meaning for values in between 0 and 1. Furthermore, every review will now be represented by this very long (sparse) vector which has a constant length for every review.
So on a smaller scale if:
word word_id
I -> 0
you -> 1
he -> 2
be -> 3
eat -> 4
happy -> 5
sad -> 6
banana -> 7
a -> 8
the sentences would then be processed in a following way.
I be happy -> [0,3,5] -> [1,0,0,1,0,1,0,0,0]
I eat a banana. -> [0,4,8,7] -> [1,0,0,0,1,0,0,1,1]
Now I highlighted the word sparse. That means, there will have A LOT MORE zeros in comparison with ones. We can take advantage of that. Instead of checking every word, whether it is contained in a review or not; we will check a substantially smaller list of only those words that DO appear in our review.
Therefore, we can make things easy for us and create reviews × vocabulary matrix of zeros right away by np.zeros((len(sequences), dimension)). And then just go through words in each review and flip the indicator to 1.0 at position corresponding to that word:
result[review_id][word_id] = 1.0
So instead of doing 25000 x 10000 = 250 000 000 operations, we only did number of words = 5 967 841. That's just ~2.5% of original amount of operations.
The for loop here is not processing all the matrix. As you can see, it enumerates elements of the sequence, so it's looping only on one dimension.
Let's take a simple example :
t = np.array([1,2,3,4,5,6,7,8,9])
r = np.zeros((len(t), 10))
Output
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
then we modify elements with the same way you have :
for i, s in enumerate(t):
r[i,s] = 1.
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
you can see that the for loop modified only a set of elements (len(t)) which has index [i,s] (in this case ; (0, 1), (1, 2), (2, 3), an so on))
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results

Issues using Keras np_utils.to_categorical

I'm trying to make an array of one-hot vector of integers into an array of one-hot vector that keras will be able to use to fit my model. Here's the relevant part of the code:
Y_train = np.hstack(np.asarray(dataframe.output_vector)).reshape(len(dataframe),len(output_cols))
dummy_y = np_utils.to_categorical(Y_train)
Below is an image showing what Y_train and dummy_y actually are.
I couldn't find any documentation for to_categorical that could help me.
Thanks in advance.
np_utils.to_categorical is used to convert array of labeled data(from 0 to nb_classes - 1) to one-hot vector.
The official doc with an example.
In [1]: from keras.utils import np_utils # from keras import utils as np_utils
Using Theano backend.
In [2]: np_utils.to_categorical?
Signature: np_utils.to_categorical(y, num_classes=None)
Docstring:
Convert class vector (integers from 0 to nb_classes) to binary class matrix, for use with categorical_crossentropy.
# Arguments
y: class vector to be converted into a matrix
nb_classes: total number of classes
# Returns
A binary matrix representation of the input.
File: /usr/local/lib/python3.5/dist-packages/keras/utils/np_utils.py
Type: function
In [3]: y_train = [1, 0, 3, 4, 5, 0, 2, 1]
In [4]: """ Assuming the labeled dataset has total six classes (0 to 5), y_train is the true label array """
In [5]: np_utils.to_categorical(y_train, num_classes=6)
Out[5]:
array([[ 0., 1., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0.]])
from keras.utils.np_utils import to_categorical
UPDATED --- keras.utils.np_utils doesn't work in newer versions; if so use:
from tensorflow.keras.utils import to_categorical
In both cases
to_categorical(0, max_value_of_array)
It assumes the class values were in string and you will be label encoding them, hence starting everytime from 0 to n-classes.
for the same example:- consider an array of {1,2,3,4,2}
The output will be [zero value, one value, two value, three value, four value]
array([[ 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 1., 0., 0.]],
Let's look at another example:-
Again, for an array having 3 classes, Y = {4, 8, 9, 4, 9}
to_categorical(Y) will output
array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0. ],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0. ],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1. ],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0. ],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1. ]]

Calculate the area of two separate geometries in Python

I have been stumped on this problem for a while now and was wondering if anyone would be able to help. So let's say I have a binary image as shown below and I would like to count the black elements (zero). The problem is I want to know the number of elements associated with 'background' and 'trapezoid' in the middle individually, so output two values. What would be the easiest way to approach this? I have been trying to do it without using a mask but is that even possible? I have the numpy and scipy libraries if that helps.
You can use two functions from scipy.ndimage.measurements: label and find_objects.
First you invert the array, because label function considers zero to be the background.
inverted = 1 - binary_image_array
Then you call label to find the different regions:
labeled_array, num_features = scipy.ndimage.measurements.label(inverted)
So, for this particular array, where you already know there are exactely two black blobs, you have the two regions in labeled_array.
Obviously, the scipy approach is a good answer.
I was thinking that you might be able to work with numpy.cumsum and numpy.diff to find an enclosed area.
The cumulative sum will be zero while you are in the black area, then increase by one for every pixel in the white area, be stable again while you traverse the enclosed area, then start increasing again, etc.
The second order difference then finds places where the jumps occur, and you are left with a "classified" map. No guarantee that this generalizes, just an idea.
a = numpy.zeros((10,10))
a[3:7,3:7] = 1
a[4:6, 4:6] = 0
y = numpy.cumsum(a, axis=0)
x = numpy.cumsum(a, axis=1)
yy= numpy.diff(y, n=2, axis=0)
xx = numpy.diff(x, n=2, axis=1)
numpy.dot(xx,yy)
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 2., 2., 2., 2., 0., 0., 0.],
[ 0., 0., 0., 2., 4., 4., 2., 0., 0., 0.],
[ 0., 0., 0., 2., 4., 4., 2., 0., 0., 0.],
[ 0., 0., 0., 2., 2., 2., 2., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Latent Semantic Analysis in Python discrepancy

I'm trying to follow the Wikipedia Article on latent semantic indexing in Python using the following code:
documentTermMatrix = array([[ 0., 1., 0., 1., 1., 0., 1.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 1.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 1., 1., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.]])
u,s,vt = linalg.svd(documentTermMatrix, full_matrices=False)
sigma = diag(s)
## remove extra dimensions...
numberOfDimensions = 4
for i in range(4, len(sigma) -1):
sigma[i][i] = 0
queryVector = array([[ 0.], # same as first column in documentTermMatrix
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 1.],
[ 0.],
[ 0.],
[ 1.]])
How the math says it should work:
dtMatrixToQueryAgainst = dot(u, dot(s,vt))
queryVector = dot(inv(s), dot(transpose(u), queryVector))
similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainst[:,0]
# gives 'matrices are not aligned' error. should be 1 because they're the same
What does work, with math that looks incorrect: ( from here)
dtMatrixToQueryAgainst = dot(s, vt)
queryVector = dot(transpose(u), queryVector)
similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainsst[:,0])
# gives 1, which is correct
Why does route work, and the first not, when everything I can find about the math of LSA shows the first as correct? I feel like I'm missing something obvious...
There are several inconsistencies in your code that cause errors before your point of confusion. This makes it difficult to understand exactly what you tried and why you are confused (clearly you did not run the code as it is pasted, or it would have thrown an exception earlier).
That said, if I follow your intent correctly, your first approach is nearly correct. Consider the following code:
documentTermMatrix = array([[ 0., 1., 0., 1., 1., 0., 1.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 1.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 1., 1., 0., 0., 0.],
[ 1., 0., 0., 1., 0., 0., 0.]])
numDimensions = 4
u, s, vt = linalg.svd(documentTermMatrix, full_matrices=False)
u = u[:, :numDimensions]
sigma = diag(s)[:numDimensions, :numDimensions]
vt = vt[:numDimensions, :]
lowRankDocumentTermMatrix = dot(u, dot(sigma, vt))
queryVector = documentTermMatrix[:, 0]
lowDimensionalQuery = dot(inv(sigma), dot(u.T, queryVector))
lowDimensionalQuery
vt[:,0]
You should see that lowDimensionalQuery and vt[:,0] are equal. Think of vt as a representation of the documents in a low-dimensional subspace. First we map our query into that subspace to get lowDimensionalQuery, and then we compare it with the corresponding column of vt. Your mistake was trying to compare the transformed query to the document vector from lowRankDocumentTermMatrix, which lives in the original space. Since the transformed query has fewer elements than the "reconstructed" document, Python complained.

Categories