I am working on tensorflow 1.01.
I am trying to modify an example found at:
My model is a simple linear model
x_data = tf.placeholder(shape=[None, 3], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
# Create variables for linear regression
A = tf.Variable(tf.random_normal(shape=[3,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
# Declare model operations
model_output = tf.add(tf.matmul(x_data, A), b)
Specifically, I would like to add another L0 penalty term to the model loss, same way as done with L2 norm:
l2_a_loss = tf.reduce_mean(tf.square(A))
elastic_param2 = tf.constant(1.)
e2_term = tf.multiply(elastic_param2, l2_a_loss)
However, I can not compute a loss using L0 norm
elastic_param0= tf.constant(1.)
l0_a_loss= tf.reduce_mean(tf.norm(A,ord=0))
e0_term= tf.multiply(elastic_param0, l0_a_loss)
plugging in the additional term in the model loss
loss = tf.expand_dims(tf.add(tf.add(tf.reduce_mean(tf.square(y_target - model_output)), e0_term), e2_term), 0)
ValueError: 'ord' must be a supported vector norm, got 0.
I was hoping that changing the axis argument value would fix it while also with
l0_a_loss= tf.reduce_mean(tf.norm(A,ord=0,axis=(0,1)))
I still get
ValueError: 'ord' must be a supported matrix norm in ['euclidean', 'fro', 1, inf], got 0
How to minimize the L-0 norm of A in this model?
The tensorflow documentation is wrong (even in current 1.3 version).
As you can see from this commit:
Fix description of tf.norm as it doesn't actually accept ord=0.
This means that you have to implement the L0 norm by yourself, you can't use tf.norm
I have temporarly solved this by:
l0_a_loss=tf.cast( tf.count_nonzero(A), tf.float32)
Looking forward to official documentation/code update in tensorflow
As said in the title, I am trying to create a mixture of multivariate normal distributions using tensorflow probability package.
In my original project, am feeding the weights of the categorical, the loc and the variance from the output of a neural network. However when creating the graph, I get the following error:
components[0] batch shape must be compatible with cat shape and other component batch shapes
I recreated the same problem using placeholders:
import tensorflow as tf
import tensorflow_probability as tfp # dist= tfp.distributions
sess = tf.compat.v1.InteractiveSession()
l1 = tf.compat.v1.placeholder(dtype=tf.float32, shape=[None, 2], name='observations_1')
l2 = tf.compat.v1.placeholder(dtype=tf.float32, shape=[None, 2], name='observations_2')
log_std = tf.compat.v1.get_variable('log_std', [1, 2], dtype=tf.float32,
mix = tf.compat.v1.placeholder(dtype=tf.float32, shape=[None,1], name='weights')
cat = tfp.distributions.Categorical(probs=[mix, 1.-mix])
components = [
tfp.distributions.MultivariateNormalDiag(loc=l1, scale_diag=tf.exp(log_std)),
tfp.distributions.MultivariateNormalDiag(loc=l2, scale_diag=tf.exp(log_std)),
bimix_gauss = tfp.distributions.Mixture(
So, my question is, what am I doing wrong? I looked into the error and it seems tensorshape_util.is_compatible_with is what raises the error but I don't see why.
When the components are the same type, MixtureSameFamily should be more performant.
There you only pass a single Categorical instance (with .batch_shape [b1,b2,...,bn]) and a single MVNDiag instance (with .batch_shape [b1,b2,...,bn,numcats]).
For only two classes, I wonder if Bernoulli would work?
It seems you provided a mis-shaped input to tfp.distributions.Categorical. It's probs parameter should be of shape [batch_size, cat_size] while the one you provide is rather [cat_size, batch_size, 1]. So maybe try to parametrize probs with tf.concat([mix, 1-mix], 1).
There may also be a problem with yourlog_std which doesn't have the same shape as l1and l2. In case MultivariateNormalDiag doesn't properly broadcast it, try to specify it's shape as (None, 2) or to tile it so that it's first dimension corresponds to that of your location parameters.
After the introduction of Tensorflow 2.0 the scipy interface (tf.contrib.opt.ScipyOptimizerInterface) has been removed. However, I would still like to use the scipy optimizer scipy.optimize.minimize(method=’L-BFGS-B’) to train a neural network (keras model sequential). In order for the optimizer to work, it requires as input a function fun(x0) with x0 being an array of shape (n,). Therefore, the first step would be to "flatten" the weights matrices to obtain a vector with the required shape. To this end, I modified the code provided by https://pychao.com/2019/11/02/optimize-tensorflow-keras-models-with-l-bfgs-from-tensorflow-probability/. This provides a function factory meant to create such a function fun(x0). However, the code does not seem to work and the loss function does not decrease. I would be really grateful if someone could help me work this out.
Here the piece of code I am using:
func = function_factory(model, loss_function, x_u_train, u_train)
# convert initial model parameters to a 1D tf.Tensor
init_params = tf.dynamic_stitch(func.idx, model.trainable_variables)
init_params = tf.cast(init_params, dtype=tf.float32)
# train the model with L-BFGS solver
results = scipy.optimize.minimize(fun=func, x0=init_params, method='L-BFGS-B')
def loss_function(x_u_train, u_train, network):
u_pred = tf.cast(network(x_u_train), dtype=tf.float32)
loss_value = tf.reduce_mean(tf.square(u_train - u_pred))
return tf.cast(loss_value, dtype=tf.float32)
def function_factory(model, loss_f, x_u_train, u_train):
"""A factory to create a function required by tfp.optimizer.lbfgs_minimize.
model [in]: an instance of `tf.keras.Model` or its subclasses.
loss [in]: a function with signature loss_value = loss(pred_y, true_y).
train_x [in]: the input part of training data.
train_y [in]: the output part of training data.
A function that has a signature of:
loss_value, gradients = f(model_parameters).
# obtain the shapes of all trainable parameters in the model
shapes = tf.shape_n(model.trainable_variables)
n_tensors = len(shapes)
# we'll use tf.dynamic_stitch and tf.dynamic_partition later, so we need to
# prepare required information first
count = 0
idx = [] # stitch indices
part = [] # partition indices
for i, shape in enumerate(shapes):
n = np.product(shape)
idx.append(tf.reshape(tf.range(count, count+n, dtype=tf.int32), shape))
count += n
part = tf.constant(part)
def assign_new_model_parameters(params_1d):
"""A function updating the model's parameters with a 1D tf.Tensor.
params_1d [in]: a 1D tf.Tensor representing the model's trainable parameters.
params = tf.dynamic_partition(params_1d, part, n_tensors)
for i, (shape, param) in enumerate(zip(shapes, params)):
model.trainable_variables[i].assign(tf.cast(tf.reshape(param, shape), dtype=tf.float32))
# now create a function that will be returned by this factory
def f(params_1d):
This function is created by function_factory.
params_1d [in]: a 1D tf.Tensor.
A scalar loss.
# update the parameters in the model
# calculate the loss
loss_value = loss_f(x_u_train, u_train, model)
# print out iteration & loss
tf.print("Iter:", f.iter, "loss:", loss_value)
return loss_value
# store these information as members so we can use them outside the scope
f.iter = tf.Variable(0)
f.idx = idx
f.part = part
f.shapes = shapes
f.assign_new_model_parameters = assign_new_model_parameters
return f
Here model is an object tf.keras.Sequential.
Thank you in advance for any help!
Changing from tf1 to tf2 I was exposed to the same question and after a little bit of experimenting I found the solution below that shows how to establish the interface between a function decorated with tf.function and a scipy optimizer. The important changes compared to the question are:
As mentioned by Ives scipy's lbfgs
needs to get function value and gradient, so you need to provide a function that delivers both and then set jac=True
scipy's lbfgs is a Fortran function that expects the interface to provide np.float64 arrays while tensorflow tf.function uses tf.float32.
So one has to cast input and output.
I provide an example of how this can be done for a toy problem here below.
import tensorflow as tf
import numpy as np
import scipy.optimize as sopt
def model(x):
return tf.reduce_sum(tf.square(x-tf.constant(2, dtype=tf.float32)))
def val_and_grad(x):
with tf.GradientTape() as tape:
loss = model(x)
grad = tape.gradient(loss, x)
return loss, grad
def func(x):
return [vv.numpy().astype(np.float64) for vv in val_and_grad(tf.constant(x, dtype=tf.float32))]
resdd= sopt.minimize(fun=func, x0=np.ones(5),
jac=True, method='L-BFGS-B')
fun: 7.105427357601002e-14
hess_inv: <5x5 LbfgsInvHessProduct with dtype=float64>
jac: array([-2.38418579e-07, -2.38418579e-07, -2.38418579e-07, -2.38418579e-07,
nfev: 3
nit: 2
status: 0
success: True
x: array([1.99999988, 1.99999988, 1.99999988, 1.99999988, 1.99999988])
For comparing speed
I use
the lbfgs optimizer for a style transfer
problem (see here for the network). Note, that for this problem the network parameters are fixed and the input signal is adapted. As the optimized parameters (the input signal) are 1D the function factory is not needed.
I compare four implementations
TF1.12: TF1 with with ScipyOptimizerInterface
TF2.0 (E): the approach above without using tf.function decorators
TF2.0 (G): the approach above using tf.function decorators
TF2.0/TFP: using the lbfgs minimizer from
For this comparison the optimization is stopped after 300 iterations (generally for convergence the problem requires 3000 iterations)
Method runtime(300it) final loss
TF1.12 240s 0.045 (baseline)
TF2.0 (E) 299s 0.045
TF2.0 (G) 233s 0.045
TF2.0/TFP 226s 0.053
The TF2.0 eager mode (TF2.0(E)) works correctly but is about 20% slower than the TF1.12 baseline version. TF2.0(G) with tf.function works fine and is marginally faster than TF1.12, which is a good thing to know.
The optimizer from tensorflow_probability (TF2.0/TFP) is slightly faster than TF2.0(G) using scipy's lbfgs but does not achieve the same error reduction. In fact the decrease of the loss over time is not monotonous which seems a bad sign. Comparing the two implementations of lbfgs (scipy and tensorflow_probability=TFP) it is clear that the Fortran code in scipy is significantly more complex.
So either the simplification of the algorithm in TFP is harming here or even the fact that TFP is performing all calculations in float32 may also be a problem.
Here is a simple solution using a library (autograd_minimize) that I wrote building on the answer of Roebel:
import tensorflow as tf
from autograd_minimize import minimize
def rosen_tf(x):
return tf.reduce_sum(100.0*(x[1:] - x[:-1]**2.0)**2.0 + (1 - x[:-1])**2.0)
res = minimize(rosen_tf, np.array([0.,0.]))
>>> array([0.99999912, 0.99999824])
It also works with keras models as shown with this naive example of linear regression:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
from autograd_minimize.tf_wrapper import tf_function_factory
from autograd_minimize import minimize
import tensorflow as tf
#### Prepares data
X = np.random.random((200, 2))
y = X[:,:1]*2+X[:,1:]*0.4-1
#### Creates model
model = keras.Sequential([keras.Input(shape=2),
# Transforms model into a function of its parameter
func, params = tf_function_factory(model, tf.keras.losses.MSE, X, y)
# Minimization
res = minimize(func, params, method='L-BFGS-B')
>>> [array([[2.0000016 ],
[0.40000062]]), array([-1.00000164])]
I guess SciPy does not know how to calculate gradients of TensorFlow objects. Try to use the original function factory (i.e., the one also returns the gradients together after loss), and set jac=True in scipy.optimize.minimize.
I tested the python code from the original Gist and replaced tfp.optimizer.lbfgs_minimize with SciPy optimizer. It worked with BFGS method:
results = scipy.optimize.minimize(fun=func, x0=init_params, jac=True, method='BFGS')
jac=True means SciPy knows that func also returns gradients.
For L-BFGS-B, however, it's tricky. After some effort, I finally made it work. I have to comment out the #tf.function lines and let func return grads.numpy() instead of the raw TF Tensor. I guess that's because the underlying implementation of L-BFGS-B is a Fortran function, so there might be some issue converting data from tf.Tensor -> numpy array -> Fortran array. And forcing the function func to return the ndarray version of the gradients resolves the problem. But then it's not possible to use #tf.function.
(Similar Question to: Is there a tf.keras.optimizers implementation for L-BFGS?)
While this is not from anywhere as legit as tf.contrib, it's an implementation L-BFGS (and any other scipy.optimize.minimize solver) for your consideration in case it fits your use case:
The package has models that extend keras.Model and keras.Sequential models, and can be compiled with .compile(..., optimizer="L-BFGS-B") to use L-BFGS in TF2, or compiled with any of the other standard optimizers (because flipping between stochastic & deterministic should be easy!):
I want to create a network where in the input layer nodes are just connected to some nodes in the next layer. Here is a small example:
My solution so far is that I set the weight of the edge between i1 and h1 to zero and after every optimization step I multiply the weights with a matrix (I call this matrix mask matrix) in which every entry is 1 except the entry of the weight of the edge between i1 and h1.
(See code below)
Is this approach right? Or does this have a affect on the GradientDescent? Is there another approach to create this kind of a network in TensorFlow?
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np
model = tf.keras.Sequential([
tf.keras.layers.Dense(2, activation=tf.sigmoid, input_shape=(2,)), # input shape required
tf.keras.layers.Dense(2, activation=tf.sigmoid)
#set the weights
weights=[np.array([[0, 0.25],[0.2,0.3]]),np.array([0.35,0.35]),np.array([[0.4,0.5],[0.45, 0.55]]),np.array([0.6,0.6])]
features = tf.convert_to_tensor([[0.05,0.10 ]])
labels = tf.convert_to_tensor([[0.01,0.99 ]])
mask =np.array([[0, 1],[1,1]])
#define the loss function
def loss(model, x, y):
y_ = model(x)
return tf.losses.mean_squared_error(labels=y, predictions=y_)
#define the gradient calculation
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
#create optimizer an global Step
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
global_step = tf.train.get_or_create_global_step()
#optimization step
loss_value, grads = grad(model, features, labels)
optimizer.apply_gradients(zip(grads, model.variables),global_step)
#masking the optimized weights
If you are looking for a solution for the specific example you provided, you can simply use tf.keras Functional API and define two Dense layers where one is connected to both neurons in the previous layer and the other one is only connected to one of the neurons:
from tensorflow.keras.layer import Input, Lambda, Dense, concatenate
from tensorflow.keras.models import Model
inp = Input(shape=(2,))
inp2 = Lambda(lambda x: x[:,1:2])(inp) # get the second neuron
h1_out = Dense(1, activation='sigmoid')(inp2) # only connected to the second neuron
h2_out = Dense(1, activation='sigmoid')(inp) # connected to both neurons
h_out = concatenate([h1_out, h2_out])
out = Dense(2, activation='sigmoid')(h_out)
model = Model(inp, out)
# simply train it using `fit`
The problem with your solution and some others suggested by other answers in this post is that they do not prevent training of this weight. They allow the gradient descent to train the non existent weight and then overwrite it retrospectively. This will result in a network that has a zero in this location as desired, but will negatively affect your training process as the back propagation calculation will not see the masking step as it is not part of a TensorFlow graph and so the gradient descent will follow a path which includes the assumption that this weight does have an affect on the outcome (it does not).
A better solution would be to include the masking step as a part of your TensorFlow graph, so that it can be factored into the gradient descent. Since the masking step is simply a element wise multiplication by your sparse, binary martix mask, you could just include the mask matrix as an elementwise matrix multiplicaiton in the graph definition using tf.multiply.
Sadly this means sying goodbye to the user friendly keras,layers methods and embracing a more nuts & bolts approach to TensorFlow. I can't see an obvious way to do it using the layers API.
See the implementation below, I have tried to provide comments explaining what is happening at each stage.
import tensorflow as tf
## Graph definition for model
# set up tf.placeholders for inputs x, and outputs y_
# these remain fixed during training and can have values fed to them during the session
with tf.name_scope("Placeholders"):
x = tf.placeholder(tf.float32, shape=[None, 2], name="x") # input layer
y_ = tf.placeholder(tf.float32, shape=[None, 2], name="y_") # output layer
# set up tf.Variables for the weights at each layer from l1 to l3, and setup feeding of initial values
# also set up mask as a variable and set it to be un-trianable
with tf.name_scope("Variables"):
w_l1_values = [[0, 0.25],[0.2,0.3]]
w_l1 = tf.Variable(w_l1_values, name="w_l1")
w_l2_values = [[0.4,0.5],[0.45, 0.55]]
w_l2 = tf.Variable(w_l2_values, name="w_l2")
mask_values = [[0., 1.], [1., 1.]]
mask = tf.Variable(mask_values, trainable=False, name="mask")
# link each set of weights as matrix multiplications in the graph. Inlcude an elementwise multiplication by mask.
# Sequence takes us from inputs x to output final_out, which will be compared to labels fed to placeholder y_
l1_out = tf.nn.relu(tf.matmul(x, tf.multiply(w_l1, mask)), name="l1_out")
final_out = tf.nn.relu(tf.matmul(l1_out, w_l2), name="output")
## define loss function and training operation
with tf.name_scope("Loss"):
# some loss defined as a function of graph output: final_out and labels: y_
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=final_out, labels=y_, name="loss")
with tf.name_scope("Train"):
# some optimisation strategy, arbitrary learning rate
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, name="optimizer_adam")
train_op = optimizer.minimize(loss, name="train_op")
# create session, initialise variables and train according to inputs and corresponding labels
# This should show that the values of the first layer weights change, but the one set to 0 remains at 0
with tf.Session() as sess:
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
inputs = [[0.05, 0.10]]
labels = [[0.01, 0.99]]
ans = sess.run(train_op, feed_dict={"Placeholders/x:0": inputs, "Placeholders/y_:0": labels})
train_steps = 1
for i in range(train_steps):
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
Or use the answer provided by today for a keras friendly option.
You have multiple options here.
First, you could use the dynamic masking approach in your example. I believe this will work as expected since the gradients w.r.t. the masked-out parameters will be zero (the output is constant when you change the unused parameters). This approach is simple and it can be used even when your mask is not constant during the training.
Second, if you know beforehand which weights will be always zero, you can compose your weight matrix using tf.get_variable to get a submatrix, and then concatenate it with a tf.constant tensor, e.g.:
weights_sub = tf.get_variable("w", [dim_in, dim_out - 1])
zeros = tf.zeros([dim_in, 1])
weights = tf.concat([weights_sub, zeros], axis=1)
this example will make one column of your weight matrix to be always zero.
Finally, if your mask is more complex, you can use tf.get_variable on a flattened vector and then compose a tf.SparseTensor with the variable values on the used indices:
weights_used = tf.get_variable("w", [num_used_vars])
indices = ... # get your indices in a 2-D matrix of shape [num_used_vars, 2]
dense_shape = tf.constant([dim_in, dim_out]) # this is the final shape of the weight matrix
weights = tf.SparseTensor(indices, weights_used, dense_shape)
EDIT: This probably won't work in combination with Keras' set_weights method, as it expects Numpy arrays, not Tensors.
I have written the following binary classification program in tensorflow that is buggy. The cost is returning to be zero all the time no matter what the input is. I am trying to debug a larger program which is not learning anything from the data. I have narrowed down at least one bug to the cost function always returning zero. The given program is using some random inputs and is having the same problem. self.X_train and self.y_train is originally supposed to read from files and the function self.predict() has more layers forming a feedforward neural network.
import numpy as np
import tensorflow as tf
class annClassifier():
def __init__(self):
with tf.variable_scope("Input"):
self.X = tf.placeholder(tf.float32, shape=(100, 11))
with tf.variable_scope("Output"):
self.y = tf.placeholder(tf.float32, shape=(100, 1))
self.X_train = np.random.rand(100, 11)
self.y_train = np.random.randint(0,2, size=(100, 1))
def predict(self):
with tf.variable_scope('OutputLayer'):
weights = tf.get_variable(name='weights',
shape=[11, 1],
bases = tf.get_variable(name='bases',
final_output = tf.matmul(self.X, weights) + bases
return final_output
def train(self):
prediction = self.predict()
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=self.y))
with tf.Session() as sess:
print(sess.run(cost, feed_dict={self.X:self.X_train, self.y:self.y_train}))
with tf.Graph().as_default():
classifier = annClassifier()
If someone could please figure out what I am doing wrong in this, I can try making the same change in my original program. Thanks a lot!
The only problem is invalid cost used. softmax_cross_entropy_with_logits should be used if you have more than two classes, as softmax of a single output always returns 1, as it is defined as :
softmax(x)_i = exp(x_i) / SUM_j exp(x_j)
so for a single number (one dimensional output)
softmax(x) = exp(x) / exp(x) = 1
Furthermore, for softmax output TF expects one-hot encoded labels, so if you provide only 0 or 1, there are two possibilities:
True label is 0, so the cost is -0*log(1) = 0
True label is 1, so the cost is -1*log(1) = 0
Tensorflow has a separate function to handle binary classification which applies sigmoid instead (note, that the same function for more than one output would apply sigmoid independently on each dimension which is what multi-label classification would expect):
just switch to this cost and you are good to go, you do not have to encode anything as one-hot anymore either, as this function is designed solely to be used for your use-case.
The only missing bit is that .... your code does not have actual training routine you need to define optimiser, ask it to minimise a loss and then run a train op in the loop. In your current setting you just try to predict over and over, with the network which never changes.
In particular, please refer to Cross Entropy Jungle question on SO which provides more detailed description of all these different helper functions in TF (and other libraries), which have different requirements/use cases.
The softmax_cross_entropy_with_logits is basically a stable implementation of the 2 parts :
softmax = tf.nn.softmax(prediction)
cost = -tf.reduce_mean(labels * tf.log(softmax), 1)
Now in your example, prediction is a single value, so when you apply softmax on it, its going to be always 1 irrespective of the value (exp(prediction)/exp(prediction) = 1), and so the tf.log(softmax) term becomes 0. Thats why you always get your cost zero.
Either apply sigmoid to get your probabilities between 0 or 1 or if you use want to use softmax get the labels as [1, 0] for class 0 and [0, 1] for class 1.
So I tried implementing the neural network from:
but using TensorFlow instead. I printed out the cost function twice during training and the error is appears to be getting smaller according yet all the values in the output layer are close to 1 when only two of them should be. I imagine it might be something wrong with my maths but I'm not sure. There is no difference when I try with a hidden layer or use Error Squared as cost function. Here is my code:
import tensorflow as tf
import numpy as np
input_layer_size = 3
output_layer_size = 1
x = tf.placeholder(tf.float32, [None, input_layer_size]) #holds input values
y = tf.placeholder(tf.float32, [None, output_layer_size]) # holds true y values
input_weights = tf.Variable(tf.random_normal([input_layer_size, output_layer_size]))
input_bias = tf.Variable(tf.random_normal([1, output_layer_size]))
output_layer_vals = tf.nn.sigmoid(tf.matmul(x, input_weights) + input_bias)
cross_entropy = -tf.reduce_sum(y * tf.log(output_layer_vals))
training = tf.train.AdamOptimizer(0.1).minimize(cross_entropy)
x_data = np.array(
y_data = np.reshape(np.array([0,0,1,1]).T, (4, 1))
with tf.Session() as ses:
init = tf.initialize_all_variables()
for _ in range(1000):
ses.run(training, feed_dict={x: x_data, y:y_data})
if _ % 500 == 0:
print(ses.run(output_layer_vals, feed_dict={x: x_data}))
print(ses.run(cross_entropy, feed_dict={x: x_data, y:y_data}))
And this is what it outputs:
[[ 0.82036656]
[ 0.96750367]
[ 0.87607527]
[ 0.97876281]]
0.21947 #first cross_entropy error
[[ 0.99937409]
[ 0.99998224]
[ 0.99992537]
[ 0.99999785]]
0.00062825 #second cross_entropy error, as you can see, it's smaller
First of all: you have no hidden layer. As far as I remember basic perceptrons could possibly model the XOR problem, but it needed some adjustments. However, AI is just invented by biology, but it does not model real neural networks exactly. Thus, you have to at least build an MLP (Multilayer perceptron), which consits of at least one input, one hidden and one output layer. The XOR problem needs at least two neurons + bias in the hidden layer to be solved correctly (with a high precision).
Additionally your learning rate is too high. 0.1 is a very high learning rate. To put it simply: it basically means that you update/adapt your current state by 10% of one single learning step. This lets your network forget about already learned invariants quickly. Usually the learning rate is something in between 1e-2 to 1e-6, depending on your problem, network size and general architecture.
Moreover you implemented the "simplified/short" version of cross-entropy. See wikipedia for the full version: cross-entropy. However, to avoid some edge cases TensorFlow already has its own version of cross-entropy: for example tf.nn.softmax_cross_entropy_with_logits.
Finally you should remember that the cross-entropy error is a logistic loss function that operates on probabilities of your classes. Although your sigmoid function squashes the output layer into an interval of [0, 1], this does only work in your case because you have one single output neuron. As soon as you have more than one output neuron, you also need the sum of the output layer to be exactly 1,0 in order to really describes probabilities for every class on the output layer.