Automatic differentiation and gradient tape
In this tutorial we will cover automatic differentiation, a key technique for optimizing machine learning models.
Gradient Tapes
TensorFlow provides the tf$GradientTape
API for automatic differentiation - computing the gradient of a computation with respect to its input variables.
Tensorflow “records” all operations executed inside the context of a tf$GradientTape
onto a “tape”. Tensorflow then uses that tape and the gradients associated with each recorded operation to compute the gradients of a “recorded” computation using reverse mode differentiation.
For example:
x <- tf$ones(shape(2, 2))
with(tf$GradientTape() %as% t, {
t$watch(x)
y <- tf$reduce_sum(x)
z <- tf$multiply(y, y)
})
# Derivative of z with respect to the original input tensor x
dz_dx <- t$gradient(z, x)
dz_dx
## tf.Tensor(
## [[8. 8.]
## [8. 8.]], shape=(2, 2), dtype=float32)
You can also request gradients of the output with respect to intermediate values computed during a “recorded” tf$GradientTape
context.
x <- tf$ones(shape(2, 2))
with(tf$GradientTape() %as% t, {
t$watch(x)
y <- tf$reduce_sum(x)
z <- tf$multiply(y, y)
})
# Use the tape to compute the derivative of z with respect to the
# intermediate value y.
dz_dy <- t$gradient(z, y)
dz_dy
## tf.Tensor(8.0, shape=(), dtype=float32)
By default, the resources held by a GradientTape are released as soon as GradientTape$gradient()
method is called. To compute multiple gradients over the same computation, create a persistent gradient tape. This allows multiple calls to the gradient()
method as resources are released when the tape object is garbage collected. For example:
x <- tf$constant(3)
with(tf$GradientTape(persistent = TRUE) %as% t, {
t$watch(x)
y <- x * x
z <- y * y
})
# Use the tape to compute the derivative of z with respect to the
# intermediate value y.
dz_dx <- t$gradient(z, x) # 108.0 (4*x^3 at x = 3)
dz_dx
## tf.Tensor(108.0, shape=(), dtype=float32)
## tf.Tensor(6.0, shape=(), dtype=float32)
Recording control flow
Because tapes record operations as they are executed, R control flow (using ifs and whiles for example) is naturally handled:
f <- function(x, y) {
output <- 1
for (i in seq_len(y)) {
if (i > 2 & i <= 5)
output = tf$multiply(output, x)
}
output
}
grad <- function(x, y) {
with(tf$GradientTape() %as% t, {
t$watch(x)
out <- f(x, y)
})
t$gradient(out, x)
}
x <- tf$constant(2)
grad(x, 6)
## tf.Tensor(12.0, shape=(), dtype=float32)
## tf.Tensor(12.0, shape=(), dtype=float32)
## tf.Tensor(4.0, shape=(), dtype=float32)
Higher-order gradients
Operations inside of the GradientTape context manager are recorded for automatic differentiation. If gradients are computed in that context, then the gradient computation is recorded as well. As a result, the exact same API works for higher-order gradients as well. For example:
x <- tf$Variable(1.0) # Create a Tensorflow variable initialized to 1.0
with(tf$GradientTape() %as% t, {
with(tf$GradientTape() %as% t2, {
y <- x*x*x
})
# Compute the gradient inside the 't' context manager
# which means the gradient computation is differentiable as well.
dy_dx <- t2$gradient(y, x)
})
d2y_dx <- t$gradient(dy_dx, x)
dy_dx
## tf.Tensor(3.0, shape=(), dtype=float32)
## tf.Tensor(6.0, shape=(), dtype=float32)