网络小说排行榜,盗墓笔记同人小说,斗破苍穹续集

回想一下2.4 節(jié)，計算導數(shù)是我們將用于訓練深度網(wǎng)絡的所有優(yōu)化算法中的關鍵步驟。雖然計算很簡單，但手工計算可能很乏味且容易出錯，而且這個問題只會隨著我們的模型變得更加復雜而增長。

幸運的是，所有現(xiàn)代深度學習框架都通過提供自動微分（通常簡稱為 autograd ）來解決我們的工作。當我們通過每個連續(xù)的函數(shù)傳遞數(shù)據(jù)時，該框架會構建一個計算圖來跟蹤每個值如何依賴于其他值。為了計算導數(shù)，自動微分通過應用鏈式法則通過該圖向后工作。以這種方式應用鏈式法則的計算算法稱為反向傳播。

雖然 autograd 庫在過去十年中成為熱門話題，但它們的歷史悠久。事實上，對 autograd 的最早引用可以追溯到半個多世紀以前（Wengert，1964 年）。現(xiàn)代反向傳播背后的核心思想可以追溯到 1980 年的一篇博士論文 ( Speelpenning, 1980 )，并在 80 年代后期得到進一步發(fā)展 ( Griewank, 1989 )。雖然反向傳播已成為計算梯度的默認方法，但它并不是唯一的選擇。例如，Julia 編程語言采用前向傳播（Revels等人，2016 年）. 在探索方法之前，我們先來掌握autograd這個包。

import torch

from mxnet import autograd, np, npx

npx.set_np()

from jax import numpy as jnp

import tensorflow as tf

2.5.1. 一個簡單的函數(shù)

假設我們有興趣區(qū)分函數(shù) y=2x?x關于列向量x. 首先，我們分配x一個初始值。

x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

在我們計算梯度之前y關于 x，我們需要一個地方來存放它。通常，我們避免每次求導時都分配新內存，因為深度學習需要針對相同參數(shù)連續(xù)計算導數(shù)數(shù)千或數(shù)百萬次，并且我們可能會面臨內存耗盡的風險。請注意，標量值函數(shù)相對于向量的梯度x是向量值的并且具有相同的形狀x.

# Can also create x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
x.grad # The gradient is None by default

x = np.arange(4.0)
x

array([0., 1., 2., 3.])

Before we calculate the gradient of y with respect to x, we need a place to store it. In general, we avoid allocating new memory every time we take a derivative because deep learning requires successively computing derivatives with respect to the same parameters thousands or millions of times, and we might risk running out of memory. Note that the gradient of a scalar-valued function with respect to a vector x is vector-valued and has the same shape as x.

# We allocate memory for a tensor's gradient by invoking `attach_grad`
x.attach_grad()
# After we calculate a gradient taken with respect to `x`, we will be able to
# access it via the `grad` attribute, whose values are initialized with 0s
x.grad

array([0., 0., 0., 0.])

x = jnp.arange(4.0)
x

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

Array([0., 1., 2., 3.], dtype=float32)

x = tf.range(4, dtype=tf.float32)
x

x = tf.Variable(x)

我們現(xiàn)在計算我們的函數(shù)x并將結果分配給y。

y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=)

我們現(xiàn)在可以通過調用它的方法來獲取y關于的梯度。接下來，我們可以通過的屬性訪問漸變。xbackwardxgrad

y.backward()
x.grad

tensor([ 0., 4., 8., 12.])

# Our code is inside an `autograd.record` scope to build the computational
# graph
with autograd.record():
  y = 2 * np.dot(x, x)
y

array(28.)

We can now take the gradient of y with respect to x by calling its backward method. Next, we can access the gradient via x’s grad attribute.

y.backward()
x.grad

[09:38:36] src/base.cc:49: GPU context requested, but no GPUs found.

array([ 0., 4., 8., 12.])

y = lambda x: 2 * jnp.dot(x, x)
y(x)

Array(28., dtype=float32)

We can now take the gradient of y with respect to x by passing through the grad transform.

from jax import grad

# The `grad` transform returns a Python function that
# computes the gradient of the original function
x_grad = grad(y)(x)
x_grad

Array([ 0., 4., 8., 12.], dtype=float32)

# Record all computations onto a tape
with tf.GradientTape() as t:
  y = 2 * tf.tensordot(x, x, axes=1)
y

We can now calculate the gradient of y with respect to x by calling the gradient method.

x_grad = t.gradient(y, x)
x_grad

我們已經(jīng)知道函數(shù)的梯度 y=2x?x關于 x應該4x. 我們現(xiàn)在可以驗證自動梯度計算和預期結果是否相同。

x.grad == 4 * x

tensor([True, True, True, True])

現(xiàn)在讓我們計算另一個函數(shù)x并獲取它的梯度。請注意，當我們記錄新的梯度時，PyTorch 不會自動重置梯度緩沖區(qū)。相反，新的漸變被添加到已經(jīng)存儲的漸變中。當我們想要優(yōu)化多個目標函數(shù)的總和時，這種行為會派上用場。要重置梯度緩沖區(qū)，我們可以調用x.grad.zero()如下：

x.grad.zero_() # Reset the gradient
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

x.grad == 4 * x

array([ True, True, True, True])

Now let’s calculate another function of x and take its gradient. Note that MXNet resets the gradient buffer whenever we record a new gradient.

with autograd.record():
  y = x.sum()
y.backward()
x.grad # Overwritten by the newly calculated gradient

array([1., 1., 1., 1.])

x_grad == 4 * x

Array([ True, True, True, True], dtype=bool)

y = lambda x: x.sum()
grad(y)(x)

Array([1., 1., 1., 1.], dtype=float32)

x_grad == 4 * x

Now let’s calculate another function of x and take its gradient. Note that TensorFlow resets the gradient buffer whenever we record a new gradient.

with tf.GradientTape() as t:
  y = tf.reduce_sum(x)
t.gradient(y, x) # Overwritten by the newly calculated gradient

2.5.2. 非標量變量的后向

當y是向量時，y關于向量的導數(shù)最自然的解釋是稱為雅可比x矩陣的矩陣，其中包含關于每個分量的每個分量的偏導數(shù)。同樣，對于高階和，微分結果可能是更高階的張量。yxyx

y 雖然 Jacobian 矩陣確實出現(xiàn)在一些高級機器學習技術中，但更常見的是，我們希望將的每個分量相對于完整向量的梯度求和x，從而產生與形狀相同的向量x。例如，我們通常有一個向量表示我們的損失函數(shù)的值，分別為一批訓練示例中的每個示例計算。在這里，我們只想總結為每個示例單獨計算的梯度。

由于深度學習框架在解釋非標量張量梯度的方式上有所不同，因此 PyTorch 采取了一些措施來避免混淆。調用backward非標量會引發(fā)錯誤，除非我們告訴 PyTorch 如何將對象縮減為標量。更正式地說，我們需要提供一些向量v這樣backward會計算v??xy而不是?xy. 下一部分可能令人困惑，但出于稍后會變得清楚的原因，這個論點（代表v) 被命名為gradient。更詳細的描述見楊章的Medium帖子。

x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y))) # Faster: y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

MXNet handles this problem by reducing all tensors to scalars by summing before computing a gradient. In other words, rather than returning the Jacobian ?xy, it returns the gradient of the sum ?x∑iyi.

with autograd.record():
  y = x * x
y.backward()
x.grad # Equals the gradient of y = sum(x * x)

array([0., 2., 4., 6.])

y = lambda x: x * x
# grad is only defined for scalar output functions
grad(lambda x: y(x).sum())(x)

Array([0., 2., 4., 6.], dtype=float32)

By default, TensorFlow returns the gradient of the sum. In other words, rather than returning the Jacobian ?xy, it returns the gradient of the sum ?x∑iyi.

with tf.GradientTape() as t:
  y = x * x
t.gradient(y, x) # Same as y = tf.reduce_sum(x * x)

2.5.3. 分離計算

有時，我們希望將一些計算移到記錄的計算圖之外。例如，假設我們使用輸入來創(chuàng)建一些我們不想為其計算梯度的輔助中間項。在這種情況下，我們需要從最終結果中分離出相應的計算圖。下面的玩具示例更清楚地說明了這一點：假設我們有，但我們想關注on的直接影響，而不是通過傳達的影響。在這種情況下，我們可以創(chuàng)建一個新變量，該變量具有與相同的值，但其出處（創(chuàng)建方式）已被清除。因此z = x * yy = x * xxzyuyu圖中沒有祖先，梯度不會u流向x. 例如，采用的梯度將產生結果，（與您自以來可能預期的不同）。z = x * ux3 * x * xz = x * x * x

x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad == u

tensor([True, True, True, True])

with autograd.record():
  y = x * x
  u = y.detach()
  z = u * x
z.backward()
x.grad == u

array([ True, True, True, True])

import jax

y = lambda x: x * x
# jax.lax primitives are Python wrappers around XLA operations
u = jax.lax.stop_gradient(y(x))
z = lambda x: u * x

grad(lambda x: z(x).sum())(x) == y(x)

Array([ True, True, True, True], dtype=bool)

# Set persistent=True to preserve the compute graph.
# This lets us run t.gradient more than once
with tf.GradientTape(persistent=True) as t:
  y = x * x
  u = tf.stop_gradient(y)
  z = u * x

x_grad = t.gradient(z, x)
x_grad == u

請注意，雖然此過程將y的祖先與的圖分離z，但導致的計算圖仍然存在，因此我們可以計算關于y的梯度。yx

x.grad.zero_()
y.sum().backward()
x.grad == 2 * x

tensor([True, True, True, True])

y.backward()
x.grad == 2 * x

array([ True, True, True, True])

grad(lambda x: y(x).sum())(x) == 2 * x

Array([ True, True, True, True], dtype=bool)

t.gradient(y, x) == 2 * x

2.5.4. 漸變和 Python 控制流

到目前為止，我們回顧了從輸入到輸出的路徑通過諸如. 編程為我們計算結果的方式提供了更多的自由。例如，我們可以使它們依賴于輔助變量或對中間結果的條件選擇。使用自動微分的一個好處是，即使構建函數(shù)的計算圖需要通過迷宮般的 Python 控制流（例如，條件、循環(huán)和任意函數(shù)調用），我們仍然可以計算結果變量的梯度。為了說明這一點，請考慮以下代碼片段，其中循環(huán)的迭代次數(shù) 和語句的評估都取決于輸入的值。z = x * x * xwhileifa

def f(a):
  b = a * 2
  while b.norm() < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

def f(a):
  b = a * 2
  while np.linalg.norm(b) < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

def f(a):
  b = a * 2
  while jnp.linalg.norm(b) < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

def f(a):
  b = a * 2
  while tf.norm(b) < 1000:
    b = b * 2
  if tf.reduce_sum(b) > 0:
    c = b
  else:
    c = 100 * b
  return c

下面，我們調用這個函數(shù)，傳入一個隨機值作為輸入。由于輸入是一個隨機變量，我們不知道計算圖將采用什么形式。然而，每當我們f(a)對一個特定的輸入執(zhí)行時，我們就會實現(xiàn)一個特定的計算圖并可以隨后運行backward。

a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

a = np.random.normal()
a.attach_grad()
with autograd.record():
  d = f(a)
d.backward()

from jax import random

a = random.normal(random.PRNGKey(1), ())
d = f(a)
d_grad = grad(f)(a)

a = tf.Variable(tf.random.normal(shape=()))
with tf.GradientTape() as t:
  d = f(a)
d_grad = t.gradient(d, a)
d_grad

盡管我們的函數(shù)f出于演示目的有點人為設計，但它對輸入的依賴性非常簡單：它是具有分段定義比例的線性函數(shù)。a因此，是一個包含常量項的向量，此外，需要匹配關于的梯度。f(a) / af(a) / af(a)a

a.grad == d / a

tensor(True)

a.grad == d / a

array(True)

d_grad == d / a

Array(True, dtype=bool)

d_grad == d / a

動態(tài)控制流在深度學習中很常見。例如，在處理文本時，計算圖取決于輸入的長度。在這些情況下，自動微分對于統(tǒng)計建模變得至關重要，因為不可能先驗地計算梯度。

2.5.5. 討論

您現(xiàn)在已經(jīng)領略了自動微分的威力。用于自動和高效計算導數(shù)的庫的開發(fā)極大地提高了深度學習從業(yè)者的生產力，使他們能夠專注于更高級的問題。此外，autograd 允許我們設計大量模型，筆和紙的梯度計算將非常耗時。有趣的是，雖然我們使用 autograd 來優(yōu)化模型（在統(tǒng)計意義上），但autograd 庫本身的優(yōu)化（在計算意義上）是框架設計者非常感興趣的一個豐富主題。在這里，來自編譯器和圖形操作的工具被用來以最方便和內存效率最高的方式計算結果。

現(xiàn)在，試著記住這些基礎知識：(i) 將梯度附加到那些我們想要導數(shù)的變量；(ii) 記錄目標值的計算；(iii) 執(zhí)行反向傳播功能；(iv) 訪問生成的梯度。

2.5.6. 練習

為什么二階導數(shù)的計算成本比一階導數(shù)高得多？

運行反向傳播函數(shù)后，立即再次運行它，看看會發(fā)生什么。為什么？

d在我們計算關于的導數(shù)的控制流示例中 a，如果我們將變量更改a為隨機向量或矩陣會發(fā)生什么？此時，計算的結果f(a)不再是標量。結果會怎樣？我們如何分析這個？

讓f(x)=sin?(x). 繪制圖形f及其衍生物f′. 不要利用這個事實 f′(x)=cos?(x)而是使用自動微分來獲得結果。

讓f(x)=((log?x2)?sin?x)+x?1. 寫出依賴圖跟蹤結果x到f(x).

使用鏈式法則計算導數(shù)dfdx上述函數(shù)，將每個術語放在您之前構建的依賴圖上。

給定圖形和中間導數(shù)結果，您在計算梯度時有多種選擇。從開始評估結果x到f一次來自f 追溯到x. 路徑從x到f通常稱為前向微分，而從 f到x被稱為向后微分。

你什么時候想用前向微分，什么時候用后向微分？提示：考慮所需的中間數(shù)據(jù)量、并行化步驟的能力以及涉及的矩陣和向量的大小。

聲明：本文內容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權轉載。文章觀點僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場。文章及其配圖僅供工程師學習之用，如有內容侵權或者其他違規(guī)問題，請聯(lián)系本站處理。舉報投訴

pytorch

pytorch

+關注

關注
2

文章
808

瀏覽量
13283

Pytorch自動求導示例

Pytorch自動微分的幾個例子

發(fā)表于 08-09 11:56

PyTorch如何入門

PyTorch 入門實戰(zhàn)（一）——Tensor

發(fā)表于 06-01 09:58

通過Cortex來非常方便的部署PyTorch模型

的工作。那么，問題是如何將 RoBERTa 部署為一個 JSON API，而不需要手動滾動所有這些自定義基礎設施？將 PyTorch 模型與 Cortex 一起投入生產你可以使用 Cortex 自動化部署

發(fā)表于 11-01 15:25

實用微分器

實用微分器微分器

發(fā)表于 09-25 14:34 ?815次閱讀

一篇非常新的介紹PyTorch內部機制的文章

/pytorch-internals/ 翻譯努力追求通俗、易懂，有些熟知的名詞沒有進行翻譯比如(Tensor, 張量) 部分專有名詞翻譯對照表如下英文譯文 ? ? ? autograde 自動微分

發(fā)表于 12-26 10:17 ?2205次閱讀

一篇非常新的介紹<b class='flag-5'>PyTorch</b>內部機制的文章

基于PyTorch的深度學習入門教程之PyTorch簡單知識

本文參考PyTorch官網(wǎng)的教程，分為五個基本模塊來介紹PyTorch。為了避免文章過長，這五個模塊分別在五篇博文中介紹。 Part1：PyTorch簡單知識 Part2：PyTorch

發(fā)表于 02-16 15:20 ?2284次閱讀

基于PyTorch的深度學習入門教程之PyTorch的自動梯度計算

本文參考PyTorch官網(wǎng)的教程，分為五個基本模塊來介紹PyTorch。為了避免文章過長，這五個模塊分別在五篇博文中介紹。 Part1：PyTorch簡單知識 Part2：PyTorch

發(fā)表于 02-16 15:26 ?2054次閱讀

基于PyTorch的深度學習入門教程之PyTorch重點綜合實踐

前言 PyTorch提供了兩個主要特性：（1）一個n維的Tensor，與numpy相似但是支持GPU運算。（2）搭建和訓練神經(jīng)網(wǎng)絡的自動微分功能。我們將會使用一個全連接的ReLU網(wǎng)絡作為

發(fā)表于 02-15 10:01 ?1813次閱讀

PyTorch1.8和Tensorflow2.5該如何選擇？

自深度學習重新獲得公認以來，許多機器學習框架層出不窮，爭相成為研究人員以及行業(yè)從業(yè)人員的新寵。從早期的學術成果 Caffe、Theano，到獲得龐大工業(yè)支持的 PyTorch、TensorFlow

發(fā)表于 07-09 10:33 ?1535次閱讀

PyTorch 的 Autograd 機制和使用

PyTorch 作為一個深度學習平臺，在深度學習任務中比 NumPy 這個科學計算庫強在哪里呢？我覺得一是 PyTorch 提供了自動求導機制，二是對 GPU 的支持。由此可見，自動求

發(fā)表于 08-15 09:37 ?1126次閱讀

PyTorch教程2.5之自動微分

電子發(fā)燒友網(wǎng)站提供《PyTorch教程2.5之自動微分.pdf》資料免費下載

發(fā)表于 06-05 11:38 ?0次下載

PyTorch教程13.3之自動并行

電子發(fā)燒友網(wǎng)站提供《PyTorch教程13.3之自動并行.pdf》資料免費下載

發(fā)表于 06-05 14:47 ?0次下載

PyTorch的介紹與使用案例

學習領域的一個重要工具。PyTorch底層由C++實現(xiàn)，提供了豐富的API接口，使得開發(fā)者能夠高效地構建和訓練神經(jīng)網(wǎng)絡模型。PyTorch不僅支持動態(tài)計算圖，還提供了強大的自動微分系統(tǒng)

發(fā)表于 07-10 14:19 ?431次閱讀

pytorch怎么在pycharm中運行

第一部分：PyTorch和PyCharm的安裝 1.1 安裝PyTorch PyTorch是一個開源的機器學習庫，用于構建和訓練神經(jīng)網(wǎng)絡。要在PyCharm中使用PyTorch，首先需

發(fā)表于 08-01 16:22 ?1509次閱讀

如何使用 PyTorch 進行強化學習

的計算圖和自動微分功能，非常適合實現(xiàn)復雜的強化學習算法。 1. 環(huán)境（Environment）在強化學習中，環(huán)境是一個抽象的概念，它定義了智能體（agent）可以執(zhí)行的動作（actions）、觀察到

發(fā)表于 11-05 17:34 ?342次閱讀

在线观看www成人影院-在线观看www日本免费网站-在线观看www视频-在线观看操-欧美18在线-欧美1级

搜索歷史

PyTorch教程-2.5. 自動微分

評論

Pytorch自動求導示例

PyTorch如何入門

通過Cortex來非常方便的部署PyTorch模型

實用微分器

一篇非常新的介紹PyTorch內部機制的文章

基于PyTorch的深度學習入門教程之PyTorch簡單知識

基于PyTorch的深度學習入門教程之PyTorch的自動梯度計算

基于PyTorch的深度學習入門教程之PyTorch重點綜合實踐

PyTorch1.8和Tensorflow2.5該如何選擇？

PyTorch 的 Autograd 機制和使用

PyTorch教程2.5之自動微分

PyTorch教程13.3之自動并行

PyTorch的介紹與使用案例

pytorch怎么在pycharm中運行

如何使用 PyTorch 進行強化學習