英语翻译练习

翻译练习 1（11 月 20 日）

深度前馈网络(deep feedforward network)，也叫作前馈神经网络(feedforward neural network)或者多层感知机(multilayer perceptron, MLP)，是典型的深度学习模型。

Deep feedforward network, also called feedforward neural network or mutilayer perceptron, is a typical deep learning model.

Depp feedforward networks, also often called feedforward neural networks, or multilayer perceptrons, are the quintessential deep learning method.

前馈网络的目标是近似某个函数 $f^*$ 。

The goal of the feedforward network is similarity a function $f^*$ .

The goal of a feedforward network is to approximate some function $f^*$ .

tip

The goal of xxx is to: xxxx 的目标是 xxxx

某个可以翻译为 some，某个函数： some function

例如，对于分类器， $y = f^*(x)$ 将输入 $x$ 映射到一个类别 $y$ 。

For a classifier $y = f^*(x)$ map the input $x$ to a class $y$

For example, for a classifier, $y = f^*(x)$ maps an input $x$ to a category $y$ .

前馈网络定义了一个映射 $y = f(x; θ)$ ，并且学习参数 $θ$ 的值，使它能够得到最佳的函数近似。

Deep feedforward network defines a map $y = f(x;\theta)$ and learn the parameter $\theta$ to get the best approximation to function.

A feedforward network defines a mapping $y=f(x;\theta)$ and learns the value of the parameters $\theta$ that result in the best function approximation.

tip

result in: 得到，产生

这种模型被称为前向(feedforward)的，是因为信息流过 x 的函数，流经用于定义 $f$ 的中间计算过程，最终到达输出 $y$ 。

That model called feedforward

These models are called feedforward because information flow through the function being evaluated from $x$ , through the intermediate computations used to define $f$ , and finally to the output $y$ .

在模型的输出和模型本身之间没有反馈 (feedback) 连接。

Model's outputs and model itself don't connected by feedback.

There are no feedback connections in which outputs of the model are fed back into itself.

当前馈神经网络被扩展成包含反馈连接时，它们被称为循环神经网络(recurrent neural network)，在第十章介绍。

Now feedback network is called recurrent netual network when it be extended an includes feedback connection.

When feedforward neural networks are extended to include feedback connections, they are called recurrent neural networks.

前馈网络对于机器学习的从业者是极其重要的。

The feedback network is very important for marchine learning practitioners

Feedforward networks are of extreme important to machine learning practitioners.

tip

xxx is of extreme important to xxx : xxx 对于 xxx 是极其重要的

它们是许多重要商业应用的基础。

They are the basic of many business applications.

They form the basis of many important commercial applications.

tip

something form the basis of something: xxx 是 xxx 的基础 / xxx 奠定了 xxx 的基础

例如，用于对照片中的对象进行识别的卷积神经网络就是一种专门的前馈网络。

For example, CNN is a specialized feed forward network used to recognize objects in the picture.

For example, the convolutional networks used for object recognition from photos are a specialized kind of feedforward network.

tip

a specialized kind of: 专门的一种

前馈网络是通往循环网络之路的概念基石，后者在自然语言的许多应用中发挥着巨大作用。

Feedback neural network is the basic concept of the recurrent neural network which plays an important role in many applications about natural language.

Feedforward networks are a conceptual stepping stone on the path to recurrent networks, which power many natural language applications.

tip

power: 发挥巨大作用

翻译练习 2（11 月 21 日）

前馈神经网络被称作网络 (network) 是因为它们通常用许多不同函数复合在一起来表示。

The reason why feedward neural networks are called network is they always combine many different functions.

Feedforward neural networks are called networks because they are typically represented by composing together many different functions.

例如, 我们有三个函数 $f^{(1)}, f^{(2)}$ 和 $f^{(3)}$ 连接在一个链上以形成 $f(\boldsymbol{x})=f^{(3)}\left(f^{(2)}\left(f^{(1)}(\boldsymbol{x})\right)\right)$ 。

For example, we have three functions, $f^{(1)}, f^{(2)}$ and $f^{(3)}$ , join in a chain to form $f(\boldsymbol{x})=f^{(3)}\left(f^{(2)}\left(f^{(1)}(\boldsymbol{x})\right)\right)$ .

For example, we might have three functions $f^{(1)}, f^{(2)}$ and $f^{(3)}$ connected in a chain, to form $f(\boldsymbol{x})=f^{(3)}\left(f^{(2)}\left(f^{(1)}(\boldsymbol{x})\right)\right)$ .

这些链式结构是神经网络中最常用的结构。

These chain structures are the most usually used in neural networks.

These chain structures are the most commonly used structures of neutral networks.

在这种情况下, $f^{(1)}$ 被称为网络的第一层 ( first layer), $f^{(2)}$ 被称为第二层 ( second layer ), 以此类推。

In this situation, $f^{(1)}$ is called the first layer, $f^{(2)}$ is called the second layer and so on.

In this case, $f^{(1)}$ is called the first layer, $f^{(2)}$ is called the second layer and so on.

链的全长称为模型的深度 ( depth )。

The length of the chain is called depth.

The overall length of the chain gives the depth of the model.

正是因为这个术语才出现了 “深度学习" 这个名字。

Deep learning is from this term.

It is from this terminology that the name "deep learning" arises.

前馈网络的最后一层被称为输出层 ( output layer )。

The last layer of the feedforward network is called the output layer.

The finial layer of a feedforward network is called the output layer.

在神经网络训练的过程中, 我们让 $f(\boldsymbol{x})$ 去匹配 $f^*(\boldsymbol{x})$ 的值。

During the training processing of neural networks, we let $f(\boldsymbol{x})$ to map the values of $f^*(\boldsymbol{x})$ .

During neural network training, we drive $f(x)$ to match $f^*(\boldsymbol{x})$ .

训练数据为我们提供了在不同训练点上取值的、含有噪声的 $f^*(\boldsymbol{x})$ 的近似实例。

Train datas provide us examples of approximation

The training data provides us with noisy, approximate examples of $f^*(\boldsymbol{x})$ evaluated at different training point.

每个样本 $\boldsymbol{x}$ 都伴随着一个标签 $y \approx f^*(\boldsymbol{x})$ 。

Each sample $\boldsymbol{x}$ has a label $y \approx f^*(\boldsymbol{x})$ .

Each example $\boldsymbol{x}$ is accompanied by a label $y \approx f^*(\boldsymbol{x})$ .

训练样本直接指明了输出层在每一点 $x$ 上必须做什么; 它必须产生一个接近 $y$ 的值。

Train datas directly indicate that what the output layer should do in each point $x$ , it must produce a value closed to $y$ .

The training examples specify directly what the output layer must do at each point $x$ ; it must produce a value that is close to $y$ .

翻译练习 3（11 月 22 日）

学习算法必须决定如何使用这些层来产生想要的输出，但是训练数据并没有说每个单独的层应该做什么。

Learning algorithms must use those layers to produce the desired outputs,but training datas don't directly indicate what other layers should do.

The learning algorithm must decide how to use those layer to produce the desired output, but the training data does not say what each individual layer should do.

相反，学习算法必须决定如何使用这些层来最好地实现 $f^∗$ 的近似。

On the contrary, learning algorithms must decide how to use these layers to get the approximate $f^*$ .

Instead, the learning algorithm must decide how to use these layer to best implement an approximation of $f^*$

因为训练数据并没有给出这些层中的每一层所需的输出，所以这些层被称为隐藏层(hidden layer)。

Because training datas don't give the outputs for each layer, so they are called the hidden layer.

Because the training data does not show the desired output for each of these layers, these layers are called hidden layer.

最后，这些网络被称为神经网络是因为它们或多或少地受到神经科学的启发。

At last, these networks are called neural network because they're more or less inspired by neuroscience.

Finally, these networks are called neural because they are loosely inspired by neuroscience.

网络中的每个隐藏层通常都是向量值的。

Each hidden layer of networks usually is value of the vector.

Each hidden layer of the network is typically vector-value.

这些隐藏层的维数决定了模型的宽度 (width)。

These hidden layers' dimension give the model's width.

The dimensionality of these hidden layers determines the width of the model.

tip

描述 A 的 B 时，多用 B of A, 而不是用 A's B.

向量的每个元素都可以被视为起到类似一个神经元的作用。

Each value in vectors can be seen as playing a similar role like a neure.

Each element of the vector may be interpreted as playing a role analogous to a neuron.

翻译练习 3（11 月 24 日）

一种理解前馈网络的方式是从线性模型开始, 并考虑如何克服它的局限性。

One way to understand the feedward network is starting from the linear model and considing how to overcome its limitation.

One way to understand feedforward networks is to begin with linear models and consider how to overcome their limitations

线性模型, 例如逻辑回归和线性回归, 是非常吸引人的, 因为无论是通过闭解形式还是使用凸优化, 它们都能高效且可靠地拟合。

Linear model such as the linear regression and the logistic regression are very attractive because whatever through convex optimization or other methods they can fit efficient and reliable.

Linear model, such as logistic regression and linear regression, are appealing because they may be fit efficiently and reliably, either in closed from or with convex optimisation.

线性模型也有明显的缺陷, 那就是该模型的能力被局限在线性函数里, 所以它无法理解任何两个输人变量间的相互作用。

Linear model also have the obvious defeat that the ability of the model are limited in the linear function therefore it can not understand the interaction between two inputs.

Linear models also have the obvious defect the model capacity is limited to linear functions, so the model cannot understand the interaction between any two input variables.

为了扩展线性模型来表示 $\boldsymbol{x}$ 的非线性函数, 我们可以不把线性模型用于 $\boldsymbol{x}$ 本身, 而是用在一个变换后的输人 $\phi(\boldsymbol{x})$ 上, 这里 $\phi$ 是一个非线性变换。

In order to using the extended linear model to show the nonlinear function of $\boldsymbol{x}$ , we use a transformed input $\phi(\boldsymbol{x})$ instead of using a linear function for $\boldsymbol{x}$ itself, here $\phi$ is a nonlinear transformation.

To extend linear models to represent nonlinear functions of $\boldsymbol{x}$ , we can apply the linear not to $\boldsymbol{x}$ itself but to a transformed input $\phi(\boldsymbol{x})$ , where $\phi$ is a nonlinear transformation.

tip

not to …… but to ... 不 ... 而是 ...

同样, 我们可以使用第 5.7.2 节中描述的核技巧, 来得到一个基于隐含地使用 $\phi$ 映射的非线性学习算法。

As well, we can use the kernel skill described by section 5.7.2 to get a nonlinear learning algorithm based on using $\phi$ impliedly.

Equtivalenty, we can apply the kernel trick described in Sec. 5.7.2 to obtain a nonlinear learning algorithm based on implicitly applying the $\phi$ mapping.

我们可以认为 $\phi$ 提供了一组描述 $\boldsymbol{x}$ 的特征, 或者认为它提供了 $x$ 的一个新的表示。

We can also think $\phi$ gives out a group of description about the feature of $\boldsymbol{x}$ , or provides a new representation of $x$ .

We can think of $\phi$ as providing a set of features desirabling $\boldsymbol{x}$ , or providing a new representation for $\boldsymbol{x}$ .

tip

think of sth as ... 将 ... 看作 ...

翻译练习 4（11 月 26 日）

卷积网络(convolutional network)，也叫做卷积神经网络(convolutional neural network, CNN)，是一种专门用来处理具有类似网格结构的数据的神经网络

Convolutional networks, also called convolutional neural networks (CNNS), are a type of special neural networks used to process datas with a grid structure.

Convolutional networks, also known as conventional neural networks or CNNs, are a specialized kind of neural network for processing data that has a known, grid-like topology.

例如时间序列数据(可以认为是在时间轴上有规律地采样形成的一维网格)和图像数据(可以看作是二维的像素网格)。

For example, time series datas can be seen as one-dimension grids by regularly sampling on the time axis and image datas can be seen as two-dimension pixel grids.

Examples include time-series data, which can be thought of as a 1D grid taking samples at regular time intervals.

卷积网络在诸多应用领域都表现优异。

Convolutional networks perform outstanding in many application areas.

Convolutional networks have been tremendously successful in practical applications.

"卷积神经网络" 一词表明该网络使用了卷积(convolution)这种数学运算。

The word "Convolutional networks" indicates that networks use the convolution mathematical operation.

The name "convolutional neural network" indicates that the network employs a mathematical operation called convolution.

tip

使用可以翻译为 employ

卷积是一种特殊的线性运算。

Convolution is a special linear operation.

Convolution is a specialized kind of linear operation.

tip

一种可以翻译为 kind of

卷积网络是指那些至少在网络的一层中使用卷积运算来替代一般的矩阵乘法运算的神经网络。

Conventional networks are those networks using convolution operation replace general matrix multiplication operation.

Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layer.

本章，我们首先说明什么是卷积运算。

In this chapter, we first show what is the convolution operation.

In this chapter, we will first describe what convolution is.

接着，我们会解释在神经网络中使用卷积运算的动机。

Than, we will explain the motivation about using convolution in neural networks.

Next, we will explain the motivation behind using convolution in neural networks.

然后我们会介绍池化(pooling)，这是一种几乎所有的卷积网络都会用到的操作。

Next, we will introduce pooling that is a operation almost all all of convolution networks will use it.

We will then describe an operation called pooling, which almost all all convolutional networks employ.

通常来说，卷积神经网络中用到的卷积运算和其他领域(例如工程领域以及纯数学领域)中的定义并不完全一致。

Generally speak, convention in convention neural networks is different to other areas, such as engineering and mathematical field.

Usually, the operating used in convolution neural network does not as engineering or pure mathematics.

我们会对神经网络实践中广泛应用的几种卷积函数的变体进行说明。

We will give the explain against several variants of the convolution function widely used in neural networks application.

We will describe several variants on the convolution function that are widely used in practice for neural networks.

我们也会说明如何在多种不同维数的数据上使用卷积运算。

We will also show how to do convolution operation in different dimension's datas.

We will also show how convolution may be applied to many kinds of data, with differen number of dimensions.

之后我们讨论使得卷积运算更加高效的一些方法。

And than, we will discuss some methods that can make convention more efficient.

We than discuss means of making convolution more efficient.

tip

可以用 mean 去替换 method

翻译练习 5（11 月 27 日）

之后我们讨论使得卷积运算更加高效的一些方法。

We than discuss means of making convention more efficient.

卷积网络是神经科学原理影响深度学习的典型代表。

Convention neural network is a typical example of neuroscience inferences deep learning.

Convolutional networks stand out as an example of neuroscientific principles influencing deep learning.

我们之后也会讨论这些神经科学的原理，并对卷积网络在深度学习发展史中的作用作出评价。

We than also discuss about those principles of neuroscience and give the :(

We will discuss these neuroscientific principles, then conclude with comments about the role convolutional networks have played in the history of deep learning.

本章没有涉及如何为你的卷积网络选择合适的结构

This chapter does not involve how to choose suitable structure for your conventional networks.

One topic this chapter does not address is how to choose the architecture of your convolutional network.

本章的目标是说明卷积网络提供的各种工具, 第十一章将会对如何在具体环境中选择使用相应的工具给出通用的准则。

Chapter 11 will give general principles of how to choose sutiable tools in specialized environment.

The goal of this chapter is to describe the kinds of tools that conventional network provide, while Chapter 11 describes general guidelines for choosing which tools to use in which circumstances.

对于卷积网络结构的研究进展得如此迅速，以至于针对特定基准 (benchmark)，数月甚至几周就会公开一个新的最优的网络结构，甚至在写这本书时也不好描述究竟哪种结构是最好的。

Research into convolutional network architectural proceeds so rapidly that a new best architecture for a given benchmark is announced every few weeks to months, rendering it impractical to describe the best architecture in print.

然而，最好的结构也是由本章所描述的基本部件逐步搭建起来的。

However, even the best structured network is also combined with basic components described in this chapter.

However, the best architectures have consistently been composed of the building blocks described here.

翻译练习 6（11 月 29 日）

在通常形式中, 卷积是对两个实变函数的一种数学运算。

In general, convention is a mathematics operation against two real change function.

In its most general form, convolution is an operation on two functions of a real-valued argument.

为了给出卷积的定义, 我们从两个可能会用到的函数的例子出发。

In order to give a definition of convolution, we start from two function example which may be used.

To motivate the definition of convolution, we start with examples of two functions we might use.

假设我们正在用激光传感器追踪一艘宇宙飞船的位置。

Suppose we are using a laser sensor to trick the position of a space ship.

Suppose we are tracking the location of a spaceship with a laser sendor.

我们的激光传感器给出一个单独的输出 $x(t)$ , 表示宇宙飞船在时刻 $t$ 的位置。

Our laser sensor gave a individual output $x(t)$ representing the position of the space ship at time $t$ .

Out laser sensor provides a single output $x(t)$ , the position of the spaceship at time $t$ .

$x$ 和 $t$ 都是实值的, 这意味着我们可以在任意时刻从传感器中读出飞船的位置。

$x$ and $t$ are both real values, which means that we can get the position of the space ship from the sensor at any monument.

Both $x$ and $t$ are real-valued, i.e., we can get a different reading from the laser sensor at any instant in time.

现在假设我们的传感器受到一定程度的噪声干扰。

Now suppose our sensor is disturbed by a certain amount of noise.

Now suppose that out laser sensor is somewhat noisy.

为了得到飞船位置的低噪声估计, 我们对得到的测量结果进行平均。

In order to get the low noise evaluation of the ship's position, we need to average the results of measurement.

To obtain a less noisy estimate of the spaceship's position, we would like to average together serveral measurements.

显然, 时间上越近的测量结果越相关, 所以我们采用一种加权平均的方法，对于最近的测量结果赋予更高的权重。

Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements.

我们可以采用一个加权函数 $w(a)$ 来实现, 其中 $a$ 表示测量结果距当前时刻的时间间隔。

We can use a weighting function $w(a)$ , where $a$ is the time interval between the current time and the time at which the measurement was taken.

We can use a weighting function $w(a)$ , where $a$ a is the age of a measurement.

如果我们对任意时刻都采用这种加权平均的操作, 就得到了一个新的对于飞船位置的平滑估计函数 $s$ 。

If we use those weighted average operation, we can get a new smooth evaluate function for the position of the spaceship.

If we apply such a weighted average operation at every moment, we obtain a new function $s$ providing a smoothed estimate of the position of the spaceship.

这种运算就叫做卷积 (convolution)。卷积运算通常用星号表示: $s(t)=(x * w)(t)$ .

This operation is called convolution. Convolution operation usually represented with $s(t)=(x * w)(t)$ .

This operation is called convolution. The convolution operation is typically denoted with an asterisk: $s(t)=(x * w)(t)$ .

在我们的例子中, $w$ 必须是一个有效的概率密度函数, 否则输出就不再是一个加权平均。

In our case, $w$ must be a effective probability density function, otherwise outputs are not the weighted average.

In our example, $w$ need to be a valid probability density function, or the output is not a weighted average.

另外, 在参数为负值时, $w$ 的取值必须为 0 , 否则它会预测到末来, 这不是我们能够推测得了的。但这些限制仅仅是对我们这个例子来说。

In addition, when parameters are negative $w$ must be 0, otherwise it will predict future, which cannot be control by ourselves. But those limitations only for those example.

Also, $w$ needs to be 0 for all negative arguments, or it will look into the future, which is presumably beyond our capabilities. These limitations are particular to our example though.

通常, 卷积被定义在满足上述积分式的任意函数上, 并且也可能被用于加权平均以外的目的。

In general, convolution operation is defined on any function satisfy the above integral, and can be used for other aim expect the weighted average.

In general, convolution is defined for any function for which the above intergral is defined, and may be used for other purposes besides taking weighted averages.

在卷积网络的术语中, 卷积的第一个参数 (在这个例子中, 函数 $x$ ) 通常叫做输入 (input), 第二个参数 (函数 $w$ ) 叫做核函数 (kernel function)。输出有时被称作特征映射 ( feature map )。

In terminologies of convolutional networks, the first parameter of the conversation is called the input, the second parameter of the conversation is called the kernel function. Sometimes the output is called the feature map.

In conversational network terminology, the first argument to the convolution is often referred to as the input and the second argument as the kernel function. The output is sometimes referred to as the feature map.

在本例中, 激光传感器在每个瞬间反馈测量结果的想法是不切实际的。

In this case, thie idea of laser sensors giving back measurements at every instant is impractical.

In our example, the idea of a laser sensor that can provides measurements at every instant in time is not realistic.

一般地, 当我们用计算机处理数据时, 时间会被离散化, 传感器会定期地反帻数据。所以在我们的例子中, 假设传感器每秒反唔一次测量结果是比较现实的。这样, 时刻 $t$ 只能取整数值。如果我们假设 $x$ 和 $w$ 都定义在整数时刻 $t$ 上, 就可以定义离散形式的卷积。

In general, when we use computer to process data, time will be discretized, and sensors will provide data periodically. So in our example, it is realistic to assume that the sensor provides a measurement every second. In this case, the time $t$ can only take integer values. If we assume that $x$ and $w$ are defined on integer time $t$ , we can define the discrete convolution.

Usually, when we work with data on a computer, time will be discretized and out sensor will provide data at regular intervals. In our example, it might be more realistic to asstume that our laser provides a measurement once per second. The time index $t$ can then take on only intyeger values. If we now assume that $x$ and $w$ are defined only on integer $t$ , we can define the discrete convolution.

翻译练习 6（12 月 2 日）

为了使前馈网络的想法更加具体, 我们首先从一个可以完整工作的前馈网络说起。这个例子解决一个非常简单的任务：学习 XOR 函数。

To make the idea of feedback networks more concrete, we begin with an example of a fully functioning feedforward network. This example solves a very simple task: learning the XOR function.

To make the idea of a feedforward network more concrete, we begin with an example of a fully functioning feedforward network on a very simple task: learning the XOR function.

XOR 函数 ( "异或" 逻辑) 是两个二进制值 $x_1$ 和 $x_2$ 的运算。

The XOR funtion is an operation between two binary values $x_1$ and $x_2$ .

The XOR function is an operation on two binary values $x_1$ and $x_2$ .

当这些二进制值中恰好有一个为 1 时, XOR 函数返回值为 1 。其余情况下返回值为 0 。

When there is only one value equal to one in those binary values, the XOR function will return 1.

When exactly one of those binary values is equal to 1, the XOR function returns 1.

XOR 函数提供了我们想要学习的目标函数 $y=f^*(\boldsymbol{x})$ 。

The XOR function provides the object function $y=f^*(\boldsymbol{x})$ what we want to learn.

The XOR function provides the target function $y=f^*(\boldsymbol{x})$ that we want to learn.

我们的模型给出了一个函数 $y=f(\boldsymbol{x} ; \boldsymbol{\theta})$ 并且我们的学习算法会不断调整参数 $\boldsymbol{\theta}$ 来使得 $f$ 尽可能接近 $f^*$ 。

Our model gives a function $y=f(\boldsymbol{x} ; \boldsymbol{\theta})$ and our learning algorithm will keep adjusting parameters $\boldsymbol{\theta}$ to make $f$ as similar as possible as $f^*$ .

Our model provides a function $y=f(\boldsymbol{x} ; \boldsymbol{\theta})$ and our learning algorithm will adapt the parameters $\boldsymbol{\theta}$ to make make $f$ as similar as possible as $f^*$ .

在这个简单的例子中, 我们不会关心统计泛化。

In this simple example, we don't care generalization of statistics.

In this simple example, we will not be concerned with statistical generalization.

我们希望网络在这四个点 $\mathbb{X}=\left\{[0,0]^{\top},[0,1]^{\top},[1,0]^{\top},[1,1]^{\top}\right\}$ 上表现正确。我们会用全部这四个点来训练我们的网络, 唯一的挑战是拟合训练集。

We hope the network can perform right at four points $\mathbb{X}=\left\{[0,0]^{\top},[0,1]^{\top},[1,0]^{\top},[1,1]^{\top}\right\}$ . We will train our network with those four points, the only challenge is to fit test dataset.

We want our network to perform correctly on the four points $\mathbb{X}=\left\{[0,0]^{\top},[0,1]^{\top},[1,0]^{\top},[1,1]^{\top}\right\}$ . We will train the network on all four of these points. The only challenge is to fit the training set.

我们可以把这个问题当作是回归问题, 并使用均方误差损失函数。

We can treat this problem as a regression problem and use mean square loss function.

We can treat this problem as a regression problem and use a mean square error loss function.

我们选择这个损失函数是为了尽可能简化本例中用到的数学。在应用领域, 对于二进制数据建模时, MSE 通常并不是一个合适的损失函数。更加合适的方法将在第 6.2.2.2 节中讨论。

We choose this loss function to simplify the math used in this example. In the application area, usually MSE is not a suitable method for modeling binary datas. The more suitable means will be discussed in Sec. 6.2.2.2.

We choose this loss function to simplify the math for this example as much as possible. We will see later that there are other, more appropriate approaches for modeling binary data.

tip

合适的：appropriate

方法：approach, means

评估整个训练集上表现的 MSE 损失函数为

Evaluated on our whole training set, the MSE loss function is:

我们现在必须要选择我们模型 $f(\boldsymbol{x} ; \boldsymbol{\theta})$ 的形式。假设我们选择一个线性模型, $\boldsymbol{\theta}$ 包含 $\boldsymbol{w}$ 和 $b$ , 那么我们的模型被定义成

Now we must choose the format of our model $f(\boldsymbol{x} ; \boldsymbol{\theta})$ . Suppose we choose a linear model ,with $\boldsymbol{\theta}$ consisting $\boldsymbol{w}$ and $b$ . Our model is defined to be

Now we must choose the form of our model, $f(\boldsymbol{x} ; \boldsymbol{\theta})$ . Suppose that we choose a linear model , with $\boldsymbol{\theta}$ consisting $\boldsymbol{w}$ and $b$ . Our model is defined to be

解正规方程以后, 我们得到 $\boldsymbol{w}=0$ 以及 $b=\frac{1}{2}$ 。

After solving the normal equation, we can get $\boldsymbol{w}=0$ and $b=\frac{1}{2}$ .

After solving the normal equation, we obtain $\boldsymbol{w}=0$ and $b=\frac{1}{2}$ .

tip

得到：obtain, result in

线性模型仅仅是在任意一点都输出 $0.5$ 。为什么会发生这种事?

The linear model just outputs 0.5 for any point. Why do this happen?

The model simply outputs 0.5 everywhere. Why does this happen?

图 6.1 演示了线性模型为什么不能用来表示 XOR 函数。解决这个问题的其中一种方法是使用一个模型来学习一个不同的特征空间, 在这个空间上线性模型能够表示这个解。

Fig. 6.1 shows that how a linear model is not able to represent XOR function. To solve this problem, one solution is using a model to learn a different feature space in which linear model is able to represent the solution.

Fig. 6.1 shows how a linear model is not able to represent XOR function. One way to solve this problem is to use a model that learns a different feature space in which linear model is able to represent the solution.

具体来说, 我们这里引入一个非常简单的前馈神经网络, 它有一层隐藏层并且隐藏层中包含两个单元。

Specially, here we bring in a very simple feedback neural network. It has one hidden layer containing two hidden units.

Specially, we will introduce a very simple feedforward neural network with one hidden layer containing two hidden units.

tip

具体来说: Specially

包含: contain

见图 $6.2$ 中对该模型的解释。这个前馈网络有一个通过函数 $f^{(1)}(\boldsymbol{x} ; \boldsymbol{W}, \boldsymbol{c})$ 计算得到的隐藏单元的向量 $\boldsymbol{h}$ 。这些隐藏单元的值随后被用作第二层的输人。第二层就是这个网络的输出层。

See Fig. 6.2 for an illustration of this model. This feedback network has a through function $f^{(1)}(\boldsymbol{x} ; \boldsymbol{W}, \boldsymbol{c})$ calculating values for hidden units. Those values of hidden units than are used as the input of the second layer. The second layer is the output layer of this network.

See Fig. 6.2 for an illustration of this model. This feedforward network has a vector of hidden units that are computed by a function. The values of those hidden units are then used as the input for a second layer. The second layer is the output layer of the network.

翻译练习 6（1 月 8 日）

卷积运算通过三个重要的思想来帮助改进机器学习系统: 稀疏交互 (sparse interactions )、参数共享 ( parameter sharing)、等变表示 ( equivariant representations )。

Convolution helps improve marchine learning system through three important ideas: spare interactions, parameter sharing and equivariant representations.

Convolution leverages three important ideas that can help improve a marchine learning system: spare interactions, parameter sharing and equivariant representations.

另外, 卷积提供了一种处理大小可变的输人的方法。我们下面依次介绍这些思想。

Moreover, convolution provides one method which can process different input size. We now introduce those ideas in turn.

Moreover, convolution provides a mean for working with inputs of variable size.

tip

provide a mean for doing sth.

传统的神经网络使用矩阵乘法来建立输人与输出的连接关系。其中, 参数矩阵中每一个单独的参数都描述了一个输人单元与一个输出单元间的交互。

Traditional neural networks use matrix multiplication to construct the relationship between inputs and outpus. A matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit.

这意味着每一个输出单元与每一个输人单元都产生交互。

This means every output units interacts with every input units.

然而, 卷积网络具有稀疏交互 ( sparse interactions)（也叫做稀疏连接（sparse connectivity）或者稀疏权重 ( sparse weights )) 的特征。

However, convolution networks have a feature called sparse interactions.

Convolution networks , however , typical have spare interactions.

这是使核的大小远小于输人的大小来达到的。

This caused by kernel size much less than input size.

This is accomplished by making kernel size smaller than input size.

举个例子, 当处理一张图像时, 输人的图像可能包含成千上万个像素点, 但是我们可以通过只占用几十到上百个像素点的核来检测一些小的有意义的特征, 例如图像的边缘。

For example, when processing an image, the input image may contain thousands of pixels, but we can only detect meaningful small features using kernels which ...

For example, when processing an image, the input image might have thousands of pixels, but we can detect small meaningful features with kernel that occupy only tens or hundreds of pixels.

这意味着我们需要存储的参数更少, 不仅减少了模型的存储需求, 而且提高了它的统计效率。

This means that we need to store less parameters, which not only reduce the memory requirements of the model but also improve its statistical efficiency.

这也意味着为了得到输出我们只需要更少的计算量。这些效率上的提高往往是很显著的。

This also means in order to get outputs we only need less calculated amounts.

It also means that computing the output requires fewer operations. These improvement in efficiency are usually quite large.

如果有 $m$ 个输人和 $n$ 个输出, 那么矩阵乘法需要 $m \times n$ 个参数并且相应算法的时间复杂度为 $O(m \times n)$ 。

If there are m inputs and n outputs, matrix multiplication needs $m \times n$ parameters and the time complexity is $O(m \times n)$ .

If there are m inputs and n outputs, than matrix multiplication requires $m \times n$ parameters and the algorithm used in practice have O(mxn) runtime.

如果我们限制每一个输出拥有的连接数为 $k$ , 那么稀疏的连接方法只需要 $k \times n$ 个参数以及 $O(k \times n)$ 的运行时间。在很多实际应用中, 只需保持 $k$ 比 $m$ 小几个数量级, 就能在机器学习的任务中取得好的表现。稀疏连接的图形化解释如图 9.2 和图 9.3 所示。

If we limit the number of connections each output may have to k, then the sparsely connected approach only requires only $k \times n$ parameters and $O(k \times n)$ runtime. For many practice applications, it is possible to obtain good preference on the machine learning task while keeping $k$ several orders of magnitude smaller than $m$ .

在深度卷积网络中, 处在网络深层的单元可能与绝大部分输人是间接交互的, 如图 9.4 所示。这允许网络可以通过只描述稀疏交互的基石来高效地描述多个变量的复杂交互。

In deep neural networks, units in the deeper layer might have indirect interaction with most of the input as shown in Figure 9.4. This allow networks to efficiently describe complicated interactions between multiple variables by constructing such interactions from simple building blocks that each describe only spare interaction.

翻译练习 6（1 月 16 日）

参数共享 ( parameter sharing ) 是指在一个模型的多个函数中使用相同的参数。

The parameter sharing is using same parameters for multiple functions in a model.

Parameter sharing refers to using the same parameter for more than one function in a model.

在传统的神经网络中, 当计算一层的输出时, 权重矩阵的每一个元素只使用一次, 当它乘以输人的一个元素后就再也不会用到了。

In traditional neural networks, when calculate the output of one layer, each element in weights matrix only use once and it will be never used after multipling with one input element.

In a traditional neural net, each element of the weight matrix is used exactly once when computing the output of a layer. It is multiplied by one element of the input and then never revisited.

作为参数共享的同义词, 我们可以说一个网络含有绑定的权重 ( tied weights ), 因为用于一个输人的权重也会被绑定在其他的权重上。

As a synonym for parameter sharing, we can say one network contain tied weights, because the weight used for

As a synonym for parameter sharing, one can say that a network has tied weights, because the value of the weight applied to one input is tied to the value of a weight applied elsewhere.

在卷积神经网络中, 核的每一个元素都作用在输人的每一位置上（是否考虑边界像素取决于对边界决策的设计)。

In a conventional neural net, each element of the kernel is applied to each position of the input.

tip

is applied to / is used at: 作用在

卷积运算中的参数共享保证了我们只需要学习一个参数集合, 而不是对于每一位置都需要学习一个单独的参数集合。

The parameter sharing in convolution option ensures that we only need study one parameter set instead of learning a individual parameter set for each position.

The parameter sharing used by the convolution operation means that rather than learning a separate set of parameter for every location, we learn only one set.

tip

rather than: 相比于 xxx，只需要 xxx

这虽然没有改变前向传播的运行时间 (仍然是 $O(k \times n)$ ), 但它显著地把模型的存储需求降低至 $k$ 个参数, 并且 $k$ 通常要比 $m$ 小很多个数量级。

Although it does not change the run time of the forward process, it obviously reduces the storage requirement of the model to $k$ parameter and $k$ usually less than $m$

This does not affect the runtime of forward propagation but it does futher reduce the storage requirements of the model to $k$ parameters.

因为 $m$ 和 $n$ 通常有着大致相同的大小, $k$ 在实际中相对于 $m \times n$ 是很小的。因此, 卷积在存储需求和统计效率方面极大地优于稠密矩阵的乘法运算。

Since m and n are usually roughly the same size, $k$ is practically insignificant compared to $m \times n$ .

tip

差不多大小: roughly the same size

和 xxx 想比 xxx 是微不足道的: xxx practically insignificant compared to xxx

翻译练习 7（1 月 18 日）

对于卷积, 参数共享的特殊形式使得神经网络层具有对平移等变 ( equivariance) 的性质。

For convolution, the special form of the parameter sharing makes neural network layer has equivariance.

In the case ofconvolution, the particular form of parameter sharing causes the layer to have a property called equivariance

如果一个函数满足输人改变, 输出也以同样的方式改变这一性质, 我们就说它是等变 (equivariant) 的。

To say a function is equivariant means that if the input changes, the output changes in the same way.

特别地, 如果函数 $f(x)$ 与 $g(x)$ 满足 $f(g(x))=g(f(x))$ , 我们就说 $f(x)$ 对于变换 $g$ 具有等变性。

Specially, if function $f(x)$ and $g(x)$ satisfy $f(g(x))=g(f(x))$ , we say $f(x)$ has equivariant for transformation $g(x)$ .

Specifically, a function $f(x)$ is equivariant to a function $g$ if $f(g(x))=g(f(x))$ .

对于卷积来说, 如果令 $g$ 是输人的任意平移函数, 那么卷积函数对于 $g$ 具有等变性。

In the case of conversation, if $g$ is an arbitrary translation function, than convolution function has equivariant for $g$ .

In the case of conversation, if we let g be any function that translates the input, then the convolution function is equivariant to g.

举个例子, 令 $I$ 表示图像在整数坐标上的亮度函数, $g$ 表示图像函数的变换函数 ( 把一个图像函数映射到另一个图像函数的函数

For example, let $I$ be a function giving iamge brightness at integer coordinates. let g be a function mapping one image to another image function.

这个函数把 $I$ 中的每个像素向右移动一个单位。

This shifts every pixel of I one unit to the right.

tip

shift：移动

shift sth to

如果我们先对 $I$ 进行这种变换然后进行卷积操作所得到的结果, 与先对 $I$ 进行卷积然后再对输出使用平移函数 $g$ 得到的结果是一样的。

If we apply this transformation to I, then apply convolution, the result will be the same if we applied convolution to I', then applied the transformation g to the output.

当处理时间序列数据时，这意味着通过卷积可以得到一个由输人中出现不同特征的时刻所组成的时间轴。

When processing time series data, this means that convolution produces a sort of timeline that shows when different features appear in the output.

如果我们把输人中的一个事件向后延时, 在输出中仍然会有完全相同的表示。

If we move an event later in time in the input, the exact same representation of it will appear in the output.

图像与之类似, 卷积产生了一个 2 维映射来表明某些特征在输人中出现的位置。

Similarly with images, conviction creates a 2-D map of where certain features appear in the input.

如果我们移动输人中的对象, 它的表示也会在输出中移动同样的量。当处理多个输人位置时, 一些作用在邻居像素的函数是很有用的。

If we move the object in the input, its representation will move the same amount in the output.

例如在处理图像时, 在卷积网络的第一层进行图像的边缘检测是很有用的。

For example, when processing images, the first layer of the conversation network is very useful for edge detection.

For example, when processing images, it is useful to detect edges in the first layer of a conventional network.

相同的边缘或多或少地散落在图像的各处, 所以应当对整个图像进行参数共享。

The same edges appear more or less everywhere in the image, so it is practical to share parameter across the entire image.

tip

it is practical to do sth. 做 xxx 是很合适的/很好的/应该的

more or less: 或多或少

但在某些情况下, 我们并不希望对整幅图进行参数共享。

In some cases, we don't hope to parameter sharing across the entire image.

In some cases, we may not wish to share parameters across the entire image.

英语翻译练习

翻译练习 1（11 月 20 日）​

tip

tip

tip

tip

tip

tip

翻译练习 2（11 月 21 日）​

翻译练习 3（11 月 22 日）​

tip

翻译练习 3（11 月 24 日）​

tip

tip

翻译练习 4（11 月 26 日）​

tip

tip

tip

翻译练习 5（11 月 27 日）​

翻译练习 6（11 月 29 日）​

翻译练习 6（12 月 2 日）​

tip

tip

tip

翻译练习 6（1 月 8 日）​

tip

翻译练习 6（1 月 16 日）​

tip

tip

tip

翻译练习 7（1 月 18 日）​

tip

tip

翻译练习 1（11 月 20 日）

翻译练习 2（11 月 21 日）

翻译练习 3（11 月 22 日）

翻译练习 3（11 月 24 日）

翻译练习 4（11 月 26 日）

翻译练习 5（11 月 27 日）

翻译练习 6（11 月 29 日）

翻译练习 6（12 月 2 日）

翻译练习 6（1 月 8 日）

翻译练习 6（1 月 16 日）

翻译练习 7（1 月 18 日）