首页 > 数据分析 > 深度学习教程:从感知器到深层网络

[悬赏]深度学习教程:从感知器到深层网络 (已翻译48%)

查看 (547次)
英文原文:A Deep Learning Tutorial: From Perceptrons to Deep Networks
标签: 深度学习
admin 发布于 2017-02-21 16:01:12 (共 23 段, 本文赏金: 88元)
参与翻译(9人): DataMoneyor zdb_cn yuyi007 jeffy Laurenwang 亲爱的十一 Y小牛S 小胖想吃肉 liangliqiang 默认 | 原文

【已悬赏】 赏金: 3元


近年来,人工智能领域风云再起。它不仅仅限于学术领域传播,像谷歌、微软以及Facebook之类的主要玩家正在构建他们自己的研究团队,并做了一些令人印象深刻的并购。

这种情况一部分归因于社交网络用户所产生的丰富的、许多需要被分析的原始数据,也归功于通过GPGPU获得的价格低廉的计算能力。

但除了这些现象,人工智能的复兴,相当一部分是被一个AI领域的新的趋势所驱动的,特别是在机器学习上,被称为“深度学习”。在这篇教程中,我将会向您介绍深度学习背后的关键概念和算法。从组成的最简单的单元开始,到用JAVA语言构建机器学习的概念。

(透露一下,我也是一个Java深度学习库的作者,点这可获取,本文中的例子就是用上述的库所完成的。如果你喜欢,可以在Github上送上一颗星以表支持,谢谢。使用指导在homepage上可以找到)

jeffy
翻译于 2017-02-28 21:04:54
 

参与本段翻译用户:
zdb_cn jeffy

显示原文内容 | 显示全部版本

【已悬赏】 赏金: 2元

30秒了解机器学习

如果您对机器学习不熟悉, 可以查看机器学习介绍:

以下为机器学习的简单流程:

  1. 假设我们有某个机器学习算法以及20张人工标注的训练样本(图片), 其中10张带狗的图片标注为 1,其他不带狗的10张图片标注为 0——这里以监督学习和二分类为例。
  2. 算法通过学习来辨别带狗和不带狗的图片,当算法学习完后,给其一张图片,则算法会输出一个标签(如果图片中有狗,则标签为数字 1,否则为 0)。

这个案例够简单吧?!当然,你的数据也可以是一些疾病的症状,标签则为对应的疾病;你的数据还可以是带手写字符的图片,标签为这些图片所表示的字符。

DataMoneyor
翻译于 2017-02-26 14:11:01
 

参与本段翻译用户:
DataMoneyor

显示原文内容

【已悬赏】 赏金: 3元

感知机: 早期深度学习算法

感知机是最早的有监督学习算法之一,它是神经网络的基本构成块。

假设在平面上有n个点,每个点都被标记为'0'或'1'。 我们给了一个新的点,我们想猜它的标签(这类似于上面的“狗”和“不是狗”场景)。 我们该怎么做呢?

一种方法是查看离这个点最近的点并返回该点的标签。 但是一个更聪明的方式是找到一条能尽量将标号数据分开的线,并使用它作为你的分类器。

A depiction of input data in relation to a linear classifier is a basic approach to deep learning.

在这种情况下,每个输入数据将被表示为向量x =(x_1,x_2),如果此点在这条线下面,我们的函数值将是“0”,如果在线之上,则为“1”。

为了在数学上表示,我们令分类器由权重w和垂直偏移(或偏置)b的向量定义。 然后,我们的函数将输入和权重与加权和转化函数组合:

weighted sum transfer function

然后将该传递函数的结果馈送到激活函数中以产生标记。 在上面的示例中,我们的激活函数是阈值截止的(例如,如果大于某个值,则为1):

result of this transfer function

yuyi007
翻译于 2017-02-27 19:11:48
 

参与本段翻译用户:
yuyi007

显示原文内容

【已悬赏】 赏金: 3元


训练感知器


感知器的训练包括了为其提供多个的训练样本并计算每个样本的输出结果。每使用一个样本,权重w都会被调整,通过这种方法使得输出误最小化,输出误差被定义为期望值(目标值)与实际输出之间的差。还有其他多种误差函数,比如均方差,但是训练的基本原理都是一样的。


单个感知器的缺点


单个感知器方法对于深度学习来说,有个主要的缺点:它仅仅能够学习线性可分函数。这个问题有多重要呢?拿“异或运算”,一个相对简单的函数来举例。注意到它是不能被线性区分的(下图所示,是一次失败的分割尝试)。

The drawback to this deep learning approach is that some functions cannot be classified by a linear separator.


为了解决这个问题,我们需要使用多层感知器,也就是前馈神经网络:事实上,我们会把一堆感知器组合在一起,来创建一个更强有力的学习机制。

jeffy
翻译于 2017-02-27 21:58:16
 

参与本段翻译用户:
jeffy

显示原文内容

【已悬赏】 赏金: 5元


深度学习的前向神经网络

一个神经网络实际上只是一堆感知器的组合,这些感知器通过不同的方式互相连接,对不同的激励函数起作用。

Feedforward neutral network deep learning is a more complex approach than single perceptrons.

对于初学者,我们来看一下前馈神经网络,它有如下的特征:


  • 一个输入层,一个输出层,一个或多个隐含层。上面的图片显示了一个有着三单元输入层、一个4单元隐含层和一个2单元输出层的网络(这里术语“单元”和“神经元”是一个意思)
  • 每一个“单元”是个单个感知器,如同前面所描述的
  • 对于隐含层的单元来说,输入层单元就是他们输入。同样对于输出层来说,隐含层就是它的输入。
  • 每两个神经元之间的连接都有一个权重w(与感知器的权重类似)
  • 每一个t层中的单元,一般都与前一层(t-1层)中的每一个单元都连接(尽管你可以把权重设置为0表示断开它们)
  • 为了处理输入的数据,你可将输入向量“夹取”到输入层,将向量的值设置成每个输入单元的输出值。 在这个例子中,这个网络能处理三维输入向量(因为有三个输入单元)。举个栗子,如果输入向量是[7, 1, 2],那么你就将最上面的输入单元的输出设为7,中间的单元设为1,如此类推。于是,这些值就使用加权和转移函数向前传播到每个隐含单元(这就是前向传播这个词的由来),这些隐含单元再来计算他们的输出(激励函数)
  • 输出层用与隐含层同样的方法来计算它的输出。输出层的结果也就是这个网络的输出
jeffy
翻译于 2017-02-27 22:32:26
 

参与本段翻译用户:
jeffy

显示原文内容

【已悬赏】 赏金: 9元

超越线性

假如每个感知器只允许使用线性激活函数将会怎么样? 那么,网络的最终输出仍然是输入的某个线性函数,只是调整了权重,使它有很大的不同(因为权重贯穿了整个网络)。 换句话说,线性函数的线性组合仍然只是一个线性函数。 如果我们局限于线性激活函数,那么无论神经网络有多少层,它也不比感知器更强大。(线性函数的线性组合仍然只是一个线性函数,因此大多数神经网络使用非线性激活函数)因此,大多数神经网络使用非线性激活函数,如逻辑函数,双曲函数,二元函数或整流函数。 没有它们,神经网络只能学习其输入的线性组合的函数。

训练感知器

最常见的深度学习算法用于多层感知器的监督训练,被称为反向传播。 基本步骤: 

1.输入训练样本,并通过网络向前传播;

2.计算输出误差,通常为均方误差:
      mean squared error

      其中t是目标值,y是实际的网络输出。 其他误差计算也是可以的,但MSE(均方误差)是一个不错的选择。

3.使用随机梯度下降的方法来最小化网络误差。

Gradient descent

      梯度下降法是通用的,但是在神经网络的情况下,这是作为输入参数的函数的训练误差的图(上图)。每个权重的最佳值是误差达到全局最小值时。在训练阶段,权重是以一小步一小步更新(在每个样本或几个作为一个批次的样本训练之后),以使得它们总是尝试达到全局最小值,但这不是容易的任务。通常最终我们取得局部最小值,如图上右边的那个。例如,如果权重具有值0.6,则需要将其朝着0.4改变。

      此图表示最简单的情况,其中误差取决于单个参数。然而,网络误差取决于每个网络的权重,并且误差函数是非常复杂得多的。

      幸好,关于输出误差,反向传播提供了一种用于更新两个神经元之间的每个权重的方法。派生本身相当复杂,但是对于给定节点的权重更新具有以下(简单)形式: 

      example form

      其中E是输出误差,w_i是输入到神经元的权重。
      基本上,目标是关于权重i沿梯度的方向移动。 关键当然是误差的导数,这并不总是很容易计算:如何从一个大型网络中的随机隐藏节点的随机权重中找到这个导数?
      答案是:通过反传。 首先在输出单元处计算误差,其中公式相当简单(基于目标值和预测值之间的差),然后通过网络传播回来,从而允许我们在训练期间有效地更新权重, (希望)达到最低。

亲爱的十一
翻译于 2017-03-01 17:20:02
 

参与本段翻译用户:
jeffy 亲爱的十一 Y小牛S

显示原文内容 | 显示全部版本

【已悬赏】 赏金: 2元

隐含层


隐含层很有意思。由通用逼近定理一个有着有限神经元的单个隐含层可以被训练接近一个任意的随机函数。换句话说,单个隐含层就足够强大到可以学习任何函数。即便如此,在实际中我们经常使用多个隐含层来获得更好的学习效果(如,深度网络)

隐含层是网络存储训练集数据内部抽象表达的地方


隐含层是网络存储训练集数据内部抽象表达的地方,这与人脑(极度简化的类比)有着真实世界的内部呈现类似。继续本教程,我们会去看一看隐含层的多种使用方式。


jeffy
翻译于 2017-02-27 22:47:35
 

参与本段翻译用户:
jeffy

显示原文内容

【已悬赏】 赏金: 5元

一个网络的例子

  可以看一下这个通过testMLPSigmoidBP方法用Java实现的简单(4-2-3)前馈神经网络,它将IRIS数据集进行了分类。这个数据集中包含了三类鸢尾属植物,特征包括花萼长度,花瓣长度等等。每一类提供50个样本给这个神经网络训练。特征被赋给输入神经元,每一个输出神经元代表一类数据集(“1/0/0” 表示这个植物是Setosa,“0/1/0”表示 Versicolour,而“0/0/1”表示 Virginica)。分类的错误率是2/150(即每分类150个,错2个)。

大规模网络中的难题

  神经网络中可以有多个隐含层:这样,在更高的隐含层里可以对其之前的隐含层构建新的抽象。而且像之前也提到的,这样可以更好的学习大规模网络。然而增加隐含层的层数通常会导致两个问题:

梯度消失:随着我们添加越来越多的隐含层,反向传播传递给较低层的信息会越来越少。实际上,由于信息向前反馈,不同层次间的梯度开始消失,对网络中权重的影响也会变小。

过度拟合:也许这是机器学习的核心难题。简要来说,过度拟合指的是对训练数据有着过于好的识别效果,这时导至模型非常复杂。这样的结果会导致对训练数据有非常好的识别较果,而对真实样本的识别效果非常差。

  下面我们来看看一些深度学习的算法是如何面对这些难题的。

 

小胖想吃肉
翻译于 2017-05-13 10:04:19
 

参与本段翻译用户:
小胖想吃肉

显示原文内容

【已悬赏】 赏金: 2元

Autoencoders

自编码器

Most introductory machine learning classes tend to stop with feedforward neural networks. But the space of possible nets is far richer—so let’s continue.

An autoencoder is typically a feedforward neural network which aims to learn a compressed, distributed representation (encoding) of a dataset.

大多数介绍机器学习的课程都以神经网络的自反馈结尾。但是可能存在的网络可能远比这更丰富--所以我们继续。

一个自编码器一般来说是一个具有学习数据组中压缩的,分布式的代理(编码)的自反馈式神经网络。

An autoencoder is a neural deep learning network that aims to learn a certain representation of a dataset.

Conceptually, the network is trained to “recreate” the input, i.e., the input and the target data are the same. In other words: you’re trying to output the same thing you were input, but compressed in some way. This is a confusing approach, so let’s look at an example.

从概念上来说,神经网络被训练对输入进行“再创造”,也就是,输入和目标是一致的。换句话说,你在尝试输出与你之前输入的相同的东西,但是以某种方式压缩。这是一个令人疑惑的过程,所以让我们来看一个例子。

Laurenwang
翻译于 2017-02-28 11:31:15
 

参与本段翻译用户:
Laurenwang

显示原文内容

【已悬赏】 赏金: 3元

Compressing the Input: Grayscale Images

输入的压缩:灰度图

Say that the training data consists of 28x28 grayscale images and the value of each pixel is clamped to one input layer neuron (i.e., the input layer will have 784 neurons). Then, the output layer would have the same number of units (784) as the input layer and the target value for each output unit would be the grayscale value of one pixel of the image.

假设训练数据由28X28的灰度图组成,并且每张图的像素值都与其中一个输入层神经元相关(也就是说,输入层将有784个神经元)。所以,输出层也会与输入层有相同数量的神经元单元,而且每个单元的目标输出值也将会是一张灰度图的像素值。

The intuition behind this architecture is that the network will not learn a “mapping” between the training data and its labels, but will instead learn the internal structure and features of the data itself. (Because of this, the hidden layer is also called feature detector.) Usually, the number of hidden units is smaller than the input/output layers, which forces the network to learn only the most important features and achieves a dimensionality reduction.

这个构架的直观印象并不是是网络将会去学习将训练数据和它的标签之间进行“绘图”,而将是学习内部构架和数据特点之间的联系。(正因为如此,隐藏层也被称作特征探测器)。一般来讲,隐藏单元的数量小于输入、输出层,这将迫使网络只学习最重要的特征并且达到维数约简的目的。

We want a few small nodes in the middle to learn the data at a conceptual level, producing a compact representation.

In effect, we want a few small nodes in the middle to really learn the data at a conceptual level, producing a compact representation that in some way captures the core features of our input.

我们希望在中间有一些小的节点使其在概念层次上学习数据,并产生一种紧凑的描述。

实际上,我们我们希望在中间有一些小的节点使其真正在概念层次上学习数据,并产生一些紧凑的、能用某种方式获取我们输入中的核心特点的描述。

Laurenwang
翻译于 2017-02-28 19:44:31
 

参与本段翻译用户:
Laurenwang

显示原文内容

【已悬赏】 赏金: 5元

流感病

    为了进一步验证自编码器,让我们再来看一个应用。

    在这个案例中,我们会使用一个简单的流感症状数据集。如果你感兴趣,可以使用这个代码

    下面是数据集的解释:

  • 数据集包括6个二元输入特征。
  • 前三个特征是流感病症。例如,1 0 0 0 0 0 表明该名病人体温偏高,而0 1 0 0 0 0表示咳嗽, 1 1 0 0 0 0表示咳嗽和发烧等。
  • 后三个特征是“相反”症状;当一个病人具备后三特征其中之一,那他/她很只有很低的可能得病。例如,0 0 0 1 0 0表明病人具有流感免疫。也可能会在两个特征集合中都存在:0 1 0 1 0 0表明带疫苗的病人有咳嗽症状等等。

    我们认为一个病人在他/她前三个特征中有任意两个及以上就是得病,而后三特征至少有两个时为健康,例如:

  • 111000, 101000, 110000, 011000, 011100 = 患病
  • 000111, 001110, 000101, 000011, 000110 = 健康

    我们会训练一个6个输入节点和6个输出节点的自编码器(通过反向传播),但只有两个隐藏节点。

    在几百次迭代之后,我们观察当每个患病记录被输入到机器学习网络,两个隐藏节点之一(对每个病人样本都是同一个节点)总是比另一个节点出现更高的激活值。相反,当健康记录被输入到网络中,另一个隐藏节点会有更高的激活值。

liangliqiang
翻译于 2017-06-15 17:51:34
 

参与本段翻译用户:
liangliqiang

显示原文内容

【待悬赏】 赏金: 4元

Going Back to Machine Learning

Essentially, our two hidden units have learned a compact representation of the flu symptom data set. To see how this relates to learning, we return to the problem of overfitting. By training our net to learn a compact representation of the data, we’re favoring a simpler representation rather than a highly complex hypothesis that overfits the training data.

In a way, by favoring these simpler representations, we’re attempting to learn the data in a truer sense.


Restricted Boltzmann Machines

The next logical step is to look at a Restricted Boltzmann machines (RBM), a generative stochastic neural network that can learn a probability distribution over its set of inputs.

In machine learning, Restricted Botlzmann Machines are composed of visible and hidden units.

RBMs are composed of a hidden, visible, and bias layer. Unlike the feedforward networks, the connections between the visible and hidden layers are undirected (the values can be propagated in both the visible-to-hidden and hidden-to-visible directions) and fully connected (each unit from a given layer is connected to each unit in the next—if we allowed any unit in any layer to connect to any other layer, then we’d have a Boltzmann (rather than a restricted Boltzmann) machine).

The standard RBM has binary hidden and visible units: that is, the unit activation is 0 or 1 under a Bernoulli distribution, but there are variants with other non-linearities.

While researchers have known about RBMs for some time now, the recent introduction of the contrastive divergence unsupervised training algorithm has renewed interest.



【待悬赏】 赏金: 4元

Contrastive Divergence

The single-step contrastive divergence algorithm (CD-1) works like this:

  1. Positive phase:
    • An input sample v is clamped to the input layer.
    • v is propagated to the hidden layer in a similar manner to the feedforward networks. The result of the hidden layer activations is h.
  2. Negative phase:
    • Propagate h back to the visible layer with result v’ (the connections between the visible and hidden layers are undirected and thus allow movement in both directions).
    • Propagate the new v’ back to the hidden layer with activations result h’.
  3. Weight update:

    weight update

    Where a is the learning rate and v, v’, h, h’, and w are vectors.

The intuition behind the algorithm is that the positive phase (h given v) reflects the network’s internal representation of the real world data. Meanwhile, the negative phase represents an attempt to recreate the data based on this internal representation (v’ given h). The main goal is for the generated data to be as close as possible to the real world and this is reflected in the weight update formula.

In other words, the net has some perception of how the input data can be represented, so it tries to reproduce the data based on this perception. If its reproduction isn’t close enough to reality, it makes an adjustment and tries again.



【待悬赏】 赏金: 2元

Returning to the Flu

To demonstrate contrastive divergence, we’ll use the same symptoms data set as before. The test network is an RBM with six visible and two hidden units. We’ll train the network using contrastive divergence with the symptoms v clamped to the visible layer. During testing, the symptoms are again presented to the visible layer; then, the data is propagated to the hidden layer. The hidden units represent the sick/healthy state, a very similar architecture to the autoencoder (propagating data from the visible to the hidden layer).

After several hundred iterations, we can observe the same result as with autoencoders: one of the hidden units has a higher activation value when any of the “sick” samples is presented, while the other is always more active for the “healthy” samples.

You can see this example in action in the testContrastiveDivergence method.



【待悬赏】 赏金: 2元

Deep Networks

We’ve now demonstrated that the hidden layers of autoencoders and RBMs act as effective feature detectors; but it’s rare that we can use these features directly. In fact, the data set above is more an exception than a rule. Instead, we need to find some way to use these detected features indirectly.

Luckily, it was discovered that these structures can be stacked to form deep networks. These networks can be trained greedily, one layer at a time, to help to overcome the vanishing gradient and overfitting problems associated with classic backpropagation.

The resulting structures are often quite powerful, producing impressive results. Take, for example, Google’s famous “cat” paper in which they use special kind of deep autoencoders to “learn” human and cat face detection based on unlabeled data.

Let’s take a closer look.



【待悬赏】 赏金: 6元

Stacked Autoencoders

As the name suggests, this network consists of multiple stacked autoencoders.

Stacked Autoencoders have a series of inputs, outputs, and hidden layers that contribute to machine learning outcomes.

The hidden layer of autoencoder t acts as an input layer to autoencoder t + 1. The input layer of the first autoencoder is the input layer for the whole network. The greedy layer-wise training procedure works like this:

  1. Train the first autoencoder (t=1, or the red connections in the figure above, but with an additional output layer) individually using the backpropagation method with all available training data.
  2. Train the second autoencoder t=2 (green connections). Since the input layer for t=2 is the hidden layer of t=1 we are no longer interested in the output layer of t=1 and we remove it from the network. Training begins by clamping an input sample to the input layer of t=1, which is propagated forward to the output layer of t=2. Next, the weights (input-hidden and hidden-output) of t=2 are updated using backpropagation. t=2 uses all the training samples, similar to t=1.
  3. Repeat the previous procedure for all the layers (i.e., remove the output layer of the previous autoencoder, replace it with yet another autoencoder, and train with back propagation).
  4. Steps 1-3 are called pre-training and leave the weights properly initialized. However, there’s no mapping between the input data and the output labels. For example, if the network is trained to recognize images of handwritten digits it’s still not possible to map the units from the last feature detector (i.e., the hidden layer of the last autoencoder) to the digit type of the image. In that case, the most common solution is to add one or more fully connected layer(s) to the last layer (blue connections). The whole network can now be viewed as a multilayer perceptron and is trained using backpropagation (this step is also called fine-tuning).

Stacked auto encoders, then, are all about providing an effective pre-training method for initializing the weights of a network, leaving you with a complex, multi-layer perceptron that’s ready to train (or fine-tune).



【待悬赏】 赏金: 4元

Deep Belief Networks

As with autoencoders, we can also stack Boltzmann machines to create a class known as deep belief networks (DBNs).

Deep belief networks are comprised of a stack of Boltzmann machines.

In this case, the hidden layer of RBM t acts as a visible layer for RBM t+1. The input layer of the first RBM is the input layer for the whole network, and the greedy layer-wise pre-training works like this:

  1. Train the first RBM t=1 using contrastive divergence with all the training samples.
  2. Train the second RBM t=2. Since the visible layer for t=2 is the hidden layer of t=1, training begins by clamping the input sample to the visible layer of t=1, which is propagated forward to the hidden layer of t=1. This data then serves to initiate contrastive divergence training for t=2.
  3. Repeat the previous procedure for all the layers.
  4. Similar to the stacked autoencoders, after pre-training the network can be extended by connecting one or more fully connected layers to the final RBM hidden layer. This forms a multi-layer perceptron which can then be fine tuned using backpropagation.

This procedure is akin to that of stacked autoencoders, but with the autoencoders replaced by RBMs and backpropagation replaced with the contrastive divergence algorithm.

(Note: for more on constructing and training stacked autoencoders or deep belief networks, check out the sample code here.)



【待悬赏】 赏金: 8元

Convolutional Networks

As a final deep learning architecture, let’s take a look at convolutional networks, a particularly interesting and special class of feedforward networks that are very well-suited to image recognition.

Convolutional networks are a special class of deep learning feedforward networks. Image via DeepLearning.net

Before we look at the actual structure of convolutional networks, we first define an image filter, or a square region with associated weights. A filter is applied across an entire input image, and you will often apply multiple filters. For example, you could apply four 6x6 filters to a given input image. Then, the output pixel with coordinates 1,1 is the weighted sum of a 6x6 square of input pixels with top left corner 1,1 and the weights of the filter (which is also 6x6 square). Output pixel 2,1 is the result of input square with top left corner 2,1 and so on.

With that covered, these networks are defined by the following properties:

  • Convolutional layers apply a number of filters to the input. For example, the first convolutional layer of the image could have four 6x6 filters. The result of one filter applied across the image is called feature map (FM) and the number feature maps is equal to the number of filters. If the previous layer is also convolutional, the filters are applied across all of it’s FMs with different weights, so each input FM is connected to each output FM. The intuition behind the shared weights across the image is that the features will be detected regardless of their location, while the multiplicity of filters allows each of them to detect different set of features.
  • Subsampling layers reduce the size of the input. For example, if the input consists of a 32x32 image and the layer has a subsampling region of 2x2, the output value would be a 16x16 image, which means that 4 pixels (each 2x2 square) of the input image are combined into a single output pixel. There are multiple ways to subsample, but the most popular are max pooling, average pooling, and stochastic pooling.
  • The last subsampling (or convolutional) layer is usually connected to one or more fully connected layers, the last of which represents the target data.
  • Training is performed using modified backpropagation that takes the subsampling layers into account and updates the convolutional filter weights based on all values to which that filter is applied.

You can see several examples of convolutional networks trained (with backpropagation) on the MNIST data set (grayscale images of handwritten letters) here, specifically in the the testLeNet* methods (I would recommend testLeNetTiny2 as it achieves a low error rate of about 2% in a relatively short period of time). There’s also a nice JavaScript visualization of a similar network here.



【待悬赏】 赏金: 2元

Implementation

Now that we’ve covered the most common neural network variants, I thought I’d write a bit about the challenges posed during implementation of these deep learning structures.

Broadly speaking, my goal in creating a Deep Learning library was (and still is) to build a neural network-based framework that satisfied the following criteria:

  • A common architecture that is able to represent diverse models (all the variants on neural networks that we’ve seen above, for example).
  • The ability to use diverse training algorithms (back propagation, contrastive divergence, etc.).
  • Decent performance.

To satisfy these requirements, I took a tiered (or modular) approach to the design of the software.



【待悬赏】 赏金: 2元

Structure

Let’s start with the basics:

  • NeuralNetworkImpl is the base class for all neural network models.
  • Each network contains a set of layers.
  • Each layer has a list of connections, where a connection is a link between two layers such that the network is a directed acyclic graph.

This structure is agile enough to be used for classic feedforward networks, as well as for RBMs and more complex architectures like ImageNet.

It also allows a layer to be part of more than one network. For example, the layers in a Deep Belief Network are also layers in their corresponding RBMs.

In addition, this architecture allows a DBN to be viewed as a list of stacked RBMs during the pre-training phase and a feedforward network during the fine-tuning phase, which is both intuitively nice and programmatically convenient.



【待悬赏】 赏金: 5元

Data Propagation

The next module takes care of propagating data through the network, a two-step process:

  1. Determine the order of the layers. For example, to get the results from a multilayer perceptron, the data is “clamped” to the input layer (hence, this is the first layer to be calculated) and propagated all the way to the output layer. In order to update the weights during backpropagation, the output error has to be propagated through every layer in breadth-first order, starting from the output layer. This is achieved using various implementations of LayerOrderStrategy, which takes advantage of the graph structure of the network, employing different graph traversal methods. Some examples include the breadth-first strategy and the targeting of a specific layer. The order is actually determined by the connections between the layers, so the strategies return an ordered list of connections.
  2. Calculate the activation value. Each layer has an associated ConnectionCalculator which takes it’s list of connections (from the previous step) and input values (from other layers) and calculates the resulting activation. For example, in a simple sigmoidal feedforward network, the hidden layer’s ConnectionCalculator takes the values of the input and bias layers (which are, respectively, the input data and an array of 1s) and the weights between the units (in case of fully connected layers, the weights are actually stored in a FullyConnected connection as a Matrix), calculates the weighted sum, and feeds the result into the sigmoid function. The connection calculators implement a variety of transfer (e.g., weighted sum, convolutional) and activation (e.g., logistic and tanh for multilayer perceptron, binary for RBM) functions. Most of them can be executed on a GPU using Aparapi and usable with mini-batch training.


【待悬赏】 赏金: 3元

GPU Computation with Aparapi

As I mentioned earlier, one of the reasons that neural networks have made a resurgence in recent years is that their training methods are highly conducive to parallelism, allowing you to speed up training significantly with the use of a GPGPU. In this case, I chose to work with the Aparapi library to add GPU support.

Aparapi imposes some important restrictions on the connection calculators:

  • Only one-dimensional arrays (and variables) of primitive data types are allowed.
  • Only member-methods of the Aparapi Kernel class itself are allowed to be called from the GPU executable code.

As such, most of the data (weights, input, and output arrays) is stored in Matrix instances, which use one-dimensional float arrays internally. All Aparapi connection calculators use either AparapiWeightedSum (for fully connected layers and weighted sum input functions), AparapiSubsampling2D (for subsampling layers), or AparapiConv2D (for convolutional layers). Some of these limitations can be overcome with the introduction of Heterogeneous System Architecture. Aparapi also allows to run the same code on both CPU and GPU.



【待悬赏】 赏金: 4元

Training

The training module implements various training algorithms. It relies on the previous two modules. For example, BackPropagationTrainer (all the trainers are using the Trainer base class) uses feedforward layer calculator for the feedforward phase and a special breadth-first layer calculator for propagating the error and updating the weights.

My latest work is on Java 8 support and some other improvements, will soon be merged into master.

Conclusion

The aim of this Java deep learning tutorial was to give you a brief introduction to the field of deep learning algorithms, beginning with the most basic unit of composition (the perceptron) and progressing through various effective and popular architectures, like that of the restricted Boltzmann machine.

The ideas behind neural networks have been around for a long time; but today, you can’t step foot in the machine learning community without hearing about deep networks or some other take on deep learning. Hype shouldn’t be mistaken for justification, but with the advances of GPGPU computing and the impressive progress made by researchers like Geoffrey Hinton, Yoshua Bengio, Yann LeCun and Andrew Ng, the field certainly shows a lot of promise. There’s no better time to get familiar and get involved like the present.


GMT+8, 2018-1-23 22:10 , Processed in 0.047641 second(s), 11 queries .