酷应用

在TensorFlow+Keras环境下使用RoI池化一步步实现注意力机制

百家作者：机器之心 2019-05-10 06:35:47

选自 Medium

作者：Jaime Sevilla

机器之心编译

参与：Geek AI、Chita

在本文中，作者解释了感兴趣区域池化（RoI 池化）的基本概念和一般用法，以及如何使用它来实现注意力机制。他一步步给出了在 Keras 和 TensorFlow 环境下使用 RoI 池化的实现。

项目地址：https://gist.github.com/Jsevillamol/0daac5a6001843942f91f2a3daea27a7

理解 RoI 池化

RoI 池化的概念由 Ross Girshick 在论文「Fast R-CNN」中提出，RoI 池化是其目标识别工作流程中的一部分。

在 RoI 池化的一般用例中，我们会有一个类似图像的目标，以及用边界框指定的多个感兴趣区域。我们要从每个 RoI 中生成一个嵌入。

例如，在 R-CNN 的设定下，我们有一个图像和一个为图像中可能感兴趣的部分生成边界框的候选机制。接下来，我们要为每一个候选的图像块生成嵌入：

简单地裁剪每个候选区域是行不通的，因为我们想要将最终得到的嵌入叠加在一起，而候选区域的形状不一定相同！

因此，我们需要想出一种方法对每个图像块进行变换，以生成预定义形状的嵌入。我们要怎么实现这一点？

在计算机视觉领域，使用池化操作是缩小图像形状的一种标准做法。

最常见的池化操作是「最大池化」。此时，我们将输入图像划分成形状相同的区域（通常是不重叠的），然后通过取每个区域的最大值来得到输出。

最大池化操作将每个区域划分为若干大小相同的池化区域

这并不能直接解决我们所面临的问题——形状不同的图像块将被划分成数量不一的形状相同的区域，产生不同形状的输出。

但这为我们提供了一个思路。如果我们把每个感兴趣的区域划分成相同数量的形状不同的区域，并取每个区域的最大值呢？

RoI 的池化操作将所有区域划分为相同数量的池化区域网格。

这正是 RoI 池化层所做的工作。

使用注意力机制的好处

ROI 池化实现了所谓的「注意力机制」，它让我们的模型可以专注于输入的特定特征。

在目标识别任务的环境下，我们可以将任务工作流程划分为两部分（候选区域和区域分类)，同时保留端到端的可微架构。

展示 RoI 池化层的 Fast R-CNN 架构。图源：Ross Girshick 的论文《Fast R-CNN》。

RoI 池化是一种泛化能力很强的的注意力工具，可以用于其他任务，比如对图像中预选区域的一次性上下文感知分类。也就是说，它允许我们对同一张图像的不同区域进行一次标记处理。

更一般而言，注意力机制受到了神经科学和视觉刺激研究的启发（详见 Desimone 和 Duncan 1995 年发表的论文「Neural Mechanism of Selective Visual Attention」。

如今，对注意力机制的应用已经超越了计算机视觉的范畴，它在序列处理任务中也广受欢迎。我觉得读者可以研究一下 Open AI 的注意力模型示例：《Better Language Models and their Implications》，该模型被成功地用于处理各种自然语言理解任务。

RoI 层的型签

在我们深入研究实现细节之前，我们可以先思考一下 RoI 层的型签（type signature）。

RoI 层有两个输入张量：

一批图像。为了同时处理这些，所有图像必须具备相同的形状。最终得到的 Tensor 形状为（batch_size，img_width，img_height，n_channels）。
一批候选的感兴趣区域（RoIs）。如果我们想将它们堆叠在一个张量中，每张图像中候选区域的数量必须是固定的。由于每个边界框需要通过 4 个坐标来指定，该张量的形状为（batch_size，n_rois，4）。

RoI 层的输出应该为：

为每章图像生成的嵌入列表，它编码了每个 RoI 指定的区域。对应的形状为（batch_size，n_rois，pooled_width，pooled_height，n_channels）

Keras 代码

Keras 让我们可以通过继承基本层类来实现自定义层。

「tf.keras」官方文档建议我们为自定义层实现「__init__」、「build」以及「call」方法。然而，由于「build」函数的目的是为层添加权重，而我们要实现的 RoI 层并没有权重，所以我们并不需要覆盖该方法。我们还将实现方便的「compute_output_shape」方法。

我们将分别对每个部分进行编码，然后在最后将它们整合起来。

def?__init__(self,?pooled_height,?pooled_width,?**kwargs):
????self.pooled_height?=?pooled_height
????self.pooled_width?=?pooled_width
????super(ROIPoolingLayer,?self).__init__(**kwargs)

类的 constructor 很容易理解。我们需要指定待生成嵌入的目标高度和宽度。在 constructor 的最后一行中，我们调用 parent constructor 来初始化其余的类属性。

def?compute_output_shape(self,?input_shape):
????"""?Returns?the?shape?of?the?ROI?Layer?output
????"""
????feature_map_shape,?rois_shape?=?input_shape
????assert?feature_map_shape[0]?==?rois_shape[0]
????batch_size?=?feature_map_shape[0]
????n_rois?=?rois_shape[1]
????n_channels?=?feature_map_shape[3]
????return?(batch_size,?n_rois,?self.pooled_height,?
????????????self.pooled_width,?n_channels)

「compute_output_shape」是一个很好用的效用函数，它将告诉我们对于特定的输入来说，RoI 层的输出是怎样的。

接下来，我们需要实现「call」方法。「call」函数是 RoI 池化层的逻辑所在。该函数应该将持有 RoI 池化层输入的两个张量作为输入，并输出带有嵌入的张量。

在实现这个方法之前，我们需要实现一个更简单的函数，它将把单张图像和单个 RoI 作为输入，并返回相应的嵌入。

接下来，让我们一步一步实现它。

@staticmethod?(http://twitter.com/staticmethod)
def?_pool_roi(feature_map,?roi,?pooled_height,?pooled_width):
??"""?Applies?ROI?Pooling?to?a?single?image?and?a?single?ROI
??"""
#?Compute?the?region?of?interest????????
??feature_map_height?=?int(feature_map.shape[0])
??feature_map_width??=?int(feature_map.shape[1])

??h_start?=?tf.cast(feature_map_height?*?roi[0],?'int32')
??w_start?=?tf.cast(feature_map_width??*?roi[1],?'int32')
??h_end???=?tf.cast(feature_map_height?*?roi[2],?'int32')
??w_end???=?tf.cast(feature_map_width??*?roi[3],?'int32')

??region?=?feature_map[h_start:h_end,?w_start:w_end,?:]
...

函数的前六行在计算图像中 RoI 的起始位置和终止位置。

我们规定每个 RoI 的坐标应该由 0 到 1 之间的相对数字来指定。具体而言，每个 RoI 由包含四个相对坐标（x_min，y_min，x_max，y_max）的四维张量来指定。

我们也可以用绝对坐标来指定该 RoI，但是通常而言这样做效果会较差。因为输入图像在被传递给 RoI 池化层之前会经过一些会改变图像形状的卷积层，这迫使我们跟踪图像的形状是如何改变的，从而对 RoI 边界框进行适当的放缩。

第七行使用 TensorFlow 提供的超强张量切片语法将图片直接裁剪到 RoI 上。

...
#?Divide?the?region?into?non?overlapping?areas
region_height?=?h_end?-?h_start
region_width??=?w_end?-?w_start
h_step?=?tf.cast(region_height?/?pooled_height,?'int32')
w_step?=?tf.cast(region_width??/?pooled_width?,?'int32')

areas?=?[[(
???????????i*h_step,?
???????????j*w_step,?
???????????(i+1)*h_step?if?i+1?< ?pooled_height?else?region_height,?
???????????(j+1)*w_step?if?j+1?< ?pooled_width?else?region_width
??????????)?
??????????for?j?in?range(pooled_width)]?
?????????for?i?in?range(pooled_height)]
...

在接下来的四行中，我们计算了待池化的 RoI 中每个区域的形状。

接着，我们创建了一个二维张量数组，其中每个组件都是一个元组，表示我们将从中取最大值的每个区域的起始坐标和终止坐标。

生成区域坐标网格的代码看起来过于复杂，但是请注意，如果我们只是将 RoI 划分成形状为（region_height / pooled_height，region_width / pooled_width）的区域，那么 RoI 的一些像素就不会落在任何区域内。

我们通过扩展右边和底部的大部分区域将默认情况下不会落在任何区域的剩余像素囊括进来，从而解决这个问题。这是通过在代码中声明每个边界框的最大坐标来实现的。

该部分最终得到的是一个二维边界框列表。

...
#?Take?the?maximum?of?each?area?and?stack?the?result
def?pool_area(x):?
??return?tf.math.reduce_max(region[x[0]:x[2],x[1]:x[3],:],?axis=[0,1])

pooled_features?=?tf.stack([[pool_area(x)?for?x?in?row]?for?row?in?areas])
return?pooled_features上面几行代码十分巧妙。我们定义了一个辅助函数「pool_area」，其输入为我们刚刚创建的元组指定的边界框，输出为该区域中每个通道的最大值。

我们使用列表解析式对每个已声明的区域进行「pool_area」映射。

由此，我们得到了一个形状为（pooled_height，pooled_width，n_channels）的张量，它存储了单张图像某个 RoI 的池化结果。

接下来，我们将对单张图像的多个 RoI 进行池化。使用一个辅助函数可以很直接地实现这个操作。我们还将使用「tf.map_fn」生成形状为（n_rois，pooled_height，pooled_width，n_channels）的张量。

@staticmethod?(http://twitter.com/staticmethod)
def?_pool_rois(feature_map,?rois,?pooled_height,?pooled_width):
??"""?Applies?ROI?pooling?for?a?single?image?and?varios?ROIs
??"""
??def?curried_pool_roi(roi):?
????return?ROIPoolingLayer._pool_roi(feature_map,?roi,?
?????????????????????????????????????pooled_height,?pooled_width)

??pooled_areas?=?tf.map_fn(curried_pool_roi,?rois,?dtype=tf.float32)
??return?pooled_areas

最后，我们需要实现 batch 级迭代。如果我们将一个张量系列（如我们的输入 x）传递给「tf.map_fn」，它将会把该输入压缩为我们需要的形状。

def?call(self,?x):
??"""?Maps?the?input?tensor?of?the?ROI?layer?to?its?output
??"""
??def?curried_pool_rois(x):?
????return?ROIPoolingLayer._pool_rois(x[0],?x[1],?
??????????????????????????????????????self.pooled_height,?
??????????????????????????????????????self.pooled_width)

??pooled_areas?=?tf.map_fn(curried_pool_rois,?x,?dtype=tf.float32)
??return?pooled_areas

请注意，每当「tf.map_fn」的预期输出与输入的数据类型不匹配时，我们都必须指定「tf.map_fn」的「dtype」参数。一般来说，我们最好尽可能频繁地指定该参数，从而通过 Tensorflow 计算图来明确类型是如何变化的。

下面，让我们将上述内容整合起来：

import?tensorflow?as?tf
from?tensorflow.keras.layers?import?Layer

class?ROIPoolingLayer(Layer):
????"""?Implements?Region?Of?Interest?Max?Pooling?
????????for?channel-first?images?and?relative?bounding?box?coordinates

????????#?Constructor?parameters
????????????pooled_height,?pooled_width?(int)?--?
??????????????specify?height?and?width?of?layer?outputs

????????Shape?of?inputs
????????????[(batch_size,?pooled_height,?pooled_width,?n_channels),
?????????????(batch_size,?num_rois,?4)]

????????Shape?of?output
????????????(batch_size,?num_rois,?pooled_height,?pooled_width,?n_channels)

????"""
????def?__init__(self,?pooled_height,?pooled_width,?**kwargs):
????????self.pooled_height?=?pooled_height
????????self.pooled_width?=?pooled_width

????????super(ROIPoolingLayer,?self).__init__(**kwargs)

????def?compute_output_shape(self,?input_shape):
????????"""?Returns?the?shape?of?the?ROI?Layer?output
????????"""
????????feature_map_shape,?rois_shape?=?input_shape
????????assert?feature_map_shape[0]?==?rois_shape[0]
????????batch_size?=?feature_map_shape[0]
????????n_rois?=?rois_shape[1]
????????n_channels?=?feature_map_shape[3]
????????return?(batch_size,?n_rois,?self.pooled_height,?
????????????????self.pooled_width,?n_channels)

????def?call(self,?x):
????????"""?Maps?the?input?tensor?of?the?ROI?layer?to?its?output

????????????#?Parameters
????????????????x[0]?--?Convolutional?feature?map?tensor,
????????????????????????shape?(batch_size,?pooled_height,?pooled_width,?n_channels)
????????????????x[1]?--?Tensor?of?region?of?interests?from?candidate?bounding?boxes,
????????????????????????shape?(batch_size,?num_rois,?4)
????????????????????????Each?region?of?interest?is?defined?by?four?relative?
????????????????????????coordinates?(x_min,?y_min,?x_max,?y_max)?between?0?and?1
????????????#?Output
????????????????pooled_areas?--?Tensor?with?the?pooled?region?of?interest,?shape
????????????????????(batch_size,?num_rois,?pooled_height,?pooled_width,?n_channels)
????????"""
????????def?curried_pool_rois(x):?
??????????return?ROIPoolingLayer._pool_rois(x[0],?x[1],?
????????????????????????????????????????????self.pooled_height,?
????????????????????????????????????????????self.pooled_width)

????????pooled_areas?=?tf.map_fn(curried_pool_rois,?x,?dtype=tf.float32)

????????return?pooled_areas

????@staticmethod
????def?_pool_rois(feature_map,?rois,?pooled_height,?pooled_width):
????????"""?Applies?ROI?pooling?for?a?single?image?and?varios?ROIs
????????"""
????????def?curried_pool_roi(roi):?
??????????return?ROIPoolingLayer._pool_roi(feature_map,?roi,?
???????????????????????????????????????????pooled_height,?pooled_width)

????????pooled_areas?=?tf.map_fn(curried_pool_roi,?rois,?dtype=tf.float32)
????????return?pooled_areas

????@staticmethod
????def?_pool_roi(feature_map,?roi,?pooled_height,?pooled_width):
????????"""?Applies?ROI?pooling?to?a?single?image?and?a?single?region?of?interest
????????"""

????????#?Compute?the?region?of?interest????????
????????feature_map_height?=?int(feature_map.shape[0])
????????feature_map_width??=?int(feature_map.shape[1])

????????h_start?=?tf.cast(feature_map_height?*?roi[0],?'int32')
????????w_start?=?tf.cast(feature_map_width??*?roi[1],?'int32')
????????h_end???=?tf.cast(feature_map_height?*?roi[2],?'int32')
????????w_end???=?tf.cast(feature_map_width??*?roi[3],?'int32')

????????region?=?feature_map[h_start:h_end,?w_start:w_end,?:]

????????#?Divide?the?region?into?non?overlapping?areas
????????region_height?=?h_end?-?h_start
????????region_width??=?w_end?-?w_start
????????h_step?=?tf.cast(?region_height?/?pooled_height,?'int32')
????????w_step?=?tf.cast(?region_width??/?pooled_width?,?'int32')

????????areas?=?[[(
????????????????????i*h_step,?
????????????????????j*w_step,?
????????????????????(i+1)*h_step?if?i+1?< ?pooled_height?else?region_height,?
????????????????????(j+1)*w_step?if?j+1?< ?pooled_width?else?region_width
???????????????????)?
???????????????????for?j?in?range(pooled_width)]?
??????????????????for?i?in?range(pooled_height)]

????????#?take?the?maximum?of?each?area?and?stack?the?result
????????def?pool_area(x):?
??????????return?tf.math.reduce_max(region[x[0]:x[2],?x[1]:x[3],?:],?axis=[0,1])

????????pooled_features?=?tf.stack([[pool_area(x)?for?x?in?row]?for?row?in?areas])
????????return?pooled_features

接下来，测试一下我们的实现方案！我们将使用一个高度和宽度为 200x100 的单通道图像，使用 7x3 的池化图像块提取出 2 个 RoI。图像最多可以有 4 个标签来对区域进行分类。示例特征图上的每个像素都为 1，只有处于（height-1，width-3）位置的一个像素值为 50。

import?numpy?as?np
#?Define?parameters
batch_size?=?1
img_height?=?200
img_width?=?100
n_channels?=?1
n_rois?=?2
pooled_height?=?3
pooled_width?=?7
#?Create?feature?map?input
feature_maps_shape?=?(batch_size,?img_height,?img_width,?n_channels)
feature_maps_tf?=?tf.placeholder(tf.float32,?shape=feature_maps_shape)
feature_maps_np?=?np.ones(feature_maps_tf.shape,?dtype='float32')
feature_maps_np[0,?img_height-1,?img_width-3,?0]?=?50
print(f"feature_maps_np.shape?=?{feature_maps_np.shape}")
#?Create?batch?size
roiss_tf?=?tf.placeholder(tf.float32,?shape=(batch_size,?n_rois,?4))
roiss_np?=?np.asarray([[[0.5,0.2,0.7,0.4],?[0.0,0.0,1.0,1.0]]],?dtype='float32')
print(f"roiss_np.shape?=?{roiss_np.shape}")
#?Create?layer
roi_layer?=?ROIPoolingLayer(pooled_height,?pooled_width)
pooled_features?=?roi_layer([feature_maps_tf,?roiss_tf])
print(f"output?shape?of?layer?call?=?{pooled_features.shape}")
#?Run?tensorflow?session
with?tf.Session()?as?session:
????result?=?session.run(pooled_features,?
?????????????????????????feed_dict={feature_maps_tf:feature_maps_np,??
????????????????????????????????????roiss_tf:roiss_np})

print(f"result.shape?=?{result.shape}")
print(f"first??roi?embedding=n{result[0,0,:,:,0]}")ooled_features.shape}")

上面的几行为该层定义了一个测试输入，构建了相应的张量并运行了一个 TensorFlow 会话，这样我们就可以检查它的输出。

运行该脚本将得到如下输出：

feature_maps_np.shape?=?(1,?200,?100,?1)
roiss_np.shape?=?(1,?2,?4)
output?shape?of?layer?call?=?(1,?2,?3,?7,?1)
result.shape?=?(1,?2,?3,?7,?1)
first??roi?embedding=
[[1.?1.?1.?1.?1.?1.?1.]
?[1.?1.?1.?1.?1.?1.?1.]
?[1.?1.?1.?1.?1.?1.?1.]]
second?roi?embedding=
[[?1.??1.??1.??1.??1.??1.??1.]
?[?1.??1.??1.??1.??1.??1.??1.]
?[?1.??1.??1.??1.??1.??1.?50.]]