”工欲善其事,必先利其器。“—孔子《论语.录灵公》
首页 > 编程 > MLP-混合器(理论)

MLP-混合器(理论)

发布于2024-11-08
浏览:998

TL;DR - This is the first article I am writing to report on my journey studying the MPL-Mixer architecture. It will cover the basics up to an intermediate level. The goal is not to reach an advanced level.

Reference: https://arxiv.org/abs/2105.01601

Introduction

From the original paper it's stated that

We propose the MLP-Mixer architecture (or “Mixer” for short), a competitive but conceptually and technically simple alternative, that does not use convolutions or self-attention. Instead, Mixer’s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.

At first, this explanation wasn’t very intuitive to me. With that in mind, I decided to investigate other resources that could provide an easier explanation. In any case, let’s keep in mind that the proposed architecture is shown in the image below.

MLP-Mixer (Theory)


MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.

MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.

Unlike CNNs, which focus on local image regions through convolutions, and transformers, which use attention to capture relationships between image patches, the MLP-Mixer uses only two types of operations:

  • Patch Mixing (Spatial Mixing)

  • Channel Mixing

Image Patches

In general, from the figure below, we can see that this architecture is a "fancy" classification model, where the output of the fully connected layer is a class. The input for this architecture is an image, which is divided into patches.

In case you're not 100% sure what a patch is, when an image is divided into 'patches,' it means the image is split into smaller, equally sized square (or rectangular) sections, known as patches. Each patch represents a small, localized region of the original image. Instead of processing the entire image as a whole, these smaller patches are analyzed independently or in groups, often for tasks like object detection, classification, or feature extraction. Each patch is then provided to a "Per-patch Fully-connected" layer and subsequently a "N×N \times (Mixer Layer)."

MLP-Mixer (Theory)
MLP-Mixer

How the MLP-Mixer Processes Patches

For a clear example, let's take a 224x224 pixel image which can be divided into 16x16 pixel patches, when diving the image we'll have

22416=14 \frac{224}{16} = 14 16224=14

So, the 224x224 pixel image will be divided into a grid of 14 patches along the width and 14 patched along the height. This results in a 14x14 grid of patches. The total number of patches is then calculated by 14×14=19614 \times 14 = 196 14×14=196 patches. After diving the image into patches, each patch will contain 16×16=25616 \times 16 = 256 16×16=256 pixels. These pixel values can be flattened into a 1D vector of length 256.

It's crucial to note that if the image has multiple channels (like RGB with 3 channels), each patch will actually have 3×16×16=7683 \times 16 \times 16 = 768 16×16=768 values because each pixel in a RGB image has three color channels.

Thus, for an RGB image:

  • Each patch can be represented as a vector of 768 values

  • We then have 196 patches, each represented by a 768-dimensional vector

Each patch (a 768-dimensional vector in our example) is projected into a higher-dimensional space using a linear layer (MLP). This essentially gives each patch an embedding that is used as the input to the MLP-Mixer. More specifically, after dividing the image into patches, each patch is processed by the Per-patch Fully-connected layer (depicted in the second row).

Per-patch Fully Connected Layer

Each patch is essentially treated as a vector of values, as we have explained before, a 768-dimensional vector in our example. These values are passed through a fully connected linear layer, which transforms them into another vector. This is done for each patch independently. The role of this fully connected layer is to map each patch to a higher-dimensional embedding space, similar to how token embeddings are used in transformers.

N x Mixer Layers

This part shows a set of stacked mixer layers that alternate between two different types of MLP-based operations:

1. Patch Mixing: In this layer, the model mixes information between patches. This means it looks at the relationships between patches in the image (i.e., across different spatial locations). It's achieved through an MLP that treats each patch as a separate entity and computes the interactions between them.

2. Channel Mixing: After patch mixing, the channel mixing layer processes the internal information of each patch independently. It looks at the relationships between different pixel values (or channels) within each patch by applying another MLP.

These mixer layers alternate between patch mixing and channel mixing. They are applied N times, where N is a hyperparameter to configure the number of times the layers are repeated. The goal of these layers is to mix both spatial (patch-wise) and channel-wise information across the entire image.

Global Average Pooling

After passing through several mixer layers, a global average pooling layer is applied. This layer helps reduce the dimensionality of the output by averaging the activations across all patches. The global average pooling layer computes the average value across all patches, essentially summarizing the information from all patches into a single vector. This reduces the overall dimensionality and prepares the data for the final classification step. Additionally, it helps aggregate the learned features from the entire image in a more compact way.

Fully Connected Layer for Classification

This layer is responsible for taking the output of the global average pooling and mapping it to a classification label. The fully connected layer takes the averaged features from the global pooling layer and uses them to make a prediction about the class of the image. The number of output units in this layer corresponds to the number of classes in the classification task.

Quick Recap Until Now

  1. Input Image: the image is divided into small patches

  2. Per-patch Fully Connected Layer: Each patch is processed independently by a fully connected layer to create patch embeddings

  3. Mixer Layers: The patches are then passed through a series of mixer layers, where information is mixed spatially (between patches) and channel-wise (within patches)

  4. Global Average Pooling: The features from all patches are averaged to summarize the information

  5. Fully Connected Layer: Finally, the averaged features are used to predict the class label of the image

Overview of the Mixer Layer Architecture

As stated before, the Mixer Layer architecture consists of two alternating operations:

  1. Patch Mixing

  2. Channel Mixing

Each mixing state is handled by a MLP, and there are also skip-connections and layer normalization steps included. Let's go step by step through how an image is processed using as a reference the image shown below:

MLP-Mixer (Theory)
Mixer Layer

Note, that the image itself is not provided as an input to this diagram. Actually, the image has already been divided into patches as explained before and each patch is processed by a per-patch fully connected layer (linear embedding). The output of that state is what enters the Mixer Layer. Here is how it goes:

  • Before reaching this layer, the image has already been divided into patches. These patches are flattened and embedded as vectors (after the per-patch fully connected layer)

  • The input to this diagram consists of tokens (one for each patch) and channels (the feature dimensions for each patch)

  • So the input is structured as a 2D tensor with: Patches (one patch per token in the sequence) and Channels (features within each patch)

  • In simple terms, as shown in the image we can think of the input as a matrix where: Each row represents a patch, and each column represents a channel (feature dimension)

Processing Inside the Mixer Layer

Now, let's break down the key operations in this Mixer Layer, which processes the patches that come from the previous stage:

1. Layer Normalization

  • The first step is the normalization of the input data to improve training stability. This happens before any mixing occurs.

2. Patch Mixing (First MLP Block)

  • After normalization, the first MLP block is applied. In this operations, the patches are mixed together. The idea here is to capture the relationships between different patches.

  • This operation transposes the input so that it focuses on the patches dimension

  • Then, a MLP is applied along this patches dimension. This is done to allow the model to exchange information between patches and learn how patches relate spatially.

  • Once this mixing is done, the input is transposed back to its original format, where channels are in the focus again.

3. Skip Connection

  • The architecture uses a skip-connection (bypass residual layers, allowing any layer to flow directly to any subsequent layer) that adds the original input (before patch mixing) back to the output of the patch mixing block. This helps avoid degradation of information during training.

4. Layer Normalization

  • Another layer normalization step is applied to the output of the token mixing operation before the next operation (channel mixing) is applied

5. Channel Mixing (Second MLP Block)

  • The second MLP block performs channel mixing. Here, each patch is processed independently by a separate MLP. The goal is to model relationships between the different channels (features) within each patch.

  • The input is processed along the channels dimension (without mixing information between different patches). The MLP learns to capture dependencies between the various features within each patch

6. Skip Connection (for Channel Mixing)

  • Similar to patch mixing, there's a skip-connection in the channel mixing block as well. This allows the model to retain the original input after the channel mixing operation and helps in stabilizing the learning process.

With the step by step explanation above we can summarize it with some key points:

  • Patch Mixing - this part processes the relationships between different patches (spatial mixing), allowing the model to understand global spatial patterns across the image

  • Channel Mixing - this part processes the relationships within each patch (channel-wise mixing), learning to capture dependencies between the features (such as pixel intensities or features maps) within each patch

  • Skip Connections - The skip connections help the network retain the original input and prevent vanishing gradient problems, especially in deep networks

Inside the MLP Block

The MLP block used in the MLP-Mixer architecture can be seen in the image below:

MLP-Mixer (Theory)
MLP Architecture

1. Fully Connected Layer

  • The input (whether it's patch embeddings or channels, depending on the context) first passes through a fully connected layer. This layer performs a linear transformation of the input, meaning it multiplies the input by a weight matrix and adds a bias term.

2. GELU (Gaussian Error Linear Unit) Activation

  • After the first fully connected layer, a GELU activation function is applied. GELU is a non-linear activation function that is smoother than the ReLU. It allows for a more fine-grained activation, as it approximates the behavior of a normal distribution. The formula for GELU is given by
GELU(x)=xΦ(x) GELU(x) = x \cdot \Phi(x) GELU(x)=x⋅Φ(x)

where Φ(x)\Phi(x) Φ(x) is the cumulative distribution function of a standard Gaussian distribution. The choice of GELU in MLP-Mixer is intended to improve the model’s ability to handle non-linearities in the data, which helps the model learn more complex patterns.

How it works in context:

  • In the Token Mixing MLP, this block is applied across the patch (token) dimension, mixing information across different patches

  • In the Channel Mixing MLP, the same structure is applied across the channels dimension, mixing the information within each patch independently.

Intuition - The MLP block acts as the core computational unit within the MLP-Mixer. By using two fully connected layers with an activation function in between, it learns to capture both linear and non-linear relationships within the data, depending on where it is applied (either for mixing patches or channels).

Conclusion

The MLP-Mixer offers a unique approach to image classification by leveraging simple multilayer perceptrons (MLPs) for both spatial and channel-wise interactions. By dividing an image into patches and alternating between patch mixing and channel mixing layers, the MLP-Mixer efficiently captures global spatial dependencies and local pixel relationships without relying on traditional convolutional operations. This architecture provides a streamlined yet powerful method for extracting meaningful patterns from images, demonstrating that, even with minimal reliance on complex operations, impressive performance can be achieved in deep learning tasks. It is important to note, however, that this architecture can be applied to different fields, such as time series predictions, even though it was initially proposed for vision tasks.


If you've made it this far, I want to express my gratitude. I hope this has been helpful to someone other than just me!

✧⁺⸜(^-^)⸝⁺✧

版本声明 本文转载于:https://dev.to/igor1740/mlp-mixer-theory-2dje?1如有侵犯,请联系[email protected]删除
最新教程 更多>
  • Java 已经到来 有什么新功能?
    Java 已经到来 有什么新功能?
    Java 23已正式发布!这是一个非 LTS(长期支持)版本。尽管它是一个短暂的版本,但 Java 23 包含了令人兴奋的改进、错误修复,并且还删除了您可能需要注意的功能和选项。 让我们深入了解新增内容以及它如何基于 JDK 21 和 JDK 22 等早期版本的功能构建。 范围值:基...
    编程 发布于2024-11-09
  • 单一责任原则
    单一责任原则
    每个软件组件应该只有一个且一个职责 软件组件可以是类、方法或模块 例如,瑞士军刀是一种多用途工具,违反了软件开发的单一责任原则,相反,刀是遵循单一责任的一个很好的例子(因为与瑞士军刀不同,它只能用于切割可用于切割、打开罐头、作为万能钥匙、剪刀等) 由于无论是现实世界还是软件开发,变化都是不变的,单...
    编程 发布于2024-11-09
  • 如何在 Python 中列出定义的变量:“listout”的替代品?
    如何在 Python 中列出定义的变量:“listout”的替代品?
    在 Python 中访问定义的变量在 Python 中,跟踪所有定义的变量对于保持清晰度和调试至关重要。虽然 Python shell 缺乏用于显示完整变量列表的内置功能(如 MATLAB 的“listout”命令),但有几种替代方法可以实现此功能。dir() dir() 函数提供当前作用域中定义的...
    编程 发布于2024-11-09
  • Darshan Hiranandani 的解释:如何使用 PHP 连接到 MySQL 数据库?
    Darshan Hiranandani 的解释:如何使用 PHP 连接到 MySQL 数据库?
    大家好,我是 Darshan Hiranandani,我正在解释如何使用 PHP 连接到 MySQL 数据库? 要使用 PHP 连接到 MySQL 数据库,您可以使用 mysqli 扩展或 PDO(PHP 数据对象)扩展。以下是这两种方法的示例: 使用 mysqli 扩展
    编程 发布于2024-11-09
  • 如何掌握 CSS 盒子模型以实现完美的网站布局(+ Codepen 示例)
    如何掌握 CSS 盒子模型以实现完美的网站布局(+ Codepen 示例)
    嘿,了不起的人!欢迎回到我的博客。 ?今天,我们将深入研究 CSS 盒子模型,揭秘如何确定每个元素的大小,以及如何使用这些知识来创建精确、现代和简洁的设计(本文末尾的实际示例)。 盒子模型简介 CSS 盒子模型是网页设计的基础,它规定了每个 HTML 元素如何在网页中占据空间。 盒子...
    编程 发布于2024-11-09
  • 如何在空手道的读取方法中参数化请求文件名?
    如何在空手道的读取方法中参数化请求文件名?
    在Karate的读取方法中参数化请求文件名尝试使用Karate进行自动化API测试时,您可能会在尝试通过时遇到问题将 XML 文件发送到 Read 方法,收到类似于问题中提到的异常。当您在 Read 方法中使用变量表示文件路径(例如 read(varXmlFile))时,会发生这种情况。要解决此问题...
    编程 发布于2024-11-09
  • 如何在 Pandas 中基于 If-Else-Else 条件创建列?
    如何在 Pandas 中基于 If-Else-Else 条件创建列?
    在 Pandas 中使用 If-Else-Else 条件创建列根据 if-elif-else 条件创建新列,有两种主要方法:非向量化方法这种方法涉及定义一个对行进行操作的函数:def f(row): if row['A'] == row['B']: val = 0 e...
    编程 发布于2024-11-09
  • 构建更智能的合约:Go 如何为 KALP Studio 的区块链解决方案提供支持
    构建更智能的合约:Go 如何为 KALP Studio 的区块链解决方案提供支持
    随着区块链革命的蓬勃发展,开发智能合约对于利用区块链技术变得至关重要。智能合约本质上是去中心化应用程序 (dApp) 的支柱,有助于在没有中介的情况下促进、验证或执行协议。随着各种编程语言在智能合约开发中越来越受欢迎,Go(或 Golang) 越来越受欢迎。在这篇博客中,我们将深入探讨为什么 KAL...
    编程 发布于2024-11-09
  • 在 Android 中实现 CheckBox Listener 时如何修复 Eclipse 错误?
    在 Android 中实现 CheckBox Listener 时如何修复 Eclipse 错误?
    Android CheckBox Listener:解决 Eclipse 错误尝试在 Android 中实现 CheckBox 的侦听器时,开发人员在使用时可能会遇到错误默认的 OnCheckedChangeListener 类。 Eclipse 可能会将其识别为 RadioGroup 的实例,从而...
    编程 发布于2024-11-09
  • 如何在 Linux 中使用“cpuid”指令访问 CPU 信息?
    如何在 Linux 中使用“cpuid”指令访问 CPU 信息?
    在 Linux 上使用 cpuid 指令访问 CPU 信息在这个问题中,开发人员试图在 Linux 环境中使用方法类似于 Windows API 中的 _cpuinfo() 函数。提供的代码尝试利用汇编指令 (cpuid) 来检索此信息,但开发人员希望避免手动汇编的需要。解决方案在于利用编译代码时可...
    编程 发布于2024-11-09
  • 如何确定 JavaScript 字符串的字节大小
    如何确定 JavaScript 字符串的字节大小
    确定 JavaScript 字符串的字节大小在 JavaScript 中,字符串使用 Unicode 字符编码标准(称为 UCS-2)表示。这意味着字符串中的每个字符通常由两个字节表示。但是,字符串的实际字节大小可能会有所不同,具体取决于传输过程中使用的字符串编码(例如 UTF-8)和特定浏览器实现...
    编程 发布于2024-11-09
  • JavaScript 记忆
    JavaScript 记忆
    JavaScript 是一种功能强大的编程语言,在开发交互式网站方面发挥着重要作用。然而,在处理复杂和数据密集型应用程序时,JavaScript 性能可能会成为一个问题。这就是记忆发挥作用的地方。通过释放缓存的力量,记忆化是一种可以显着提高 JavaScript 性能的技术,使您的应用程序运行得更快...
    编程 发布于2024-11-09
  • 如何在 Linux 系统中使用 Python 创建预填充输入函数?
    如何在 Linux 系统中使用 Python 创建预填充输入函数?
    Python 中的输入编辑Python 的 input() 和 raw_input() 函数本身不允许预填充输入编辑。然而,在Linux系统中,readline模块可以用来创建提供此功能的rlinput函数。rlinput函数有两个参数:prompt:显示的提示符给用户。prefill:在输入字段中...
    编程 发布于2024-11-09
  • 如何在 Java 中检索文件创建日期?
    如何在 Java 中检索文件创建日期?
    在 Java 中检索文件创建日期确定文件创建日期对于组织和管理文件至关重要,特别是在需要按时间顺序排序时。在 Java 中,有一个利用 Java NIO 库的解决方案。NIO(新输入/输出)提供了检索文件元数据(包括创建日期)的方法。仅当底层文件系统支持时才能访问此元数据。要使用 NIO 访问文件创...
    编程 发布于2024-11-09
  • 如何用 Python 构建 Hangman 游戏:分步指南
    如何用 Python 构建 Hangman 游戏:分步指南
    Hangman 是一款经典的猜词游戏,非常有趣,对于初学者程序员来说是一个很棒的项目。 在本文中,我们将学习如何用 Python 构建一个简单版本的 Hangman 游戏。 最后,您将了解如何使用 Python 的基本控制结构、函数和列表来创建这个游戏。 什么是刽子手? Hang...
    编程 发布于2024-11-09

免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。

Copyright© 2022 湘ICP备2022001581号-3