」工欲善其事,必先利其器。「—孔子《論語.錄靈公》
首頁 > 程式設計 > MLP-混合器(理論)

MLP-混合器(理論)

發佈於2024-11-08
瀏覽:985

TL;DR - This is the first article I am writing to report on my journey studying the MPL-Mixer architecture. It will cover the basics up to an intermediate level. The goal is not to reach an advanced level.

Reference: https://arxiv.org/abs/2105.01601

Introduction

From the original paper it's stated that

We propose the MLP-Mixer architecture (or “Mixer” for short), a competitive but conceptually and technically simple alternative, that does not use convolutions or self-attention. Instead, Mixer’s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.

At first, this explanation wasn’t very intuitive to me. With that in mind, I decided to investigate other resources that could provide an easier explanation. In any case, let’s keep in mind that the proposed architecture is shown in the image below.

MLP-Mixer (Theory)


MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.

MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.

Unlike CNNs, which focus on local image regions through convolutions, and transformers, which use attention to capture relationships between image patches, the MLP-Mixer uses only two types of operations:

  • Patch Mixing (Spatial Mixing)

  • Channel Mixing

Image Patches

In general, from the figure below, we can see that this architecture is a "fancy" classification model, where the output of the fully connected layer is a class. The input for this architecture is an image, which is divided into patches.

In case you're not 100% sure what a patch is, when an image is divided into 'patches,' it means the image is split into smaller, equally sized square (or rectangular) sections, known as patches. Each patch represents a small, localized region of the original image. Instead of processing the entire image as a whole, these smaller patches are analyzed independently or in groups, often for tasks like object detection, classification, or feature extraction. Each patch is then provided to a "Per-patch Fully-connected" layer and subsequently a " N×N \times (Mixer Layer)."

MLP-Mixer (Theory)
MLP-Mixer

How the MLP-Mixer Processes Patches

For a clear example, let's take a 224x224 pixel image which can be divided into 16x16 pixel patches, when diving the image we'll have

22416=14 \frac{224}{16} = 14 16224=14

So, the 224x224 pixel image will be divided into a grid of 14 patches along the width and 14 patched along the height. This results in a 14x14 grid of patches. The total number of patches is then calculated by 14×14=19614 \times 14 = 196 14×14=196 patches. After diving the image into patches, each patch will contain 16×16=25616 \times 16 = 256 16×16=256 pixels. These pixel values can be flattened into a 1D vector of length 256.

It's crucial to note that if the image has multiple channels (like RGB with 3 channels), each patch will actually have 3×16×16=7683 \times 16 \times 16 = 768 16×16=768 values because each pixel in a RGB image has three color channels.

Thus, for an RGB image:

  • Each patch can be represented as a vector of 768 values

  • We then have 196 patches, each represented by a 768-dimensional vector

Each patch (a 768-dimensional vector in our example) is projected into a higher-dimensional space using a linear layer (MLP). This essentially gives each patch an embedding that is used as the input to the MLP-Mixer. More specifically, after dividing the image into patches, each patch is processed by the Per-patch Fully-connected layer (depicted in the second row).

Per-patch Fully Connected Layer

Each patch is essentially treated as a vector of values, as we have explained before, a 768-dimensional vector in our example. These values are passed through a fully connected linear layer, which transforms them into another vector. This is done for each patch independently. The role of this fully connected layer is to map each patch to a higher-dimensional embedding space, similar to how token embeddings are used in transformers.

N x Mixer Layers

This part shows a set of stacked mixer layers that alternate between two different types of MLP-based operations:

1. Patch Mixing: In this layer, the model mixes information between patches. This means it looks at the relationships between patches in the image (i.e., across different spatial locations). It's achieved through an MLP that treats each patch as a separate entity and computes the interactions between them.

2. Channel Mixing: After patch mixing, the channel mixing layer processes the internal information of each patch independently. It looks at the relationships between different pixel values (or channels) within each patch by applying another MLP.

These mixer layers alternate between patch mixing and channel mixing. They are applied N times, where N is a hyperparameter to configure the number of times the layers are repeated. The goal of these layers is to mix both spatial (patch-wise) and channel-wise information across the entire image.

Global Average Pooling

After passing through several mixer layers, a global average pooling layer is applied. This layer helps reduce the dimensionality of the output by averaging the activations across all patches. The global average pooling layer computes the average value across all patches, essentially summarizing the information from all patches into a single vector. This reduces the overall dimensionality and prepares the data for the final classification step. Additionally, it helps aggregate the learned features from the entire image in a more compact way.

Fully Connected Layer for Classification

This layer is responsible for taking the output of the global average pooling and mapping it to a classification label. The fully connected layer takes the averaged features from the global pooling layer and uses them to make a prediction about the class of the image. The number of output units in this layer corresponds to the number of classes in the classification task.

Quick Recap Until Now

  1. Input Image: the image is divided into small patches

  2. Per-patch Fully Connected Layer: Each patch is processed independently by a fully connected layer to create patch embeddings

  3. Mixer Layers: The patches are then passed through a series of mixer layers, where information is mixed spatially (between patches) and channel-wise (within patches)

  4. Global Average Pooling: The features from all patches are averaged to summarize the information

  5. Fully Connected Layer: Finally, the averaged features are used to predict the class label of the image

Overview of the Mixer Layer Architecture

As stated before, the Mixer Layer architecture consists of two alternating operations:

  1. Patch Mixing

  2. Channel Mixing

Each mixing state is handled by a MLP, and there are also skip-connections and layer normalization steps included. Let's go step by step through how an image is processed using as a reference the image shown below:

MLP-Mixer (Theory)
Mixer Layer

Note, that the image itself is not provided as an input to this diagram. Actually, the image has already been divided into patches as explained before and each patch is processed by a per-patch fully connected layer (linear embedding). The output of that state is what enters the Mixer Layer. Here is how it goes:

  • Before reaching this layer, the image has already been divided into patches. These patches are flattened and embedded as vectors (after the per-patch fully connected layer)

  • The input to this diagram consists of tokens (one for each patch) and channels (the feature dimensions for each patch)

  • So the input is structured as a 2D tensor with: Patches (one patch per token in the sequence) and Channels (features within each patch)

  • In simple terms, as shown in the image we can think of the input as a matrix where: Each row represents a patch, and each column represents a channel (feature dimension)

Processing Inside the Mixer Layer

Now, let's break down the key operations in this Mixer Layer, which processes the patches that come from the previous stage:

1. Layer Normalization

  • The first step is the normalization of the input data to improve training stability. This happens before any mixing occurs.

2. Patch Mixing (First MLP Block)

  • After normalization, the first MLP block is applied. In this operations, the patches are mixed together. The idea here is to capture the relationships between different patches.

  • This operation transposes the input so that it focuses on the patches dimension

  • Then, a MLP is applied along this patches dimension. This is done to allow the model to exchange information between patches and learn how patches relate spatially.

  • Once this mixing is done, the input is transposed back to its original format, where channels are in the focus again.

3. Skip Connection

  • The architecture uses a skip-connection (bypass residual layers, allowing any layer to flow directly to any subsequent layer) that adds the original input (before patch mixing) back to the output of the patch mixing block. This helps avoid degradation of information during training.

4. Layer Normalization

  • Another layer normalization step is applied to the output of the token mixing operation before the next operation (channel mixing) is applied

5. Channel Mixing (Second MLP Block)

  • The second MLP block performs channel mixing. Here, each patch is processed independently by a separate MLP. The goal is to model relationships between the different channels (features) within each patch.

  • The input is processed along the channels dimension (without mixing information between different patches). The MLP learns to capture dependencies between the various features within each patch

6. Skip Connection (for Channel Mixing)

  • Similar to patch mixing, there's a skip-connection in the channel mixing block as well. This allows the model to retain the original input after the channel mixing operation and helps in stabilizing the learning process.

With the step by step explanation above we can summarize it with some key points:

  • Patch Mixing - this part processes the relationships between different patches (spatial mixing), allowing the model to understand global spatial patterns across the image

  • Channel Mixing - this part processes the relationships within each patch (channel-wise mixing), learning to capture dependencies between the features (such as pixel intensities or features maps) within each patch

  • Skip Connections - The skip connections help the network retain the original input and prevent vanishing gradient problems, especially in deep networks

Inside the MLP Block

The MLP block used in the MLP-Mixer architecture can be seen in the image below:

MLP-Mixer (Theory)
MLP Architecture

1. Fully Connected Layer

  • The input (whether it's patch embeddings or channels, depending on the context) first passes through a fully connected layer. This layer performs a linear transformation of the input, meaning it multiplies the input by a weight matrix and adds a bias term.

2. GELU (Gaussian Error Linear Unit) Activation

  • After the first fully connected layer, a GELU activation function is applied. GELU is a non-linear activation function that is smoother than the ReLU. It allows for a more fine-grained activation, as it approximates the behavior of a normal distribution. The formula for GELU is given by
GELU(x)=xΦ(x) GELU(x) = x \cdot \Phi(x) GELU(x)=x⋅Φ(x)

where Φ(x)\Phi(x) Φ(x) is the cumulative distribution function of a standard Gaussian distribution. The choice of GELU in MLP-Mixer is intended to improve the model’s ability to handle non-linearities in the data, which helps the model learn more complex patterns.

How it works in context:

  • In the Token Mixing MLP, this block is applied across the patch (token) dimension, mixing information across different patches

  • In the Channel Mixing MLP, the same structure is applied across the channels dimension, mixing the information within each patch independently.

Intuition - The MLP block acts as the core computational unit within the MLP-Mixer. By using two fully connected layers with an activation function in between, it learns to capture both linear and non-linear relationships within the data, depending on where it is applied (either for mixing patches or channels).

Conclusion

The MLP-Mixer offers a unique approach to image classification by leveraging simple multilayer perceptrons (MLPs) for both spatial and channel-wise interactions. By dividing an image into patches and alternating between patch mixing and channel mixing layers, the MLP-Mixer efficiently captures global spatial dependencies and local pixel relationships without relying on traditional convolutional operations. This architecture provides a streamlined yet powerful method for extracting meaningful patterns from images, demonstrating that, even with minimal reliance on complex operations, impressive performance can be achieved in deep learning tasks. It is important to note, however, that this architecture can be applied to different fields, such as time series predictions, even though it was initially proposed for vision tasks.


If you've made it this far, I want to express my gratitude. I hope this has been helpful to someone other than just me!

✧⁺⸜(^-^)⸝⁺✧

版本聲明 本文轉載於:https://dev.to/igor1740/mlp-mixer-theory-2dje?1如有侵犯,請洽[email protected]刪除
最新教學 更多>
  • 如何使用PHP從XML文件中有效地檢索屬性值?
    如何使用PHP從XML文件中有效地檢索屬性值?
    從php $xml = simplexml_load_file($file); foreach ($xml->Var[0]->attributes() as $attributeName => $attributeValue) { echo $attributeName,...
    程式設計 發佈於2025-03-20
  • 如何在JavaScript對像中動態設置鍵?
    如何在JavaScript對像中動態設置鍵?
    在嘗試為JavaScript對象創建動態鍵時,如何使用此Syntax jsObj['key' i] = 'example' 1;不工作。正確的方法採用方括號: jsobj ['key''i] ='example'1; 在JavaScript中,數組是一...
    程式設計 發佈於2025-03-20
  • 為什麼我的CSS背景圖像出現?
    為什麼我的CSS背景圖像出現?
    故障排除:CSS背景圖像未出現 ,您的背景圖像儘管遵循教程說明,但您的背景圖像仍未加載。圖像和样式表位於相同的目錄中,但背景仍然是空白的白色帆布。 而不是不棄用的,您已經使用了CSS樣式: bockent {背景:封閉圖像文件名:背景圖:url(nickcage.jpg); 如果您的html,cs...
    程式設計 發佈於2025-03-20
  • 大批
    大批
    [2 數組是對象,因此它們在JS中也具有方法。 切片(開始):在新數組中提取部分數組,而無需突變原始數組。 令ARR = ['a','b','c','d','e']; // USECASE:提取直到索引作...
    程式設計 發佈於2025-03-20
  • 如何使用組在MySQL中旋轉數據?
    如何使用組在MySQL中旋轉數據?
    在關係數據庫中使用mySQL組使用mySQL組進行查詢結果,在關係數據庫中使用MySQL組,轉移數據的數據是指重新排列的行和列的重排以增強數據可視化。在這裡,我們面對一個共同的挑戰:使用組的組將數據從基於行的基於列的轉換為基於列。 Let's consider the following ...
    程式設計 發佈於2025-03-20
  • 為什麼使用Firefox後退按鈕時JavaScript執行停止?
    為什麼使用Firefox後退按鈕時JavaScript執行停止?
    導航歷史記錄問題:JavaScript使用Firefox Back Back 此行為是由瀏覽器緩存JavaScript資源引起的。要解決此問題並確保在後續頁面訪問中執行腳本,Firefox用戶應設置一個空功能。 警報'); }; alert('inline Alert')...
    程式設計 發佈於2025-03-20
  • Python讀取CSV文件UnicodeDecodeError終極解決方法
    Python讀取CSV文件UnicodeDecodeError終極解決方法
    在試圖使用已內置的CSV模塊讀取Python中時,CSV文件中的Unicode Decode Decode Decode Decode decode Error讀取,您可能會遇到錯誤的錯誤:無法解碼字節 在位置2-3中:截斷\ uxxxxxxxx逃脫當CSV文件包含特殊字符或Unicode的路徑逃...
    程式設計 發佈於2025-03-20
  • 如何在Java字符串中有效替換多個子字符串?
    如何在Java字符串中有效替換多個子字符串?
    在java 中有效地替換多個substring,需要在需要替換一個字符串中的多個substring的情況下,很容易求助於重複應用字符串的刺激力量。 However, this can be inefficient for large strings or when working with nu...
    程式設計 發佈於2025-03-20
  • HTML格式標籤
    HTML格式標籤
    HTML 格式化元素 **HTML Formatting is a process of formatting text for better look and feel. HTML provides us ability to format text without us...
    程式設計 發佈於2025-03-20
  • 如何為PostgreSQL中的每個唯一標識符有效地檢索最後一行?
    如何為PostgreSQL中的每個唯一標識符有效地檢索最後一行?
    postgresql:為每個唯一標識符在postgresql中提取最後一行,您可能需要遇到與數據集合中每個不同標識的信息相關的信息。考慮以下數據:[ 1 2014-02-01 kjkj 在數據集中的每個唯一ID中檢索最後一行的信息,您可以在操作員上使用Postgres的有效效率: id dat...
    程式設計 發佈於2025-03-20
  • 可以在純CS中將多個粘性元素彼此堆疊在一起嗎?
    可以在純CS中將多個粘性元素彼此堆疊在一起嗎?
    [2这里: https://webthemez.com/demo/sticky-multi-header-scroll/index.html </main> <section> { display:grid; grid-template-...
    程式設計 發佈於2025-03-20
  • 如何使用FormData()處理多個文件上傳?
    如何使用FormData()處理多個文件上傳?
    )處理多個文件輸入時,通常需要處理多個文件上傳時,通常是必要的。 The fd.append("fileToUpload[]", files[x]); method can be used for this purpose, allowing you to send multi...
    程式設計 發佈於2025-03-20
  • 如何使用Regex在PHP中有效地提取括號內的文本
    如何使用Regex在PHP中有效地提取括號內的文本
    php:在括號內提取文本在處理括號內的文本時,找到最有效的解決方案是必不可少的。一種方法是利用PHP的字符串操作函數,如下所示: 作為替代 $ text ='忽略除此之外的一切(text)'; preg_match('#((。 &&& [Regex使用模式來搜索特...
    程式設計 發佈於2025-03-20
  • 如何從PHP中的數組中提取隨機元素?
    如何從PHP中的數組中提取隨機元素?
    從陣列中的隨機選擇,可以輕鬆從數組中獲取隨機項目。考慮以下數組:; 從此數組中檢索一個隨機項目,利用array_rand( array_rand()函數從數組返回一個隨機鍵。通過將$項目數組索引使用此鍵,我們可以從數組中訪問一個隨機元素。這種方法為選擇隨機項目提供了一種直接且可靠的方法。
    程式設計 發佈於2025-03-20
  • 如何限制動態大小的父元素中元素的滾動範圍?
    如何限制動態大小的父元素中元素的滾動範圍?
    在交互式接口中實現垂直滾動元素的CSS高度限制問題:考慮一個佈局,其中我們具有與用戶垂直滾動一起移動的可滾動地圖div,同時與固定的固定sidebar保持一致。但是,地圖的滾動無限期擴展,超過了視口的高度,阻止用戶訪問頁面頁腳。 $("#map").css({ margin...
    程式設計 發佈於2025-03-20

免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。

Copyright© 2022 湘ICP备2022001581号-3