」工欲善其事,必先利其器。「—孔子《論語.錄靈公》
首頁 > 程式設計 > 從 MySQL 遷移到 PostgreSQL

從 MySQL 遷移到 PostgreSQL

發佈於2024-07-29
瀏覽:722

Migrating from MySQL to PostgreSQL

Migrating a database from MySQL to Postgres is a challenging process.

While MySQL and Postgres do a similar job, there are some fundamental differences between them and those differences can create issues that need addressing for the migration to be successful.

Where to start?

Pg Loader is a tool that can be used to move your data to PostgreSQL, however, it's not perfect, but can work well in some cases. It's worth looking at to see if it's the direction you want to go.

Another approach to take is to create custom scripts.

Custom scripts offer greater flexibility and scope to address issues specific to your dataset.

For this article, custom scripts were built to handle the migration process.

Exporting the data

How the data is exported is critical to a smooth migration. Using mysqldump in its default setup will lead to a more difficult process.

Use the --compatible=ansi option to export the data in a format PostgreSQL requires.

To make the migration easier to handle, split up the schema and data dumps so they can be processed separately. The processing requirements for each file are very different and creating a script for each will make it more manageable.

Schema differences

Data Types

There are differences in what data types are available in MySQL and PostgreSQL, this means when processing your schema you are going to need to decide what field data types work best for your data.

Category MySQL PostgreSQL
Numeric INT, TINYINT, SMALLINT, MEDIUMINT, BIGINT, FLOAT, DOUBLE, DECIMAL INTEGER, SMALLINT, BIGINT, NUMERIC, REAL, DOUBLE PRECISION, SERIAL, SMALLSERIAL, BIGSERIAL
String CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT CHAR, VARCHAR, TEXT
Date and Time DATE, TIME, DATETIME, TIMESTAMP, YEAR DATE, TIME, TIMESTAMP, INTERVAL, TIMESTAMPTZ
Binary BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB BYTEA
Boolean BOOLEAN (TINYINT(1)) BOOLEAN
Enum and Set ENUM, SET ENUM (no SET equivalent)
JSON JSON JSON, JSONB
Geometric GEOMETRY, POINT, LINESTRING, POLYGON POINT, LINE, LSEG, BOX, PATH, POLYGON, CIRCLE
Network Address No built-in types CIDR, INET, MACADDR
UUID No built-in type (can use CHAR(36)) UUID
Array No built-in support Supports arrays of any data type
XML No built-in type XML
Range Types No built-in support int4range, int8range, numrange, tsrange, tstzrange, daterange
Composite Types No built-in support User-defined composite types

Tinyint field type

Tinyint doesn't exist in PostgreSQL. You have the choice of smallint or boolean to replace it with. Choose the data type most like the current dataset.

 $line =~ s/\btinyint(?:\(\d \))?\b/smallint/gi;

Enum Field type

Enum fields are a little more complex, while enums exist in PostgreSQL, they require creating custom types.

To avoid duplicating custom types, it is better to plan out what enum types are required and create the minimum number of custom types needed for your schema. Custom types are not table specific, one custom type can be used on multiple tables.

CREATE TYPE color_enum AS ENUM ('blue', 'green');

...
"shirt_color" color_enum NOT NULL DEFAULT 'blue',
"pant_color" color_enum NOT NULL DEFAULT 'green',
...

The creation of the types would need to be done before the SQL is imported. The script could then be adjusted to use the custom types that have been created.

If there are multiple fields using enum('blue','green'), these should all be using the same enum custom type. Creating custom types for each individual field would not be good database design.

if ( $line =~ /"([^"] )"\s enum\(([^)] )\)/ ) {
    my $column_name = $1;
    my $enum_values = $2;
    if ( $enum_values !~ /''/ ) {
        $enum_values .= ",''";
    }

    my @items = $enum_values =~ /'([^']*)'/g;

    my $sorted_enum_values = join( ',', sort @items );

    my $enum_type_name;
    if ( exists $enum_types{$sorted_enum_values} ) {
        $enum_type_name = $enum_types{$sorted_enum_values};
    }
    else {
        $enum_type_name = create_enum_type_name($sorted_enum_values);
        $enum_types{$sorted_enum_values} = $enum_type_name;

        # Add CREATE TYPE statement to post-processing
        push @enum_lines,
        "CREATE TYPE $enum_type_name AS ENUM ($enum_values);\n";
    }

    # Replace the line with the new ENUM type
    $line =~ s/enum\([^)] \)/$enum_type_name/;
}

Indexes

There are differences in how indexes are created. There are two variations of indexes, Indexes with character limitations and indexes without character limitations. Both of these needed to be handled and removed from the SQL and put into a separate SQL file to be run after the import is complete (run_after.sql).

if ($line =~ /^\s*KEY\s /i) {
    if ($line =~ /KEY\s "([^"] )"\s \("([^"] )"\)/) {
        my $index_name = $1;
        my $column_name = $2;
        push @post_process_lines, "CREATE INDEX idx_${current_table}_$index_name ON \"$current_table\" (\"$column_name\");\n";
    } elsif ($line =~ /KEY\s "([^"] )"\s \("([^"] )"\((\d )\)\)/i) {
        my $index_name = $1;
        my $column_name = $2;
        my $prefix_length = $3;
        push @post_process_lines, "CREATE INDEX idx_${current_table}_$index_name ON \"$current_table\" (LEFT(\"$column_name\", $prefix_length));\n";
    }
    next;
}

Full text indexes work quite differently in PostgreSQL. To create full text index the index must convert the data into a vector.

The vector can then be indexed. There are two index types to choose from when indexing vectors. GIN and GiST. Both have pros and cons. Generally GIN is preferred over GiST. While GIN is slower building the index, it's faster for lookups.

if ( $line =~ /^\s*FULLTEXT\s KEY\s "([^"] )"\s \("([^"] )"\)/i ) {
    my $index_name  = $1;
    my $column_name = $2;
    push @post_process_lines,
    "CREATE INDEX idx_fts_${current_table}_$index_name ON \"$current_table\" USING GIN (to_tsvector('english', \"$column_name\"));\n";
    next;
}

Auto increment

PostgreSQL doesn't use the AUTOINCREMENT keyword, instead it uses GENERATED ALWAYS AS IDENTITY.

There is a catch with using GENERATED ALWAYS AS IDENTITY while importing data. GENERATED ALWAYS AS IDENTITY is not designed for importing IDs, When inserting a row into a table, the ID field cannot be specified. The ID value will be auto generated. Trying to insert your own IDs into the row will produce an error.

To work around this issue, the ID field can be set as SERIAL type instead of int GENERATED ALWAYS AS IDENTITY. SERIAL is much more flexible for imports, but it is not recommended to leave the field as SERIAL.

An alternative to using this method would be to add OVERRIDING SYSTEM VALUE into the insert query.

INSERT INTO table (id, name)
OVERRIDING SYSTEM VALUE
VALUES (100, 'A Name');

If you use SERIAL, some queries will need to be written into run_after.sql to change the SERIAL to GENERATED ALWAYS AS IDENTITY and reset the internal counter after the schema is created and the data has been inserted.

if ( $line =~ /^\s*"(\w )"\s (int|bigint)\s NOT\s NULL\s AUTO_INCREMENT\s*,/i ) {
    my $column_name = $1;
    $line =~ s/^\s*"$column_name"\s (int|bigint)\s NOT\s NULL\s AUTO_INCREMENT\s*,/"$column_name" SERIAL,/;

    push @post_process_lines, "ALTER TABLE \"$current_table\" ALTER COLUMN \"$column_name\" DROP DEFAULT;\n";

    push @post_process_lines, "DROP SEQUENCE ${current_table}_${column_name}_seq;\n";

    push @post_process_lines, "ALTER TABLE \"$current_table\" ALTER COLUMN \"$column_name\" ADD GENERATED ALWAYS AS IDENTITY;\n";

    push @post_process_lines, "SELECT setval('${current_table}_${column_name}_seq', (SELECT COALESCE(MAX(\"$column_name\"), 1) FROM \"$current_table\"));\n\n";

}

Schema results

Original schema after exporting from MySQL

DROP TABLE IF EXISTS "address_book";
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE "address_book" (
  "id" int NOT NULL AUTO_INCREMENT,
  "user_id" varchar(50) NOT NULL,
  "common_name" varchar(50) NOT NULL,
  "display_name" varchar(50) NOT NULL,
  PRIMARY KEY ("id"),
  KEY "user_id" ("user_id")
);

Processed main SQL file

DROP TABLE IF EXISTS "address_book";
CREATE TABLE "address_book" (
  "id" SERIAL,
  "user_id" varchar(85) NOT NULL,
  "common_name" varchar(85) NOT NULL,
  "display_name" varchar(85) NOT NULL,
  PRIMARY KEY ("id")
);

Run_after.sql

ALTER TABLE "address_book" ALTER COLUMN "id" DROP DEFAULT;
DROP SEQUENCE address_book_id_seq;
ALTER TABLE "address_book" ALTER COLUMN "id" ADD GENERATED ALWAYS AS IDENTITY;
SELECT setval('address_book_id_seq', (SELECT COALESCE(MAX("id"), 1) FROM "address_book"));
CREATE INDEX idx_address_book_user_id ON "address_book" ("user_id");

Its worth noting the index naming convention used in the migration. The index name includes both the table name and the field name. Index names have to be unique, not only within the table the index was added to, but the entire database, adding the table name and the column name reduces the chances of duplicates in your script.

Data processing

The biggest hurdle in migrating your database is getting the data into a format PostgreSQL accepts. There are some differences in how PostgreSQL stores data that requires extra attention.

Character sets

The dataset used for this article predated utf8mb4 and uses the old default of Latin1, the charset is not compatible with PostgreSQL default charset UTF8, it should be noted that PostgreSQL UTF8 also differs from MySQL's UTF8mb4.

The issue with migrating from Latin1 to UTF8 is how the data is stored. In Latin1 each character is a single byte, while in UTF8 the characters can be multibyte, up to 4 bytes.

An example of this is the word café

in Latin1 the data is stored as 4 bytes and in UTF8 as 5 bytes. During migration of character sets, the byte value is taken into account and can lead to truncated data in UTF8. PostgreSQL will error on this truncation.

To avoid truncation, add padding to affected Varchar fields.

It's worth noting that this same truncation issue could occur if you were changing character sets within MySQL.

Character Escaping

It's not uncommon to see backslash escaped single quotes stored in a database.

However, PostgreSQL doesn't support this by default. Instead, the ANSI SQL standard method of using double single quotes is used.

If the varchar field contains It\'s it would need to be changed to it''s

 $line =~ s/\\'/\'\'/g;

Table Locking

In SQL dumps there are table locking calls before each insert.

LOCK TABLES "address_book" WRITE;

Generally it is unnecessary to manually lock a table in PostgreSQL.

PostgreSQL handles transactions by using Multi-Version Concurrency Control (MVCC). When a row is updated, it creates a new version. Once the old version is no longer in use, it will be removed. This means that table locking is often not needed. PostgreSQL will use locks along side MVCC to improve concurrency. Manually setting locks can negatively affect concurrency.

For this reason, removing the manual locks from the SQL dump and letting PostgreSQL handle the locks as needed is the better choice.

Importing data

The next step in the migration process is running the SQL files generated by the script. If the previous steps were done correctly this part should be a smooth action. What actually happens is the import picks up problems that went unseen in the prior steps, and requires going back and adjusting the scripts and trying again.

To run the SQL files sign into the Postgres database using Psql and run the import function

\i /path/to/converted_schema.sql

The two main errors to watch out for:

ERROR: value too long for type character varying(50)

This can be fixed by increasing varchar field character length as mentioned earlier.

ERROR: invalid command \n

This error can be caused by stray escaped single quotes, or other incompatible data values. To fix these, regex may need to be added to the data processing script to target the specific problem area.

Some of these errors require a harder look at the insert statements to find where the issues are. This can be challenging in a large SQL file. To help with this, write out the INSERT statements that were erroring to a separate, much smaller SQL file, which can more easily be studied to find the issues.

my %lines_to_debug = map { $_ => 1 } (1148, 1195); 
 ...
if (exists $lines_to_debug{$current_line_number}) {
    print $debug_data "$line";  
}

Chunking Data

Regardless of what scripting language you choose to use for your migration, chunking data is going to be important on large SQL files.

For this script, the data was chunked into 1Mb chunks, which helped kept the script efficient. You should pick a chunk size that makes sense for your dataset.

my $bytes_read = read( $original_data, $chunk, $chunk_size );

Verifying Data

There are a few methods of verifying the data

Row Count

Doing a row count is an easy way to ensure at least all the rows were inserted. Count the rows in the old database and compare that to the rows in the new database.

SELECT count(*) FROM address_book

Checksum

Running a checksum across the columns may help, but bear in mind that some fields, especially varchar fields, could have been changed to ANSI standard format. So while this will work on some fields, it won't be accurate on all fields.

For MySQL

SELECT MD5(GROUP_CONCAT(COALESCE(user_id, '') ORDER BY id)) FROM address_book

For PostgreSQL

SELECT MD5(STRING_AGG(COALESCE(user_id, ''), '' ORDER BY id)) FROM address_book

Manual Data Check

You are going to want to verify the data through a manual process also. Run some queries that make sense, queries that would be likely to pick up issues with the import.

Final thoughts

Migrating databases is a large undertaking, but with careful planning and a good understanding of both your dataset and the differences between the two database systems, it can be completed successfully.

There is more to migrating to a new database than just the import, but a solid dataset migration will put you in a good place for the rest of the transition.


Scripts created for this migration can be found on Git Hub.

版本聲明 本文轉載於:https://dev.to/mrpercival/migrating-from-mysql-to-postgresql-1oh7?1如有侵犯,請聯絡[email protected]刪除
最新教學 更多>
  • 為什麼填入在 Safari 和 IE 選擇清單中不起作用?
    為什麼填入在 Safari 和 IE 選擇清單中不起作用?
    在Safari 和IE 的選擇清單中不顯示填充儘管W3 規範中沒有限制,但WebKit 瀏覽器不支援選擇框中的填充,包括Safari和Chrome。因此,這些瀏覽器中不應用填充。 要解決此問題,請考慮使用 text-indent 而不是 padding-left。透過相應增加選擇框的寬度來保持相同的...
    程式設計 發佈於2024-11-05
  • 在 Spring Boot 中建立自訂註解的終極指南
    在 Spring Boot 中建立自訂註解的終極指南
    Such annotations fill the entire project in Spring Boot. But do you know what problems these annotations solve? Why were custom annotations introduce...
    程式設計 發佈於2024-11-05
  • 為什麼 Elixir 在非同步處理方面比 Node.js 更好?
    為什麼 Elixir 在非同步處理方面比 Node.js 更好?
    简单回答:Node.js 是单线程的,并拆分该单线程来模拟并发,而 Elixir 利用了 Erlang 虚拟机 BEAM 原生的并发和并行性,同时执行进程。 下面,我们将更深入地了解这种差异,探索两个关键概念:Node.js 事件循环和 Elixir 的 BEAM VM 和 OTP。这些元素对于理解...
    程式設計 發佈於2024-11-05
  • AngularJS $watch 如何取代動態導航高度調整中的計時器?
    AngularJS $watch 如何取代動態導航高度調整中的計時器?
    避免 AngularJS 的高度監視計時器當導航高度是動態時,AngularJS 程式設計師經常面臨響應式導航的挑戰。這就導致需要調整內容的 margin-top 值以回應導航高度的變化。 以前,使用計時器來偵測導航高度的變化,但這種方法有缺點:使用計時器和調整內容的 margin-top 出現延遲...
    程式設計 發佈於2024-11-05
  • 從零到 Web 開發人員:掌握 PHP 基礎知識
    從零到 Web 開發人員:掌握 PHP 基礎知識
    掌握PHP基礎至關重要:安裝PHP建立PHP檔案運行程式碼理解變數和資料類型使用表達式和運算子建立實際專案以提高技能 PHP開發入門:掌握PHP基礎PHP是一種用途廣泛、功能強大的腳本語言,用於創建動態且互動式Web應用程式。對於初學者來說,掌握PHP的基本知識至關重要。 一、安裝PHP在本地開發機...
    程式設計 發佈於2024-11-05
  • 緩衝區:Node.js
    緩衝區:Node.js
    Node.js 中緩衝區的簡單指南 Node.js 中的 Buffer 用於處理原始二進位數據,這在處理流、文件或網路數據時非常有用。 如何建立緩衝區 來自字串: const buf = Buffer.from('Hello'); 分配特定大小的Buffer...
    程式設計 發佈於2024-11-05
  • 掌握 Node.js 中的版本管理
    掌握 Node.js 中的版本管理
    作為開發者,我們經常遇到需要不同 Node.js 版本的專案。對於可能不經常參與 Node.js 專案的新手和經驗豐富的開發人員來說,這種情況都是一個陷阱:確保每個專案使用正確的 Node.js 版本。 在安裝依賴項並執行專案之前,驗證您的 Node.js 版本是否符合或至少相容專案的要求至關重要...
    程式設計 發佈於2024-11-05
  • 如何在 Go 二進位檔案中嵌入 Git 修訂資訊以進行故障排除?
    如何在 Go 二進位檔案中嵌入 Git 修訂資訊以進行故障排除?
    確定Go 二進位檔案中的Git 修訂版部署程式碼時,將二進位檔案與建置它們的git 修訂版關聯起來會很有幫助排除故障的目的。然而,直接使用修訂號更新原始程式碼是不可行的,因為它會改變原始程式碼。 解決方案:利用建造標誌解決此挑戰的方法包括利用建造標誌。透過使用建置標誌在主套件中設定當前 git 修訂...
    程式設計 發佈於2024-11-05
  • 常見 HTML 標籤:視角
    常見 HTML 標籤:視角
    HTML(超文本標記語言)構成了 Web 開發的基礎,是互聯網上每個網頁的結構。透過了解最常見的 HTML 標籤及其高級用途,到 2024 年,開發人員可以創建更有效率、更易於存取且更具視覺吸引力的網頁。在這篇文章中,我們將探討這些 HTML 標籤及其最高級的用例,以協助您提升 Web 開發技能。 ...
    程式設計 發佈於2024-11-05
  • CSS 媒體查詢
    CSS 媒體查詢
    確保網站在各種裝置上無縫運作比以往任何時候都更加重要。隨著用戶透過桌上型電腦、筆記型電腦、平板電腦和智慧型手機造訪網站,響應式設計已成為必要。響應式設計的核心在於媒體查詢,這是一項強大的 CSS 功能,可讓開發人員根據使用者裝置的特徵應用不同的樣式。在本文中,我們將探討什麼是媒體查詢、它們如何運作以...
    程式設計 發佈於2024-11-05
  • 了解 JavaScript 中的提升:綜合指南
    了解 JavaScript 中的提升:綜合指南
    JavaScript 中的提升 提升是一種行為,其中變數和函數聲明在先前被移動(或「提升」)到其包含範圍(全域範圍或函數範圍)的頂部程式碼被執行。這意味著您可以在程式碼中實際聲明變數和函數之前使用它們。 變數提升 變數 用 var 宣告的變數被提升...
    程式設計 發佈於2024-11-05
  • 將 Stripe 整合到單一產品 Django Python 商店中
    將 Stripe 整合到單一產品 Django Python 商店中
    In the first part of this series, we created a Django online shop with htmx. In this second part, we'll handle orders using Stripe. What We'll...
    程式設計 發佈於2024-11-05
  • 在 Laravel 測試排隊作業的技巧
    在 Laravel 測試排隊作業的技巧
    使用 Laravel 應用程式時,經常會遇到命令需要執行昂貴任務的情況。為了避免阻塞主進程,您可能決定將任務卸載到可以由佇列處理的作業。 讓我們來看一個例子。想像一下指令 app:import-users 需要讀取一個大的 CSV 檔案並為每個條目建立一個使用者。該命令可能如下所示: /* Imp...
    程式設計 發佈於2024-11-05
  • 如何創建人類層級的自然語言理解 (NLU) 系統
    如何創建人類層級的自然語言理解 (NLU) 系統
    Scope: Creating an NLU system that fully understands and processes human languages in a wide range of contexts, from conversations to literature. ...
    程式設計 發佈於2024-11-05
  • 如何使用 JSTL 迭代 HashMap 中的 ArrayList?
    如何使用 JSTL 迭代 HashMap 中的 ArrayList?
    使用JSTL 迭代HashMap 中的ArrayList在Web 開發中,JSTL(JavaServer Pages 標準標記庫)提供了一組標記來簡化JSP 中的常見任務( Java 伺服器頁面)。其中一項任務是迭代資料結構。 要迭代 HashMap 及其中包含的 ArrayList,可以使用 JS...
    程式設計 發佈於2024-11-05

免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。

Copyright© 2022 湘ICP备2022001581号-3