」工欲善其事,必先利其器。「—孔子《論語.錄靈公》
首頁 > 程式設計 > 從 MySQL 遷移到 PostgreSQL

從 MySQL 遷移到 PostgreSQL

發佈於2024-07-29
瀏覽:326

Migrating from MySQL to PostgreSQL

Migrating a database from MySQL to Postgres is a challenging process.

While MySQL and Postgres do a similar job, there are some fundamental differences between them and those differences can create issues that need addressing for the migration to be successful.

Where to start?

Pg Loader is a tool that can be used to move your data to PostgreSQL, however, it's not perfect, but can work well in some cases. It's worth looking at to see if it's the direction you want to go.

Another approach to take is to create custom scripts.

Custom scripts offer greater flexibility and scope to address issues specific to your dataset.

For this article, custom scripts were built to handle the migration process.

Exporting the data

How the data is exported is critical to a smooth migration. Using mysqldump in its default setup will lead to a more difficult process.

Use the --compatible=ansi option to export the data in a format PostgreSQL requires.

To make the migration easier to handle, split up the schema and data dumps so they can be processed separately. The processing requirements for each file are very different and creating a script for each will make it more manageable.

Schema differences

Data Types

There are differences in what data types are available in MySQL and PostgreSQL, this means when processing your schema you are going to need to decide what field data types work best for your data.

Category MySQL PostgreSQL
Numeric INT, TINYINT, SMALLINT, MEDIUMINT, BIGINT, FLOAT, DOUBLE, DECIMAL INTEGER, SMALLINT, BIGINT, NUMERIC, REAL, DOUBLE PRECISION, SERIAL, SMALLSERIAL, BIGSERIAL
String CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT CHAR, VARCHAR, TEXT
Date and Time DATE, TIME, DATETIME, TIMESTAMP, YEAR DATE, TIME, TIMESTAMP, INTERVAL, TIMESTAMPTZ
Binary BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB BYTEA
Boolean BOOLEAN (TINYINT(1)) BOOLEAN
Enum and Set ENUM, SET ENUM (no SET equivalent)
JSON JSON JSON, JSONB
Geometric GEOMETRY, POINT, LINESTRING, POLYGON POINT, LINE, LSEG, BOX, PATH, POLYGON, CIRCLE
Network Address No built-in types CIDR, INET, MACADDR
UUID No built-in type (can use CHAR(36)) UUID
Array No built-in support Supports arrays of any data type
XML No built-in type XML
Range Types No built-in support int4range, int8range, numrange, tsrange, tstzrange, daterange
Composite Types No built-in support User-defined composite types

Tinyint field type

Tinyint doesn't exist in PostgreSQL. You have the choice of smallint or boolean to replace it with. Choose the data type most like the current dataset.

 $line =~ s/\btinyint(?:\(\d \))?\b/smallint/gi;

Enum Field type

Enum fields are a little more complex, while enums exist in PostgreSQL, they require creating custom types.

To avoid duplicating custom types, it is better to plan out what enum types are required and create the minimum number of custom types needed for your schema. Custom types are not table specific, one custom type can be used on multiple tables.

CREATE TYPE color_enum AS ENUM ('blue', 'green');

...
"shirt_color" color_enum NOT NULL DEFAULT 'blue',
"pant_color" color_enum NOT NULL DEFAULT 'green',
...

The creation of the types would need to be done before the SQL is imported. The script could then be adjusted to use the custom types that have been created.

If there are multiple fields using enum('blue','green'), these should all be using the same enum custom type. Creating custom types for each individual field would not be good database design.

if ( $line =~ /"([^"] )"\s enum\(([^)] )\)/ ) {
    my $column_name = $1;
    my $enum_values = $2;
    if ( $enum_values !~ /''/ ) {
        $enum_values .= ",''";
    }

    my @items = $enum_values =~ /'([^']*)'/g;

    my $sorted_enum_values = join( ',', sort @items );

    my $enum_type_name;
    if ( exists $enum_types{$sorted_enum_values} ) {
        $enum_type_name = $enum_types{$sorted_enum_values};
    }
    else {
        $enum_type_name = create_enum_type_name($sorted_enum_values);
        $enum_types{$sorted_enum_values} = $enum_type_name;

        # Add CREATE TYPE statement to post-processing
        push @enum_lines,
        "CREATE TYPE $enum_type_name AS ENUM ($enum_values);\n";
    }

    # Replace the line with the new ENUM type
    $line =~ s/enum\([^)] \)/$enum_type_name/;
}

Indexes

There are differences in how indexes are created. There are two variations of indexes, Indexes with character limitations and indexes without character limitations. Both of these needed to be handled and removed from the SQL and put into a separate SQL file to be run after the import is complete (run_after.sql).

if ($line =~ /^\s*KEY\s /i) {
    if ($line =~ /KEY\s "([^"] )"\s \("([^"] )"\)/) {
        my $index_name = $1;
        my $column_name = $2;
        push @post_process_lines, "CREATE INDEX idx_${current_table}_$index_name ON \"$current_table\" (\"$column_name\");\n";
    } elsif ($line =~ /KEY\s "([^"] )"\s \("([^"] )"\((\d )\)\)/i) {
        my $index_name = $1;
        my $column_name = $2;
        my $prefix_length = $3;
        push @post_process_lines, "CREATE INDEX idx_${current_table}_$index_name ON \"$current_table\" (LEFT(\"$column_name\", $prefix_length));\n";
    }
    next;
}

Full text indexes work quite differently in PostgreSQL. To create full text index the index must convert the data into a vector.

The vector can then be indexed. There are two index types to choose from when indexing vectors. GIN and GiST. Both have pros and cons. Generally GIN is preferred over GiST. While GIN is slower building the index, it's faster for lookups.

if ( $line =~ /^\s*FULLTEXT\s KEY\s "([^"] )"\s \("([^"] )"\)/i ) {
    my $index_name  = $1;
    my $column_name = $2;
    push @post_process_lines,
    "CREATE INDEX idx_fts_${current_table}_$index_name ON \"$current_table\" USING GIN (to_tsvector('english', \"$column_name\"));\n";
    next;
}

Auto increment

PostgreSQL doesn't use the AUTOINCREMENT keyword, instead it uses GENERATED ALWAYS AS IDENTITY.

There is a catch with using GENERATED ALWAYS AS IDENTITY while importing data. GENERATED ALWAYS AS IDENTITY is not designed for importing IDs, When inserting a row into a table, the ID field cannot be specified. The ID value will be auto generated. Trying to insert your own IDs into the row will produce an error.

To work around this issue, the ID field can be set as SERIAL type instead of int GENERATED ALWAYS AS IDENTITY. SERIAL is much more flexible for imports, but it is not recommended to leave the field as SERIAL.

An alternative to using this method would be to add OVERRIDING SYSTEM VALUE into the insert query.

INSERT INTO table (id, name)
OVERRIDING SYSTEM VALUE
VALUES (100, 'A Name');

If you use SERIAL, some queries will need to be written into run_after.sql to change the SERIAL to GENERATED ALWAYS AS IDENTITY and reset the internal counter after the schema is created and the data has been inserted.

if ( $line =~ /^\s*"(\w )"\s (int|bigint)\s NOT\s NULL\s AUTO_INCREMENT\s*,/i ) {
    my $column_name = $1;
    $line =~ s/^\s*"$column_name"\s (int|bigint)\s NOT\s NULL\s AUTO_INCREMENT\s*,/"$column_name" SERIAL,/;

    push @post_process_lines, "ALTER TABLE \"$current_table\" ALTER COLUMN \"$column_name\" DROP DEFAULT;\n";

    push @post_process_lines, "DROP SEQUENCE ${current_table}_${column_name}_seq;\n";

    push @post_process_lines, "ALTER TABLE \"$current_table\" ALTER COLUMN \"$column_name\" ADD GENERATED ALWAYS AS IDENTITY;\n";

    push @post_process_lines, "SELECT setval('${current_table}_${column_name}_seq', (SELECT COALESCE(MAX(\"$column_name\"), 1) FROM \"$current_table\"));\n\n";

}

Schema results

Original schema after exporting from MySQL

DROP TABLE IF EXISTS "address_book";
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE "address_book" (
  "id" int NOT NULL AUTO_INCREMENT,
  "user_id" varchar(50) NOT NULL,
  "common_name" varchar(50) NOT NULL,
  "display_name" varchar(50) NOT NULL,
  PRIMARY KEY ("id"),
  KEY "user_id" ("user_id")
);

Processed main SQL file

DROP TABLE IF EXISTS "address_book";
CREATE TABLE "address_book" (
  "id" SERIAL,
  "user_id" varchar(85) NOT NULL,
  "common_name" varchar(85) NOT NULL,
  "display_name" varchar(85) NOT NULL,
  PRIMARY KEY ("id")
);

Run_after.sql

ALTER TABLE "address_book" ALTER COLUMN "id" DROP DEFAULT;
DROP SEQUENCE address_book_id_seq;
ALTER TABLE "address_book" ALTER COLUMN "id" ADD GENERATED ALWAYS AS IDENTITY;
SELECT setval('address_book_id_seq', (SELECT COALESCE(MAX("id"), 1) FROM "address_book"));
CREATE INDEX idx_address_book_user_id ON "address_book" ("user_id");

Its worth noting the index naming convention used in the migration. The index name includes both the table name and the field name. Index names have to be unique, not only within the table the index was added to, but the entire database, adding the table name and the column name reduces the chances of duplicates in your script.

Data processing

The biggest hurdle in migrating your database is getting the data into a format PostgreSQL accepts. There are some differences in how PostgreSQL stores data that requires extra attention.

Character sets

The dataset used for this article predated utf8mb4 and uses the old default of Latin1, the charset is not compatible with PostgreSQL default charset UTF8, it should be noted that PostgreSQL UTF8 also differs from MySQL's UTF8mb4.

The issue with migrating from Latin1 to UTF8 is how the data is stored. In Latin1 each character is a single byte, while in UTF8 the characters can be multibyte, up to 4 bytes.

An example of this is the word café

in Latin1 the data is stored as 4 bytes and in UTF8 as 5 bytes. During migration of character sets, the byte value is taken into account and can lead to truncated data in UTF8. PostgreSQL will error on this truncation.

To avoid truncation, add padding to affected Varchar fields.

It's worth noting that this same truncation issue could occur if you were changing character sets within MySQL.

Character Escaping

It's not uncommon to see backslash escaped single quotes stored in a database.

However, PostgreSQL doesn't support this by default. Instead, the ANSI SQL standard method of using double single quotes is used.

If the varchar field contains It\'s it would need to be changed to it''s

 $line =~ s/\\'/\'\'/g;

Table Locking

In SQL dumps there are table locking calls before each insert.

LOCK TABLES "address_book" WRITE;

Generally it is unnecessary to manually lock a table in PostgreSQL.

PostgreSQL handles transactions by using Multi-Version Concurrency Control (MVCC). When a row is updated, it creates a new version. Once the old version is no longer in use, it will be removed. This means that table locking is often not needed. PostgreSQL will use locks along side MVCC to improve concurrency. Manually setting locks can negatively affect concurrency.

For this reason, removing the manual locks from the SQL dump and letting PostgreSQL handle the locks as needed is the better choice.

Importing data

The next step in the migration process is running the SQL files generated by the script. If the previous steps were done correctly this part should be a smooth action. What actually happens is the import picks up problems that went unseen in the prior steps, and requires going back and adjusting the scripts and trying again.

To run the SQL files sign into the Postgres database using Psql and run the import function

\i /path/to/converted_schema.sql

The two main errors to watch out for:

ERROR: value too long for type character varying(50)

This can be fixed by increasing varchar field character length as mentioned earlier.

ERROR: invalid command \n

This error can be caused by stray escaped single quotes, or other incompatible data values. To fix these, regex may need to be added to the data processing script to target the specific problem area.

Some of these errors require a harder look at the insert statements to find where the issues are. This can be challenging in a large SQL file. To help with this, write out the INSERT statements that were erroring to a separate, much smaller SQL file, which can more easily be studied to find the issues.

my %lines_to_debug = map { $_ => 1 } (1148, 1195); 
 ...
if (exists $lines_to_debug{$current_line_number}) {
    print $debug_data "$line";  
}

Chunking Data

Regardless of what scripting language you choose to use for your migration, chunking data is going to be important on large SQL files.

For this script, the data was chunked into 1Mb chunks, which helped kept the script efficient. You should pick a chunk size that makes sense for your dataset.

my $bytes_read = read( $original_data, $chunk, $chunk_size );

Verifying Data

There are a few methods of verifying the data

Row Count

Doing a row count is an easy way to ensure at least all the rows were inserted. Count the rows in the old database and compare that to the rows in the new database.

SELECT count(*) FROM address_book

Checksum

Running a checksum across the columns may help, but bear in mind that some fields, especially varchar fields, could have been changed to ANSI standard format. So while this will work on some fields, it won't be accurate on all fields.

For MySQL

SELECT MD5(GROUP_CONCAT(COALESCE(user_id, '') ORDER BY id)) FROM address_book

For PostgreSQL

SELECT MD5(STRING_AGG(COALESCE(user_id, ''), '' ORDER BY id)) FROM address_book

Manual Data Check

You are going to want to verify the data through a manual process also. Run some queries that make sense, queries that would be likely to pick up issues with the import.

Final thoughts

Migrating databases is a large undertaking, but with careful planning and a good understanding of both your dataset and the differences between the two database systems, it can be completed successfully.

There is more to migrating to a new database than just the import, but a solid dataset migration will put you in a good place for the rest of the transition.


Scripts created for this migration can be found on Git Hub.

版本聲明 本文轉載於:https://dev.to/mrpercival/migrating-from-mysql-to-postgresql-1oh7?1如有侵犯,請聯絡[email protected]刪除
最新教學 更多>
  • PHP高手必備 | Ctype函數入門指南
    PHP高手必備 | Ctype函數入門指南
    PHP 的 ctype 函数详解:字符类型验证利器 核心要点 PHP 4.2 及以上版本包含的 ctype 函数族用于验证字符串中字符的类型,常用于数据验证。它们可以检查字符串是否仅包含大写字符、数字、十六进制字符等。但务必确保传入这些函数的始终是字符串。 ctype 函数种类繁多,包括 ctype...
    程式設計 發佈於2025-04-19
  • 在Ubuntu/linux上安裝mysql-python時,如何修復\“ mysql_config \”錯誤?
    在Ubuntu/linux上安裝mysql-python時,如何修復\“ mysql_config \”錯誤?
    mysql-python安裝錯誤:“ mysql_config找不到”“ 由於缺少MySQL開發庫而出現此錯誤。解決此問題,建議在Ubuntu上使用該分發的存儲庫。使用以下命令安裝Python-MysqldB: sudo apt-get安裝python-mysqldb sudo pip in...
    程式設計 發佈於2025-04-19
  • 如何同步迭代並從PHP中的兩個等級陣列打印值?
    如何同步迭代並從PHP中的兩個等級陣列打印值?
    同步的迭代和打印值來自相同大小的兩個數組使用兩個數組相等大小的selectbox時,一個包含country代碼的數組,另一個包含鄉村代碼,另一個包含其相應名稱的數組,可能會因不當提供了exply for for for the uncore for the forsion for for ytry...
    程式設計 發佈於2025-04-19
  • 您可以使用CSS在Chrome和Firefox中染色控制台輸出嗎?
    您可以使用CSS在Chrome和Firefox中染色控制台輸出嗎?
    在javascript console 中顯示顏色是可以使用chrome的控制台顯示彩色文本,例如紅色的redors,for for for for錯誤消息? 回答是的,可以使用CSS將顏色添加到Chrome和Firefox中的控制台顯示的消息(版本31或更高版本)中。要實現這一目標,請使用以下...
    程式設計 發佈於2025-04-19
  • Java開發者如何保護數據庫憑證免受反編譯?
    Java開發者如何保護數據庫憑證免受反編譯?
    在java 在單獨的配置文件保護數據庫憑證的最有效方法中存儲憑據是將它們存儲在單獨的配置文件中。該文件可以在運行時加載,從而使登錄數據從編譯的二進製文件中遠離。 使用prevereness class import java.util.prefs.preferences; 公共類示例{ 首選...
    程式設計 發佈於2025-04-19
  • 如何使用組在MySQL中旋轉數據?
    如何使用組在MySQL中旋轉數據?
    在關係數據庫中使用mySQL組使用mySQL組進行查詢結果,在關係數據庫中使用MySQL組,轉移數據的數據是指重新排列的行和列的重排以增強數據可視化。在這裡,我們面對一個共同的挑戰:使用組的組將數據從基於行的基於列的轉換為基於列。讓我們考慮以下查詢: select data d.data_ti...
    程式設計 發佈於2025-04-19
  • 如何正確使用與PDO參數的查詢一樣?
    如何正確使用與PDO參數的查詢一樣?
    在pdo 中使用類似QUERIES在PDO中的Queries時,您可能會遇到類似疑問中描述的問題:此查詢也可能不會返回結果,即使$ var1和$ var2包含有效的搜索詞。錯誤在於不正確包含%符號。 通過將變量包含在$ params數組中的%符號中,您確保將%字符正確替換到查詢中。沒有此修改,PD...
    程式設計 發佈於2025-04-19
  • 為什麼我會收到MySQL錯誤#1089:錯誤的前綴密鑰?
    為什麼我會收到MySQL錯誤#1089:錯誤的前綴密鑰?
    mySQL錯誤#1089:錯誤的前綴鍵錯誤descript [#1089-不正確的前綴鍵在嘗試在表中創建一個prefix鍵時會出現。前綴鍵旨在索引字符串列的特定前綴長度長度,可以更快地搜索這些前綴。 了解prefix keys `這將在整個Movie_ID列上創建標準主鍵。主密鑰對於唯一識...
    程式設計 發佈於2025-04-19
  • 如何使用node-mysql在單個查詢中執行多個SQL語句?
    如何使用node-mysql在單個查詢中執行多個SQL語句?
    在node-mysql node-mysql文檔最初出於安全原因最初禁用多個語句支持,因為它可能導致SQL注入攻擊。要啟用此功能,您需要在創建連接時將倍增設置設置為true: var connection = mysql.createconnection({{multipleStatement:...
    程式設計 發佈於2025-04-19
  • 在PHP中如何高效檢測空數組?
    在PHP中如何高效檢測空數組?
    在PHP 中檢查一個空數組可以通過各種方法在PHP中確定一個空數組。如果需要驗證任何數組元素的存在,則PHP的鬆散鍵入允許對數組本身進行直接評估:一種更嚴格的方法涉及使用count()函數: if(count(count($ playerList)=== 0){ //列表為空。 } 對...
    程式設計 發佈於2025-04-19
  • 大批
    大批
    [2 數組是對象,因此它們在JS中也具有方法。 切片(開始):在新數組中提取部分數組,而無需突變原始數組。 令ARR = ['a','b','c','d','e']; // USECASE:提取直到索引作...
    程式設計 發佈於2025-04-19
  • 如何配置Pytesseract以使用數字輸出的單位數字識別?
    如何配置Pytesseract以使用數字輸出的單位數字識別?
    Pytesseract OCR具有單位數字識別和僅數字約束 在pytesseract的上下文中,在配置tesseract以識別單位數字和限制單個數字和限制輸出對數字可能會提出質疑。 To address this issue, we delve into the specifics of Te...
    程式設計 發佈於2025-04-19
  • 如何在鼠標單擊時編程選擇DIV中的所有文本?
    如何在鼠標單擊時編程選擇DIV中的所有文本?
    在鼠標上選擇div文本單擊帶有文本內容,用戶如何使用單個鼠標單擊單擊div中的整個文本?這允許用戶輕鬆拖放所選的文本或直接複製它。 在單個鼠標上單擊的div元素中選擇文本,您可以使用以下Javascript函數: function selecttext(canduterid){ if(d...
    程式設計 發佈於2025-04-19
  • 為什麼我的CSS背景圖像出現?
    為什麼我的CSS背景圖像出現?
    故障排除:CSS背景圖像未出現 ,您的背景圖像儘管遵循教程說明,但您的背景圖像仍未加載。圖像和样式表位於相同的目錄中,但背景仍然是空白的白色帆布。 而不是不棄用的,您已經使用了CSS樣式: bockent {背景:封閉圖像文件名:背景圖:url(nickcage.jpg); 如果您的html,cs...
    程式設計 發佈於2025-04-19
  • 如何高效地在一個事務中插入數據到多個MySQL表?
    如何高效地在一個事務中插入數據到多個MySQL表?
    mySQL插入到多個表中,該數據可能會產生意外的結果。雖然似乎有多個查詢可以解決問題,但將從用戶表的自動信息ID與配置文件表的手動用戶ID相關聯提出了挑戰。 使用Transactions和last_insert_id() 插入用戶(用戶名,密碼)值('test','tes...
    程式設計 發佈於2025-04-19

免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。

Copyright© 2022 湘ICP备2022001581号-3