Migrando de MySQL a PostgreSQL

Página delantera > Programación > Migrando de MySQL a PostgreSQL

Migrando de MySQL a PostgreSQL

Publicado el 2024-07-29

Navegar:354

Migrating from MySQL to PostgreSQL

Migrating a database from MySQL to Postgres is a challenging process.

While MySQL and Postgres do a similar job, there are some fundamental differences between them and those differences can create issues that need addressing for the migration to be successful.

Where to start?

Pg Loader is a tool that can be used to move your data to PostgreSQL, however, it's not perfect, but can work well in some cases. It's worth looking at to see if it's the direction you want to go.

Another approach to take is to create custom scripts.

Custom scripts offer greater flexibility and scope to address issues specific to your dataset.

For this article, custom scripts were built to handle the migration process.

Exporting the data

How the data is exported is critical to a smooth migration. Using mysqldump in its default setup will lead to a more difficult process.

Use the --compatible=ansi option to export the data in a format PostgreSQL requires.

To make the migration easier to handle, split up the schema and data dumps so they can be processed separately. The processing requirements for each file are very different and creating a script for each will make it more manageable.

Schema differences

Data Types

There are differences in what data types are available in MySQL and PostgreSQL, this means when processing your schema you are going to need to decide what field data types work best for your data.

Category	MySQL	PostgreSQL
Numeric	INT, TINYINT, SMALLINT, MEDIUMINT, BIGINT, FLOAT, DOUBLE, DECIMAL	INTEGER, SMALLINT, BIGINT, NUMERIC, REAL, DOUBLE PRECISION, SERIAL, SMALLSERIAL, BIGSERIAL
String	CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT	CHAR, VARCHAR, TEXT
Date and Time	DATE, TIME, DATETIME, TIMESTAMP, YEAR	DATE, TIME, TIMESTAMP, INTERVAL, TIMESTAMPTZ
Binary	BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB	BYTEA
Boolean	BOOLEAN (TINYINT(1))	BOOLEAN
Enum and Set	ENUM, SET	ENUM (no SET equivalent)
JSON	JSON	JSON, JSONB
Geometric	GEOMETRY, POINT, LINESTRING, POLYGON	POINT, LINE, LSEG, BOX, PATH, POLYGON, CIRCLE
Network Address	No built-in types	CIDR, INET, MACADDR
UUID	No built-in type (can use CHAR(36))	UUID
Array	No built-in support	Supports arrays of any data type
XML	No built-in type	XML
Range Types	No built-in support	int4range, int8range, numrange, tsrange, tstzrange, daterange
Composite Types	No built-in support	User-defined composite types

Tinyint field type

Tinyint doesn't exist in PostgreSQL. You have the choice of smallint or boolean to replace it with. Choose the data type most like the current dataset.

 $line =~ s/\btinyint(?:\(\d \))?\b/smallint/gi;

Enum Field type

Enum fields are a little more complex, while enums exist in PostgreSQL, they require creating custom types.

To avoid duplicating custom types, it is better to plan out what enum types are required and create the minimum number of custom types needed for your schema. Custom types are not table specific, one custom type can be used on multiple tables.

CREATE TYPE color_enum AS ENUM ('blue', 'green');

...
"shirt_color" color_enum NOT NULL DEFAULT 'blue',
"pant_color" color_enum NOT NULL DEFAULT 'green',
...

The creation of the types would need to be done before the SQL is imported. The script could then be adjusted to use the custom types that have been created.

If there are multiple fields using enum('blue','green'), these should all be using the same enum custom type. Creating custom types for each individual field would not be good database design.

if ( $line =~ /"([^"] )"\s enum\(([^)] )\)/ ) {
    my $column_name = $1;
    my $enum_values = $2;
    if ( $enum_values !~ /''/ ) {
        $enum_values .= ",''";
    }

    my @items = $enum_values =~ /'([^']*)'/g;

    my $sorted_enum_values = join( ',', sort @items );

    my $enum_type_name;
    if ( exists $enum_types{$sorted_enum_values} ) {
        $enum_type_name = $enum_types{$sorted_enum_values};
    }
    else {
        $enum_type_name = create_enum_type_name($sorted_enum_values);
        $enum_types{$sorted_enum_values} = $enum_type_name;

        # Add CREATE TYPE statement to post-processing
        push @enum_lines,
        "CREATE TYPE $enum_type_name AS ENUM ($enum_values);\n";
    }

    # Replace the line with the new ENUM type
    $line =~ s/enum\([^)] \)/$enum_type_name/;
}

Indexes

There are differences in how indexes are created. There are two variations of indexes, Indexes with character limitations and indexes without character limitations. Both of these needed to be handled and removed from the SQL and put into a separate SQL file to be run after the import is complete (run_after.sql).

if ($line =~ /^\s*KEY\s /i) {
    if ($line =~ /KEY\s "([^"] )"\s \("([^"] )"\)/) {
        my $index_name = $1;
        my $column_name = $2;
        push @post_process_lines, "CREATE INDEX idx_${current_table}_$index_name ON \"$current_table\" (\"$column_name\");\n";
    } elsif ($line =~ /KEY\s "([^"] )"\s \("([^"] )"\((\d )\)\)/i) {
        my $index_name = $1;
        my $column_name = $2;
        my $prefix_length = $3;
        push @post_process_lines, "CREATE INDEX idx_${current_table}_$index_name ON \"$current_table\" (LEFT(\"$column_name\", $prefix_length));\n";
    }
    next;
}

Full text indexes work quite differently in PostgreSQL. To create full text index the index must convert the data into a vector.

The vector can then be indexed. There are two index types to choose from when indexing vectors. GIN and GiST. Both have pros and cons. Generally GIN is preferred over GiST. While GIN is slower building the index, it's faster for lookups.

if ( $line =~ /^\s*FULLTEXT\s KEY\s "([^"] )"\s \("([^"] )"\)/i ) {
    my $index_name  = $1;
    my $column_name = $2;
    push @post_process_lines,
    "CREATE INDEX idx_fts_${current_table}_$index_name ON \"$current_table\" USING GIN (to_tsvector('english', \"$column_name\"));\n";
    next;
}

Auto increment

PostgreSQL doesn't use the AUTOINCREMENT keyword, instead it uses GENERATED ALWAYS AS IDENTITY.

There is a catch with using GENERATED ALWAYS AS IDENTITY while importing data. GENERATED ALWAYS AS IDENTITY is not designed for importing IDs, When inserting a row into a table, the ID field cannot be specified. The ID value will be auto generated. Trying to insert your own IDs into the row will produce an error.

To work around this issue, the ID field can be set as SERIAL type instead of int GENERATED ALWAYS AS IDENTITY. SERIAL is much more flexible for imports, but it is not recommended to leave the field as SERIAL.

An alternative to using this method would be to add OVERRIDING SYSTEM VALUE into the insert query.

INSERT INTO table (id, name)
OVERRIDING SYSTEM VALUE
VALUES (100, 'A Name');

If you use SERIAL, some queries will need to be written into run_after.sql to change the SERIAL to GENERATED ALWAYS AS IDENTITY and reset the internal counter after the schema is created and the data has been inserted.

if ( $line =~ /^\s*"(\w )"\s (int|bigint)\s NOT\s NULL\s AUTO_INCREMENT\s*,/i ) {
    my $column_name = $1;
    $line =~ s/^\s*"$column_name"\s (int|bigint)\s NOT\s NULL\s AUTO_INCREMENT\s*,/"$column_name" SERIAL,/;

    push @post_process_lines, "ALTER TABLE \"$current_table\" ALTER COLUMN \"$column_name\" DROP DEFAULT;\n";

    push @post_process_lines, "DROP SEQUENCE ${current_table}_${column_name}_seq;\n";

    push @post_process_lines, "ALTER TABLE \"$current_table\" ALTER COLUMN \"$column_name\" ADD GENERATED ALWAYS AS IDENTITY;\n";

    push @post_process_lines, "SELECT setval('${current_table}_${column_name}_seq', (SELECT COALESCE(MAX(\"$column_name\"), 1) FROM \"$current_table\"));\n\n";

}

Schema results

Original schema after exporting from MySQL

DROP TABLE IF EXISTS "address_book";
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE "address_book" (
  "id" int NOT NULL AUTO_INCREMENT,
  "user_id" varchar(50) NOT NULL,
  "common_name" varchar(50) NOT NULL,
  "display_name" varchar(50) NOT NULL,
  PRIMARY KEY ("id"),
  KEY "user_id" ("user_id")
);

Processed main SQL file

DROP TABLE IF EXISTS "address_book";
CREATE TABLE "address_book" (
  "id" SERIAL,
  "user_id" varchar(85) NOT NULL,
  "common_name" varchar(85) NOT NULL,
  "display_name" varchar(85) NOT NULL,
  PRIMARY KEY ("id")
);

Run_after.sql

ALTER TABLE "address_book" ALTER COLUMN "id" DROP DEFAULT;
DROP SEQUENCE address_book_id_seq;
ALTER TABLE "address_book" ALTER COLUMN "id" ADD GENERATED ALWAYS AS IDENTITY;
SELECT setval('address_book_id_seq', (SELECT COALESCE(MAX("id"), 1) FROM "address_book"));
CREATE INDEX idx_address_book_user_id ON "address_book" ("user_id");

Its worth noting the index naming convention used in the migration. The index name includes both the table name and the field name. Index names have to be unique, not only within the table the index was added to, but the entire database, adding the table name and the column name reduces the chances of duplicates in your script.

Data processing

The biggest hurdle in migrating your database is getting the data into a format PostgreSQL accepts. There are some differences in how PostgreSQL stores data that requires extra attention.

Character sets

The dataset used for this article predated utf8mb4 and uses the old default of Latin1, the charset is not compatible with PostgreSQL default charset UTF8, it should be noted that PostgreSQL UTF8 also differs from MySQL's UTF8mb4.

The issue with migrating from Latin1 to UTF8 is how the data is stored. In Latin1 each character is a single byte, while in UTF8 the characters can be multibyte, up to 4 bytes.

An example of this is the word café

in Latin1 the data is stored as 4 bytes and in UTF8 as 5 bytes. During migration of character sets, the byte value is taken into account and can lead to truncated data in UTF8. PostgreSQL will error on this truncation.

To avoid truncation, add padding to affected Varchar fields.

It's worth noting that this same truncation issue could occur if you were changing character sets within MySQL.

Character Escaping

It's not uncommon to see backslash escaped single quotes stored in a database.

However, PostgreSQL doesn't support this by default. Instead, the ANSI SQL standard method of using double single quotes is used.

If the varchar field contains It\'s it would need to be changed to it''s

 $line =~ s/\\'/\'\'/g;

Table Locking

In SQL dumps there are table locking calls before each insert.

LOCK TABLES "address_book" WRITE;

Generally it is unnecessary to manually lock a table in PostgreSQL.

PostgreSQL handles transactions by using Multi-Version Concurrency Control (MVCC). When a row is updated, it creates a new version. Once the old version is no longer in use, it will be removed. This means that table locking is often not needed. PostgreSQL will use locks along side MVCC to improve concurrency. Manually setting locks can negatively affect concurrency.

For this reason, removing the manual locks from the SQL dump and letting PostgreSQL handle the locks as needed is the better choice.

Importing data

The next step in the migration process is running the SQL files generated by the script. If the previous steps were done correctly this part should be a smooth action. What actually happens is the import picks up problems that went unseen in the prior steps, and requires going back and adjusting the scripts and trying again.

To run the SQL files sign into the Postgres database using Psql and run the import function

\i /path/to/converted_schema.sql

The two main errors to watch out for:

ERROR: value too long for type character varying(50)

This can be fixed by increasing varchar field character length as mentioned earlier.

ERROR: invalid command \n

This error can be caused by stray escaped single quotes, or other incompatible data values. To fix these, regex may need to be added to the data processing script to target the specific problem area.

Some of these errors require a harder look at the insert statements to find where the issues are. This can be challenging in a large SQL file. To help with this, write out the INSERT statements that were erroring to a separate, much smaller SQL file, which can more easily be studied to find the issues.

my %lines_to_debug = map { $_ => 1 } (1148, 1195); 
 ...
if (exists $lines_to_debug{$current_line_number}) {
    print $debug_data "$line";  
}

Chunking Data

Regardless of what scripting language you choose to use for your migration, chunking data is going to be important on large SQL files.

For this script, the data was chunked into 1Mb chunks, which helped kept the script efficient. You should pick a chunk size that makes sense for your dataset.

my $bytes_read = read( $original_data, $chunk, $chunk_size );

Verifying Data

There are a few methods of verifying the data

Row Count

Doing a row count is an easy way to ensure at least all the rows were inserted. Count the rows in the old database and compare that to the rows in the new database.

SELECT count(*) FROM address_book

Checksum

Running a checksum across the columns may help, but bear in mind that some fields, especially varchar fields, could have been changed to ANSI standard format. So while this will work on some fields, it won't be accurate on all fields.

For MySQL

SELECT MD5(GROUP_CONCAT(COALESCE(user_id, '') ORDER BY id)) FROM address_book

For PostgreSQL

SELECT MD5(STRING_AGG(COALESCE(user_id, ''), '' ORDER BY id)) FROM address_book

Manual Data Check

You are going to want to verify the data through a manual process also. Run some queries that make sense, queries that would be likely to pick up issues with the import.

Final thoughts

Migrating databases is a large undertaking, but with careful planning and a good understanding of both your dataset and the differences between the two database systems, it can be completed successfully.

There is more to migrating to a new database than just the import, but a solid dataset migration will put you in a good place for the rest of the transition.

Scripts created for this migration can be found on Git Hub.

Declaración de liberación Este artículo se reproduce en: https://dev.to/mrpercival/migrate-from-mysql-to-postgresql-1oh7?1 Si hay alguna infracción, comuníquese con [email protected] para eliminarla.

Último tutorial Más>

¿Por qué las uniones de la izquierda parecen intraesiones al filtrarse en la cláusula WHERE en la mesa derecha?
Left endrum: Horas de brujería cuando se convierte en una unión interna en el ámbito de un mago de la base de datos, realizar recuperaciones de ...

Programación Publicado el 2025-04-22
Async void vs. async tarea en ASP.NET: ¿Por qué el método de async void a veces arroja excepciones?
comprensión de la distinción entre la tarea async void y async en asp.net en aplicaciones ASP.NET, la programación asíncrona juega un papel cr...

Programación Publicado el 2025-04-22
¿Cómo convertir una columna Pandas DataFrame a formato de fecha y hora de filtrar por fecha?
transformar la columna Pandas DataFrame en formato de Datetime escenario: datos dentro de un marco de datos PANDAS a menudo existe en varios...

Programación Publicado el 2025-04-22
¿Cómo convertir eficientemente las zonas horarias en PHP?
Conversión de zona horaria eficiente en php en PHP, el manejo de las zonas horarias puede ser una tarea directa. Esta guía proporcionará un méto...

Programación Publicado el 2025-04-22
¿Cómo repetir eficientemente los caracteres de cadena para la sangría en C#?
repitiendo una cadena para la indentación al sangrar una cadena basada en la profundidad de un elemento, es conveniente tener una forma eficie...

Programación Publicado el 2025-04-22
Formación
Los métodos son fns que se pueden llamar a los objetos Las matrices son objetos, por lo tanto, también tienen métodos en js. Slice (Begi...

Programación Publicado el 2025-04-22
¿Cómo establecer dinámicamente las claves en los objetos JavaScript?
cómo crear una clave dinámica para una variable de objeto JavaScript al intentar crear una clave dinámica para un objeto JavaScript, usando esta...

Programación Publicado el 2025-04-22
¿Puedo migrar mi cifrado de MCRYPT a OpenSSL y descifrar datos cifrados de MCRYPT usando OpenSSL?
actualizando mi biblioteca de cifrado de MCRYP En OpenSSL, ¿es posible descifrar datos encriptados con MCRYPT? Dos publicaciones diferentes propo...

Programación Publicado el 2025-04-22
Método para convertir correctamente los caracteres LATIN1 en UTF8 en UTF8 MySQL Table
converse los caracteres latin1 en una tabla utf8 a utf8 ha encontrado un problema donde los caracteres con diacrísos "mysql_set_charset (...

Programación Publicado el 2025-04-22
¿Cómo puedo generar eficientemente las babosas amigables con la URL a partir de cuerdas Unicode en PHP?
elaborando una función para una generación de babosas eficiente creando babosas, representaciones simplificadas de las cadenas unicode utiliza...

Programación Publicado el 2025-04-22
¿Cómo acceder dinámicamente a las variables globales en JavaScript?
Acceder a variables globales dinámicamente por nombre en javascript a las variables globales durante el tiempo de ejecución puede ser un requisi...

Programación Publicado el 2025-04-22
¿Cómo puedo manejar los nombres de archivo UTF-8 en las funciones del sistema de archivos de PHP?
manejando los nombres de archivo UTF-8 en las funciones del sistema de archivos de PHP al crear carpetas que contienen caracteres UTF-8 utiliz...

Programación Publicado el 2025-04-22
¿Por qué Microsoft Visual C ++ no implementa correctamente la instanciación de la plantilla de dos fases?
El misterio de la plantilla de dos fases "roto" instanciación en Microsoft Visual c declaración de problemas: usuarios comúnmente ...

Programación Publicado el 2025-04-22
¿Por qué el DateTime de PHP :: Modify ('+1 mes') produce resultados inesperados?
modificando meses con php datetime: descubrir el comportamiento previsto cuando se trabaja con la clase de datetime de PHP, suma o ritir meses...

Programación Publicado el 2025-04-22
Resolver el error MySQL 1153: el paquete excede el límite 'max_allowed_packet'
MySql Error 1153: la solución de problemas obtuvo un paquete más grande que 'max_allowed_packet' bytes frente al error enigmático mysq...

Programación Publicado el 2025-04-22

Clasificación Más>

Aprende japonés Aprender coreano Aprender chino Aprender idioma extranjero Juego Problema comun Periféricos tecnológicos AI Tutoriales de software Programación Artículo