How to optimize when importing large amounts of data into PostgreSQL database

Source | OSCHINA Community

Author | PostgreSQLChina

Original link: https://my.oschina.net/postgresqlchina/blog/5568852

When using PostgreSQL, we sometimes insert a large amount of data into the database, such as importing test data, importing business data, and so on. This article describes some options for optimization when importing large amounts of data. You can choose according to your own situation.

1. Turn off auto-commit

Turn off autocommit and only do one commit at the end of each (data copy).

If each insert is allowed to commit independently, then PostgreSQL will do a lot of processing for each row added. And the biggest advantage of completing all inserts in one transaction is that if a record fails to be inserted, then all inserted records up to that point will be rolled back, so that you will not face only partial data. The problem of incomplete data.

postgres=#  \echo :AUTOCOMMITonpostgres=# \set AUTOCOMMIT offpostgres=#  \echo :AUTOCOMMIToff

2. The index is not created during the import phase, or the index is deleted during the import phase

If you are importing data from a table, the fastest way is to create the table, batch import with COPY, and then create the indexes needed for the table. Creating an index on a table with existing data is faster than incrementally updating each row of the table.

If you add a lot of data to an existing table, you can drop the index first, import the data from the table, and then recreate the index. Of course, during periods of missing indexes, the database performance of other database users will be negatively impacted. And we need to think carefully before dropping the unique index, because the error checking provided by the unique constraint disappears when the index is missing. (Carefully consider the impact of indexes)

3. Delete foreign key constraints

As with indexes, checking foreign key constraints as a whole is more efficient than checking incremental rows of data. So we can also delete foreign key constraints, import table data, and then rebuild constraints will be more efficient.

Fourth, increase maintenance_work_mem

Temporarily increasing maintenance_work_mem can improve performance when loading large amounts of data. This parameter can also help speed up the CREATE INDEX and ALTER TABLE ADD FOREIGN KEY commands. It won't do much for COPY itself, but it can speed up the creation of indexes and foreign key constraints.

postgres=# show maintenance_work_mem; maintenance_work_mem---------------------- 64MB(1 row)

5. Change single-value insert to multi-value insert

Reduce SQL parsing time.

6. Turn off archive mode and reduce wal log level

When importing large amounts of data into an installation using WAL archiving or streaming replication, it is faster to perform a new basebackup than an incremental WAL at the end of the data import.

To prevent incremental WAL when logging, you can temporarily adjust wal_level to minimal, archive_modet off, and max_wal_senders to 0 to disable archiving and streaming replication. But modifying these settings requires restarting the service.

postgres=# show wal_level; wal_level----------- minimal(1 row)
postgres=# show  archive_mode; archive_mode-------------- off(1 row)
postgres=# show max_wal_senders; max_wal_senders----------------- 0(1 row)

Seven, increase max_wal_size

Temporarily increasing the max_wal_size configuration variable can also make large data loads faster. This is because loading large amounts of data into PostgreSQL will cause checkpoints to occur more frequently than usual (as specified by the checkpoint_timeout configuration variable).

When a checkpoint occurs, all dirty pages must be flushed to disk. Reduce the number of checkpoints by temporarily increasing max_wal_size during bulk data loads.

postgres=# show max_wal_size; max_wal_size-------------- 1GB(1 row)

8. Use copy instead of insert

COPY is optimized for bulk data loading.

The COPY command is optimized for loading large numbers of rows; it is not as flexible as INSERT, but in the case of large loads, it results in a lot less load. Because COPY is a single command, there is no need to turn off autocommit when populating the table.

If you cannot use COPY, it is more efficient to use PREPARE to create a preliminary INSERT and then use EXECUTE multiple times. This avoids the overhead of repeated analysis and planning of INSERTs.

9. Disable triggers

Before importing data, DISABLE the trigger on the related table, and enable it again after the import is complete.

ALTER TABLE tab_1 DISABLE TRIGGER ALL;导入数据ALTER TABLE tab_1 ENABLE TRIGGER ALL;

10. Related derivative tools: pg_bulkload

pg_bulkload is a high-speed data loading tool for PostgreSQL, as opposed to the copy command. The biggest advantage is speed. In direct mode of pg_bulkload, it skips the shared buffer and WAL buffer and writes directly to the file. It also includes data recovery capabilities to recover in the event of an import failure.

Address: https://github.com/ossc-db/pg_bulkload

11. After importing the data, use analyze

Running ANALYZE or VACUUM ANALYZE ensures that the planner has up-to-date statistics on table data.

If there are no statistics or the statistics are too old, the planner may choose an execution plan with poor performance, resulting in poor query performance for the table.

END

The lake and warehouse are all thunder and rain and little rain?

Here are the latest open source information, software updates, technical dry goods, etc.

Click here↓↓↓ Remember to follow ✔ Star ⭐ Oh~