我正在测试Postgres插入性能。我有一个表,其中有一列,其数据类型为数字。还有一个索引。我用这个查询填充了数据库:
insert into aNumber (id) values (564),(43536),(34560) ...
我非常快地插入了400万行,使用上面的查询一次插入10,000行。在数据库达到600万行之后,性能急剧下降到每15分钟100万行。有什么技巧可以提高插入性能吗?我需要在这个项目上的最佳插入性能。
在内存为5gb的机器上使用Windows 7 Pro。
我正在测试Postgres插入性能。我有一个表,其中有一列,其数据类型为数字。还有一个索引。我用这个查询填充了数据库:
insert into aNumber (id) values (564),(43536),(34560) ...
我非常快地插入了400万行,使用上面的查询一次插入10,000行。在数据库达到600万行之后,性能急剧下降到每15分钟100万行。有什么技巧可以提高插入性能吗?我需要在这个项目上的最佳插入性能。
在内存为5gb的机器上使用Windows 7 Pro。
当前回答
请参阅PostgreSQL手册中的填充数据库,depesz关于该主题的优秀文章,以及这个SO问题。
(请注意,这个答案是关于批量加载数据到现有的DB或创建一个新的DB。如果你感兴趣的是数据库恢复性能与pg_restore或pg_dump输出的psql执行,大部分这并不适用,因为pg_dump和pg_restore已经做的事情,如创建触发器和索引后,它完成一个模式+数据恢复)。
有很多事情要做。理想的解决方案是导入到一个没有索引的UNLOGGED表中,然后将其更改为logged并添加索引。不幸的是,在PostgreSQL 9.4中,不支持将表从UNLOGGED更改为logged。9.5添加ALTER TABLE…SET LOGGED允许你这样做。
如果可以让数据库脱机进行批量导入,请使用pg_bulkload。
否则:
Disable any triggers on the table Drop indexes before starting the import, re-create them afterwards. (It takes much less time to build an index in one pass than it does to add the same data to it progressively, and the resulting index is much more compact). If doing the import within a single transaction, it's safe to drop foreign key constraints, do the import, and re-create the constraints before committing. Do not do this if the import is split across multiple transactions as you might introduce invalid data. If possible, use COPY instead of INSERTs If you can't use COPY consider using multi-valued INSERTs if practical. You seem to be doing this already. Don't try to list too many values in a single VALUES though; those values have to fit in memory a couple of times over, so keep it to a few hundred per statement. Batch your inserts into explicit transactions, doing hundreds of thousands or millions of inserts per transaction. There's no practical limit AFAIK, but batching will let you recover from an error by marking the start of each batch in your input data. Again, you seem to be doing this already. Use synchronous_commit=off and a huge commit_delay to reduce fsync() costs. This won't help much if you've batched your work into big transactions, though. INSERT or COPY in parallel from several connections. How many depends on your hardware's disk subsystem; as a rule of thumb, you want one connection per physical hard drive if using direct attached storage. Set a high max_wal_size value (checkpoint_segments in older versions) and enable log_checkpoints. Look at the PostgreSQL logs and make sure it's not complaining about checkpoints occurring too frequently. If and only if you don't mind losing your entire PostgreSQL cluster (your database and any others on the same cluster) to catastrophic corruption if the system crashes during the import, you can stop Pg, set fsync=off, start Pg, do your import, then (vitally) stop Pg and set fsync=on again. See WAL configuration. Do not do this if there is already any data you care about in any database on your PostgreSQL install. If you set fsync=off you can also set full_page_writes=off; again, just remember to turn it back on after your import to prevent database corruption and data loss. See non-durable settings in the Pg manual.
你还应该考虑优化你的系统:
Use good quality SSDs for storage as much as possible. Good SSDs with reliable, power-protected write-back caches make commit rates incredibly faster. They're less beneficial when you follow the advice above - which reduces disk flushes / number of fsync()s - but can still be a big help. Do not use cheap SSDs without proper power-failure protection unless you don't care about keeping your data. If you're using RAID 5 or RAID 6 for direct attached storage, stop now. Back your data up, restructure your RAID array to RAID 10, and try again. RAID 5/6 are hopeless for bulk write performance - though a good RAID controller with a big cache can help. If you have the option of using a hardware RAID controller with a big battery-backed write-back cache this can really improve write performance for workloads with lots of commits. It doesn't help as much if you're using async commit with a commit_delay or if you're doing fewer big transactions during bulk loading. If possible, store WAL (pg_wal, or pg_xlog in old versions) on a separate disk / disk array. There's little point in using a separate filesystem on the same disk. People often choose to use a RAID1 pair for WAL. Again, this has more effect on systems with high commit rates, and it has little effect if you're using an unlogged table as the data load target.
你可能也会对优化PostgreSQL进行快速测试感兴趣。
其他回答
除了Craig Ringer的文章和depesz的博客文章外,如果您想通过在事务中使用预处理语句插入来加快通过ODBC (psqlodbc)接口的插入速度,还需要做一些额外的事情来使其快速工作:
Set the level-of-rollback-on-errors to "Transaction" by specifying Protocol=-1 in the connection string. By default psqlodbc uses "Statement" level, which creates a SAVEPOINT for each statement rather than an entire transaction, making inserts slower. Use server-side prepared statements by specifying UseServerSidePrepare=1 in the connection string. Without this option the client sends the entire insert statement along with each row being inserted. Disable auto-commit on each statement using SQLSetConnectAttr(conn, SQL_ATTR_AUTOCOMMIT, reinterpret_cast<SQLPOINTER>(SQL_AUTOCOMMIT_OFF), 0); Once all rows have been inserted, commit the transaction using SQLEndTran(SQL_HANDLE_DBC, conn, SQL_COMMIT);. There is no need to explicitly open a transaction.
不幸的是,psqlodbc通过发出一系列未准备好的插入语句来“实现”SQLBulkOperations,因此为了实现最快的插入,需要手动编写上述步骤。
使用COPY表…用二进制,根据文档“比文本和CSV格式快一些。”只有当您有数百万行要插入,并且您对二进制数据感到满意时才这样做。
下面是一个使用psycopg2和二进制输入的Python食谱示例。
我今天花了大约6个小时在同一个问题上。插入以“常规”速度(每100K小于3秒)进行,直到5MI(总共30MI)行,然后性能急剧下降(一直下降到每100K 1分钟)。
我不会列出所有不起作用的事情,直接切入正题。
我在目标表上放置了一个主键(这是一个GUID),我的30MI或行愉快地以每100K不到3秒的恒定速度流到目的地。
我也遇到了这个插入性能问题。我的解决方案是衍生一些go例程来完成插入工作。与此同时,SetMaxOpenConns应该被赋予一个适当的数字,否则会有太多的打开连接错误被警告。
db, _ := sql.open()
db.SetMaxOpenConns(SOME CONFIG INTEGER NUMBER)
var wg sync.WaitGroup
for _, query := range queries {
wg.Add(1)
go func(msg string) {
defer wg.Done()
_, err := db.Exec(msg)
if err != nil {
fmt.Println(err)
}
}(query)
}
wg.Wait()
对于我的项目,加载速度要快得多。这段代码片段只是给出了它的工作原理。读者应该能够轻松地修改它。
请参阅PostgreSQL手册中的填充数据库,depesz关于该主题的优秀文章,以及这个SO问题。
(请注意,这个答案是关于批量加载数据到现有的DB或创建一个新的DB。如果你感兴趣的是数据库恢复性能与pg_restore或pg_dump输出的psql执行,大部分这并不适用,因为pg_dump和pg_restore已经做的事情,如创建触发器和索引后,它完成一个模式+数据恢复)。
有很多事情要做。理想的解决方案是导入到一个没有索引的UNLOGGED表中,然后将其更改为logged并添加索引。不幸的是,在PostgreSQL 9.4中,不支持将表从UNLOGGED更改为logged。9.5添加ALTER TABLE…SET LOGGED允许你这样做。
如果可以让数据库脱机进行批量导入,请使用pg_bulkload。
否则:
Disable any triggers on the table Drop indexes before starting the import, re-create them afterwards. (It takes much less time to build an index in one pass than it does to add the same data to it progressively, and the resulting index is much more compact). If doing the import within a single transaction, it's safe to drop foreign key constraints, do the import, and re-create the constraints before committing. Do not do this if the import is split across multiple transactions as you might introduce invalid data. If possible, use COPY instead of INSERTs If you can't use COPY consider using multi-valued INSERTs if practical. You seem to be doing this already. Don't try to list too many values in a single VALUES though; those values have to fit in memory a couple of times over, so keep it to a few hundred per statement. Batch your inserts into explicit transactions, doing hundreds of thousands or millions of inserts per transaction. There's no practical limit AFAIK, but batching will let you recover from an error by marking the start of each batch in your input data. Again, you seem to be doing this already. Use synchronous_commit=off and a huge commit_delay to reduce fsync() costs. This won't help much if you've batched your work into big transactions, though. INSERT or COPY in parallel from several connections. How many depends on your hardware's disk subsystem; as a rule of thumb, you want one connection per physical hard drive if using direct attached storage. Set a high max_wal_size value (checkpoint_segments in older versions) and enable log_checkpoints. Look at the PostgreSQL logs and make sure it's not complaining about checkpoints occurring too frequently. If and only if you don't mind losing your entire PostgreSQL cluster (your database and any others on the same cluster) to catastrophic corruption if the system crashes during the import, you can stop Pg, set fsync=off, start Pg, do your import, then (vitally) stop Pg and set fsync=on again. See WAL configuration. Do not do this if there is already any data you care about in any database on your PostgreSQL install. If you set fsync=off you can also set full_page_writes=off; again, just remember to turn it back on after your import to prevent database corruption and data loss. See non-durable settings in the Pg manual.
你还应该考虑优化你的系统:
Use good quality SSDs for storage as much as possible. Good SSDs with reliable, power-protected write-back caches make commit rates incredibly faster. They're less beneficial when you follow the advice above - which reduces disk flushes / number of fsync()s - but can still be a big help. Do not use cheap SSDs without proper power-failure protection unless you don't care about keeping your data. If you're using RAID 5 or RAID 6 for direct attached storage, stop now. Back your data up, restructure your RAID array to RAID 10, and try again. RAID 5/6 are hopeless for bulk write performance - though a good RAID controller with a big cache can help. If you have the option of using a hardware RAID controller with a big battery-backed write-back cache this can really improve write performance for workloads with lots of commits. It doesn't help as much if you're using async commit with a commit_delay or if you're doing fewer big transactions during bulk loading. If possible, store WAL (pg_wal, or pg_xlog in old versions) on a separate disk / disk array. There's little point in using a separate filesystem on the same disk. People often choose to use a RAID1 pair for WAL. Again, this has more effect on systems with high commit rates, and it has little effect if you're using an unlogged table as the data load target.
你可能也会对优化PostgreSQL进行快速测试感兴趣。