我正在测试Postgres插入性能。我有一个表,其中有一列,其数据类型为数字。还有一个索引。我用这个查询填充了数据库:
insert into aNumber (id) values (564),(43536),(34560) ...
我非常快地插入了400万行,使用上面的查询一次插入10,000行。在数据库达到600万行之后,性能急剧下降到每15分钟100万行。有什么技巧可以提高插入性能吗?我需要在这个项目上的最佳插入性能。
在内存为5gb的机器上使用Windows 7 Pro。
我正在测试Postgres插入性能。我有一个表,其中有一列,其数据类型为数字。还有一个索引。我用这个查询填充了数据库:
insert into aNumber (id) values (564),(43536),(34560) ...
我非常快地插入了400万行,使用上面的查询一次插入10,000行。在数据库达到600万行之后,性能急剧下降到每15分钟100万行。有什么技巧可以提高插入性能吗?我需要在这个项目上的最佳插入性能。
在内存为5gb的机器上使用Windows 7 Pro。
当前回答
我今天花了大约6个小时在同一个问题上。插入以“常规”速度(每100K小于3秒)进行,直到5MI(总共30MI)行,然后性能急剧下降(一直下降到每100K 1分钟)。
我不会列出所有不起作用的事情,直接切入正题。
我在目标表上放置了一个主键(这是一个GUID),我的30MI或行愉快地以每100K不到3秒的恒定速度流到目的地。
其他回答
除了Craig Ringer的文章和depesz的博客文章外,如果您想通过在事务中使用预处理语句插入来加快通过ODBC (psqlodbc)接口的插入速度,还需要做一些额外的事情来使其快速工作:
Set the level-of-rollback-on-errors to "Transaction" by specifying Protocol=-1 in the connection string. By default psqlodbc uses "Statement" level, which creates a SAVEPOINT for each statement rather than an entire transaction, making inserts slower. Use server-side prepared statements by specifying UseServerSidePrepare=1 in the connection string. Without this option the client sends the entire insert statement along with each row being inserted. Disable auto-commit on each statement using SQLSetConnectAttr(conn, SQL_ATTR_AUTOCOMMIT, reinterpret_cast<SQLPOINTER>(SQL_AUTOCOMMIT_OFF), 0); Once all rows have been inserted, commit the transaction using SQLEndTran(SQL_HANDLE_DBC, conn, SQL_COMMIT);. There is no need to explicitly open a transaction.
不幸的是,psqlodbc通过发出一系列未准备好的插入语句来“实现”SQLBulkOperations,因此为了实现最快的插入,需要手动编写上述步骤。
使用COPY表…用二进制,根据文档“比文本和CSV格式快一些。”只有当您有数百万行要插入,并且您对二进制数据感到满意时才这样做。
下面是一个使用psycopg2和二进制输入的Python食谱示例。
我今天花了大约6个小时在同一个问题上。插入以“常规”速度(每100K小于3秒)进行,直到5MI(总共30MI)行,然后性能急剧下降(一直下降到每100K 1分钟)。
我不会列出所有不起作用的事情,直接切入正题。
我在目标表上放置了一个主键(这是一个GUID),我的30MI或行愉快地以每100K不到3秒的恒定速度流到目的地。
如果你碰巧插入带有uuid的列(这并不完全是你的情况)并添加到@Dennis的答案(我还不能评论),建议使用gen_random_uuid()(需要PG 9.4和pgcrypto模块)比uuid_generate_v4()快(很多)
=# explain analyze select uuid_generate_v4(),* from generate_series(1,10000);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=4) (actual time=11.674..10304.959 rows=10000 loops=1)
Planning time: 0.157 ms
Execution time: 13353.098 ms
(3 filas)
vs
=# explain analyze select gen_random_uuid(),* from generate_series(1,10000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=4) (actual time=252.274..418.137 rows=10000 loops=1)
Planning time: 0.064 ms
Execution time: 503.818 ms
(3 filas)
而且,这是建议的官方方式
请注意 如果您只需要随机生成的uuid(版本4),可以考虑使用pgcrypto模块中的gen_random_uuid()函数。
这将3.7M行的插入时间从大约2小时降低到大约10分钟。
请参阅PostgreSQL手册中的填充数据库,depesz关于该主题的优秀文章,以及这个SO问题。
(请注意,这个答案是关于批量加载数据到现有的DB或创建一个新的DB。如果你感兴趣的是数据库恢复性能与pg_restore或pg_dump输出的psql执行,大部分这并不适用,因为pg_dump和pg_restore已经做的事情,如创建触发器和索引后,它完成一个模式+数据恢复)。
有很多事情要做。理想的解决方案是导入到一个没有索引的UNLOGGED表中,然后将其更改为logged并添加索引。不幸的是,在PostgreSQL 9.4中,不支持将表从UNLOGGED更改为logged。9.5添加ALTER TABLE…SET LOGGED允许你这样做。
如果可以让数据库脱机进行批量导入,请使用pg_bulkload。
否则:
Disable any triggers on the table Drop indexes before starting the import, re-create them afterwards. (It takes much less time to build an index in one pass than it does to add the same data to it progressively, and the resulting index is much more compact). If doing the import within a single transaction, it's safe to drop foreign key constraints, do the import, and re-create the constraints before committing. Do not do this if the import is split across multiple transactions as you might introduce invalid data. If possible, use COPY instead of INSERTs If you can't use COPY consider using multi-valued INSERTs if practical. You seem to be doing this already. Don't try to list too many values in a single VALUES though; those values have to fit in memory a couple of times over, so keep it to a few hundred per statement. Batch your inserts into explicit transactions, doing hundreds of thousands or millions of inserts per transaction. There's no practical limit AFAIK, but batching will let you recover from an error by marking the start of each batch in your input data. Again, you seem to be doing this already. Use synchronous_commit=off and a huge commit_delay to reduce fsync() costs. This won't help much if you've batched your work into big transactions, though. INSERT or COPY in parallel from several connections. How many depends on your hardware's disk subsystem; as a rule of thumb, you want one connection per physical hard drive if using direct attached storage. Set a high max_wal_size value (checkpoint_segments in older versions) and enable log_checkpoints. Look at the PostgreSQL logs and make sure it's not complaining about checkpoints occurring too frequently. If and only if you don't mind losing your entire PostgreSQL cluster (your database and any others on the same cluster) to catastrophic corruption if the system crashes during the import, you can stop Pg, set fsync=off, start Pg, do your import, then (vitally) stop Pg and set fsync=on again. See WAL configuration. Do not do this if there is already any data you care about in any database on your PostgreSQL install. If you set fsync=off you can also set full_page_writes=off; again, just remember to turn it back on after your import to prevent database corruption and data loss. See non-durable settings in the Pg manual.
你还应该考虑优化你的系统:
Use good quality SSDs for storage as much as possible. Good SSDs with reliable, power-protected write-back caches make commit rates incredibly faster. They're less beneficial when you follow the advice above - which reduces disk flushes / number of fsync()s - but can still be a big help. Do not use cheap SSDs without proper power-failure protection unless you don't care about keeping your data. If you're using RAID 5 or RAID 6 for direct attached storage, stop now. Back your data up, restructure your RAID array to RAID 10, and try again. RAID 5/6 are hopeless for bulk write performance - though a good RAID controller with a big cache can help. If you have the option of using a hardware RAID controller with a big battery-backed write-back cache this can really improve write performance for workloads with lots of commits. It doesn't help as much if you're using async commit with a commit_delay or if you're doing fewer big transactions during bulk loading. If possible, store WAL (pg_wal, or pg_xlog in old versions) on a separate disk / disk array. There's little point in using a separate filesystem on the same disk. People often choose to use a RAID1 pair for WAL. Again, this has more effect on systems with high commit rates, and it has little effect if you're using an unlogged table as the data load target.
你可能也会对优化PostgreSQL进行快速测试感兴趣。