我需要通过编程的方式将数千万条记录插入Postgres数据库。目前,我在一个查询中执行了数千条插入语句。
有没有更好的方法来做到这一点,一些我不知道的批量插入语句?
我需要通过编程的方式将数千万条记录插入Postgres数据库。目前,我在一个查询中执行了数千条插入语句。
有没有更好的方法来做到这一点,一些我不知道的批量插入语句?
当前回答
这主要取决于数据库中的(其他)活动。这样的操作会有效地冻结其他会话的整个数据库。另一个需要考虑的问题是数据模型和约束、触发器等的存在。
我的第一种方法总是:创建一个(临时)表,其结构与目标表类似(创建表tmp AS select * from target where 1=0),并从将文件读入临时表开始。 然后我检查哪些是可以检查的:重复项,目标中已经存在的键,等等。
然后执行do insert到target select * from tmp或类似的操作。
如果失败了,或者花费了太长时间,我将中止它并考虑其他方法(暂时删除索引/约束等)
其他回答
我用本地libpq方法实现了非常快速的Postgresq数据加载器。 试试我的套餐https://www.nuget.org/packages/NpgsqlBulkCopy/
May be I'm late already. But, there is a Java library called pgbulkinsert by Bytefish. Me and my team were able to bulk insert 1 Million records in 15 seconds. Of course, there were some other operations that we performed like, reading 1M+ records from a file sitting on Minio, do couple of processing on the top of 1M+ records, filter down records if duplicates, and then finally insert 1M records into the Postgres Database. And all these processes were completed within 15 seconds. I don't remember exactly how much time it took to do the DB operation, but I think it was around less then 5 seconds. Find more details from https://www.bytefish.de/blog/pgbulkinsert_bulkprocessor.html
正如其他人所注意到的,在将数据导入Postgres时,会因为Postgres为您设计的检查而减慢速度。此外,您经常需要以某种方式操作数据,以使其适合使用。任何可以在Postgres进程之外完成的操作都意味着您可以使用COPY协议进行导入。
For my use I regularly import data from the httparchive.org project using pgloader. As the source files are created by MySQL you need to be able to handle some MySQL oddities such as the use of \N for an empty value and along with encoding problems. The files are also so large that, at least on my machine, using FDW runs out of memory. pgloader makes it easy to create a pipeline that lets you select the fields you want, cast to the relevant data types and any additional work before it goes into your main database so that index updates, etc. are minimal.
下面的查询可以创建带有generate_series列的测试表,该列有10000行。*我通常创建这样的测试表来测试查询性能,你可以检查generate_series():
CREATE TABLE test AS SELECT generate_series(1, 10000);
postgres=# SELECT count(*) FROM test;
count
-------
10000
(1 row)
postgres=# SELECT * FROM test;
generate_series
-----------------
1
2
3
4
5
6
-- More --
并且,如果你已经有了测试表,运行下面的查询来插入10000行:
INSERT INTO test (generate_series) SELECT generate_series(1, 10000);
带有数组的UNNEST函数可以与多行VALUES语法一起使用。我认为这个方法比使用COPY慢,但它对我在psycopg和python (python列表传递给游标)的工作中很有用。execute变成pg ARRAY
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
);
没有值使用subselect额外的存在性检查:
INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
SELECT UNNEST(ARRAY[1, 2, 3]),
UNNEST(ARRAY[100, 200, 300]),
UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
SELECT 1 FROM tablename tt
WHERE tt.fieldname1=temptable.fieldname1
);
同样的语法用于批量更新:
UPDATE tablename
SET fieldname1=temptable.data
FROM (
SELECT UNNEST(ARRAY[1,2]) AS id,
UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;