我需要通过编程的方式将数千万条记录插入Postgres数据库。目前,我在一个查询中执行了数千条插入语句。

有没有更好的方法来做到这一点,一些我不知道的批量插入语句?


当前回答

带有数组的UNNEST函数可以与多行VALUES语法一起使用。我认为这个方法比使用COPY慢,但它对我在psycopg和python (python列表传递给游标)的工作中很有用。execute变成pg ARRAY

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
    UNNEST(ARRAY[1, 2, 3]), 
    UNNEST(ARRAY[100, 200, 300]), 
    UNNEST(ARRAY['a', 'b', 'c'])
);

没有值使用subselect额外的存在性检查:

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
    SELECT UNNEST(ARRAY[1, 2, 3]), 
           UNNEST(ARRAY[100, 200, 300]), 
           UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
    SELECT 1 FROM tablename tt
    WHERE tt.fieldname1=temptable.fieldname1
);

同样的语法用于批量更新:

UPDATE tablename
SET fieldname1=temptable.data
FROM (
    SELECT UNNEST(ARRAY[1,2]) AS id,
           UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;

其他回答

我刚刚遇到了这个问题,建议将csvsql(发行版)批量导入到Postgres。要执行批量插入,只需创建b,然后使用csvsql,它连接到数据库,并为整个csv文件夹创建单独的表。

$ createdb test 
$ csvsql --db postgresql:///test --insert examples/*.csv

May be I'm late already. But, there is a Java library called pgbulkinsert by Bytefish. Me and my team were able to bulk insert 1 Million records in 15 seconds. Of course, there were some other operations that we performed like, reading 1M+ records from a file sitting on Minio, do couple of processing on the top of 1M+ records, filter down records if duplicates, and then finally insert 1M records into the Postgres Database. And all these processes were completed within 15 seconds. I don't remember exactly how much time it took to do the DB operation, but I think it was around less then 5 seconds. Find more details from https://www.bytefish.de/blog/pgbulkinsert_bulkprocessor.html

带有数组的UNNEST函数可以与多行VALUES语法一起使用。我认为这个方法比使用COPY慢,但它对我在psycopg和python (python列表传递给游标)的工作中很有用。execute变成pg ARRAY

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
    UNNEST(ARRAY[1, 2, 3]), 
    UNNEST(ARRAY[100, 200, 300]), 
    UNNEST(ARRAY['a', 'b', 'c'])
);

没有值使用subselect额外的存在性检查:

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
    SELECT UNNEST(ARRAY[1, 2, 3]), 
           UNNEST(ARRAY[100, 200, 300]), 
           UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
    SELECT 1 FROM tablename tt
    WHERE tt.fieldname1=temptable.fieldname1
);

同样的语法用于批量更新:

UPDATE tablename
SET fieldname1=temptable.data
FROM (
    SELECT UNNEST(ARRAY[1,2]) AS id,
           UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;

PostgreSQL有一个关于如何最好地初始填充数据库的指南,他们建议使用COPY命令批量加载行。该指南还提供了其他一些关于如何加快处理速度的好技巧,比如在加载数据之前删除索引和外键(然后再将它们添加回来)。

使用COPY还有一种替代方法,即Postgres支持的多行值语法。从文档中可以看到:

INSERT INTO films (code, title, did, date_prod, kind) VALUES
    ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
    ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

上面的代码插入了两行,但是您可以任意扩展它,直到达到预处理语句令牌的最大数量(可能是999美元,但我不能100%确定)。有时不能使用COPY,对于这些情况,这是一个有价值的替代品。