我写了一个Python程序,它作用于一个大的输入文件,以创建数百万个表示三角形的对象。算法为:

读取输入文件 处理该文件并创建一个三角形列表,由它们的顶点表示 以OFF格式输出顶点:顶点列表后面跟着三角形列表。三角形由顶点列表中的索引表示

OFF要求我在打印三角形之前打印出完整的顶点列表,这意味着在我将输出写入文件之前,我必须将三角形列表保存在内存中。与此同时,由于列表的大小,我得到了内存错误。

告诉Python我不再需要某些数据,并且可以释放这些数据的最佳方式是什么?


当前回答

I had a similar problem in reading a graph from a file. The processing included the computation of a 200 000x200 000 float matrix (one line at a time) that did not fit into memory. Trying to free the memory between computations using gc.collect() fixed the memory-related aspect of the problem but it resulted in performance issues: I don't know why but even though the amount of used memory remained constant, each new call to gc.collect() took some more time than the previous one. So quite quickly the garbage collecting took most of the computation time.

为了解决内存和性能问题,我改用了我曾经在某个地方读过的多线程技巧(对不起,我再也找不到相关的帖子了)。之前,我在一个大的for循环中读取文件的每一行,处理它,并每隔一段时间运行gc.collect()来释放内存空间。现在我调用一个函数,在一个新线程中读取和处理文件的一个块。一旦线程结束,内存就会自动释放,而不会出现奇怪的性能问题。

实际上它是这样工作的:

from dask import delayed  # this module wraps the multithreading
def f(storage, index, chunk_size):  # the processing function
    # read the chunk of size chunk_size starting at index in the file
    # process it using data in storage if needed
    # append data needed for further computations  to storage 
    return storage

partial_result = delayed([])  # put into the delayed() the constructor for your data structure
# I personally use "delayed(nx.Graph())" since I am creating a networkx Graph
chunk_size = 100  # ideally you want this as big as possible while still enabling the computations to fit in memory
for index in range(0, len(file), chunk_size):
    # we indicates to dask that we will want to apply f to the parameters partial_result, index, chunk_size
    partial_result = delayed(f)(partial_result, index, chunk_size)

    # no computations are done yet !
    # dask will spawn a thread to run f(partial_result, index, chunk_size) once we call partial_result.compute()
    # passing the previous "partial_result" variable in the parameters assures a chunk will only be processed after the previous one is done
    # it also allows you to use the results of the processing of the previous chunks in the file if needed

# this launches all the computations
result = partial_result.compute()

# one thread is spawned for each "delayed" one at a time to compute its result
# dask then closes the tread, which solves the memory freeing issue
# the strange performance issue with gc.collect() is also avoided

其他回答

正如其他答案已经说过的,Python可以避免释放内存给操作系统,即使Python代码不再使用内存(因此gc.collect()不会释放任何东西),特别是在长时间运行的程序中。无论如何,如果您在Linux上,您可以尝试通过直接调用libc函数malloc_trim(手册页)来释放内存。 喜欢的东西:

import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_trim(0)

del语句可能有用,但IIRC不能保证释放内存。医生来了…这里有一个不发布的原因。

我听说有人在Linux和unix类型的系统上使用python进程来做一些工作,得到结果,然后终止它。

本文介绍了Python垃圾收集器,但我认为缺乏内存控制是托管内存的缺点

你不能显式地释放内存。您需要做的是确保不保留对对象的引用。然后它们将被垃圾收集,释放内存。

在您的情况下,当您需要大型列表时,通常需要重新组织代码,通常使用生成器/迭代器代替。这样你就不需要在内存中存储大的列表。

如果您不关心顶点重用,您可以有两个输出文件——一个用于顶点,一个用于三角形。完成后,将三角形文件附加到顶点文件。

I had a similar problem in reading a graph from a file. The processing included the computation of a 200 000x200 000 float matrix (one line at a time) that did not fit into memory. Trying to free the memory between computations using gc.collect() fixed the memory-related aspect of the problem but it resulted in performance issues: I don't know why but even though the amount of used memory remained constant, each new call to gc.collect() took some more time than the previous one. So quite quickly the garbage collecting took most of the computation time.

为了解决内存和性能问题,我改用了我曾经在某个地方读过的多线程技巧(对不起,我再也找不到相关的帖子了)。之前,我在一个大的for循环中读取文件的每一行,处理它,并每隔一段时间运行gc.collect()来释放内存空间。现在我调用一个函数,在一个新线程中读取和处理文件的一个块。一旦线程结束,内存就会自动释放,而不会出现奇怪的性能问题。

实际上它是这样工作的:

from dask import delayed  # this module wraps the multithreading
def f(storage, index, chunk_size):  # the processing function
    # read the chunk of size chunk_size starting at index in the file
    # process it using data in storage if needed
    # append data needed for further computations  to storage 
    return storage

partial_result = delayed([])  # put into the delayed() the constructor for your data structure
# I personally use "delayed(nx.Graph())" since I am creating a networkx Graph
chunk_size = 100  # ideally you want this as big as possible while still enabling the computations to fit in memory
for index in range(0, len(file), chunk_size):
    # we indicates to dask that we will want to apply f to the parameters partial_result, index, chunk_size
    partial_result = delayed(f)(partial_result, index, chunk_size)

    # no computations are done yet !
    # dask will spawn a thread to run f(partial_result, index, chunk_size) once we call partial_result.compute()
    # passing the previous "partial_result" variable in the parameters assures a chunk will only be processed after the previous one is done
    # it also allows you to use the results of the processing of the previous chunks in the file if needed

# this launches all the computations
result = partial_result.compute()

# one thread is spawned for each "delayed" one at a time to compute its result
# dask then closes the tread, which solves the memory freeing issue
# the strange performance issue with gc.collect() is also avoided