我在一个计算资源共享的环境中工作,也就是说,我们有几台服务器机器,每台机器都配备了几个Nvidia Titan X gpu。
For small to moderate size models, the 12 GB of the Titan X is usually enough for 2–3 people to run training concurrently on the same GPU. If the models are small enough that a single model does not take full advantage of all the computational units of the GPU, this can actually result in a speedup compared with running one training process after the other. Even in cases where the concurrent access to the GPU does slow down the individual training time, it is still nice to have the flexibility of having multiple users simultaneously train on the GPU.
TensorFlow的问题在于,默认情况下,它在启动时分配了全部可用的GPU内存。即使是一个小型的两层神经网络,我看到所有12 GB的GPU内存都用完了。
有没有一种方法让TensorFlow只分配,比如说,4 GB的GPU内存,如果我们知道这对一个给定的模型来说已经足够了?
上面所有的答案都假设使用sess.run()调用来执行,这在TensorFlow的最新版本中成为异常而不是规则。
当使用tf。估计器框架(TensorFlow 1.4及以上)将分数传递给隐式创建的MonitoredTrainingSession的方式是,
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
trainingConfig = tf.estimator.RunConfig(session_config=conf, ...)
tf.estimator.Estimator(model_fn=...,
config=trainingConfig)
类似地,在Eager模式下(TensorFlow 1.5及以上),
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
tfe.enable_eager_execution(config=conf)
编辑:11-04-2018
例如,如果要使用tf.contrib.gan。Train,那么你可以使用类似bellow的东西:
tf.contrib.gan.gan_train(........, config=conf)
Tensorflow 2.0 Beta和(可能)更高版本
API再次改变。现在可以在以下地方找到它:
tf.config.experimental.set_memory_growth(
device,
enable
)
别名:
tf.compat.v1.config.experimental.set_memory_growth
tf.compat.v2.config.experimental.set_memory_growth
引用:
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/config/experimental/set_memory_growth
https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth
参见:
Tensorflow—使用GPU: https://www.tensorflow.org/guide/gpu
对于Tensorflow 2.0 Alpha,请参见:这个答案