如何在tensorflow中获得当前可用的gpu ?

我有一个使用分布式TensorFlow的计划，我看到TensorFlow可以使用gpu进行训练和测试。在集群环境中，每台机器可能有0个或1个或多个gpu，我想在尽可能多的机器上运行我的TensorFlow图。

我发现当运行tf.Session()时，TensorFlow在日志消息中给出了关于GPU的信息，如下所示:

I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)

我的问题是如何从TensorFlow获取当前可用GPU的信息?我可以从日志中获得加载的GPU信息，但我想以一种更复杂的编程方式来实现。我也可以故意使用CUDA_VISIBLE_DEVICES环境变量限制GPU，所以我不想知道从OS内核获取GPU信息的方法。

简而言之，我想要一个函数像tf.get_available_gpu()将返回['/gpu:0'， '/gpu:1']如果有两个gpu可用的机器。我如何实现这个?

当前回答

在TensorFlow Core v2.3.0中，以下代码应该可以工作。

import tensorflow as tf
visible_devices = tf.config.get_visible_devices()
for devices in visible_devices:
  print(devices)

根据您的环境，这段代码将产生流动的结果。

PhysicalDevice (name = / physical_device: CPU: 0, device_type = CPU) PhysicalDevice (name = / physical_device: GPU: 0, device_type = GPU)

2020-11-19 07:58:03

其他回答

我正在TF-2.1和torch上工作，所以我不想在任何ML框架中指定这个自动选择。我只使用原版的nvidia-smi和os。找到一个空的显卡。

def auto_gpu_selection(usage_max=0.01, mem_max=0.05):
"""Auto set CUDA_VISIBLE_DEVICES for gpu

:param mem_max: max percentage of GPU utility
:param usage_max: max percentage of GPU memory
:return:
"""
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
log = str(subprocess.check_output("nvidia-smi", shell=True)).split(r"\n")[6:-1]
gpu = 0

# Maximum of GPUS, 8 is enough for most
for i in range(8):
    idx = i*3 + 2
    if idx > log.__len__()-1:
        break
    inf = log[idx].split("|")
    if inf.__len__() < 3:
        break
    usage = int(inf[3].split("%")[0].strip())
    mem_now = int(str(inf[2].split("/")[0]).strip()[:-3])
    mem_all = int(str(inf[2].split("/")[1]).strip()[:-3])
    # print("GPU-%d : Usage:[%d%%]" % (gpu, usage))
    if usage < 100*usage_max and mem_now < mem_max*mem_all:
        os.environ["CUDA_VISIBLE_EVICES"] = str(gpu)
        print("\nAuto choosing vacant GPU-%d : Memory:[%dMiB/%dMiB] , GPU-Util:[%d%%]\n" %
              (gpu, mem_now, mem_all, usage))
        return
    print("GPU-%d is busy: Memory:[%dMiB/%dMiB] , GPU-Util:[%d%%]" %
          (gpu, mem_now, mem_all, usage))
    gpu += 1
print("\nNo vacant GPU, use CPU instead\n")
os.environ["CUDA_VISIBLE_EVICES"] = "-1"

如果我能得到任何GPU，它将CUDA_VISIBLE_EVICES设置为该GPU的BUSID:

GPU-0 is busy: Memory:[5738MiB/11019MiB] , GPU-Util:[60%]
GPU-1 is busy: Memory:[9688MiB/11019MiB] , GPU-Util:[78%]

Auto choosing vacant GPU-2 : Memory:[1MiB/11019MiB] , GPU-Util:[0%]

else，设置为-1使用CPU:

GPU-0 is busy: Memory:[8900MiB/11019MiB] , GPU-Util:[95%]
GPU-1 is busy: Memory:[4674MiB/11019MiB] , GPU-Util:[35%]
GPU-2 is busy: Memory:[9784MiB/11016MiB] , GPU-Util:[74%]

No vacant GPU, use CPU instead

注意:在导入任何需要GPU的ML帧之前使用这个函数，然后它会自动选择一个GPU。此外，你可以轻松设置多个任务。

2020-08-02 07:59:49

在任何shell中运行以下命令

python -c "import tensorflow as tf; print(\"Num GPUs Available: \", len(tf.config.list_physical_devices('GPU')))"

2022-04-03 20:48:48

我在我的机器上有一个名为NVIDIA GTX GeForce 1650 Ti的GPU, tensorflow-gpu==2.2.0

运行以下两行代码:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

输出:

Num GPUs Available:  1

2020-05-30 10:57:00

有一个名为device_lib.list_local_devices()的无文档方法，它允许您列出本地进程中可用的设备。(注意:作为一个未记录的方法，这是受制于向后不兼容的更改。)该函数返回DeviceAttributes协议缓冲区对象的列表。您可以为GPU设备提取一个字符串设备名称列表，如下所示:

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

请注意(至少到TensorFlow 1.4)，调用device_lib.list_local_devices()将运行一些初始化代码，默认情况下，将在所有设备上分配所有GPU内存(GitHub问题)。为了避免这种情况，首先使用显式的小per_process_gpu_fraction或allow_growth=True创建一个会话，以防止分配所有内存。请参阅这个问题了解更多细节。

2016-07-26 02:34:21

在test util中还有一个方法。所以我们要做的就是

tf.test.is_gpu_available()

和/或

tf.test.gpu_device_name()

在Tensorflow文档中查找参数。

2018-06-22 06:06:09

如何在tensorflow中获得当前可用的gpu ?

推荐文章

最新文章

标签