Testing GPU functionality ######################### This section provides information and links that help with testing :ref:`CoreNEURON`'s GPU support. Other sections of the documentation that may be relevant are: - The :ref:`getting-coreneuron` section, which documents both building from source with CoreNEURON support and installing Python wheels. - The :ref:`coreneuron-running-a-simulation` section, which explains the basics of porting a NEURON model to use CoreNEURON. - The :ref:`Running GPU benchmarks` section, which outlines how to use profiling tools such as Caliper, NVIDIA NSight Systems, and NVIDIA NSight Compute. This section aims to add some basic information about how to test if GPU execution is working. This might be useful if, for example, you need to test GPU execution on a new system. Accessing GPU resources *********************** If your local system has an (NVIDIA) GPU installed then you can probably skip this section. The ``nvidia-smi`` tool may be useful to check this; it will show the GPUs attached to a system: .. code-block:: console $ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro P2200 Off | 00000000:01:00.0 Off | N/A | | 45% 33C P8 4W / 75W | 71MiB / 5049MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ On a university cluster or supercomputer system then you will typically need to pass some kind of extra constraint to the job scheduler. For example on the BlueBrain5 system, which uses Slurm, you can allocate a GPU node using the ``volta`` constraint: .. code-block:: console [login node] $ salloc -A -C volta salloc: Granted job allocation 294001 ... [compute node] $ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off | Off | ... Running NEURON tests ******************** If you have configured NEURON with CoreNEURON, CoreNEURON GPU support and tests (:ref:`-DNRN_ENABLE_TESTS=ON `) enabled then simply running .. code-block:: console $ ctest --output-on-failure in your CMake build directory will execute a large number of tests, many of them including GPU execution. You can filter which tests are run by name using the ``-R`` option to CTest, for example: .. code-block:: console $ ctest --output-on-failure -R gpu Test project /path/to/your/build Start 42: coreneuron_modtests::direct_py_gpu 1/53 Test #42: coreneuron_modtests::direct_py_gpu ............................. Passed 1.98 sec Start 43: coreneuron_modtests::direct_hoc_gpu 2/53 Test #43: coreneuron_modtests::direct_hoc_gpu ............................ Passed 1.03 sec Start 44: coreneuron_modtests::spikes_py_gpu ... Running tests manually ********************** It is sometimes convenient to run basic tests outside the CTest infrastructure. A particularly useful test case is the ``ringtest`` that is included in the CoreNEURON repository. This is very convenient because binary input data files for CoreNEURON are committed to the repository -- meaning that the test can be run without NEURON, Python, HOC, and friends -- and the required mechanisms are compiled as part of the standard NEURON build. To run this test on CPU you can, from your build directory, run: .. code-block:: console $ ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring ... where it is assumed that ``..`` is the source directory. To enable GPU execution, add the ``--gpu`` option: .. code-block:: console $ ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring --gpu Info : 4 GPUs shared by 1 ranks per node ... You should see that the statistics printed at the end of the simulation are the same. It can also be useful to enable some basic profiling, for example by using NVIDIA's NSight Systems utility ``nsys``: .. code-block:: console $ nsys nvprof ./bin/x86_64/special-core -d ../external/coreneuron/tests/integration/ring --gpu WARNING: special-core and any of its children processes will be profiled. Collecting data... Info : 4 GPUs shared by 1 ranks per node ... Number of spikes: 37 Number of spikes with non negative gid-s: 37 Processing events... ... CUDA API Statistics: Time(%) Total Time (ns) Num Calls Average (ns) Minimum (ns) Maximum (ns) StdDev (ns) Name ------- --------------- --------- ------------- ------------ ------------ ----------- -------------------------- 42.7 2,127,723,623 136,038 15,640.7 3,630 10,224,640 59,860.5 cuLaunchKernel ... CUDA Kernel Statistics: Time(%) Total Time (ns) Instances Average (ns) Minimum (ns) Maximum (ns) StdDev (ns) Name ------- --------------- --------- ------------ ------------ ------------ ----------- ---------------------------------------------------------------------------------------------------- 32.3 346,133,763 8,000 43,266.7 42,175 50,080 1,435.3 nvkernel__ZN10coreneuron18solve_interleaved1Ei_F1L653_4 12.7 136,155,806 8,002 17,015.2 3,615 1,099,738 90,544.0 nvkernel__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_F1L375_7 10.4 111,258,439 8,002 13,903.8 3,199 1,314,489 73,556.3 nvkernel__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_F1L274_4 10.1 108,647,844 8,000 13,581.0 3,391 1,274,394 70,309.4 nvkernel__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_F1L418_10 ... This can be helpful to confirm that compute kernels are really being launched on the GPU. Substrings such as ``solve_interleaved1``, ``solve_interleaved2``, ``nrn_cur_`` and ``nrn_state_`` in these kernel names indicate that the computationally heavy parts of the simulation are indeed being executed on the GPU. This test dataset is extremely small, so you should not pay much attention to the simulation time in this case. .. note:: The kernel names, which start with ``nvkernel__ZN10coreneuron`` above, are implementation details of the OpenACC or OpenMP implementation being used. They can also depend on whether you use MOD2C or NMODL to translate MOD files. If you want to do any more sophisticated profiling then you should use a profiling tool such as Caliper that can access the well-defined human-readable names for these kernels that NEURON and CoreNEURON define.