RkBlog

Nvidia Modular diagnostic software - MODS

Nvidia MODS or Modular diagnostic software is an Nvidia internal set of tools for GPU diagnostic. Those tools did leak out and are now used by third-party repair shops when troubleshooting broken GPUs. Let's take a look at what MODS can do and how to use it.

Modular diagnostic software overview

MODS is available as a collection of two tools that can run various tests that check all aspects of a graphics card - from VRAM chips to GPU chip specifics. It's used by OEMs to validate a card or by repair technicians to help track down broken parts of the product.

The software versions that went public are distributed as ZIP archives containing a miniature Linux distribution with all dependencies and drivers. The intent is to boot it from a bootable flash drive, execute tests, and look at the results (which are also saved as a text file on the flash drive). Mods comes with a PDF file containing full documentation on how to use it.

Creating bootable MODS flash drive

You will have to Google out sites offering MODS for download. Usually, it's some Russian forums or sites related to third-party repair. You will also find some tutorials or repair examples on YouTube. You should not attempt a repair if you have no experience with this.

From what I could find there are two versions - 367.38.1 with all tools and documentation and partial 400.184 containing only mods and mats tools (there could be a newer version as well). The 367.38.1 version does not support Turing cards so if you have an RTX or GTX 16XX card you will need those newer two files as well (from what I see only mats works).

Assuming you have the ZIP file we can create the bootable flash drive:

copy c:\mods\367381.pkg c:\mods\pkgname
copy c:\mods\runmods.rbt c:\mods\runmods

\grub --config-file="find --set-root /tiny/kernel; configfile /dos2lin/dos2lin.lst"

Note that if you have a different version of the tools the 367381.pkg file name would have to be corrected here. This will give you a bootable MODS flash drive that boots into Linux via FreeDOS.

On boot it will execute tests defined in /mods/ARGS file, for example:

gputest.js
-test 3
-mfg
-null_display
-poll_interrupts
-pstate 0.max
-no_thermal_slowdown
-matsinfo

You can edit this file and set a preferred set of tests or execute them manually after the system boots. For more options on usage and customization of the software stack, you can watch this video:

Using Nvidia MODS

How to use MODS

There are two main tools in this software stack - mods and mats. The first one is used to test the GPU the second one is used to test the VRAM chips. Weird artifacts or famous Turing xd artifacts are usually associated with damaged VRAM chips. Other symptoms may be related to the GPU chip itself or some component on the board. mods won't tell you everything but if you are a repair specialist it should help.

For end users/gamers those tools could be quickly used to see if their GPU is working correctly, especially when buying used cards.

Mods can run explicit tests from a list (check the PDF for details) or two sets of tests - quicker OEM one or full suite:

mods gputest.js -mfg (for CEM testing)
mods gputest.js -oqa (for OEM outgoing QA testing)

Running mods creates a mods.log file containing all output from all tests run.

MODS start: Thu Nov 12 17:11:37 2020 

 Warning : test specifications should be used to control p-states 

Command Line : gputest.js -test 3 -test 18 -test 19 -test 52 -test 111 -test 112 -test 143 -mfg -null_display -poll_interrupts -pstate 0.max -no_thermal_slowdown -matsinfo 

CPU
Foundry   : GenuineIntel
Name      : Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz
Family    : 6
Model     : 14
Stepping  : 10

Version
MODS           : 367.38
OperatingSystem: Linux (x86_64)
Kernel         : 4.1.2-gentoo
KernelDriver   : 3.63
HostName       : tinylinux
Smbios version [0x302] is not supported

                 gpu 0  dev.sub 0.0          
                 ---------------------------       
Device Id      : GP104   
...

mats can be used to test VRAM chips, for example:

./mats -e 10

This will start displaying weird colors on the screen and after it's done it will print a report (and save it as report.txt). The result can look like so:

mats version 400.184.  Testing TU106 with 50 MB of memory starting with 0 MB.

Read    Error Count: 0
Write   Error Count: 0
Unknown Error Count: 0

=== MEMORY ERRORS BY SUBPARTITION ===
SUBPART READ ERRORS WRITE ERRORS UNKNOWN ERRS
------- ----------- ------------ ------------
FBIOA0            0            0            0
FBIOA1            0            0            0
FBIOB0            0            0            0
FBIOB1            0            0            0
FBIOC0            0            0            0
FBIOC1            0            0            0
FBIOD0            0            0            0
FBIOD1            0            0            0

Failing Bits: 
   None



Error Code = 00000000 (OK)

                                        
 #######     ####     ######    ######  
 ########   ######   ########  ######## 
 ##    ##  ##    ##  ##     #  ##     # 
 ##    ##  ##    ##   ###       ###     
 ########  ########    ####      ####   
 #######   ########      ###       ###  
 ##        ##    ##  #     ##  #     ## 
 ##        ##    ##  ########  ######## 
 ##        ##    ##   ######    ######

This lists every memory channel (FBIO) / chip and errors for each if any occurred. Starting from the bottom right chip you can identify each VRAM chip with the subpart label (starting with higher bits first):

VRAM chip labels on TU106
VRAM chip labels on TU106

If you get some errors on some of the chips then that could indicate a problem with that chip - or problems with the memory controller on the GPU or circuitry leading to the memory chip.

If you are interested in GPU repair or analyzing graphics card state, power lines, and alike I would recommend checking YouTube videos where repair specialists go over fixing broken GPUs. I would not recommend any attempts at fixing a valuable GPU if you have no prior experience with it.

Comment article