Automatic NaN checking as a debugging feature #4181

glwagner · 2025-03-07T20:33:35Z

After chatting with @charleskawczynski, I was wondering if a feature that automatically checks and reports on NaN that appear after a kernel launch might be useful.

It's not too hard to implement such a feature.

Basically, it just means inserting a check into launch!:

Oceananigans.jl/src/Utils/kernel_launching.jl

Lines 275 to 302 in 795de5e

    
           @inline function _launch!(arch, grid, workspec, kernel!, first_kernel_arg, other_kernel_args...; 
        
                                     exclude_periphery = false, 
        
                                     reduced_dimensions = (), 
        
                                     active_cells_map = nothing) 
        
               location = Oceananigans.location(first_kernel_arg) 
        
               loop!, worksize = configure_kernel(arch, grid, workspec, kernel!; 
        
                                                  location, 
        
                                                  exclude_periphery, 
        
                                                  reduced_dimensions, 
        
                                                  active_cells_map) 
        
               # Don't launch kernels with no size 
        
               haswork = if worksize isa OffsetStaticSize 
        
                   length(worksize) > 0 
        
               elseif worksize isa Number 
        
                   worksize > 0 
        
               else 
        
                   true 
        
               end 
        
               if haswork 
        
                   loop!(first_kernel_arg, other_kernel_args...) 
        
               end 
        
               return nothing 
        
           end

after loop!, which would be something like

if check_for_nans
    args = (first_kernel_arg, other_kernel_args...)
    for n in 1:length(args)
        if args[n] isa AbstractArray
            found_nan = any(args[n] .== NaN)
            found_nan && error("Found a NaN in the $(n)th argument to $kernel!")
        end
    end
end

In terms of how to implement this, I think the least invasive way is through a global variable, sort of like a log level.

But a more general design would add info to arch. One could even allow general callbacks in arch:

struct CPU{C}
    launch_callback :: C
end

CPU() = CPU(nothing)
has_callback(::CPU{Nothing}) = false
has_callback(::CPU) = true

so that within _launch!,

if has_callback(arch)
    arch.launch_callback(same_args_that_launch_gets...)
end

It'd also be fun to print the index that the NaN(s) were found at.

The text was updated successfully, but these errors were encountered:

ali-ramadhan · 2025-03-10T13:56:22Z

This would be quite helpful for debugging especially on the GPU!

Yeah unfortunately the NaNChecker can slow down your simulation and it doesn't actually catch when/where the NaN occurs even if run every iteration. It just tells you that one of the fields it checks has a NaN.

charleskawczynski · 2025-03-10T14:26:56Z

Doing a generic callback can be especially useful, because then users can inject calls to Infiltrator's @exfiltrate, and gain interactive access to all of the input variables to the kernel (this is why all of the kernel arguments are passed to our kernel callback).

glwagner · 2025-03-10T19:07:24Z

This would be quite helpful for debugging especially on the GPU!

Yeah unfortunately the NaNChecker can slow down your simulation and it doesn't actually catch when/where the NaN occurs even if run every iteration. It just tells you that one of the fields it checks has a NaN.

Also with an arch callback you get the info of the specific kernel where the NaN arose; that specificity might help a lot

charleskawczynski · 2025-04-25T16:46:27Z

Another thing you could do here would be to log the norm of the output, (for example) in a global constant dict, and a iteration counter. This might be useful in comparing two simulations in that you might be able to see when two simulations diverge.

glwagner · 2025-04-25T20:44:48Z

Hmm yes, though that kind of thing is pretty easy to code up manually on case-by-case basis. I guess I may not find that super useful since I am usually able to pinpoint issues quickly, but maybe it would help new developers. It would also have to be a model feature rather than a kernel feature because model iterations are not generically available in launch! (so different design than this PR)

charleskawczynski · 2025-04-25T21:28:21Z

(I had debugging Julia 1.11 in mind)

glwagner · 2025-04-25T21:50:09Z

True! Pinpointing the issue there has proven elusive indeed

glwagner added the feature 🌟 label Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic NaN checking as a debugging feature #4181

Automatic NaN checking as a debugging feature #4181

glwagner commented Mar 7, 2025 •

edited

Loading

ali-ramadhan commented Mar 10, 2025

charleskawczynski commented Mar 10, 2025

glwagner commented Mar 10, 2025

charleskawczynski commented Apr 25, 2025

glwagner commented Apr 25, 2025 •

edited

Loading

charleskawczynski commented Apr 25, 2025

glwagner commented Apr 25, 2025

Automatic NaN checking as a debugging feature #4181

Automatic NaN checking as a debugging feature #4181

Comments

glwagner commented Mar 7, 2025 • edited Loading

ali-ramadhan commented Mar 10, 2025

charleskawczynski commented Mar 10, 2025

glwagner commented Mar 10, 2025

charleskawczynski commented Apr 25, 2025

glwagner commented Apr 25, 2025 • edited Loading

charleskawczynski commented Apr 25, 2025

glwagner commented Apr 25, 2025

glwagner commented Mar 7, 2025 •

edited

Loading

glwagner commented Apr 25, 2025 •

edited

Loading