Skip to content

Automatic NaN checking as a debugging feature #4181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
glwagner opened this issue Mar 7, 2025 · 7 comments
Open

Automatic NaN checking as a debugging feature #4181

glwagner opened this issue Mar 7, 2025 · 7 comments
Labels
feature 🌟 Something new and shiny

Comments

@glwagner
Copy link
Member

glwagner commented Mar 7, 2025

After chatting with @charleskawczynski, I was wondering if a feature that automatically checks and reports on NaN that appear after a kernel launch might be useful.

It's not too hard to implement such a feature.

Basically, it just means inserting a check into launch!:

@inline function _launch!(arch, grid, workspec, kernel!, first_kernel_arg, other_kernel_args...;
exclude_periphery = false,
reduced_dimensions = (),
active_cells_map = nothing)
location = Oceananigans.location(first_kernel_arg)
loop!, worksize = configure_kernel(arch, grid, workspec, kernel!;
location,
exclude_periphery,
reduced_dimensions,
active_cells_map)
# Don't launch kernels with no size
haswork = if worksize isa OffsetStaticSize
length(worksize) > 0
elseif worksize isa Number
worksize > 0
else
true
end
if haswork
loop!(first_kernel_arg, other_kernel_args...)
end
return nothing
end

after loop!, which would be something like

if check_for_nans
    args = (first_kernel_arg, other_kernel_args...)
    for n in 1:length(args)
        if args[n] isa AbstractArray
            found_nan = any(args[n] .== NaN)
            found_nan && error("Found a NaN in the $(n)th argument to $kernel!")
        end
    end
end

In terms of how to implement this, I think the least invasive way is through a global variable, sort of like a log level.

But a more general design would add info to arch. One could even allow general callbacks in arch:

struct CPU{C}
    launch_callback :: C
end

CPU() = CPU(nothing)
has_callback(::CPU{Nothing}) = false
has_callback(::CPU) = true

so that within _launch!,

if has_callback(arch)
    arch.launch_callback(same_args_that_launch_gets...)
end

It'd also be fun to print the index that the NaN(s) were found at.

@glwagner glwagner added the feature 🌟 Something new and shiny label Mar 7, 2025
@ali-ramadhan
Copy link
Member

This would be quite helpful for debugging especially on the GPU!

Yeah unfortunately the NaNChecker can slow down your simulation and it doesn't actually catch when/where the NaN occurs even if run every iteration. It just tells you that one of the fields it checks has a NaN.

@charleskawczynski
Copy link
Member

Doing a generic callback can be especially useful, because then users can inject calls to Infiltrator's @exfiltrate, and gain interactive access to all of the input variables to the kernel (this is why all of the kernel arguments are passed to our kernel callback).

@glwagner
Copy link
Member Author

This would be quite helpful for debugging especially on the GPU!

Yeah unfortunately the NaNChecker can slow down your simulation and it doesn't actually catch when/where the NaN occurs even if run every iteration. It just tells you that one of the fields it checks has a NaN.

Also with an arch callback you get the info of the specific kernel where the NaN arose; that specificity might help a lot

@charleskawczynski
Copy link
Member

Another thing you could do here would be to log the norm of the output, (for example) in a global constant dict, and a iteration counter. This might be useful in comparing two simulations in that you might be able to see when two simulations diverge.

@glwagner
Copy link
Member Author

glwagner commented Apr 25, 2025

Hmm yes, though that kind of thing is pretty easy to code up manually on case-by-case basis. I guess I may not find that super useful since I am usually able to pinpoint issues quickly, but maybe it would help new developers. It would also have to be a model feature rather than a kernel feature because model iterations are not generically available in launch! (so different design than this PR)

@charleskawczynski
Copy link
Member

(I had debugging Julia 1.11 in mind)

@glwagner
Copy link
Member Author

True! Pinpointing the issue there has proven elusive indeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature 🌟 Something new and shiny
Projects
None yet
Development

No branches or pull requests

3 participants