-
Notifications
You must be signed in to change notification settings - Fork 233
Automatic NaN checking as a debugging feature #4181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This would be quite helpful for debugging especially on the GPU! Yeah unfortunately the |
Doing a generic callback can be especially useful, because then users can inject calls to Infiltrator's |
Also with an |
Another thing you could do here would be to log the norm of the output, (for example) in a global constant dict, and a iteration counter. This might be useful in comparing two simulations in that you might be able to see when two simulations diverge. |
Hmm yes, though that kind of thing is pretty easy to code up manually on case-by-case basis. I guess I may not find that super useful since I am usually able to pinpoint issues quickly, but maybe it would help new developers. It would also have to be a model feature rather than a kernel feature because model iterations are not generically available in |
(I had debugging Julia 1.11 in mind) |
True! Pinpointing the issue there has proven elusive indeed |
After chatting with @charleskawczynski, I was wondering if a feature that automatically checks and reports on NaN that appear after a kernel launch might be useful.
It's not too hard to implement such a feature.
Basically, it just means inserting a check into
launch!
:Oceananigans.jl/src/Utils/kernel_launching.jl
Lines 275 to 302 in 795de5e
after
loop!
, which would be something likeIn terms of how to implement this, I think the least invasive way is through a global variable, sort of like a log level.
But a more general design would add info to
arch
. One could even allow general callbacks inarch
:so that within
_launch!
,It'd also be fun to print the index that the NaN(s) were found at.
The text was updated successfully, but these errors were encountered: