Skip to content

ModuleNotFoundError: No module named 'callbacks' while running benchmark under large_language_model_pretraining/nemo #791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
samrock001 opened this issue Apr 28, 2025 · 3 comments

Comments

@samrock001
Copy link

Seeing ModuleNotFoundError: No module named 'callbacks' in slurm logs while running benchmark under large_language_model_pretraining/nemo

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in
fdl_runner_app()
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 338, in call
raise e
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 321, in call
return get_command(self)(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 665, in main
return _main(
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 55, in fdl_direct_run
fdl_buildable: fdl.Buildable = ZlibJSONSerializer().deserialize(fdl_config)
File "/opt/NeMo-Run/src/nemo_run/core/serialization/zlib_json.py", line 41, in deserialize
return serialization.load_json(
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 850, in load_json
return Deserialization(json.loads(serialized_value), pyref_policy).result
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 736, in init
self._result = self._deserialize(self._root)
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 776, in _deserialize
return self._deserialize_ref(serialized_object)
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 749, in _deserialize_ref
deserialized_object = self._deserialize(self._serialized_objects[key])
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 797, in _deserialize
values = [value for _, value in self._deserialize(serialized_items)]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in _deserialize
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in _deserialize
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 776, in _deserialize
return self._deserialize_ref(serialized_object)
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 749, in _deserialize_ref
deserialized_object = self._deserialize(self._serialized_objects[key])
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 797, in _deserialize
values = [value for _, value in self._deserialize(serialized_items)]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in _deserialize
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in _deserialize
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 776, in _deserialize
return self._deserialize_ref(serialized_object)
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 749, in _deserialize_ref
deserialized_object = self._deserialize(self._serialized_objects[key])
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 798, in _deserialize
metadata = self._deserialize(serialized_object[_METADATA_KEY])
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 776, in _deserialize
return self._deserialize_ref(serialized_object)
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 749, in _deserialize_ref
deserialized_object = self._deserialize(self._serialized_objects[key])
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 797, in _deserialize
values = [value for _, value in self._deserialize(serialized_items)]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in _deserialize
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in _deserialize
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 766, in
return [self._deserialize(x) for x in serialized_object]
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 778, in _deserialize
return self._deserialize_pyref(serialized_object)
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 755, in _deserialize_pyref
return import_symbol(self._pyref_policy, pyref[_MODULE_KEY],
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 293, in import_symbol
with reraised_exception.try_with_lazy_message(make_message):
File "/usr/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/reraised_exception.py", line 82, in try_with_lazy_message
raise decorate_exception(exc, message) from None
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/reraised_exception.py", line 74, in try_with_lazy_message
yield
File "/usr/local/lib/python3.10/dist-packages/fiddle/_src/experimental/serialization.py", line 294, in import_symbol
value = importlib.import_module(module)
File "/usr/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'callbacks'
Fiddle context: Error occurred while importing pyref to 'PreemptiveStop' from 'callbacks'.

@ShriyaRishab
Copy link
Contributor

@samrock001 Are you running within Docker and using the Dockerfile provided in the repo?

Can you also share the command used to reproduce this error?

@samrock001
Copy link
Author

samrock001 commented Apr 29, 2025

Yeah - using the Dockerfile provided in the repo. Executing the following from within the built docker container:

root@4e084e2dbd2a:/workspace/llama31# cd large_language_model_pretraining/nemo/
root@4e084e2dbd2a:/workspace/llama31/large_language_model_pretraining/nemo# source config.sh
root@4e084e2dbd2a:/workspace/llama31/large_language_model_pretraining/nemo# ./run_llama31.sh

There is another problem. Every re-run, build_pretraining_datamodule runs again and consumes a lot of time in generating the index mapping cache files. Would it be possible to modify the code to re-use index mapping cache files created from prior runs or accept an input argument?

build_data_index = run.Partial(
    **build_pretraining_datamodule**,
    datamodule=datamodule,
    trainer_max_steps=pretrain.trainer.max_steps,
    trainer_val_check_interval=pretrain.trainer.val_check_interval,
    trainer_limit_val_batches=pretrain.trainer.limit_val_batches,
    trainer_limit_test_batches=pretrain.trainer.limit_test_batches,
)

@ShriyaRishab
Copy link
Contributor

ShriyaRishab commented Apr 29, 2025

Would it be possible to modify the code to re-use index mapping cache files created from prior runs or accept an input argument?

The index file depends on the seed which needs to be randomly generated for every run according to the MLPerf rules. For testing purposes, you can use the same seed which would lead to using a previously generated index file. But note that you cannot submit the same seed runs to the MLPerf training benchmark as each run needs to be generated using a different, random seed.

@Elnifio can you please help with "ModuleNotFoundError: No module named 'callbacks'"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants