-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Pull requests: NVIDIA/Megatron-LM
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
param_copy_back_gpu_hook should sync to h2d stream
#1543
opened Apr 16, 2025 by
ariverhorse
Loading…
Fix parameter error in text_generation_server.py file
#1542
opened Apr 16, 2025 by
xichengpro
Loading…
[BUGFIX] Save dist_checkpointing metadata on all nodes for multi-node training
#1531
opened Apr 13, 2025 by
Pranaykarvi
Loading…
added fix to avoid overflow with new numpy casting behaviour (Issue: #1519)
#1520
opened Apr 4, 2025 by
Apsod
Loading…
Add full support for Local mode without Apex/TE, and add support for Open XLA on CUDA
#1510
opened Mar 31, 2025 by
ajayvohra2005
Loading…
[BUG]: Updating the logic for reducing the load_balancing_loss during logging, such that the correct value is logged while using CUDA Graphs
#1507
opened Mar 27, 2025 by
arjun-choudhry
Loading…
fix for group_limited_topk: K_r is moe_router_topk instead of moe_router_num_groups
#1502
opened Mar 25, 2025 by
ladyrick
Loading…
[Bug Fix] fix p2p communication order error and stuck problems when pp 2 and vpp 2 with remove pad
#1495
opened Mar 22, 2025 by
ETOgaosion
Loading…
Fix llama_mistral loader by using args.true_vocab_size
#1491
opened Mar 20, 2025 by
zhuzilin
Loading…
Build dataset for all GPUs with tp_rank=0 and pp_rank=0 or -1 in multi-machine training.
#1480
opened Mar 14, 2025 by
wan-nan
Loading…
Enabling variable_seq_lengths when encoder has Different TP Size
#1470
opened Mar 12, 2025 by
xiaojunjie
Loading…
fix(moe): the missing argument 'router_dtype' of _DeepepManager.__init__
#1463
opened Mar 11, 2025 by
AsakusaRinne
Loading…
Previous Next
ProTip!
Type g p on any issue or pull request to go back to the pull request listing page.