📌 Multi-Head Attention Shape Transformations (Cheat Sheet) #595
talk2jaydip
started this conversation in
Show and tell
Replies: 2 comments 1 reply
-
This is a really nice and organized figure/cheatsheet! Thanks a lot for sharing! |
Beta Was this translation helpful? Give feedback.
1 reply
-
Helpful! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
(batch_size, num_tokens, d_in)
(batch_size, num_tokens, d_in)
(batch_size, num_tokens, d_in)
(batch_size, num_tokens, d_out)
d_in → d_out
(batch_size, num_tokens, d_out)
(batch_size, num_tokens, num_heads, head_dim)
d_out → num_heads × head_dim
(batch_size, num_tokens, num_heads, head_dim)
(1,2)
(batch_size, num_heads, num_tokens, head_dim)
num_tokens ↔ num_heads
(batch_size, num_heads, num_tokens, head_dim)
and(batch_size, num_heads, head_dim, num_tokens)
(batch_size, num_heads, num_tokens, num_tokens)
head_dim → num_tokens
(batch_size, num_heads, num_tokens, num_tokens)
(batch_size, num_heads, num_tokens, num_tokens)
(batch_size, num_heads, num_tokens, num_tokens)
and(batch_size, num_heads, num_tokens, head_dim)
(batch_size, num_heads, num_tokens, head_dim)
num_tokens ↔ head_dim
(batch_size, num_heads, num_tokens, head_dim)
(batch_size, num_tokens, d_out)
num_heads × head_dim → d_out
(batch_size, num_tokens, d_out)
(batch_size, num_tokens, d_out)
Beta Was this translation helpful? Give feedback.
All reactions