Solution to slow jit for `Module.init` on TPU #1334

thisiscam · 2021-05-19T06:41:50Z

thisiscam
May 19, 2021

I'm experiencing very slow jit for a very small 3 layer Transformer model on TPU.
Without jit (using op-by-op mode), I can initialize my parameters within a minute.
I wonder why that's the case? Here's my guess: with jit, all the Dead Code Elimination is done by XLA compiler, which is known to be slow.

I see that several related issues/PRs: #879 , #910 #1277, but it seems like none of them really documents how to resolve this slowness?

Here's my hacky solution: jit the initialization on CPU, then transfer to TPU.
My guess is that because XLA is not well optimized (and thus doing crazy things) on CPU, the compilation is much much faster (< 3 mins):

def initialize_params(config, *args, **kwargs):
  initial_variables = Transformer(config).init(*args, **kwargs)
  return initial_variables["params"]

init_param_cpu_jit = jax.jit(
  functools.partial(initialize_params, model_config), 
  device=jax.devices(backend="cpu")[0]
)

parameters = init_param_cpu_jit(*arguments_to_model)
parameters_tpu = jax.device_put_replicated(parameters, jax.local_devices())

It is a cumbersome solution, which derivation is mostly based on my guesses..
Could someone help me understand if there's any drawback with this approach?

Thanks

thisiscam · 2021-05-19T06:44:38Z

thisiscam
May 19, 2021
Author

Another idea I had is to leverage jax's abstract evaluation to extract out all the shapes of the parameters, without doing any actual computation. This sounds like a much less hacky approach, but I'm not sure how feasible it is.

0 replies

jheek · 2021-05-19T10:23:48Z

jheek
May 19, 2021
Maintainer

This problem is caused by compiling random init routines over-and-over. XLA inlines everything so code isn't shared between initializer calls that create the same shapes.

A lazy init is indeed an option. There is a util in flax that can do this: https://flax.readthedocs.io/en/latest/flax.jax_utils.html#flax.jax_utils.partial_eval_by_shape

Feel free to try and see if it speeds up the call

0 replies

thisiscam · 2021-05-19T16:22:14Z

thisiscam
May 19, 2021
Author

@jheek Thanks for pointing out that function.
It's unclear to me how I can use it to handle initialization with rng? Those are considered data dependent by JAX right?

1 reply

jheek May 20, 2021
Maintainer

It's unclear to me how I can use it to handle initialization with rng? Those are considered data dependent by JAX right?

The RNG needs to be passed concretly indeed so it would look something like this:

rng = jax.random.PRNGKey(seed)
variables = partial_eval_by_shape(lambda x: model.init(rng), [input_shape])

marcvanzee · 2021-05-20T07:43:59Z

marcvanzee
May 20, 2021
Maintainer

Thanks for creating this issue @thisiscam! I agree the issues you mentioned do not explain it sufficiently, and I personally would like to understand this better as well.

@jheek your solution of using partial_eval_by_shape makes sense, but are we still doing this in Linen? It seems in our Transformer examples we simply do:

jax.jit(transformer.init)(...),

which, according to the comment on Module.init (which I think we added together), initializes the model lazily using only the shape of the arguments. So will XLA inline everything in this case as well, or will compiled code blocks be shared? (Which is what we want).

0 replies

thisiscam · 2021-05-23T19:43:59Z

thisiscam
May 23, 2021
Author

Maybe I should ask a rather more basic question: why should we use jit at all?

It seems like from the linked issues, jit is used : 1) to avoid extra computation through the model, and 2) to avoid jitting too many small pieces of code, which might fracture the accelerators memory?

However, if the init is only applied once, can't I simply execute init on CPU in op-by-op mode? This would almost solve both problems right? Assuming I'm ok with running some extra ops on CPU, and also ok with fracturing CPU RAM by many small pieces of code. These are pretty reasonable assumptions to me.

1 reply

jheek May 25, 2021
Maintainer

Some people do indeed init on CPU. In theory I think this should be better. After all initializers are run-once and not compute heavy. In practise the fastest option depends a lot on XLA, the type of accelerator and the kind of model you are trying to init.
For example: Op-by-op is faster if the initialisers often are the same (shape, dtype, initializer fn). But if they are different all the time you might as well compile them all at once with jit.
Also the cost of compiling once big program differs across accelerators and has changed over time. GPU used to be very slow but I think it's now the fastest because it has custom kernels for PRNG related ops now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Solution to slow jit for `Module.init` on TPU #1334

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 5 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Solution to slow jit for Module.init on TPU #1334

Uh oh!

Uh oh!

thisiscam May 19, 2021

Replies: 5 comments · 2 replies

Uh oh!

thisiscam May 19, 2021 Author

Uh oh!

jheek May 19, 2021 Maintainer

Uh oh!

thisiscam May 19, 2021 Author

Uh oh!

jheek May 20, 2021 Maintainer

Uh oh!

marcvanzee May 20, 2021 Maintainer

Uh oh!

Uh oh!

thisiscam May 23, 2021 Author

Uh oh!

jheek May 25, 2021 Maintainer

Solution to slow jit for `Module.init` on TPU #1334

thisiscam
May 19, 2021

Replies: 5 comments 2 replies

thisiscam
May 19, 2021
Author

jheek
May 19, 2021
Maintainer

thisiscam
May 19, 2021
Author

jheek May 20, 2021
Maintainer

marcvanzee
May 20, 2021
Maintainer

thisiscam
May 23, 2021
Author

jheek May 25, 2021
Maintainer