Skip to content

Commit 90503ec

Browse files
pcoetcopybara-github
authored andcommitted
PR #1519: copy edited XLA architecture doc
Imported from GitHub PR #1519 Copybara import of the project: -- 576709a by David Huntsperger <[email protected]>: copy edited XLA architecture doc Merging this change closes #1519 COPYBARA_INTEGRATE_REVIEW=#1519 from pcoet:doc-architecture 576709a PiperOrigin-RevId: 512713359
1 parent 4f460f9 commit 90503ec

File tree

1 file changed

+51
-51
lines changed

1 file changed

+51
-51
lines changed

docs/architecture.md

+51-51
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,71 @@
11
# XLA architecture
22

3-
XLA is a machine learning (ML) compiler that optimizes
4-
linear algebra (XLA = accelerated linear algebra), providing improvements in
5-
execution speed and memory usage. This page provides a brief overview of the
6-
the XLA compiler's objectives and architecture.
3+
XLA (Accelerated Linear Algebra) is a machine learning (ML) compiler that
4+
optimizes linear algebra, providing improvements in execution speed and memory
5+
usage. This page provides a brief overview of the objectives and architecture of
6+
the XLA compiler.
77

88
## Objectives
99

10-
Today, XLA supports several ML framework frontends (such as PyTorch, TensorFlow,
11-
and JAX) and is part of the OpenXLA projectan ecosystem of open-source compiler
12-
technologies for ML that's developed collaboratively by leading ML
13-
hardware and software organizations. Before the OpenXLA project was created, XLA
14-
was developed inside the TensorFlow project, but the fundamental
15-
objectives have remained the same:
10+
Today, XLA supports several ML framework frontends (including PyTorch,
11+
TensorFlow, and JAX) and is part of the OpenXLA project &ndash; an ecosystem of
12+
open-source compiler technologies for ML that's developed collaboratively by
13+
leading ML hardware and software organizations. Before the OpenXLA project was
14+
created, XLA was developed inside the TensorFlow project, but the fundamental
15+
objectives remain the same:
1616

17-
* **Improve execution speed.** Compile subgraphs to reduce the execution time
18-
of short-lived ops to eliminate overhead from the execution runtime, fuse
19-
pipelined operations to reduce memory overhead, and specialize known
20-
tensor shapes to allow for more aggressive constant propagation.
17+
* **Improve execution speed.** Compile subgraphs to reduce the execution time
18+
of short-lived ops and eliminate overhead from the runtime, fuse pipelined
19+
operations to reduce memory overhead, and specialize known tensor shapes to
20+
allow for more aggressive constant propagation.
2121

22-
* **Improve memory usage.** Analyze and schedule memory usage,
23-
eliminating many intermediate storage buffers.
22+
* **Improve memory usage.** Analyze and schedule memory usage, eliminating
23+
many intermediate storage buffers.
2424

25-
* **Reduce reliance on custom ops.** Remove the need for many custom ops by
26-
improving the performance of automatically fused low-level ops to match the
27-
performance of custom ops that were originally fused by hand.
25+
* **Reduce reliance on custom ops.** Remove the need for many custom ops by
26+
improving the performance of automatically fused low-level ops to match the
27+
performance of custom ops that were originally fused by hand.
2828

29-
* **Improve portability.** Make it relatively easy to write a new backend for
30-
novel hardware, at which point a large fraction of ML models can
31-
run unmodified on that hardware. This is in contrast with the approach of
32-
specializing individual monolithic ops for new hardware, which requires
33-
models be rewritten to make use of those ops.
29+
* **Improve portability.** Make it relatively easy to write a new backend for
30+
novel hardware, so that a large fraction of ML models can run unmodified on
31+
that hardware. This is in contrast with the approach of specializing
32+
individual monolithic ops for new hardware, which requires models to be
33+
rewritten to make use of those ops.
3434

3535
## How it works
3636

37-
The XLA Compiler takes model graphs from ML frameworks defined in
37+
The XLA compiler takes model graphs from ML frameworks defined in
3838
[StableHLO](https://github.com/openxla/stablehlo) and compiles them into machine
39-
instructions for various architectures. StableHLO defines a versioned
40-
operation set (HLO = high level operations) that provides a
41-
portability layer between ML frameworks and the compiler.
39+
instructions for various architectures. StableHLO defines a versioned operation
40+
set (HLO = high level operations) that provides a portability layer between ML
41+
frameworks and the compiler.
4242

4343
In general, the compilation process that converts the model graph into a
4444
target-optimized executable includes these steps:
4545

46-
1. XLA performs several built-in optimization and analysis passes on the
47-
StableHLO graph that are target-independent, such as
48-
[CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination),
49-
target-independent operation fusion, and buffer analysis for allocating runtime
50-
memory for the computation. During this optimization stage, XLA also converts
51-
the StableHLO dialect into an internal HLO dialect.
46+
1. XLA performs several built-in optimization and analysis passes on the
47+
StableHLO graph that are target-independent, such as
48+
[CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination),
49+
target-independent operation fusion, and buffer analysis for allocating
50+
runtime memory for the computation. During this optimization stage, XLA also
51+
converts the StableHLO dialect into an internal HLO dialect.
5252

53-
2. XLA sends the HLO computation to a
54-
backend for further HLO-level optimizations, this time with target-specific
55-
information and needs in mind. For example, the GPU backend may perform
56-
operation fusions that are beneficial specifically for the GPU programming model
57-
and determine how to partition the computation into streams. At this stage,
58-
backends may also pattern-match certain operations or combinations thereof to
59-
optimized library calls.
53+
2. XLA sends the HLO computation to a backend for further HLO-level
54+
optimizations, this time with target-specific information and needs in mind.
55+
For example, the GPU backend may perform operation fusions that are
56+
beneficial specifically for the GPU programming model and determine how to
57+
partition the computation into streams. At this stage, backends may also
58+
pattern-match certain operations or combinations thereof to optimized
59+
library calls.
6060

61-
3. The backend then performs target-specific code generation. The CPU and GPU
62-
backends included with XLA use [LLVM](http://llvm.org) for low-level
63-
IR, optimization, and code-generation. These backends emit the LLVM IR necessary
64-
to represent the HLO computation in an efficient manner, and then invoke LLVM to
65-
emit native code from this LLVM IR.
61+
3. The backend then performs target-specific code generation. The CPU and GPU
62+
backends included with XLA use [LLVM](http://llvm.org) for low-level IR,
63+
optimization, and code generation. These backends emit the LLVM IR necessary
64+
to represent the HLO computation in an efficient manner, and then invoke
65+
LLVM to emit native code from this LLVM IR.
6666

67-
Within this process, the XLA Compiler is modular in the sense that it is easy to
68-
slot-in an alternative backend to [target some novel HW
69-
architecture](./developing_new_backend.md). The GPU backend currently supports
70-
NVIDIA GPUs via the LLVM NVPTX backend; the CPU backend supports multiple CPU
71-
ISAs.
67+
Within this process, the XLA compiler is modular in the sense that it is easy to
68+
slot in an alternative backend to
69+
[target some novel HW architecture](./developing_new_backend.md). The GPU
70+
backend currently supports NVIDIA GPUs via the LLVM NVPTX backend. The CPU
71+
backend supports multiple CPU ISAs.

0 commit comments

Comments
 (0)