Description
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04 x86_64): Linux x86_64 in a Docker cointainer
- TensorFlow installed from (source or binary):
- TensorFlow version (use command below): 1.0.0-RC.2
- Java version (i.e., the output of
java -version
): openjdk version "21.0.4" - Java command line flags (e.g., GC parameters):
- Python version (if transferring a model trained in Python): 3.9
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
- GPU model and memory:
Describe the current behavior
I am using TensorFlow in a Spring Boot application, which exposes an endpoint for NER processing. The TensorFlow model is trained in Python and loaded into the Java application for inference.
To optimize performance, I initialize the TensorFlow session once during application startup using a @PostConstruct method and store it in a private field:
private Session session;
@PostConstruct
private void initialize() throws IOException {
byte[] bytes = Files.readAllBytes(Paths.get("/path/to/model/"));
Graph graph = new Graph();
graph.importGraphDef(GraphDef.parseFrom(bytes), "PREFIX");
session = new Session(graph);
}
The session is reused in a public method for running predictions:
public Result predict(String input) {
try (Tensor textTensor = Tensor.of(TINT32.class, ...);
Result result = session.runner()
.feed("otherOperationName", textTensor)
.fetch("operationName")
.run()) {
// Process the result here
}
}
During performance testing, I monitored the heap memory and found no significant issues. However, when the application runs in a Docker container, it crashes after a while, regardless of the memory allocated to the container (even with 120GB of memory). The following warning appears in the logs before the crash:
W external/local_tsl//framework/cpu_allocator_impl.cc:83] Allocation of 34891293 exceeds 10% of free system memory.
Is it possible that the memory leak is caused by the session being stored in a private field and never explicitly closed, even though all tensors and intermediate results are properly managed (closed) in the predict method?
Describe the expected behavior
The application should not exhibit memory leaks or crashes when deployed in a Docker container, regardless of memory allocation.