Skip to content

Study accuracy and performance penalty in byte size calculation of Event/Batch size #17736

Open
@andsel

Description

@andsel

In the parent issue #7417 we need to measure the size in bytes of the batch and hence also the size of each event. This poses a couple of problems:

  • what does it mean the size of an event/object?
  • should we have a byte-exact value or is acceptable an estimation?
  • how the estimation influences accuracy and performance vs byte-exact?

What does it mean the size of an event/object?

In JVM like any other language every object instance needs some space to be managed by the runtime. For example a simple Object instance 16 bytes on a 64 CPU and 8 on 32 (https://shipilev.net/jvm/objects-inside-out/#_observation_32_bit_vms_improve_footprint). So considering for example a string what is it's byte size? Is the list of bytes that encode the string (call it payload) or also the memory occupation of the String object instance?

Should we have a byte-exact value or is acceptable an estimation?

If we decide to use an approximation of the size how much good it is respect to byte-exact? How much does it cost?
The exact byte size of an object wouldn't be always feasible. Considering the JOL library, a tool used to compute memory consumption of a single instance/class, given that JVM is HotSpot, doesn't solve completely the problem. It doesn't calculate the full retained size of instance, the instance itself plus the full reference graph that's reachable from that. To do that we need navigate the graph, and it's not always easy or feasible(consider private fields in an instance).

Computing the size would require to navigate the graph of an event instance, which is mostly a maps-of-maps structure. However, to have a sort of estimation we could also think to serialise the event with CBOR and take the byte array's size obtained as an approximation of the size. How much accurate are these approximations?

How the estimation influences accuracy and performance vs byte-exact?

Given that the byte-exact size is difficult to reach, also from HPROF heap dumps, how behaves the estimations proposed respect to accuracy and performances?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions