[RIP‐74] Flink‐Connector‐RocketMQ Dynamic Load Balancing

Status

Current State: Accept
Authors: hqbfz
Shepherds: ferrirW
Mailing List discussion: [email protected]
Pull Request: https://github.com/apache/rocketmq-flink/pull/126
Test doc: https://shimo.im/docs/loqeMyDZKOCzr0qn
Released: No

Background & Motivation

What do we need to do

rocketmq-flink implements dynamic load balancing of queues at runtime rocketmq-flink 实现运行时队列动态负载均衡
rocketmq-flink queue allocation algorithm optimization rocketmq-flink 队列分配算法优化

Why should we do that

In the original version, rocketmq-flink's queue allocation algorithm could not be averaged and could exacerbate unevenness when reduced. When the online environment experiences frequent scale-down, the situation becomes worse, and if you want to alleviate this situation, you can only restart the job, but cause large pauses. Therefore, we need to implement runtime dynamic allocation of queues. 在原版本中，rocketmq-flink 的队列分配算法无法做到平均，并且当缩容之后可能加重不均匀情况。在线上环境经历频繁扩缩容之后，情况会变得更加糟糕，而且如果想要缓解这种情况只能重启作业，但会造成较大停顿。因此我们需要实现队列的运行时动态分配。

Goals

Completely solve the problem of uneven queue allocation 彻底解决队列分配不均情况

Non-Goals

Are there any limits of this proposal?

Dynamic load balancing involves queue migration, how to maintain accurate consumption points 动态负载均衡涉及队列迁移，如何维护准确消费位点
Communication loss caused by the network environment 网络环境导致的通信丢失
How are the mappings maintained by jobManager consistent with actual assignments JobManager 维护的映射关系如何和实际分配强一致

Changes

Implement the expansion and shrinkage logic of jobManager and maintain the mapping relationship accurately 对 jobManager 进行扩缩容逻辑实现，以及对映射关系的准确维护
taskManager implements scaling logical adaptation and site reporting taskManager 实现扩缩容逻辑适配以及位点的上报

Architecture

需要额外设计的事件通知：

TM 位点更新的及时上报
TM 分配结果上报校验分配正确
由于初始化时的内部分配导致的jm维护的映射关系与实际分配不一致，需要tm上报初始化分配结果，jb更新本地维护的映射关系

Event notifications requiring additional design:

tm site update is reported in time tm 位点更新的及时上报
tm Assignment result is reported for verification. The assignment is correct tm 分配结果上报校验分配正确
The mappings maintained by jm are inconsistent with the actual mappings because of the internal allocation during initialization. tm needs to report the initial allocation result and jb needs to update the mappings maintained locally 由于初始化时的内部分配导致的jm维护的映射关系与实际分配不一致，需要tm上报初始化分配结果，jb更新本地维护的映射关系

Implementation Outline

Achieve the initial dynamic capacity expansion and contraction 实现初始动态容量扩展和收缩
The mapping relationship between tasks and queues maintained by jobManager is consistent with the actual allocation jobManager 维护的任务与队列之间的映射关系与实际分配一致
The guarantee service for special cases is realized 特殊情况下的保障服务得以实现

How does alternatives solve the issue you proposed?

Dynamic load balancing involves queue migration and site maintenance depends on the broker. 动态负载均衡会涉及队列迁移，位点的维护依赖于broker;
When restarting, flink's internal automatic allocation by checkpoint is abandoned. 重启时抛弃flink内部根据checkpoint的自动分配;

Pros and Cons of alternatives?

Pros

Flink internal automatic distribution ensures uniform initial distribution no matter how the number of tm changes, so internal distribution can reduce our additional design; Flink内部自动分配会保证无论tm数量怎么变化，都可以实现均匀初始分配，所以内部分配可以减少我们的额外设计；
Loci dependent Broker can report loci through a scheduled task, simple implementation 位点依赖于Broker可以通过定时任务上报位点即可，实现简单

Cons

After Flink's internal automatic allocation is deprecated, we need to design additional initialization cases and tm count changes to increase system complexity. Flink内部自动分配弃用之后，我们需要额外设计初始化情况以及tm数量变动情况等等问题，增加系统复杂性
Loci dependency on Broker can lead to: different jobs using the same Group and topic, resulting in two job sites covering each other, resulting in consumption duplication or loss; 位点依赖于Broker可能导致：不同作业使用相同Group和topic，导致两个作业位点互相覆盖，使消费重复或丢失等问题；

Home
RocketMQ Improvement Proposal
- RIP
User Guide
- FAQ
Community
- Release Policy

[RIP‐74] Flink‐Connector‐RocketMQ Dynamic Load Balancing

Status

Background & Motivation

What do we need to do

Why should we do that

Goals

Non-Goals

Are there any limits of this proposal?

Changes

Architecture

Implementation Outline

How does alternatives solve the issue you proposed?

Pros and Cons of alternatives?

Pros

Cons

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally