Skip to content

[Enhancement] Add a parameter that controls the number of StreamLoad tasks committed per partition #92

@baishaoisde

Description

@baishaoisde

Search before asking

  • I had searched in the issues and found no similar issues.

Description

If the amount of data in a partition is greater than INSERT_BARCH_SIZE, each task commits multiple StreamLoad tasks. If the task fails to retry, all data in the partition is recommitted to the StreamLoad task, as well as the data that was previously successfully written. Data duplication occurs.
当一个分区中的数据量大于参数INSERT_BARCH_SIZE时,每个task便会提交多个StreamLoad任务,如果任务发生失败重试,那么该分区的所有数据便会重新提交StreamLoad任务,对于之前成功写入的数据也会重新提交,造成数据重复。

我的建议是增加一个参数,如果开启则强制每个分区只提交一个StreamLoad,保证数据不会被重复提交。

Solution

My suggestion is to add a parameter that, if enabled, forces only one StreamLoad per partition to ensure that data is not repeatedly committed.

我的建议是增加一个参数,如果开启则强制每个分区只提交一个StreamLoad,保证数据不会被重复提交。

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions