StarRocks的 broker load导入性能测试

StarRocks的 broker load导入性能测试

        --2021-11-25  刘春雷

1、前言

StarRocks目前我们已经使用了一阵了,由于公司的架构问题,我们之前没有打通HDFS 的broker load 的导入流程。此处全是辛酸泪~

近期终于打通了HDFS 的broker load 流程,就进行了相关导入性能的测试。

2、测试结果汇总

1.1、测试结果

【结论】:

  • broker load 导入速度比stream load 快
  • BE的节点数越多,broker load的速度越快
  • 文件格式对broker load的速度影响不大(此处对比csv与parquet)
  • broker 服务推荐部署在BE节点,可以减少网络传输
  • 如果导入的文件比较少,broker 服务的数量对broker load的速度影响不大,但推荐所有BE均部署broker
  • FE:load_parallel_instance_num 等参数会影响broker load的速度,但不宜调整过大,会导致冲突严重
  • BE:flush_thread_num_per_store 与 olap_table_sink_send_interval_ms 参数会影响broker load的速度,但影响有限
  • BE服务器CPU核数对导入速度有影响,核数越多越快
  • BE服务器CPU主频对导入速度有影响,主频数越多越快
  • 本次测试最快172w条/s

【总结】:

  • 为提高broker load的速度: HDFS的文件格式均可,设置FE、BE相关参数,增加BE节点数,broker 与 BE混合部署,且BE均部署broker 服务,提高单个BE的配置:cpu核数、主频

【测试详细结果如下】:







1.2、集群信息

1.3、HDFS导入架构

【公司的HDFS】: 有IP白名单限制

3个单独的物理机器已经开通了HDFS的IP白名单

【DBA组内的HDFS】: 同样有IP白名单限制

但是FE/BE/Broker的IP均已经开通

1.4、测试的建表SQL

CREATE TABLE xxx (
slot varchar(64) NULL COMMENT “”,
platform int(11) NULL COMMENT “”,
ad_local1 varchar(20) NULL COMMENT “”,
ad_cate2 varchar(20) NULL COMMENT “”,
ad_id bigint(20) NULL COMMENT “”,
parthasinfo bigint(20) NULL COMMENT “”,
sloc1 varchar(20) NULL COMMENT “”,
scate2 varchar(20) NULL COMMENT “”,
sellingcitytag varchar(20) NULL COMMENT “”,
requesttype varchar(20) NULL COMMENT “”,
ad_viptag int(11) NULL COMMENT “”,
action_type varchar(32) NULL COMMENT “”,
data_time datetime NULL COMMENT “”
) ENGINE=OLAP
DUPLICATE KEY(slot, platform, ad_local1, ad_cate2)
COMMENT “测试表”
PARTITION BY RANGE(data_time)
(
PARTITION p20211103 VALUES [(‘2021-11-03 00:00:00’), (‘2021-11-04 00:00:00’)

),
DISTRIBUTED BY HASH(ad_id, slot) BUCKETS 48
PROPERTIES (
“replication_num” = “3”,
“dynamic_partition.enable” = “true”,
“dynamic_partition.time_unit” = “DAY”,
“dynamic_partition.time_zone” = “Asia/Shanghai”,
“dynamic_partition.start” = “-30”,
“dynamic_partition.end” = “1”,
“dynamic_partition.prefix” = “p”,
“dynamic_partition.buckets” = “48”,
“in_memory” = “false”,
“storage_format” = “DEFAULT”
);

1.5、HDFS准备文件

【公司的】:

总条数:226229635

总大小:3G

总文件数:12个

文件格式: parquet

单个文件:256M

【公司的】:

总条数:226229635

总大小:19G

总文件数:25个

文件格式: csv

单个文件:782M

【DBA组内的】:

总条数:226229635

总大小:19G

总文件数:25个

文件格式: csv

单个文件:782M

2、DBA的工具导入

2.1、实现方式

公司HDFS–DBA程序实现web方式下载–循环顺序 stream load 上传到StarRocks

2.2、时间情况

下载耗时 :203s

stream load耗时 :712s

行数:226229635

导入速度:226229635/712= 317738 条/s

3、broker load 导入

3.1、broker导入任务-1

【信息】:

集群:A

公司HDFS parquet格式

1个独立的broker

FE相关参数默认

BE数量:6个

创建导入任务

LOAD LABEL xx_202111091625
(
DATA INFILE(“viewfs://xxx/xxx/xxx/xxx/xxx/xx/xx/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “\x01”
FORMAT AS “parquet”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”
)
PROPERTIES
(
“timeout” = “3600”
);

【任务结果】:

条数:226229635

速度: 457954 条/s

3.2、broker导入任务-2

【信息】:

集群:A

公司HDFS csv格式

1个独立的broker

FE相关参数默认

BE数量:6个

创建导入任务

LOAD LABEL xx_2_202111121045
(
DATA INFILE(“viewfs://xx/xx/xx/xx/xx/temp/xx/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”
)
PROPERTIES
(
“timeout” = “6000”
);

【任务结果】:

条数:226229635

速度: 424445 条/s

3.3、 broker导入任务-3

【信息】:

集群:A

公司HDFS csv格式

3个独立的broker

FE参数:

load_parallel_instance_num = 16
max_broker_concurrency = 100 默认

BE数量:6个

创建导入任务

LOAD LABEL xx_2ck_2_202111131100
(
DATA INFILE(“viewfs://xx/xx/xx/xx/xx/x/xxx/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”
)
PROPERTIES
(
“timeout” = “6000”
);

【任务结果】:

条数:226229635

速度: 557215 条/s

3.4、 broker导入任务-4

【信息】:

集群:A

公司HDFS csv格式

3个独立的broker

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

BE数量:6个

创建导入任务

LOAD LABEL xx_202111131417
(
DATA INFILE(“viewfs://xx/xx/xx/xx/xx/xx/xx_test_csv/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”
)
PROPERTIES
(
“timeout” = “6000”
);

【任务结果】:

条数:226229635

速度: 590677 条/s

3.5、broker导入任务-5

【信息】:

集群:A

组内的HDFS csv格式

FE,BE均混合部署broker,共12个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

BE数量:6个

【创建导入任务】

LOAD LABEL xx_202111141354
(
DATA INFILE(“hdfs://xxx:xx/StarRocks/xxx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”
);

【任务结果】:

条数:226229635

速度: 586087 条/s

3.6、broker导入任务-6

【信息】:

集群:A

组内的 HDFS csv格式

FE,BE均混合部署broker,共12个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

BE数量:6个

创建导入任务

LOAD LABEL xx_202111141454
(
DATA INFILE(“hdfs://xx:xx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”,
“password” = “xx”
)
PROPERTIES
(
“timeout” = “6000”
);

【任务结果】:

条数:226229635

速度: 608144 条/s

3.7、 broker导入任务-7

【信息】:

集群:A

组内的 HDFS csv格式

BE均混合部署broker,共6个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

BE数量:6个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

创建导入任务

LOAD LABEL xx_2ck_2_202111150918
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);

【任务结果】:

条数:226229635

速度: 589139 条/s

3.8、 broker导入任务-8

【信息】:

集群:A

组内的 HDFS csv格式

BE均混合部署broker,共6个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:6个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

创建导入任务

LOAD LABEL xx2_202111161526
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);

【任务结果】:

条数:226229635

速度: 562760 条/s

3.9、 broker导入任务-9

【信息】:

集群:A

组内的HDFS csv格式

BE均混合部署broker,共6个

FE参数:

load_parallel_instance_num = 2
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:6个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

【创建导入任务】

LOAD LABEL xx_2_202111161640
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);

【任务结果】:

条数:226229635

速度: 509526 条/s

3.10、 broker导入任务-10

【信息】:

集群:B

组内的HDFS csv格式

FE均混合部署broker,共3个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:6个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

【创建导入任务】

LOAD LABEL xx_202111181741
(
DATA INFILE(“hdfs://xx:xx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”,
“password” = “xx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);

【任务结果】:

条数:226229635

速度: 501617 条/s

3.11、 broker导入任务-11

【信息】:

集群:B

组内的 HDFS csv格式

2个BE均混合部署broker,共2个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:6个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

创建导入任务

LOAD LABEL xx_202111191628
(
DATA INFILE(“hdfs://xx:xx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”,
“password” = “xx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);

【任务结果】:

条数:226229635

速度: 844140 条/s

3.12、 broker导入任务-12

【信息】:

集群:B

组内的 HDFS csv格式

6个BE均混合部署broker,共6个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:6个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

创建导入任务

LOAD LABEL xx_2_202111191647
(
DATA INFILE(“hdfs://xx:xx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);

【任务结果】:

条数:226229635

速度: 883709 条/s

3.13、 broker导入任务-13

【信息】:

集群:B

组内的 HDFS csv格式

6个BE混合部署broker,共6个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:9个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

创建导入任务

LOAD LABEL xx_202111231755
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:

条数:226229635

速度: 1508197 条/s

3.14、 broker导入任务-14

【信息】:

集群:B

组内的 HDFS csv格式

9个BE混合部署broker,共9个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:9个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

创建导入任务

LOAD LABEL xx_202111231806
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);
【任务结果】:

条数:226229635

速度: 1459546 条/s

3.15、 broker导入任务-15

【信息】:

集群:B

组内的 HDFS csv格式

9个BE混合部署broker,共9个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:11个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

创建导入任务

LOAD LABELxx202111231824
(
DATA INFILE(“hdfs://xxx:xxx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xx”,
“password” = “xx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);

【任务结果】:

条数:226229635

速度: 1753718 条/s

3.16、 broker导入任务-16

【信息】:

集群:B

组内的 HDFS csv格式

11个BE混合部署broker,共11个

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:11个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

创建导入任务

LOAD LABEL xx_2_202111231830
(
DATA INFILE(“hdfs://xxx:xx/StarRocks/xx_csv/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “,”
FORMAT AS “csv”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “6000”,
“load_mem_limit” = “12884901888”
);

【任务结果】:

条数:226229635

速度: 1726943 条/s

3.17、 broker导入任务-17

【信息】:

集群:B

公司的 HDFS parquet 格式

3个独立部署的broker

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:11个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

创建导入任务

LOAD LABEL xx_202111251452
(
DATA INFILE(“viewfs://xxx/xx/xx/xx/xx/xx/xx/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “\x01”
FORMAT AS “parquet”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “hdp_lbg_supin”
)
PROPERTIES
(
“timeout” = “3600”,
“load_mem_limit” = “12884901888”
);

【任务结果】:

条数:226229635

速度: 1405153 条/s

3.18、 broker导入任务-18

【信息】:

集群:C

公司的 HDFS parquet 格式

3个独立部署的broker

FE参数:

load_parallel_instance_num = 32
max_broker_concurrency = 400

be参数:

flush_thread_num_per_store=8

olap_table_sink_send_interval_ms=0

BE数量:3个

导入任务参数:

“load_mem_limit” = “12884901888” (12G)

创建导入任务

LOAD LABEL xx_202111251649
(
DATA INFILE(“viewfs://xxx/xx/xx/xxx/xx/xx/xx/dt=20211103/*”)
INTO TABLE xx
COLUMNS TERMINATED BY “\x01”
FORMAT AS “parquet”
(ad_id,parthasinfo,platform,slot,sloc1,scate2,sellingcitytag,requesttype,ad_viptag,ad_cate2,ad_local1,action_type,data_time)
)
WITH BROKER ‘broker’
(
“username” = “xxx”
)
PROPERTIES
(
“timeout” = “3600”,
“load_mem_limit” = “12884901888”
);

【任务结果】:

条数:226229635

速度: 526115 条/s

2赞

将 load_parallel_instance_num 变大后,show load 返回的JobDetails 中 taskNumber 数量会变大么