StarRocks数据库:datax与broker load导入性能对比

StarRocks数据库:datax与broker load导入性能对比

                                 --2021-11-01 刘春雷

很多情况下,来源数据在HDFS里面,为了更好的接入HDFS数据到StarRocks,本次测试下datax 与Broker load 的导入HDFS数据的性能情况。
注:因接触datax时间不久,此处只是简单的进行测试,如果测试没有发挥出datax的性能,请谅解。

数据源:HDFS

数据库:StarRocks
datax+StarRocksWriter PK Broker load

1、信息
机器:10.x.x.1
datax :DATAX-OPENSOURCE-3.0
StarRocksWriter 下载:
https://docs.starrocks.com/zh-cn/main/loading/DataX-starrocks-writer

2、环境
2.1、datax
ssh 10.x.x.1
cd /opt/soft/
tar zxvf starrockswriter.tar.gz
tar zxvf datax.tar.gz
cp -r /opt/soft/starrockswriter /opt/soft/datax/plugin/writer
2.2、HDFS
准备数据:
数据示例:
aa0,0
aa1,1
aa2,2
aa3,3
aa4,4
aa5,5

hadoop fs -du -s -h hdfs://10.x.x.2:666/StarRocks/test_1.csv
889.7 M 1.7 G hdfs://10.x.x.2:666/StarRocks/test_1.csv
文件行数:79207920

hadoop fs -du -s -h hdfs://10.x.x.2:666/StarRocks/test_3.csv
7.6 G 15.3 G hdfs://10.x.x.2:666/StarRocks/test_3.csv
文件行数:697029696

2.3、StarRocks 建表
CREATE TABLE lcl_test (
name varchar(100) NULL COMMENT “”,
age varchar(64) NULL COMMENT “”
) ENGINE=OLAP
DUPLICATE KEY(name)
COMMENT “test”
DISTRIBUTED BY HASH(name) BUCKETS 16
PROPERTIES (
“replication_num” = “3”,
“in_memory” = “false”,
“storage_format” = “DEFAULT”
);
2.4、StarRocks集群
版本:1.19.0
2
3、datax测试导入
3.1、任务配置
3.1.1、任务1
{
“job”: {
“content”: [
{
“reader”: {
“name”: “hdfsreader”,
“parameter”: {
“column”: [
{
“index”: 0,
“type”: “string”
},
{
“index”: 1,
“type”: “string”
}
],
“defaultFS”: “hdfs://10.x.x.2:666”,
“encoding”: “UTF-8”,
“fieldDelimiter”: “,”,
“fileType”: “text”,
“path”: “/StarRocks/test_1.csv”
}
},
“writer”: {
“name”: “starrockswriter”,
“parameter”: {
“column”: [“name”,“age”],
“jdbcUrl”: “jdbc:mysql://xxx:xxx/”,
“loadUrl”: [“xxx:xxx”],
“table”: “lcl_test”,
“database”: “lcl”,
“username”: “xxx”,
“password”: “xxx”,
“preSql”: [],
“postSql”: [],
“loadProps”: {“column_separator”: “,”,“row_delimiter”: “\x02”}
}
}
}
],
“setting”: {
“speed”: {
“channel”: “10”
}
}
}
}
3.1.2、任务2
{
“job”: {
“content”: [
{
“reader”: {
“name”: “hdfsreader”,
“parameter”: {
“column”: [
{
“index”: 0,
“type”: “string”
},
{
“index”: 1,
“type”: “string”
}
],
“defaultFS”: “hdfs://10.x.x.2:666”,
“encoding”: “UTF-8”,
“fieldDelimiter”: “,”,
“fileType”: “text”,
“path”: “/StarRocks/test_3.csv”
}
},
“writer”: {
“name”: “starrockswriter”,
“parameter”: {
“column”: [“name”,“age”],
“jdbcUrl”: “jdbc:mysql://xxx:xxx/”,
“loadUrl”: [“xxx:xxx”],
“table”: “lcl_test_3”,
“database”: “lcl”,
“username”: “xxx”,
“password”: “xxx”,
“preSql”: [],
“postSql”: [],
“loadProps”: {“column_separator”: “,”,“row_delimiter”: “\x02”}
}
}
}
],
“setting”: {
“speed”: {
“channel”: “40”,
“record” : 400000,
“byte” : 1073741824
}
}
}
}
3.2、datax导入数据
3.2.1、任务1导入
2021-11-02 15:00:42.573 [job-0] INFO StandAloneJobContainerCommunicator - Total 79207920 records, 774512640 bytes | Speed 1.61MB/s, 172191 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 406.710s | All Task WaitReaderTime 1.625s | Percentage 100.00%
2021-11-02 15:00:42.574 [job-0] INFO JobContainer -
任务启动时刻 : 2021-11-02 14:53:01
任务结束时刻 : 2021-11-02 15:00:42
任务总计耗时 : 461s
任务平均流量 : 1.61MB/s
记录写入速度 : 172191rec/s
读出记录总数 : 79207920
读写失败总数 : 0

3.2.2、任务2导入
2021-11-02 18:28:14.452 [job-0] INFO StandAloneJobContainerCommunicator - Total 697029696 records, 6815711232 bytes | Speed 1.64MB/s, 175928 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 3,539.709s | All Task WaitReaderTime 27.402s | Percentage 100.00%
2021-11-02 18:28:14.454 [job-0] INFO JobContainer -
任务启动时刻 : 2021-11-02 17:22:10
任务结束时刻 : 2021-11-02 18:28:14
任务总计耗时 : 3963s
任务平均流量 : 1.64MB/s
记录写入速度 : 175928rec/s
读出记录总数 : 697029696
读写失败总数 : 0
4、Broker Load导入测试
4.1、创建任务
4.1.1、任务1
LOAD LABEL lcl.label12
(
DATA INFILE(“hdfs://10.x.x.2:666/StarRocks/test_1.csv”)
INTO TABLE lcl_test_2
COLUMNS TERMINATED BY “,”
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “3600”
);
4.1.2、任务2
LOAD LABEL lcl.label4
(
DATA INFILE(“hdfs://10.x.x.2:666/StarRocks/test_3.csv”)
INTO TABLE lcl_test_4
COLUMNS TERMINATED BY “,”
)
WITH BROKER ‘broker’
(
“username” = “xxx”,
“password” = “xxx”
)
PROPERTIES
(
“timeout” = “3600”
);

4.2、查看任务信息
4.2.1、任务1
show load where label = ‘label2’\G
*************************** 1. row ***************************
JobId: 10262
Label: label2
State: FINISHED
Progress: ETL:100%; LOAD:100%
Type: BROKER
EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=79207920
TaskInfo: cluster:N/A; timeout(s):3600; max_filter_ratio:0.0
ErrorMsg: NULL
CreateTime: 2021-11-02 15:10:04
EtlStartTime: 2021-11-02 15:10:09
EtlFinishTime: 2021-11-02 15:10:09
LoadStartTime: 2021-11-02 15:10:09
LoadFinishTime: 2021-11-02 15:11:47
URL: NULL
JobDetails: {“Unfinished backends”:{“15b09ab5-ed2e-4ae6-b69e-3b4cb9f001cb”:[]},“ScannedRows”:79207920,“TaskNumber”:1,“All backends”:{“15b09ab5-ed2e-4ae6-b69e-3b4cb9f001cb”:[10002,10003,10004]},“FileNumber”:1,“FileSize”:932928480}
1 row in set (0.01 sec)

4.2.2、任务2
show load where label = ‘label4’\G
*************************** 1. row ***************************
JobId: 12546
Label: label4
State: FINISHED
Progress: ETL:100%; LOAD:100%
Type: BROKER
EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=697029696
TaskInfo: cluster:N/A; timeout(s):3600; max_filter_ratio:0.0
ErrorMsg: NULL
CreateTime: 2021-11-03 09:07:29
EtlStartTime: 2021-11-03 09:07:31
EtlFinishTime: 2021-11-03 09:07:31
LoadStartTime: 2021-11-03 09:07:31
LoadFinishTime: 2021-11-03 09:21:34
URL: NULL
JobDetails: {“Unfinished backends”:{“3b898b84-5973-490e-9985-a64afe9165db”:[]},“ScannedRows”:697029696,“TaskNumber”:1,“All backends”:{“3b898b84-5973-490e-9985-a64afe9165db”:[10002,10003,10004]},“FileNumber”:1,“FileSize”:8209770624}
5、对比

4赞