主键模型表,ALTER TABLE xxx ADD COLUMN失败

【详述】1.19.5-a356769的主键模型表上新增一个列失败,State: CANCELLED,每次重试后Error replicas值会变大
【背景】ALTER TABLE o_order_index ADD COLUMN amount DECIMAL(10,2) COMMENT ‘交易金额’;
【业务影响】
【StarRocks版本】1.19.5-a356769
【集群规模】例如:3fe(1 follower+2observer)+3be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:8C/32G/万兆
【附件】

SHOW ALTER TABLE COLUMN WHERE TableName = “o_order_index” ORDER BY CreateTime DESC LIMIT 1\G;
*************************** 1. row ***************************
JobId: 224336
TableName: o_order_index
CreateTime: 2022-05-16 11:53:55
FinishTime: 2022-05-16 11:55:00
IndexName: o_order_index
IndexId: 224337
OriginIndexId: 21808
SchemaVersion: 1:219964650
TransactionId: -1
State: CANCELLED
Msg: Create replicas failed. Error: Error replicas:10002=245762, 10002=245766, 10002=245770
Progress: NULL
Timeout: 86400
1 row in set (0.00 sec)

ERROR: No query specified

每次重试之后“10002=”后的数值都会变大

在fe日志fe.log搜索“Create replicas failed”日志如下:

grep ‘Create replicas failed’ fe.log
2022-05-16 11:55:00,737 INFO (schema change|15) [SchemaChangeJobV2.cancelImpl():624] cancel SCHEMA_CHANGE job 224336, err: Create replicas failed. Error: Error replicas:10002=245762, 10002=245766, 10002=245770

grep ‘schema change|15’ fe.log
2022-05-16 11:54:00,675 INFO (schema change|15) [SchemaChangeJobV2.runPendingJob():200] begin to send create replica tasks. job: 224336
2022-05-16 11:54:00,681 INFO (schema change|15) [AlterJobV2.checkTableStable():216] table 21807 is stable, start job224336, type SCHEMA_CHANGE
2022-05-16 11:55:00,715 WARN (schema change|15) [SchemaChangeJobV2.runPendingJob():304] failed to create replicas for job: 224336, Error replicas:10002=245762, 10002=245766, 10002=245770
2022-05-16 11:55:00,737 INFO (schema change|15) [SchemaChangeJobV2.cancelImpl():624] cancel SCHEMA_CHANGE job 224336, err: Create replicas failed. Error: Error replicas:10002=245762, 10002=245766, 10002=245770

在10002的be上搜索日志如下:

grep ‘245762’ be.INFO
I0516 11:54:01.219380 16615 task_worker_pool.cpp:178] submitting task. type=CREATE, signature=245762
I0516 11:54:01.219386 16615 task_worker_pool.cpp:190] success to submit task. type=CREATE, signature=245762, task_count_in_queue=5354
I0516 11:55:36.245762 2605 tablet_manager.cpp:191] Creating tablet_id=236098 schema_hash=219964650

10002 be状态正常

show backends\G;
*************************** 1. row ***************************
BackendId: 10002
Cluster: default_cluster
IP: 192.168.153.119
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2022-05-10 11:44:57
LastHeartbeat: 2022-05-16 13:13:48
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 10948
DataUsedCapacity: 64.072 GB
AvailCapacity: 1.870 TB
TotalCapacity: 1.952 TB
UsedPct: 4.20 %
MaxDiskUsedPct: 4.20 %
ErrMsg:
Version: 1.19.5-a356769
Status: {“lastSuccessReportTabletsTime”:“2022-05-16 13:13:04”}
DataTotalCapacity: 1.933 TB
DataUsedPct: 3.24 %
*************************** 2. row ***************************

ps -ef | grep be
root 2361 1 3 May10 ? 05:48:23 /xdd/soft/StarRocks/be/lib/starrocks_be

在be上执行pstack命令,然后重启一下be

pstack 2361 >> /tmp/pstack.log
cat /tmp/pstack.log

pstack.log请查询附件pstack.log (171.3 KB)

2赞

感谢景丹支持:已通过调整建表超时时长参数解决

admin set frontend config (“tablet_create_timeout_second”=“5”)