3.1.9 sql Invalid plan 报错

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】问题详细描述
集群从3.1.6升级到3.1.9后,有个sql 查询报错了,报错的日志如下,不是必现的,重复跑几次会出现

2024-03-11 14:03:42,861 WARN (starrocks-mysql-nio-pool-469|48472) [StmtExecutor.execute():674] execute Exception, sql SELECT count(*) as total,sum(purchase_quantity) as purchase_quantity,sum(receive_quantity) as receive_quantity,sum(quantity_diff) as quantity_diff,sum(over_receive_num) as over_receive_num,sum(wait_quantity) as wait_quantity,sum(exchange_quantity) as exchange_quantity,sum(return_quantity) as return_quantity,sum(good_quantity) as good_quantity,sum(bad_quantity) as bad_quantity,sum(purchase_amount) as purchase_amount,sum(receive_amount) as receive_amount,sum(wait_amount) as wait_amount,sum(return_amount) as return_amount,sum(order_count) as order_count,if(sum(receive_quantity + return_quantity) > 0, round(sum(bad_quantity + return_bad_quantity) / sum(receive_quantity + return_quantity), 4), 0) as bad_rate,if(sum(receive_quantity + return_quantity) > 0, round(sum(return_quantity)/sum(receive_quantity + return_quantity), 4), 0) as return_rate FROM ( SELECT max(pri.pri_id) as pri_id,sum(pri.purchase_quantity) as purchase_quantity,sum(pri.receive_quantity) as receive_quantity,sum(pri.quantity_diff) as quantity_diff,sum(pri.over_receive_num) as over_receive_num,sum(pri.wait_quantity) as wait_quantity,sum(pri.exchange_quantity) as exchange_quantity,sum(pri.return_quantity) as return_quantity,sum(pri.good_quantity) as good_quantity,sum(pri.bad_quantity) as bad_quantity,sum(pri.return_good_quantity) as return_good_quantity,sum(pri.return_bad_quantity) as return_bad_quantity,sum(pri.purchase_amount) as purchase_amount,sum(pri.receive_quantity * pri.price) as receive_amount,sum(pri.wait_amount) as wait_amount,sum(pri.return_amount) as return_amount,count(distinct pri.type, pri.purchase_order_id) as order_count FROM purchase_report_item pri WHERE pri.company_id = 901202594714743296 AND pri.type IN (1,3,5) AND pri.status = 10 AND ( pri.order_time >= 1690819200 AND pri.order_time < 1693497600 ) GROUP BY pri.buyer_uid ) as pri
com.starrocks.sql.common.StarRocksPlannerException: Invalid plan:
PhysicalHashAggregate {type=GLOBAL, groupBy=[], partitionBy=[] ,aggregations={97: sum=sum(97: sum), 98: sum=sum(98: sum), 99: sum=sum(99: sum), 83: count=count(83: count), 84: sum=sum(84: sum), 85: sum=sum(85: sum), 86: sum=sum(86: sum), 87: sum=sum(87: sum), 88: sum=sum(88: sum), 89: sum=sum(89: sum), 90: sum=sum(90: sum), 91: sum=sum(91: sum), 92: sum=sum(92: sum), 93: sum=sum(93: sum), 94: sum=sum(94: sum), 95: sum=sum(95: sum), 96: sum=sum(96: sum)}}
-> PhysicalDistributionOperator {distributionSpec=GATHER ,globalDict=[]}
-> PhysicalHashAggregate {type=LOCAL, groupBy=[], partitionBy=[] ,aggregations={97: sum=sum(80: count), 98: sum=sum(add(66: sum, 71: sum)), 99: sum=sum(add(73: sum, 75: sum)), 83: count=count(), 84: sum=sum(65: sum), 85: sum=sum(66: sum), 86: sum=sum(67: sum), 87: sum=sum(68: sum), 88: sum=sum(69: sum), 89: sum=sum(70: sum), 90: sum=sum(71: sum), 91: sum=sum(72: sum), 92: sum=sum(73: sum), 93: sum=sum(76: sum), 94: sum=sum(77: sum), 95: sum=sum(78: sum), 96: sum=sum(79: sum)}}
-> PhysicalHashAggregate {type=GLOBAL, groupBy=[16: buyer_uid], partitionBy=[16: buyer_uid] ,aggregations={65: sum=sum(65: sum), 66: sum=sum(66: sum), 67: sum=sum(67: sum), 68: sum=sum(68: sum), 69: sum=sum(69: sum), 70: sum=sum(70: sum), 71: sum=sum(71: sum), 72: sum=sum(72: sum), 73: sum=sum(73: sum), 75: sum=sum(75: sum), 76: sum=sum(76: sum), 77: sum=sum(77: sum), 78: sum=sum(78: sum), 79: sum=sum(79: sum), 80: count=count(if(10: type IS NULL, null, 8: purchase_order_id))}}
-> PhysicalHashAggregate {type=DISTINCT_GLOBAL, groupBy=[8: purchase_order_id, 10: type, 16: buyer_uid], partitionBy=[8: purchase_order_id, 10: type, 16: buyer_uid] ,aggregations={65: sum=sum(65: sum), 66: sum=sum(66: sum), 67: sum=sum(67: sum), 68: sum=sum(68: sum), 69: sum=sum(69: sum), 70: sum=sum(70: sum), 71: sum=sum(71: sum), 72: sum=sum(72: sum), 73: sum=sum(73: sum), 75: sum=sum(75: sum), 76: sum=sum(76: sum), 77: sum=sum(77: sum), 78: sum=sum(78: sum), 79: sum=sum(79: sum)}}
-> PhysicalDistributionOperator {distributionSpec=SHUFFLE_AGG[8(true), 10(true), 16(true)] ,globalDict=[]}
-> PhysicalHashAggregate {type=LOCAL, groupBy=[8: purchase_order_id, 10: type, 16: buyer_uid], partitionBy=[16: buyer_uid] ,aggregations={65: sum=sum(26: purchase_quantity), 66: sum=sum(27: receive_quantity), 67: sum=sum(28: quantity_diff), 68: sum=sum(29: over_receive_num), 69: sum=sum(30: wait_quantity), 70: sum=sum(31: exchange_quantity), 71: sum=sum(32: return_quantity), 72: sum=sum(33: good_quantity), 73: sum=sum(34: bad_quantity), 75: sum=sum(40: return_bad_quantity), 76: sum=sum(45: purchase_amount), 77: sum=sum(63: expr), 78: sum=sum(47: wait_amount), 79: sum=sum(49: return_amount)}}
-> PhysicalOlapScanOperator {table=15567036, selectedPartitionId=[15567035], outputColumns=[33: good_quantity, 34: bad_quantity, 8: purchase_order_id, 40: return_bad_quantity, 10: type, 42: price, 45: purchase_amount, 47: wait_amount, 16: buyer_uid, 49: return_amount, 26: purchase_quantity, 27: receive_quantity, 28: quantity_diff, 29: over_receive_num, 30: wait_quantity, 31: exchange_quantity, 32: return_quantity], projection=[33: good_quantity, 34: bad_quantity, 40: return_bad_quantity, 8: purchase_order_id, 10: type, 45: purchase_amount, 47: wait_amount, 16: buyer_uid, 49: return_amount, 26: purchase_quantity, 27: receive_quantity, 28: quantity_diff, 29: over_receive_num, 30: wait_quantity, 31: exchange_quantity, multiply(cast(27: receive_quantity as DECIMAL128(9,0)), cast(42: price as DECIMAL128(18,6))), 32: return_quantity], predicate=4: company_id = 901202594714743296 AND 10: type IN (1, 3, 5) AND 54: status = 10 AND 57: order_time >= 1690819200 AND 57: order_time < 1693497600, limit=-1}
Type check failed. the type of arg 65: sum in expr ‘sum(65: sum)’ is defined as INT, but the actual type is BIGINT
at com.starrocks.sql.optimizer.validate.PlanValidator.validatePlan(PlanValidator.java:64) ~[starrocks-fe.jar:?]
at com.starrocks.sql.optimizer.Optimizer.optimizeByCost(Optimizer.java:235) ~[starrocks-fe.jar:?]
at com.starrocks.sql.optimizer.Optimizer.optimize(Optimizer.java:136) ~[starrocks-fe.jar:?]
at com.starrocks.sql.StatementPlanner.createQueryPlan(StatementPlanner.java:151) ~[starrocks-fe.jar:?]
at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:99) ~[starrocks-fe.jar:?]
at com.starrocks.sql.StatementPlanner.plan(StatementPlanner.java:62) ~[starrocks-fe.jar:?]
at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:482) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:375) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:481) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:767) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]

【背景】做过哪些操作?

【业务影响】
【是否存算分离】
【StarRocks版本】例如:3.1.9
【集群规模】例如:3fe(3 follower)+ 11BE
【机器信息】CPU虚拟核/内存/网卡,例如:48C/64G/万兆
【联系方式】社区3群-杨荣

@U_1670552093746_8532 最小化了一个复现的 sql

CREATE TABLE sql_bug_test_2 (
pri_id bigint(20) NOT NULL COMMENT “”,
purchase_order_id varchar(930) NULL DEFAULT “” COMMENT “”,
type tinyint(4) NULL DEFAULT “0” COMMENT “”,
supplier_id bigint(20) NULL DEFAULT “0” COMMENT “”,
purchase_quantity int(11) NULL DEFAULT “0” COMMENT “”
) ENGINE=OLAP
PRIMARY KEY(pri_id)
COMMENT “”
DISTRIBUTED BY HASH(pri_id) BUCKETS 1
PROPERTIES (
“replication_num” = “3”,
“in_memory” = “false”,
“enable_persistent_index” = “true”,
“replicated_storage” = “true”,
“compression” = “LZ4”
);

insert into sql_bug_test_2
select 220288354389442561, ‘1_PO220928001’,1,1,1
union select 220288354389763074, ‘1_PO220928001’,1,1,0
union select 220288354389864449, ‘1_PO220928002’,1,1,1
union select 220288354390050305, ‘1_PO220928002’,1,1,0
union select 220288354390151169, ‘1_PO221124001’,1,1,20
union select 220288354390349312, ‘1_PO221124001’,1,1,0
union select 220288354390443522, ‘1_PO221124002’,1,1,10
union select 220288354390619136, ‘1_PO221124002’,1,1,0
union select 220288354390736899, ‘1_PO221202001’,1,1,100
union select 220288354390736900, ‘1_PO221202001’,1,1,100
;

SELECT
pri.supplier_id,
sum(pri.purchase_quantity) as purchase_quantity,
count(distinct pri.type, pri.purchase_order_id) as order_count
FROM
sql_bug_test_2 pri
GROUP BY
pri.supplier_id
;

https://github.com/StarRocks/starrocks/pull/41228/files
可能是这个pr 引进的问题,有空可以看看

多谢, 我们看下

我测试了下,通过调整 set new_planner_agg_stage=2; 可以规避这个报错
不知道这个调整会不会有什么副作用

表经过谓词过滤后数据量不大的话, 没什么差别. 请问你们purchase_report_item这张表是没有分区只有一个bucket是吗?

是的,只有1个bucket

嗯, 这里plan checker处理的有点问题. 规避的话一是可以使用你上边的方法, 二是可以临时关闭plan checker. set enable_plan_validation = false. 我们修复下

嗯,了解了,感谢

Fix: https://github.com/StarRocks/starrocks/pull/42454