sql统计外部大数据导致be节点宕机

【详述】sql统计外部大数据表导致be节点宕机
mysql> select count(*) from hive_catalog_cluster_1.cu.idc_active_resources_url ;
ERROR 1064 (HY000): Backend node not found. Check if any backend node is down.
【背景】使用sql统计外部表hive指定表大小
【业务影响】三个be节点查询测试只会第一个节点不可用
【是否存算分离】否
【StarRocks版本】例如:Stable-3.1
【集群规模】例如:3fe(3 follower)+3be(fe与be混部)
【机器信息】CPU虚拟核/内存/网卡,例如:128C/380G/万兆
【联系方式】649802003@qq.com
【附件】
be.out日志如下:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data01/StarRocks-3.1.1/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data01/StarRocks-3.1.1/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
3.1.1 RELEASE (build e022963)
query_id:c7f90b7f-eb37-11ee-ac9a-e8ebd3287c2c, fragment_instance:c7f90b7f-eb37-11ee-ac9a-e8ebd3287c2d
Hive file path: 000009_0.gz, partition id: 343, length: 33894202, offset: 0
tracker:process consumption: 8232576816
tracker:query_pool consumption: 4221381584
tracker:load consumption: 0
tracker:metadata consumption: 15326366
tracker:tablet_metadata consumption: 6990900
tracker:rowset_metadata consumption: 7849832
tracker:segment_metadata consumption: 88880
tracker:column_metadata consumption: 396754
tracker:tablet_schema consumption: 12340
tracker:segment_zonemap consumption: 62294
tracker:short_key_index consumption: 2050
tracker:column_zonemap_index consumption: 116354
tracker:ordinal_index consumption: 126896
tracker:bitmap_index consumption: 0
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 0
tracker:schema_change consumption: 0
tracker:column_pool consumption: 2916461492
tracker:page_cache consumption: 63767136
tracker:update consumption: 0
tracker:chunk_allocator consumption: 137739184
tracker:clone consumption: 0
tracker:consistency consumption: 0
*** Aborted at 1711433537 (unix time) try “date -d @1711433537” if you are using GNU date ***
PC: @ 0x7f13616bb040 __memchr
*** SIGSEGV (@0x0) received by PID 53854 (TID 0x7f101cba8700) from PID 0; stack trace: ***
@ 0x5fffa42 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f1362c51852 os::Linux::chained_handler()
@ 0x7f1362c58676 JVM_handle_linux_signal
@ 0x7f1362c4e653 signalHandler()
@ 0x7f1362117630 (unknown)
@ 0x7f13616bb040 __memchr
@ 0x7f13616c1da1 __GI_memmem
@ 0x4107d44 starrocks::CSVReader::next_record()
@ 0x52e39df starrocks::HdfsScannerCSVReader::next_record()
@ 0x52e53e8 starrocks::HdfsTextScanner::parse_csv()
@ 0x52e60d2 starrocks::HdfsTextScanner::do_get_next()
@ 0x52d2d31 starrocks::HdfsScanner::get_next()
@ 0x523c7cf starrocks::connector::HiveDataSource::get_next()
@ 0x293789d starrocks::pipeline::ConnectorChunkSource::_read_chunk()
@ 0x2be8357 starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking()
@ 0x2933f1b _ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlvE_clEv
@ 0x293c831 starrocks::workgroup::ScanExecutor::worker_thread()
@ 0x4c1fd22 starrocks::ThreadPool::dispatch_thread()
@ 0x4c1a81a starrocks::thread::supervise_thread()
@ 0x7f136210fea5 start_thread
@ 0x7f136172a8dd __clone
@ 0x0 (unknown)

帮忙发一下 这个hive表的建表语句

在 fe.audit.log 中搜一下这个 query id,看下 对应的sql 是否是上面 select count(*) from hive_catalog_cluster_1.cu.idc_active_resources_url ;

2024-03-26 14:12:46,198 [query] |Client=10.11.231.1:56202|User=root|AuthorizedUser=‘root’@’%’|ResourceGroup=default_wg|Catalog=default_catalog|Db=|State=ERR|ErrorCode=CANCELLED|Time=35620|ScanBytes=0|ScanRows=0|ReturnRows=0|StmtId=1907|QueryId=c7f90b7f-eb37-11ee-ac9a-e8ebd3287c2c|IsQuery=true|feIp=10.11.231.1|Stmt=select count() from hive_catalog_cluster_1.cu.idc_active_resources_url|Digest=|PlanCpuCost=5.0|PlanMemCost=10.0
2024-03-26 14:12:46,198 [slow_query] |Client=10.11.231.1:56202|User=root|AuthorizedUser=‘root’@’%’|ResourceGroup=default_wg|Catalog=default_catalog|Db=|State=ERR|ErrorCode=CANCELLED|Time=35620|ScanBytes=0|ScanRows=0|ReturnRows=0|StmtId=1907|QueryId=c7f90b7f-eb37-11ee-ac9a-e8ebd3287c2c|IsQuery=true|feIp=10.11.231.1|Stmt=select count(
) from hive_catalog_cluster_1.cu.idc_active_resources_url|Digest=|PlanCpuCost=5.0|PlanMemCost=10.0

是有对应的语句,但是有error

CREATE TABLE idc_active_resources_url (
houseid varchar(1048576) DEFAULT NULL,
url varchar(1048576) DEFAULT NULL,
ip varchar(1048576) DEFAULT NULL,
first_found_time bigint(20) DEFAULT NULL,
last_found_time bigint(20) DEFAULT NULL,
is_block int(11) DEFAULT NULL,
visit bigint(20) DEFAULT NULL,
traffic_type int(11) DEFAULT NULL,
application_protocal int(11) DEFAULT NULL,
referer varchar(1048576) DEFAULT NULL,
location varchar(1048576) DEFAULT NULL,
title varchar(1048576) DEFAULT NULL,
content varchar(1048576) DEFAULT NULL,
idc varchar(1048576) DEFAULT NULL,
dt bigint(20) DEFAULT NULL,
hour int(11) DEFAULT NULL
)
PARTITION BY ( idc, dt, hour )
PROPERTIES (“location” = “hdfs://nameservice1/user/hive/warehouse/cu.db/idc_active_resources_url”); 这是建表语句,可以统计出来,但是会掉线第一个be节点

只会挂掉一个 be 吗。

此外挂掉的文件是不是都是 Hive file path: 000009_0.gz 这个。

对,只会挂掉一个be,这个 000009_0.gz 文件是不固定的

测试过加limit查询,be节点不会宕机,统计全表数据量会宕机一个be节点,数据总量在5-6亿,查询时间需要45秒左右

挂的永远是同一个 be 吗

是的,我连接另外两个节点进行查询,也是固定那个节点宕机

3个 be 版本都是一样的吗

版本都是一致的,不会出现一个集群部署不同版本的情况

hive 建表语句看看

建表语句在上面评论已经贴出来了

我说 hive 里面的 show create table,不是 sr 的

这个就是关联的外表查询,不是starrocks的

没有分隔符,啥的显示吗

建表语句没有的

能看看 crash 的 text 文件是不是有什么不同的地方。

比如换行符不同,是否包含 UTF-8 BOM 啥的。