2.1.3版本的starrocks be节点偶发性down掉

背景:
从1.18.4升级至2.1.3后,base compaction会导致be down掉,在1.18.4版本没有这个问题,1.19.7版本也有这个问题,但是发生频率没有2.1.3版本高。
be.out:
tcmalloc: large alloc 2006474752 bytes == 0x1da6c2000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 2844852224 bytes == 0x1da6c2000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4348510208 bytes == 0x238bd6000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4388855808 bytes == 0x238bba000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4429209600 bytes == 0x238bba000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4429209600 bytes == 0x238bba000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4469563392 bytes == 0x238bba000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4469563392 bytes == 0x238bb4000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4509917184 bytes == 0x238bb0000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4550270976 bytes == 0x238ba8000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4590624768 bytes == 0x238ba8000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4630978560 bytes == 0x238ba8000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
tcmalloc: large alloc 4630978560 bytes == 0x238ba4000 @ 0x4e98709 0x50163dc 0x1ceab18 0x4f66c75 0x20de5d0 0x17e5761 0x19ab5c7 0x184ad59 0x1c1a159 0x18ed52f 0x18f1c4d 0x18f308b 0x18e84ee 0x176e009 0x175a0af 0x4fe0870
terminate called after throwing an instance of ‘std::bad_alloc’
what(): std::bad_alloc
*** Aborted at 1650236616 (unix time) try “date -d @1650236616” if you are using GNU date ***
PC: @ 0x7f9279c48207 __GI_raise
*** SIGABRT (@0x3ea00017286) received by PID 94854 (TID 0x7f91f512d700) from PID 94854; stack trace: ***
@ 0x3503022 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f927a9125d0 (unknown)
@ 0x7f9279c48207 __GI_raise
@ 0x7f9279c498f8 __GI_abort
@ 0x15de049 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
@ 0x4f66726 __cxxabiv1::__terminate()
@ 0x4f66791 std::terminate()
@ 0x4f668e4 __cxa_throw
@ 0x15ddf50 _Znwm.cold
@ 0x1d5b8c9 starrocks::faststring::GrowArray()
@ 0x19f9533 starrocks::PageIO::compress_page_body()
@ 0x19a6e43 starrocks::ScalarColumnWriter::finish_current_page()
@ 0x19a8b8c starrocks::ScalarColumnWriter::append()
@ 0x19ab612 starrocks::ScalarColumnWriter::append()
@ 0x184ad59 starrocks::SegmentWriter::append_chunk()
@ 0x1c1a159 starrocks::HorizontalBetaRowsetWriter::add_chunk()
@ 0x18ed52f starrocks::vectorized::Compaction::_merge_rowsets_horizontally()
@ 0x18f1c4d starrocks::vectorized::Compaction::do_compaction_impl()
@ 0x18f308b starrocks::vectorized::Compaction::do_compaction()
@ 0x18e84ee starrocks::vectorized::BaseCompaction::compact()
@ 0x176e009 starrocks::StorageEngine::_perform_base_compaction()
@ 0x175a0af starrocks::StorageEngine::_base_compaction_thread_callback()
@ 0x4fe0870 execute_native_thread_routine
@ 0x7f927a90add5 start_thread
@ 0x7f9279d0fead __clone
@ 0x0 (unknown)

主机配置: 16c31g
base compaction 的 tablet大小为8G

我们先定位下 ,有消息会第一时间同步给您

请问这个问题解决了吗?我们这边也出现了同样的问题

W0501 05:17:00.676827 15865 stream_load.cpp:460] plan streaming load failed. errmsg=tablet 410959 has few replicas: 1, quorum: 2, cluster: 1792512149id=c64bdef15f54754e-7434b7df116b1c81, job_id=-1, txn_id=532505, label=e08edc15-c3de-43a5-825b-ad8f5e9bb4a2
E0501 06:05:00.555640 15601 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.555687 15605 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.555814 15594 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.555845 15626 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.555944 15631 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.555987 15588 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.556013 15615 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.556115 15628 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.556125 15601 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.556185 15608 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.556288 15629 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.556391 15610 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.556435 15586 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.556465 15609 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.556489 15617 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
E0501 06:05:00.556664 15630 olap_scan_node.cpp:259] [TUniqueId(hi=-6733773761950772756, lo=-4633915612094799306)] Cancelled: canceled state
W0501 06:05:00.561292 15694 fragment_mgr.cpp:193] Fail to open fragment a28cd470-c914-11ec-bfb1-064d65f3e238: Cancelled: canceled state
/root/starrocks/be/src/exec/vectorized/project_node.cpp:122 _children[0]->get_next(state, chunk, eos)
/root/starrocks/be/src/runtime/plan_fragment_executor.cpp:329 _plan->get_next(_runtime_state, &_chunk, &_done)
/root/starrocks/be/src/runtime/plan_fragment_executor.cpp:217 _get_next_internal_vectorized(&chunk)
W0501 06:49:24.338397 15862 stream_load_executor.cpp:155] begin transaction failed, errmsg=Label [ed851ff8-0c4c-43dc-84f3-0b4c64ac611d] has already been used.id=b448d5ae90f02010-1b82d94a7c77aab1, job_id=-1, txn_id=-1, label=ed851ff8-0c4c-43dc-84f3-0b4c64ac611d
W0501 06:49:24.339902 15859 stream_load_executor.cpp:155] begin transaction failed, errmsg=Label [ecb8614e-e3f5-45cf-81c5-487ff8dc07b3] has already been used.id=bc4bfa340328c732-2e7746c1d790e8b0, job_id=-1, txn_id=-1, label=ecb8614e-e3f5-45cf-81c5-487ff8dc07b3
W0501 06:49:25.335152 15867 stream_load_executor.cpp:155] begin transaction failed, errmsg=Label [d39e9df4-7705-4736-a965-505d3c9cc7a4] has already been used.id=b34d39230daa6a3a-1c3fd500f52dbdb6, job_id=-1, txn_id=-1, label=d39e9df4-7705-4736-a965-505d3c9cc7a4
W0501 06:49:25.335196 15857 stream_load_executor.cpp:155] begin transaction failed, errmsg=Label [e8c32d31-ee57-4598-980a-a703f0f0bcfd] has already been used.id=3f43ecf8b510a1a0-b7b0159e134695b5, job_id=-1, txn_id=-1, label=e8c32d31-ee57-4598-980a-a703f0f0bcfd
W0501 06:49:25.392858 15860 stream_load_executor.cpp:155] begin transaction failed, errmsg=Label [e03de786-ed8a-43c0-8eaf-26f1bbf20602] has already been used.id=f84f5d836c8ee4fe-a0f42a58d2152b91, job_id=-1, txn_id=-1, label=e03de786-ed8a-43c0-8eaf-26f1bbf20602

相同版本,也遇到问题了

你好,建议参考升级注意事项升级到2.1.5版本,对应base compaction有问题的tablet对应的表或者分区需要truncate重新导入数据

你好,建议参考升级注意事项升级到2.1.5版本,对应base compaction有问题的tablet对应的表或者分区需要truncate重新导入数据

我们扩容be节点内存后,就没有再发生be节点down掉的问题了。

我们扩容be 节点内存后,就没有再down掉了