[电子疑难病(已自愈)] 无规律的Ubuntu 24死机现象, 求赛博专家会诊

WiiGe 发表于 2025-6-1 02:34

本帖最后由 WiiGe 于 2025-8-11 03:21 编辑

最新一次死机的日志:https://share.wiige.top/upload/er5yBd

=================================================================================

更新简单脱敏后的日志: https://share.wiige.top/p/aAAYza

=================================================================================

我弄了一台HP的Z8G4工作站当HomeServer, 系统Ubuntu 24 LTS, 所有应用均docker化与系统隔离, 主要就是Nextcloud, Komga, Plex, Gitlab之类的东西

配置为:
Z8G4: https://support.hp.com/us-en/dri ... orkstation/16449803
内存有128GB DDR4 ECC(已隐去空槽信息):
sudo dmidecode -t memory
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.

Handle 0x000F, DMI type 16, 23 bytes
Physical Memory Array
   Location: System Board Or Motherboard
   Use: System Memory
   Error Correction Type: Single-bit ECC
   Maximum Capacity: 4608 GB
   Error Information Handle: Not Provided
   Number Of Devices: 12

Handle 0x0010, DMI type 17, 40 bytes
Memory Device
   Array Handle: 0x000F
   Error Information Handle: Not Provided
   Total Width: 72 bits
   Data Width: 64 bits
   Size: 64 GB
   Form Factor: DIMM
   Set: None
   Locator: CPU0-DIMM1
   Bank Locator: CPU0
   Type: DDR4
   Type Detail: Synchronous LRDIMM
   Speed: 2666 MT/s
   Manufacturer: Samsung
   Serial Number: 398E97A7
   Asset Tag: Not Specified
   Part Number: M386A8K40BM2-CTD
   Rank: 4
   Configured Memory Speed: 2666 MT/s
   Minimum Voltage: 1.2 V
   Maximum Voltage: 1.2 V
   Configured Voltage: 1.2 V

Handle 0x001D, DMI type 16, 23 bytes
Physical Memory Array
   Location: System Board Or Motherboard
   Use: System Memory
   Error Correction Type: Single-bit ECC
   Maximum Capacity: 4608 GB
   Error Information Handle: Not Provided
   Number Of Devices: 12

Handle 0x001E, DMI type 17, 40 bytes
Memory Device
   Array Handle: 0x001D
   Error Information Handle: Not Provided
   Total Width: 72 bits
   Data Width: 64 bits
   Size: 64 GB
   Form Factor: DIMM
   Set: None
   Locator: CPU1-DIMM1
   Bank Locator: CPU1
   Type: DDR4
   Type Detail: Synchronous LRDIMM
   Speed: 2666 MT/s
   Manufacturer: Samsung
   Serial Number: 398F3D61
   Asset Tag: Not Specified
   Part Number: M386A8K40BM2-CTD
   Rank: 4
   Configured Memory Speed: 2666 MT/s
   Minimum Voltage: 1.2 V
   Maximum Voltage: 1.2 V
   Configured Voltage: 1.2 V

有几片SSD:

$sudo smartctl --all /dev/nvme0

=== START OF INFORMATION SECTION ===
Model Number:                   INTEL SSDPED1D280GA
Serial Number:                   PHMB75160057280CGN
Firmware Version:                E2010480
PCI Vendor/Subsystem ID:          0x8086
IEEE OUI Identifier:             0x5cd2e4
Controller ID:                   0
NVMe Version:                   <1.2
Number of Namespaces:             1
Namespace 1 Size/Capacity:       280,065,171,456
Namespace 1 Formatted LBA Size: 512
Local Time is:                   Sun Jun1 01:58:46 2025 CST
Firmware Updates (0x02):          1 Slot
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x0006): Wr_Unc DS_Mngmt
Log Page Attributes (0x0a):       Cmd_Eff_Lg Telmtry_Lg
Maximum Data Transfer Size:       32 Pages

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

$ sudo smartctl --all /dev/nvme1

=== START OF INFORMATION SECTION ===
Model Number:                   MEMBLAZE P6530CH0384M00
Serial Number:                   SH220600806
Firmware Version:                0A101Z00
PCI Vendor/Subsystem ID:          0x1c5f
IEEE OUI Identifier:             0x00e004
Total NVM Capacity:             3,840,755,982,336
Unallocated NVM Capacity:       0
Controller ID:                   1
NVMe Version:                   1.4
Number of Namespaces:             1
Namespace 1 Size/Capacity:       3,840,755,982,336
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64:          38b19e 7418032601
Local Time is:                   Sun Jun1 01:59:07 2025 CST
Firmware Updates (0x17):          3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x001f): Security Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x0054): DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):       Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:       128 Pages
WarningComp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 77 Celsius
Namespace 1 Features (0x08):    No_ID_Reuse

Supported Power States
St Op Max Active Idle RL RT WL WTEnt_LatEx_Lat
0 + 25.00W    -    - 0000    0    0
1 + 14.00W    -    - 0000    0    0
2 + 13.00W    -    - 0000    0    0
3 + 12.00W    -    - 0000    0    0
4 + 11.00W    -    - 0000    0    0
5 + 10.00W    -    - 0000    0    0
6 + 9.00W    -    - 0000    0    0
7 + 8.00W    -    - 0000    0    0
8 + 7.00W    -    - 0000    0    0
9 + 6.00W    -    - 0000    0    0

Supported LBA Sizes (NSID 0x1)
Id FmtDataMetadtRel_Perf
0 + 512    0       2
1 - 4096    0       0
2 - 512    8       2
3 - 4096    8       0
4 - 4096    64       0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

$sudo smartctl --all /dev/nvme2
=== START OF INFORMATION SECTION ===
Model Number:                   Fanxiang S500Pro 2TB
Serial Number:                   FXS500Pro243954233
Firmware Version:                SN12517
PCI Vendor/Subsystem ID:          0x1e4b
IEEE OUI Identifier:             0x000000
Total NVM Capacity:             2,048,408,248,320
Unallocated NVM Capacity:       0
Controller ID:                   0
NVMe Version:                   1.4
Number of Namespaces:             1
Namespace 1 Size/Capacity:       2,048,408,248,320
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64:          000000 0243954233
Local Time is:                   Sun Jun1 01:59:09 2025 CST
Firmware Updates (0x1a):          5 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x02):       Cmd_Eff_Lg
Maximum Data Transfer Size:       128 Pages
WarningComp. Temp. Threshold: 90 Celsius
Critical Comp. Temp. Threshold: 95 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WTEnt_LatEx_Lat
0 + 6.50W    -    - 0000    0    0
1 + 5.80W    -    - 1111    0    0
2 + 3.60W    -    - 2222    0    0
3 - 0.7460W    -    - 3333 5000 10000
4 - 0.7260W    -    - 4444 8000 45000

Supported LBA Sizes (NSID 0x1)
Id FmtDataMetadtRel_Perf
0 + 512    0       0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

$ sudo smartctl --all /dev/nvme3

=== START OF INFORMATION SECTION ===
Model Number:                   MEMBLAZE P6530CH0384M00
Serial Number:                   SH220600832
Firmware Version:                0A101Z00
PCI Vendor/Subsystem ID:          0x1c5f
IEEE OUI Identifier:             0x00e004
Total NVM Capacity:             3,840,755,982,336
Unallocated NVM Capacity:       0
Controller ID:                   1
NVMe Version:                   1.4
Number of Namespaces:             1
Namespace 1 Size/Capacity:       3,840,755,982,336
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64:          38b19e 7418034001
Local Time is:                   Sun Jun1 01:59:11 2025 CST
Firmware Updates (0x17):          3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x001f): Security Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x0054): DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):       Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:       128 Pages
WarningComp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 77 Celsius
Namespace 1 Features (0x08):    No_ID_Reuse

Supported Power States
St Op Max Active Idle RL RT WL WTEnt_LatEx_Lat
0 + 25.00W    -    - 0000    0    0
1 + 14.00W    -    - 0000    0    0
2 + 13.00W    -    - 0000    0    0
3 + 12.00W    -    - 0000    0    0
4 + 11.00W    -    - 0000    0    0
5 + 10.00W    -    - 0000    0    0
6 + 9.00W    -    - 0000    0    0
7 + 8.00W    -    - 0000    0    0
8 + 7.00W    -    - 0000    0    0
9 + 6.00W    -    - 0000    0    0

Supported LBA Sizes (NSID 0x1)
Id FmtDataMetadtRel_Perf
0 + 512    0       2
1 - 4096    0       0
2 - 512    8       2
3 - 4096    8       0
4 - 4096    64       0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

CPU就捡的洋垃圾6138:
$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
80Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz

弄了张3070 8G 涡轮跑点小模型:
$ lspci | grep -i nvidia

症状表现:
7*24小时运行, 但时不时会死机, 死机间隔短则十数小时, 长则数月. 据统计70%的死机时间在00:00-09:00这段时间, 剩下的30%死机随机分布于其他时间段, 但00:00-09:00这段时间主要就是Plex的声纹分析和媒体分析之类的后台工作
死机时直连显示器无光标闪烁, 直连键盘输入无响应, CapsLoack和ScrLock无响应. LAN内SSH无响应, Intel AMT无响应

机箱内未见故障灯(这内部是真好看呐):

死机时候可明显听到风扇拉满, 至少处于80%+的高转速, 因为平时Z8G4非常安静
除了Hard reset之外, 这个问题从没被我真正解决过
大概是入手这套配置1-2个月后开始出现, 该症状已经持续超过一年

我做了什么:
检查机箱内温度, 发现最高也就62°C, CPU占用率平均低于47%, 某些单核会持续100%(后台转码情有可原)
怀疑是显卡过热所以没显示输出, nvitop看了发现也就不到60°C
怀疑是swap满了卡死, 检查内核日志无OOM字样, 将swap扩到64G问题依旧
怀疑是网卡过热掉了, 发现显示输出也是卡死, 这机要死应是大家齐齐整整一起死了
怀疑是运行环境太热, 切到7*24小时开空调的卧室, 症状依旧, 死机频率仍旧不可琢磨
怀疑是杂牌电饭锅线吃不住1800W的电源, 换了根原装备件电源线, 问题依旧
怀疑是内存不稳, 在BIOS里跑了3小时内存测试, 全绿通过
怀疑是电源不对, 看了原装备件价格, 被劝退
怀疑是主板有问题, 但我没有证据, 店家: 做好散热不要乱猜

挠头: 难不成还能是CPU单核满载把自己热死了?

类似病症:
https://h30471.www3.hp.com/t5/tai-shi-dian-nao/z8g4jing-chang-si-ji/m-p/1134658-> 我没有它那样的故障灯
https://h30471.www3.hp.com/t5/tai-shi-dian-nao/HP-Z8G4-928-PCIE-yan-zhong-chao-shi-si-ji-zhong-qi/m-p/1261763 -> 我没有它那样的POST报错
https://h30471.www3.hp.com/t5/tai-shi-dian-nao/hui-puZ8G4gong-zuo-zhanCPU-wen-du-guo-gao-jing-chang-si-ji/m-p/1264789 -> 我观察了下我没有那么高的温度(数据来自cockpit)
https://h30471.www3.hp.com/t5/tai-shi-dian-nao/HPZ8G4-pin-fan-si-ji/m-p/1077573 -> 和我的状况很像, 我看了看也妹有啥浮灰影响散热

现在我没啥思路了. 就想着发来这里集思广益, 看看诸君有没有什么好点子可以找到这个错误的缘由, 按理说这机器就是散热强力设计稳定, 是个只为性能释放而生的机器, 怎么没事就死机叻? 我不明白(奉化口音)
如有需要我可以贴一下dmesg和journalctl -k的输出, 但我看着这东西又臭又长大家估计不怎么想看......?

Realplayer 发表于 2025-6-1 07:44

装win，看下有没有whea，有就换电源

WiiGe 发表于 2025-6-1 10:16

服务还在跑唉能不能不用再装个系统......能用dmesg之类的直接诊断吗？如果你猜是电源问题，那我在dmesg里面确实看到了几个ACPI错误：

[ 7.234940] i2c i2c-0: Systems with more than 8 memory slots not supported yet, not instantiating SPD
[ 7.234950] ioatdma 0000:00:04.1: enabling device (0004 -> 0006)
[ 7.259027] ioatdma 0000:00:04.2: enabling device (0004 -> 0006)
[ 7.280643] ioatdma 0000:00:04.3: enabling device (0004 -> 0006)
[ 7.298799] ioatdma 0000:00:04.4: enabling device (0004 -> 0006)
[ 7.316247] ioatdma 0000:00:04.5: enabling device (0004 -> 0006)
[ 7.330358] ioatdma 0000:00:04.6: enabling device (0004 -> 0006)
[ 7.342621] ioatdma 0000:00:04.7: enabling device (0004 -> 0006)
[ 7.345615] gnss: GNSS driver registered with major 236
[ 7.347248] ACPI Error: Needed , found 00000000bbfdd616 (20230628/exresop-557)
[ 7.348108] ACPI Error: AE_AML_OPERAND_TYPE, While resolving operands for (20230628/dswexec-433)

[ 7.348125]
            Initialized Local Variables for Method :
[ 7.348127] Local1: 00000000b7f14b0c <Obj>       Buffer(12) 00 00 00 00 00 00 00 00

[ 7.348140] Initialized Arguments for Method :(2 arguments defined for method invocation)
[ 7.348142] Arg0: 000000009fb42d8e <Obj>       Integer 0000000000000004
[ 7.348147] Arg1: 00000000bbfdd616 <Obj>       Integer 0000000000000000

[ 7.348156] ACPI Error: Aborting method \_SB.WMIV.WVPO due to previous error (AE_AML_OPERAND_TYPE) (20230628/psparse-529)
[ 7.348166] ACPI Error: Aborting method \_SB.WMIV.WMPV due to previous error (AE_AML_OPERAND_TYPE) (20230628/psparse-529)
[ 7.357943] ACPI Error: Needed , found 000000004f09eef4 (20230628/exresop-557)
[ 7.358106] ioatdma 0000:80:04.0: enabling device (0004 -> 0006)
[ 7.359974] ACPI Error: AE_AML_OPERAND_TYPE, While resolving operands for (20230628/dswexec-433)

[ 7.361874]
            Initialized Local Variables for Method :
[ 7.361879] Local1: 00000000ea3c534d <Obj>       Buffer(12) 00 00 00 00 00 00 00 00

[ 7.361921] Initialized Arguments for Method :(2 arguments defined for method invocation)
[ 7.361926] Arg0: 00000000ac004b87 <Obj>       Integer 0000000000000004
[ 7.361943] Arg1: 000000004f09eef4 <Obj>       Integer 0000000000000000

[ 7.361972] ACPI Error: Aborting method \_SB.WMIV.WVPO due to previous error (AE_AML_OPERAND_TYPE) (20230628/psparse-529)
[ 7.363854] ACPI Error: Aborting method \_SB.WMIV.WMPV due to previous error (AE_AML_OPERAND_TYPE) (20230628/psparse-529)
[ 7.369425] ACPI Error: Needed , found 00000000305152d8 (20230628/exresop-557)
[ 7.370762] ACPI Error: AE_AML_OPERAND_TYPE, While resolving operands for (20230628/dswexec-433)

[ 7.372121]
            Initialized Local Variables for Method :
[ 7.372124] Local1: 0000000051f950d8 <Obj>       Buffer(136) 00 00 00 00 00 00 00 00

[ 7.372162] Initialized Arguments for Method :(2 arguments defined for method invocation)
[ 7.372167] Arg0: 000000007cce02b3 <Obj>       Integer 0000000000000080
[ 7.372183] Arg1: 00000000305152d8 <Obj>       Integer 0000000000000000

[ 7.372206] ACPI Error: Aborting method \_SB.WMIV.WVPO due to previous error (AE_AML_OPERAND_TYPE) (20230628/psparse-529)
[ 7.372276] workqueue: work_for_cpu_fn hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[ 7.373520] ACPI Error: Aborting method \_SB.WMIV.WMPV due to previous error (AE_AML_OPERAND_TYPE) (20230628/psparse-529)
[ 7.374961] input: HP WMI hotkeys as /devices/virtual/input/input9
[ 7.378132] ACPI Error: Needed , found 00000000e263359f (20230628/exresop-557)
[ 7.379034] ACPI Error: AE_AML_OPERAND_TYPE, While resolving operands for (20230628/dswexec-433)

[ 7.379951]
            Initialized Local Variables for Method :
[ 7.379953] Local1: 00000000c51b5c99 <Obj>       Buffer(136) 00 00 00 00 00 00 00 00

[ 7.379973] Initialized Arguments for Method :(2 arguments defined for method invocation)
[ 7.379975] Arg0: 00000000e5261626 <Obj>       Integer 0000000000000080
[ 7.379983] Arg1: 00000000e263359f <Obj>       Integer 0000000000000000

[ 7.379995] ACPI Error: Aborting method \_SB.WMIV.WVPO due to previous error (AE_AML_OPERAND_TYPE) (20230628/psparse-529)
[ 7.380921] ACPI Error: Aborting method \_SB.WMIV.WMPV due to previous error (AE_AML_OPERAND_TYPE) (20230628/psparse-529)
[ 7.389494] ioatdma 0000:80:04.1: enabling device (0004 -> 0006)
[ 7.405383] ioatdma 0000:80:04.2: enabling device (0004 -> 0006)
[ 7.419566] ioatdma 0000:80:04.3: enabling device (0004 -> 0006)
[ 7.429298] ice: Intel(R) Ethernet Connection E800 Series Linux Driver
[ 7.429301] ice: Copyright (c) 2018, Intel Corporation.
[ 7.432372] ioatdma 0000:80:04.4: enabling device (0004 -> 0006)
[ 7.433469] Creating 1 MTD partitions on "0000:00:1f.5":
[ 7.433498] 0x000000000000-0x000002000000 : "BIOS"
[ 7.444463] ioatdma 0000:80:04.5: enabling device (0004 -> 0006)
[ 7.456213] ioatdma 0000:80:04.6: enabling device (0004 -> 0006)
[ 7.467575] ioatdma 0000:80:04.7: enabling device (0004 -> 0006)
[ 7.468405] snd_hda_intel 0000:00:1f.3: enabling device (0140 -> 0142)
[ 7.469170] snd_hda_intel 0000:2d:00.1: enabling device (0140 -> 0142)
[ 7.469332] snd_hda_intel 0000:2d:00.1: Disabling MSI
[ 7.469397] snd_hda_intel 0000:2d:00.1: Handle vga_switcheroo audio client

这样的内核日志能定位出是电源问题吗？

indtability 发表于 2025-6-1 12:11

死机前的日志没什么可疑点或者共同的特征吗？先看死机前状态想办法复现吧，实在没头绪我也觉得可以换个系统或者起码换个内核试试，docker 化的应用应该很好迁移才对

WiiGe 发表于 2025-6-1 12:43

本帖最后由 WiiGe 于 2025-6-1 12:53 编辑

indtability 发表于 2025-6-1 12:11
死机前的日志没什么可疑点或者共同的特征吗？先看死机前状态想办法复现吧，实在没头绪我也觉得可以换个系统 ...
我贴个journalctl -p err 给你看看哦

WiiGe 发表于 2025-6-1 12:48

死机状态我并不能稳定复现, 只是每次死机症状完全一致因此让我认为这应是同一个问题.
话说还有什么其他可以查看历史日志的工具吗? 我也就知道个 journalctl 和 dmesg......
大夫有需要的话我都可以贴出来的

chachi 发表于 2025-6-1 13:16

先跑硬件测试，比如内存测试
再确定软件问题

indtability 发表于 2025-6-1 13:17

日志下载下来只有8k，没法打开看，不过日志要是没有明显错误或者看不出死机规律的话，感觉也没啥好办法，只能建议换成更上游的内核或者发行版来碰碰运气了。

WiiGe 发表于 2025-6-1 13:51

chachi 发表于 2025-6-1 13:16
先跑硬件测试，比如内存测试
再确定软件问题

内存测试已通过, 全绿

QUI 发表于 2025-6-1 16:15

跑个gpu stress test试试？

calmer 发表于 2025-6-1 17:47

开机报的那些apci意义不大。journalctl -b -1 -k看能不能抓到上次死机现场。

—— 来自 HONOR PTP-AN10, Android 15, 鹅球 v3.5.99

WiiGe 发表于 2025-6-1 22:32

本帖最后由 WiiGe 于 2025-6-1 23:24 编辑

calmer 发表于 2025-6-1 17:47
开机报的那些apci意义不大。journalctl -b -1 -k看能不能抓到上次死机现场。

—— 来自 HONOR PTP-AN10, A ...
你说得对我也感觉这些ACPI错误似乎不是关键的问题, 不过journalctl -b -1 -k只能拿到5月26号之后的内容, 所以我尝试了 journalctl > all.log, 得到了一个500M的日志.

WiiGe 发表于 2025-6-1 22:40

以下是我的一些观察:
日志中最多的错误是Feb 26 00:23:47 z8g4-portal dockerd: time="2025-02-25T16:23:47.091392377Z" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:42370" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:42370->127.0.0.53:53: i/o timeout" question=";tracker.tiny-**.com.\tIN\t A"这种内容, 在某次重启后能看到超过4万行的错误(行号1858635-1890899):

这类错误发生过多次, 几乎每次重启后执行一段时间都会大量出现, 其中夹杂一些:

May 31 01:02:49 z8g4-portal dockerd: time="2025-05-31T01:02:49.706566899+08:00" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:49889" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:49889->127.0.0.53:53: i/o timeout" question=";share.rayser.cn.\tIN\t A"
May 31 01:03:13 z8g4-portal systemd-timesyncd: Timed out waiting for reply from 198.18.26.128:123 (ntp.ubuntu.com).
May 31 01:05:01 z8g4-portal CRON: pam_unix(cron:session): session opened for user root(uid=0) by root(uid=0)
May 31 01:05:01 z8g4-portal CRON: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 31 01:05:01 z8g4-portal CRON: pam_unix(cron:session): session closed for user root
May 31 01:05:35 z8g4-portal dockerd: time="2025-05-31T01:05:35.012524229+08:00" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:42567" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:42567->127.0.0.53:53: i/o timeout" question=";tracker.bt4g.com.\tIN\t A"

或者
May 31 00:57:13 z8g4-portal dockerd: time="2025-05-31T00:57:13.910263128+08:00" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:46779" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:46779->127.0.0.53:53: i/o timeout" question=";share.rayser.cn.\tIN\t A"
May 31 00:58:13 z8g4-portal systemd: Starting pmie_check.service - Check PMIE instances are running...
May 31 00:58:13 z8g4-portal systemd: Starting pmie_farm_check.service - Check and migrate non-primary pmie farm instances...
May 31 00:58:13 z8g4-portal systemd: Started pmie_check.service - Check PMIE instances are running.
May 31 00:58:13 z8g4-portal systemd: Started pmie_farm_check.service - Check and migrate non-primary pmie farm instances.
May 31 00:58:13 z8g4-portal systemd: pmie_farm_check.service: Deactivated successfully.
May 31 00:58:13 z8g4-portal systemd: pmie_check.service: Deactivated successfully.
May 31 01:00:01 z8g4-portal systemd: Starting sysstat-collect.service - system activity accounting tool...
May 31 01:00:01 z8g4-portal systemd: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
May 31 01:00:01 z8g4-portal systemd: sysstat-collect.service: Deactivated successfully.
May 31 01:00:01 z8g4-portal systemd: Finished sysstat-collect.service - system activity accounting tool.
May 31 01:00:01 z8g4-portal systemd: systemd-tmpfiles-clean.service: Deactivated successfully.
May 31 01:00:01 z8g4-portal systemd: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
May 31 01:01:04 z8g4-portal dockerd: time="2025-05-31T01:01:04.350097803+08:00" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:59390" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:59390->127.0.0.53:53: i/o timeout" question=";sparkle.ghostchu.com.\tIN\t A"

的错误, 看上去是NTP的问题.

WiiGe 发表于 2025-6-1 22:50

根据日志来看, 带-- Boot XXXXXXXXXXXXXXX -- 的记录有29条, 即journalctl记载了29次重启, 以这个 failed to query external DNS server 后紧跟重启的记录有14次, 占比约48%, 我感觉可能这和我遇到的多次死机症状有所关联? 大致形式如下:

Mar 04 10:18:24 z8g4-portal dockerd: time="2025-03-04T02:18:24.008409267Z" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:47004" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:47004->127.0.0.53:53: i/o timeout" question=";seeders-paradise.org.\tIN\t AAAA"
Mar 04 10:18:24 z8g4-portal dockerd: time="2025-03-04T02:18:24.008511054Z" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:57097" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:57097->127.0.0.53:53: i/o timeout" question=";seeders-paradise.org.\tIN\t A"
Mar 04 10:18:29 z8g4-portal dockerd: time="2025-03-04T02:18:29.053045662Z" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:34618" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:34618->127.0.0.53:53: i/o timeout" question=";highteahop.top.\tIN\t AAAA"
Mar 04 10:18:29 z8g4-portal dockerd: time="2025-03-04T02:18:29.053119935Z" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:50855" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:50855->127.0.0.53:53: i/o timeout" question=";highteahop.top.\tIN\t A"
Mar 04 10:20:04 z8g4-portal systemd: Starting sysstat-collect.service - system activity accounting tool...
Mar 04 10:20:04 z8g4-portal systemd: sysstat-collect.service: Deactivated successfully.
Mar 04 10:20:04 z8g4-portal systemd: Finished sysstat-collect.service - system activity accounting tool.
-- Boot b2563ed2eccc4cb0b982bd5928adb2a7 --
Mar 04 13:07:50 z8g4-portal kernel: Linux version 6.8.0-54-generic (buildd@lcy02-amd64-083) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #56-Ubuntu SMP PREEMPT_DYNAMIC Sat Feb8 00:37:57 UTC 2025 (Ubuntu 6.8.0-54.56-generic 6.8.12)

有Journal stopped字样的重启我认为这就是正常维护性重启, 不计入死机现场, 次数16次, 占比约55%:
Mar 12 07:13:44 z8g4-portal systemd-shutdown: Syncing filesystems and block devices.
Mar 12 07:13:44 z8g4-portal kernel: BTRFS info (device dm-0): last unmount of filesystem 533af4a8-6aaf-49f7-8d01-5336db1d8307
Mar 12 07:13:59 z8g4-portal systemd-shutdown: Sending SIGTERM to remaining processes...
Mar 12 07:13:59 z8g4-portal systemd-journald: Journal stopped
-- Boot 440da79b7a1d4f44b1fbc5356f736634 --
Mar 12 07:21:35 z8g4-portal kernel: Linux version 6.8.0-55-generic (buildd@lcy02-amd64-095) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #57-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 12 23:42:21 UTC 2025 (Ubuntu 6.8.0-55.57-generic 6.8.12)
Mar 12 07:21:35 z8g4-portal kernel: Command line: BOOT_IMAGE=/vmlinuz-6.8.0-55-generic root=/dev/mapper/ubuntu--vg-lv--0 ro
Mar 12 07:21:35 z8g4-portal kernel: KERNEL supported cpus:
Mar 12 07:21:35 z8g4-portal kernel: Intel GenuineIntel

posthoc 发表于 2025-6-1 23:06

127.0.0.53连接失败可能是ubuntu的systemd-resolved设置有问题？不过你都用了这么久了不应该吧。而且不至于卡死整个机器吧，除非是网卡硬件问题（自己在win11遇到过一次每次重启之后10分钟左右稳定卡死，注意到每次卡死之前网络先挂，重新插拔网卡之后再无问题）

WiiGe 发表于 2025-6-1 23:30

本帖最后由 WiiGe 于 2025-6-2 00:16 编辑

在这些 failed to query external DNS server 错误之前我看到一些OOM错误, 其他情况我没怎么看明白到底发生了什么, OOM这种情况大致出现了7-8次, 形式如下:
May 02 11:06:54 z8g4-portal dockerd: time="2025-05-02T03:06:54.074625447Z" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:35975" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:35975->127.0.0.53:53: i/o timeout" question=";taciturn-shadow.spb.ru.\tIN\t A"
May 02 11:09:11 z8g4-portal kernel: Plex Commercial invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
May 02 11:09:11 z8g4-portal kernel: CPU: 0 PID: 287380 Comm: Plex Commercial Tainted: P       OE    6.8.0-58-generic #60-Ubuntu
May 02 11:09:11 z8g4-portal kernel: Hardware name: HP HP Z8 G4 Workstation/81C7, BIOS P60 v02.95 11/21/2024
May 02 11:09:11 z8g4-portal kernel: Call Trace:
May 02 11:09:11 z8g4-portal kernel:<TASK>
May 02 11:09:11 z8g4-portal kernel:dump_stack_lvl+0x76/0xa0
May 02 11:09:11 z8g4-portal kernel:dump_stack+0x10/0x20
May 02 11:09:11 z8g4-portal kernel:dump_header+0x47/0x1f0
May 02 11:09:11 z8g4-portal kernel:oom_kill_process+0x118/0x280
May 02 11:09:11 z8g4-portal kernel:? oom_evaluate_task+0x143/0x1e0
May 02 11:09:11 z8g4-portal kernel:out_of_memory+0x103/0x350
May 02 11:09:11 z8g4-portal kernel:__alloc_pages_may_oom+0x10c/0x1d0
May 02 11:09:11 z8g4-portal kernel:__alloc_pages_slowpath.constprop.0+0x420/0x9f0
May 02 11:09:11 z8g4-portal kernel:__alloc_pages+0x31f/0x350
May 02 11:09:11 z8g4-portal kernel:alloc_pages_mpol+0x91/0x210
May 02 11:09:11 z8g4-portal kernel:alloc_pages+0x5b/0xd0
May 02 11:09:11 z8g4-portal kernel:folio_alloc+0x15/0x40
May 02 11:09:11 z8g4-portal kernel:filemap_alloc_folio+0xf4/0x100
May 02 11:09:11 z8g4-portal kernel:__filemap_get_folio+0x195/0x2d0
May 02 11:09:11 z8g4-portal kernel:filemap_fault+0x15c/0x8e0
May 02 11:09:11 z8g4-portal kernel:? set_pte_range+0x116/0x3f0
May 02 11:09:11 z8g4-portal kernel:__do_fault+0x3a/0x190
May 02 11:09:11 z8g4-portal kernel:do_read_fault+0x133/0x200
May 02 11:09:11 z8g4-portal kernel:do_fault+0xf0/0x260
May 02 11:09:11 z8g4-portal kernel:handle_pte_fault+0x114/0x1d0
May 02 11:09:11 z8g4-portal kernel:__handle_mm_fault+0x654/0x800
May 02 11:09:11 z8g4-portal kernel:handle_mm_fault+0x18a/0x380
May 02 11:09:11 z8g4-portal kernel:do_user_addr_fault+0x169/0x670
May 02 11:09:11 z8g4-portal kernel:exc_page_fault+0x83/0x1b0
May 02 11:09:11 z8g4-portal kernel:asm_exc_page_fault+0x27/0x30
May 02 11:09:11 z8g4-portal kernel: RIP: 0033:0x709ad07aaffa
May 02 11:09:11 z8g4-portal kernel: Code: Unable to access opcode bytes at 0x709ad07aafd0.
May 02 11:09:11 z8g4-portal kernel: RSP: 002b:00007ffdb9994370 EFLAGS: 00010202
May 02 11:09:11 z8g4-portal kernel: RAX: 0000709ad07ab1c0 RBX: 0000709aceb65080 RCX: 0000000000000010
May 02 11:09:11 z8g4-portal kernel: RDX: 0000000000000002 RSI: ffffffffffffffec RDI: 0000709ad00c7580
May 02 11:09:11 z8g4-portal kernel: RBP: 0000000000000006 R08: 0000709acea212f0 R09: 0000709acea212ec
May 02 11:09:11 z8g4-portal kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000709ad00c7580
May 02 11:09:11 z8g4-portal kernel: R13: 0000000000001800 R14: 0000709acebb7dc0 R15: 0000709aceb913c0
May 02 11:09:11 z8g4-portal kernel:</TASK>
May 02 11:09:11 z8g4-portal kernel: Mem-Info:
May 02 11:09:11 z8g4-portal kernel: active_anon:6681869 inactive_anon:24607315 isolated_anon:0
                                 active_file:30644 inactive_file:20137 isolated_file:0
                                 unevictable:11192 dirty:13 writeback:0
                                 slab_reclaimable:254395 slab_unreclaimable:296998
                                 mapped:64449 shmem:34779 pagetables:220975
                                 sec_pagetables:0 bounce:0
                                 kernel_misc_reclaimable:0
                                 free:115427 free_pcp:14330 free_cma:0
May 02 11:09:11 z8g4-portal kernel: Node 0 active_anon:7669552kB inactive_anon:53187884kB active_file:91524kB inactive_file:73772kB unevictable:27528kB isolated(anon):0kB isolated(file):0kB mapped:147512kB dirty:52kB writeback:0kB shmem:25020kB shmem_thp:0kB shmem_p**pped:0kB anon_thp:0kB writeback_tmp:0kB kernel_stack:29912kB pagetables:580788kB sec_pagetables:0kB all_unreclaimable? no
May 02 11:09:11 z8g4-portal kernel: Node 1 active_anon:19057924kB inactive_anon:45241376kB active_file:31052kB inactive_file:6776kB unevictable:17240kB isolated(anon):0kB isolated(file):0kB mapped:110284kB dirty:0kB writeback:0kB shmem:114096kB shmem_thp:0kB shmem_p**pped:0kB anon_thp:0kB writeback_tmp:0kB kernel_stack:35848kB pagetables:303112kB sec_pagetables:0kB all_unreclaimable? no
May 02 11:09:11 z8g4-portal kernel: Node 0 DMA free:11264kB boost:0kB min:8kB low:20kB high:32kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
May 02 11:09:11 z8g4-portal kernel: lowmem_reserve[]: 0 1595 63998 63998 63998
May 02 11:09:11 z8g4-portal kernel: Node 0 DMA32 free:247020kB boost:0kB min:1116kB low:2748kB high:4380kB reserved_highatomic:0KB active_anon:244352kB inactive_anon:1181220kB active_file:1356kB inactive_file:1204kB unevictable:0kB writepending:32kB present:1767796kB managed:1701228kB mlocked:0kB bounce:0kB free_pcp:6252kB local_pcp:0kB free_cma:0kB
May 02 11:09:11 z8g4-portal kernel: lowmem_reserve[]: 0 0 62402 62402 62402
May 02 11:09:11 z8g4-portal kernel: Node 0 Normal free:133464kB boost:257320kB min:301076kB low:364972kB high:428868kB reserved_highatomic:0KB active_anon:7425236kB inactive_anon:52006628kB active_file:89964kB inactive_file:72276kB unevictable:27528kB writepending:20kB present:65011712kB managed:63909016kB mlocked:27528kB bounce:0kB free_pcp:29556kB local_pcp:248kB free_cma:0kB
May 02 11:09:11 z8g4-portal kernel: lowmem_reserve[]: 0 0 0 0 0
May 02 11:09:11 z8g4-portal kernel: Node 1 Normal free:71976kB boost:79872kB min:125096kB low:191136kB high:257176kB reserved_highatomic:0KB active_anon:19057924kB inactive_anon:45241376kB active_file:31052kB inactive_file:6776kB unevictable:17240kB writepending:0kB present:67108864kB managed:66040244kB mlocked:17240kB bounce:0kB free_pcp:19500kB local_pcp:0kB free_cma:0kB
May 02 11:09:11 z8g4-portal kernel: lowmem_reserve[]: 0 0 0 0 0
May 02 11:09:11 z8g4-portal kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
May 02 11:09:11 z8g4-portal kernel: Node 0 DMA32: 722*4kB (UM) 1063*8kB (UME) 902*16kB (UME) 820*32kB (UME) 505*64kB (UME) 325*128kB (UME) 158*256kB (UME) 81*512kB (UME) 37*1024kB (UM) 1*2048kB (M) 0*4096kB = 247840kB
May 02 11:09:11 z8g4-portal kernel: Node 0 Normal: 4166*4kB (UME) 4475*8kB (UME) 2479*16kB (UME) 1276*32kB (UME) 22*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 134368kB
May 02 11:09:11 z8g4-portal kernel: Node 1 Normal: 261*4kB (UME) 303*8kB (UME) 1411*16kB (UME) 1401*32kB (UME) 0*64kB 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 71004kB
May 02 11:09:11 z8g4-portal kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 02 11:09:11 z8g4-portal kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 02 11:09:11 z8g4-portal kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
May 02 11:09:11 z8g4-portal kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 02 11:09:11 z8g4-portal kernel: 116714 total pagecache pages
May 02 11:09:11 z8g4-portal kernel: 26954 pages in swap cache
May 02 11:09:11 z8g4-portal kernel: Free swap= 0kB
May 02 11:09:11 z8g4-portal kernel: Total swap = 8388604kB
May 02 11:09:11 z8g4-portal kernel: 33476090 pages RAM
May 02 11:09:11 z8g4-portal kernel: 0 pages HighMem/MovableOnly
May 02 11:09:11 z8g4-portal kernel: 559628 pages reserved
May 02 11:09:11 z8g4-portal kernel: 0 pages hwpoisoned
May 02 11:09:11 z8g4-portal kernel: Tasks state (memory values in pages):
May 02 11:09:11 z8g4-portal kernel: uidtgid total_vm    rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
May 02 11:09:11 z8g4-portal kernel: [ 1203] 01203 35180    160    160    0       0 139264    0       -250 systemd-journal
May 02 11:09:11 z8g4-portal kernel: [ 1234] 01234 20117 6080 3520 2560       0 98304    0       -1000 dmeventd
很长很长很长的Tasks state
很长很长很长的Tasks state
很长很长很长的Tasks state
很长很长很长的Tasks state
很长很长很长的Tasks state
May 02 11:09:11 z8g4-portal kernel: [ 478703] 0 478703 5326 1280    0 1280       0 73728    0          0 curl
May 02 11:09:11 z8g4-portal kernel: [ 478719]1000 478719 3909    0    0    0       0 81920    0          0 sudo
May 02 11:09:11 z8g4-portal kernel: [ 478726] 999 478726 55568 1760    0 1440    320 172032    329          0 postgres
May 02 11:09:11 z8g4-portal kernel: [ 478747] 999 478747 55073    160    0    160       0 126976    329          0 postgres
May 02 11:09:11 z8g4-portal kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-9ebf9b26996e7c9e659ad15d77607dc4bf5a53052370014be92d3a108c675cd2.scope,mems_allowed=0-1,global_oom,task_memcg=/system.slice/docker-82e65ed77c11c0b981e0f273c0193944b9c35d4e93f39d24ba7b5182352c1085.scope,task=ffmpeg,pid=443109,uid=0
May 02 11:09:11 z8g4-portal kernel: Out of memory: Killed process 443109 (ffmpeg) total-vm:118242576kB, anon-rss:116685432kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:228776kB oom_score_adj:0
May 02 11:09:11 z8g4-portal systemd: docker-82e65ed77c11c0b981e0f273c0193944b9c35d4e93f39d24ba7b5182352c1085.scope: A process of this unit has been killed by the OOM killer.
May 02 11:09:12 z8g4-portal dockerd: time="2025-05-02T03:09:11.892772400Z" level=warning msg="Health check for container 9ebf9b26996e7c9e659ad15d77607dc4bf5a53052370014be92d3a108c675cd2 error: timed out starting health check for container 9ebf9b26996e7c9e659ad15d77607dc4bf5a53052370014be92d3a108c675cd2"
May 02 11:09:12 z8g4-portal dockerd: time="2025-05-02T03:09:11.892973734Z" level=warning msg="Health check for container 0fb159a16e740859ef9d5cec397b0bf6588e2868c7444f87caa95fb47b9ee1ca error: timed out starting health check for container 0fb159a16e740859ef9d5cec397b0bf6588e2868c7444f87caa95fb47b9ee1ca"
May 02 11:09:12 z8g4-portal dockerd: time="2025-05-02T03:09:11.893194928Z" level=warning msg="Health check for container e69453d2b9babaf835e77c238e0d3ad151701d79fd03e3b15404d6ff803fb9d3 error: timed out starting health check for container e69453d2b9babaf835e77c238e0d3ad151701d79fd03e3b15404d6ff803fb9d3"
May 02 11:09:19 z8g4-portal kernel: oom_reaper: reaped process 443109 (ffmpeg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 02 11:09:26 z8g4-portal dockerd: time="2025-05-02T03:09:26.710334282Z" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:35892" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:35892->127.0.0.53:53: i/o timeout" question=";console.打码打码打码打码.\tIN\t AAAA"
May 02 11:09:26 z8g4-portal dockerd: time="2025-05-02T03:09:26.710435489Z" level=error msg=" failed to query external DNS server" client-addr="udp:127.0.0.1:33238" dns-server="udp:127.0.0.53:53" error="read udp 127.0.0.1:33238->127.0.0.53:53: i/o timeout" question=";console.打码打码打码打码.\tIN\t A"

我想我看的应该也不怎么全面, 欢迎有其他工具的大夫加入会诊, 我找了个快传平台搭了个microbin把输出日志简单脱敏放了上去: https://share.wiige.top/p/aAAYza

WiiGe 发表于 2025-6-1 23:32

posthoc 发表于 2025-6-1 23:06
127.0.0.53连接失败可能是ubuntu的systemd-resolved设置有问题？不过你都用了这么久了不应该吧。而且不至于 ...

确实, 看上去 failed to query external DNS server是某种错误导致的结果而不是诱因, 但这日志我实在是没法完全看懂

痴货发表于 2025-6-2 00:55

查了下reddit上好像也类似现象：

https://www.reddit.com/r/Ubuntu/comments/1dbt03y/ubuntu_2404_lts_freezes_randomly_sometimes/

不如先换成22试试看

indtability 发表于 2025-6-2 02:08

swap 好像占满了，往这方面找找呢？

dns 错误是配置有问题还是运行时出的问题呢？我对 docker 不熟，不过按 podman 的使用经验来说，有些情况确实需要单独配置 dns。

loveS1 发表于 2025-6-2 02:54

大概率是软件吃满了内存和 swap，导致系统响应极慢形如死机。
可以试试在docker配置里加上 -mem，限制所有 container 合计内存小于总内存，再看看会不会出现死机。

phorcys02 发表于 2025-6-2 02:55

本帖最后由 phorcys02 于 2025-6-2 03:00 编辑

swap耗尽， free 内存也没多少，还有内核分配内存出错的信息
大概率是OOM导致的，因为如果一个服务会导致系统oom，那就不会只出现一次，只会是无数次。
记录下内存使用的日志，检查下是你的哪个服务吃这么多内存，还是说哪个服务有内存泄露或者配置了太大的cache。

要么修好泄露，要么减少内存使用，要么加点内存。

WiiGe 发表于 2025-6-2 05:42

痴货发表于 2025-6-2 00:55
查了下reddit上好像也类似现象：

卧槽看着好像是这么回事儿哦? 我动摇了

WiiGe 发表于 2025-6-2 05:50

本帖最后由 WiiGe 于 2025-6-2 06:13 编辑

phorcys02 发表于 2025-6-2 02:55
swap耗尽， free 内存也没多少，还有内核分配内存出错的信息
大概率是OOM导致的，因为如果一个服务会导 ...
这是怎么看出来的呢? 是关注哪些关键词可以看到内核分配内存失败的哇? 求教啊
是这个:
Apr 24 13:06:13 z8g4-portal kernel: workqueue: work_for_cpu_fn hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
或是:
Apr 27 17:21:24 z8g4-portal kernel: workqueue: vmstat_update hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND还是:Apr 28 18:00:07 z8g4-portal kernel: workqueue: drain_vmap_area_work hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
这个呢?

记录下内存使用的日志是用sar来看吗? 我没太多这部分经验. 或者说一劳永逸地找个旧路由器搭个Prometheus?

我觉得你和楼上两位的说法很符合我的直觉, 但我没啥排查内存泄漏减少内存的思路, 望大夫们来张详细些的处方.

不过加内存好说, 总之看日志不妨碍我下单, 板子上空插槽也够, 先拉到512G看看是否会影响死机现象的出现频率, 这样可好?

tsubasa9 发表于 2025-6-2 11:48

内存爆炸把ffmpeg杀了
你就不能先少开点容器看看会不会死机？

phorcys02 发表于 2025-6-2 12:35

WiiGe 发表于 2025-6-2 05:50
这是怎么看出来的呢? 是关注哪些关键词可以看到内核分配内存失败的哇? 求教啊
是这个:
或是:

简单的如atop，sar之类的定时记录的，频率配高一点跑跑看
或者一步到位Prometheus也没毛病...
我感觉你列出的几个服务应该不至于吃满内存，大概率是什么配置的问题

moondigi 发表于 2025-6-2 12:45

你不打算关容器来测试稳定性的话就给每个容器加上cpu内存限制

WiiGe 发表于 2025-6-2 23:54

moondigi 发表于 2025-6-2 12:45
你不打算关容器来测试稳定性的话就给每个容器加上cpu内存限制

我是docker compose启的，随便加
resource:
limite/reservation:
memory:
cpu:
这些项会导致容器以奇形怪状的方式报错，属实有些ptsd，有什么好的引入方式吗？

WiiGe 发表于 2025-6-2 23:55

phorcys02 发表于 2025-6-2 12:35
简单的如atop，sar之类的定时记录的，频率配高一点跑跑看
或者一步到位Prometheus也没毛病...
我感觉你列 ...

老专家有什么排查配置的思路吗？或者说先等一段时间拿到内存使用日志再说？

WiiGe 发表于 2025-6-2 23:56

tsubasa9 发表于 2025-6-2 11:48
内存爆炸把ffmpeg杀了
你就不能先少开点容器看看会不会死机？

没…………没有规律嘛我也不知道先噶谁，总之我把ffmpeg先斩了看看情况？

moondigi 发表于 2025-6-3 09:29

WiiGe 发表于 2025-6-2 23:54
我是docker compose启的，随便加
resource:
limite/reservation:

我用podman的不用docker，你看看官方文档是不是配置错了，先用命令行加cpu内存参数启动测试一段时间

tsubasa9 发表于 2025-6-3 09:39

本帖最后由 tsubasa9 于 2025-6-3 09:41 编辑

没有规律就应该先空置啥都不跑，控制变量好歹学过吧
先前办公室工作站跑服务也是差不多症状
重装了个minimal的archlinux一样表现
最后换了内存才解决
基本认定是内存兼容问题
内存测试和log毫无问题

UNICORN00 发表于 2025-6-3 09:40

WiiGe 发表于 2025-6-3 09:49

moondigi 发表于 2025-6-3 09:29
我用podman的不用docker，你看看官方文档是不是配置错了，先用命令行加cpu内存参数启动测试一段时间 ...

好奇问问podman是不是就是非商业的docker？我看官网的意思好像是和k8s很近, 也是用containerd的吗? 我只有一些k3s写yaml的经历没用过podman

WiiGe 发表于 2025-6-3 09:55

tsubasa9 发表于 2025-6-3 09:39
没有规律就应该先空置啥都不跑，控制变量好歹学过吧
先前办公室工作站跑服务也是差不多症状
重装了个minima ...

对的对的对的, 这种测不出来但是跑起来怎么都不顺的问题最烦人叻
最开始控制变量想的是再弄一台Z8G4然后配置匀一匀弄成两个256G内存的机器负债均衡看看
但是预算不太够, 就想还是先来坛里问问, 指不定有什么思路
空跑肯定是不太行的, 这些服务是我的刚需, 所以还是回到了预算不太够的情况上来了哈哈

ChengChung 发表于 2025-6-3 10:38

线上运行的机器一般是推荐把swap干掉的，这样它即使OOM也不太会拖累大家一起死
另外有带外硬件监控吗？

WiiGe 发表于 2025-6-3 15:12

ChengChung 发表于 2025-6-3 10:38
线上运行的机器一般是推荐把swap干掉的，这样它即使OOM也不太会拖累大家一起死
另外有带外硬件监控吗？ ...

好问题，HP的工作站都只有Intel AMT，就那个破vPro，我不知道这东西到底能不能算BMC

WiiGe 发表于 2025-6-11 08:54

本帖最后由 WiiGe 于 2025-6-11 10:52 编辑

在这一周内我做了的改进:

- 接受 ChengChung 的建议干掉了 swap
- 接受 moondigi 的建议给compose.yaml 加了limit
- 接受 tsubasa9 的建议空载 48小时 (未见异常)
- 买了台洋垃圾小主机搭 Prometheus, 但机器还在快递员手上
- 添加了两根64G内存, 但有一根无法识别而正处于退换货状态, 机器上已安装3根64G内存

昨天下午仍旧出现一次死机情况, 内核日志已于1楼更新

Update: 刚才2025-06-11-10:40 又一次死机, 机器在拉取docker hub的vllm的镜像的同时我通过WebDav打开了一个BDMV文件夹, 这时Plex正在执行入库操作, 然后它就死机了

会不会和网络/文件系统的IO有关系?

WiiGe 发表于 2025-6-11 13:32

本帖最后由 WiiGe 于 2025-6-11 13:34 编辑

附sar结果: sar -f /var/log/sysstat/sa04
Linux 6.8.0-60-generic (z8g4-portal) 06/04/2025    _x86_64_    (80 CPU)

12:00:22 AM CPU %user %nice %system %iowait %steal %idle
12:10:13 AM all    0.45    0.20    0.43    0.23    0.00 98.69
12:20:12 AM all    0.44    0.18    0.45    0.14    0.00 98.79
12:30:24 AM all    0.38    0.00    0.14    0.00    0.00 99.48
12:40:06 AM all    0.38    0.00    0.14    0.00    0.00 99.48
12:50:04 AM all    0.37    0.00    0.13    0.00    0.00 99.49
01:00:23 AM all    0.35    0.00    0.12    0.00    0.00 99.53
01:10:12 AM all    0.53    0.00    0.22    0.00    0.00 99.24
01:20:10 AM all    0.36    0.00    0.13    0.00    0.00 99.51
01:30:02 AM all    0.48    0.00    0.21    0.00    0.00 99.31
01:40:17 AM all    0.41    0.00    0.18    0.00    0.00 99.41
01:50:06 AM all    0.49    0.00    0.36    0.00    0.00 99.15
02:00:05 AM all    0.47    0.00    0.32    0.00    0.00 99.21
02:10:21 AM all    0.53    0.00    0.39    0.01    0.00 99.07
02:20:02 AM all    0.45    0.00    0.29    0.00    0.00 99.26
02:30:15 AM all    0.43    0.00    0.24    0.00    0.00 99.33
02:40:24 AM all    0.45    0.00    0.28    0.00    0.00 99.27
02:50:06 AM all    0.37    0.00    0.15    0.00    0.00 99.48
03:00:16 AM all    0.34    0.00    0.12    0.00    0.00 99.54
03:10:02 AM all    0.68    0.00    0.43    0.00    0.00 98.88
03:20:05 AM all    0.58    0.00    0.32    0.00    0.00 99.10
03:30:24 AM all    0.37    0.00    0.16    0.00    0.00 99.47
03:40:14 AM all    0.37    0.00    0.15    0.00    0.00 99.48
03:50:24 AM all    0.36    0.00    0.12    0.00    0.00 99.52
04:00:02 AM all    0.35    0.00    0.12    0.00    0.00 99.53
04:10:24 AM all    0.37    0.00    0.12    0.00    0.00 99.51
04:20:04 AM all    0.36    0.00    0.12    0.00    0.00 99.52
04:30:22 AM all    0.35    0.00    0.12    0.00    0.00 99.53
04:40:14 AM all    0.49    0.00    0.24    0.03    0.00 99.24
04:50:12 AM all    0.52    0.00    0.17    0.01    0.00 99.30
05:00:16 AM all    0.36    0.00    0.12    0.00    0.00 99.51
05:10:02 AM all    0.66    0.00    0.24    0.04    0.00 99.06
05:20:08 AM all    0.49    0.00    0.19    0.07    0.00 99.25
05:30:13 AM all    0.36    0.00    0.12    0.00    0.00 99.51
05:40:22 AM all    0.37    0.00    0.13    0.00    0.00 99.50
05:50:01 AM all    0.35    0.00    0.12    0.00    0.00 99.52
06:00:12 AM all    0.35    0.00    0.12    0.00    0.00 99.54
06:10:24 AM all    0.38    0.00    0.12    0.00    0.00 99.49
06:20:02 AM all    0.35    0.00    0.12    0.00    0.00 99.54
06:30:24 AM all    0.35    0.00    0.12    0.00    0.00 99.53
06:40:24 AM all    0.39    0.00    0.13    0.00    0.00 99.47
06:50:10 AM all    0.39    0.00    0.13    0.00    0.00 99.48
07:00:24 AM all    0.40    0.00    0.13    0.00    0.00 99.46
07:10:02 AM all    0.40    0.00    0.13    0.00    0.00 99.47
07:20:08 AM all    0.35    0.00    0.12    0.00    0.00 99.53
07:30:17 AM all    0.35    0.00    0.12    0.00    0.00 99.54
07:40:22 AM all    0.39    0.00    0.12    0.00    0.00 99.49
07:50:12 AM all    0.35    0.00    0.11    0.00    0.00 99.53
08:00:00 AM all    0.35    0.00    0.12    0.00    0.00 99.53
08:10:11 AM all    0.38    0.00    0.12    0.00    0.00 99.49
08:20:10 AM all    0.35    0.00    0.11    0.00    0.00 99.54
08:30:00 AM all    0.36    0.00    0.12    0.00    0.00 99.51
08:40:24 AM all    0.36    0.00    0.12    0.00    0.00 99.52
08:50:02 AM all    0.36    0.00    0.12    0.00    0.00 99.51
09:00:24 AM all    0.36    0.00    0.14    0.00    0.00 99.50
09:10:24 AM all    0.88    0.52    0.38    0.08    0.00 98.14
09:20:01 AM all    0.36    2.08    0.35    0.00    0.00 97.21
09:30:24 AM all    0.35    2.06    0.35    0.00    0.00 97.24
09:40:12 AM all    0.36    2.06    0.36    0.00    0.00 97.21
09:50:05 AM all    0.37    2.07    0.36    0.00    0.00 97.20
10:00:02 AM all    0.35    2.07    0.36    0.00    0.00 97.21

10:00:02 AM CPU %user %nice %system %iowait %steal %idle
10:10:03 AM all    0.93    0.26    0.40    0.08    0.00 98.34
10:20:05 AM all    0.37    0.00    0.13    0.00    0.00 99.50
10:30:12 AM all    0.37    0.00    0.13    0.00    0.00 99.51
10:40:06 AM all    0.37    0.00    0.13    0.00    0.00 99.49
10:50:02 AM all    0.36    0.00    0.12    0.00    0.00 99.51
11:00:24 AM all    0.36    0.00    0.12    0.00    0.00 99.52
11:10:07 AM all    0.38    0.00    0.13    0.00    0.00 99.49
11:20:11 AM all    0.37    0.00    0.13    0.00    0.00 99.50
11:30:17 AM all    0.37    0.00    0.12    0.00    0.00 99.50
11:40:02 AM all    0.37    0.00    0.14    0.00    0.00 99.49
11:50:04 AM all    0.37    0.00    0.12    0.00    0.00 99.50
12:00:24 PM all    0.36    0.00    0.12    0.00    0.00 99.51
12:10:22 PM all    0.37    0.00    0.13    0.00    0.00 99.50
12:20:12 PM all    0.36    0.00    0.12    0.00    0.00 99.51
12:30:08 PM all    0.37    0.00    0.12    0.00    0.00 99.50
12:40:24 PM all    0.37    0.00    0.12    0.00    0.00 99.51
12:50:02 PM all    0.37    0.00    0.12    0.00    0.00 99.51
01:00:14 PM all    0.37    0.00    0.12    0.00    0.00 99.51
01:10:13 PM all    0.38    0.00    0.12    0.00    0.00 99.49
01:20:11 PM all    0.35    0.00    0.12    0.00    0.00 99.52
01:30:24 PM all    0.37    0.00    0.13    0.01    0.00 99.49
01:40:12 PM all    0.37    0.00    0.13    0.01    0.00 99.50
01:50:07 PM all    0.36    0.00    0.12    0.00    0.00 99.52
02:00:02 PM all    0.36    0.00    0.12    0.00    0.00 99.51
02:10:23 PM all    0.39    0.00    0.12    0.00    0.00 99.49
02:20:22 PM all    0.36    0.00    0.12    0.00    0.00 99.52
02:30:12 PM all    0.37    0.00    0.12    0.00    0.00 99.50
02:40:24 PM all    0.37    0.00    0.12    0.00    0.00 99.51
02:50:02 PM all    0.38    0.00    0.13    0.00    0.00 99.48
03:00:15 PM all    0.53    0.00    0.21    0.09    0.00 99.17
03:10:01 PM all    0.59    0.00    0.16    0.00    0.00 99.25
03:20:21 PM all    0.53    0.00    0.15    0.00    0.00 99.32
03:30:24 PM all    0.38    0.00    0.15    0.00    0.00 99.47
03:40:02 PM all    0.46    0.00    0.15    0.00    0.00 99.39
03:50:24 PM all    0.40    0.00    0.14    0.00    0.00 99.46
04:00:24 PM all    0.43    0.00    0.15    0.00    0.00 99.42
04:10:22 PM all    0.38    0.00    0.14    0.00    0.00 99.48
04:20:24 PM all    0.41    0.00    0.15    0.00    0.00 99.44
04:30:07 PM all    0.39    0.00    0.14    0.00    0.00 99.47
04:40:16 PM all    0.43    0.00    0.15    0.00    0.00 99.42
04:50:08 PM all    0.38    0.00    0.14    0.00    0.00 99.48
05:00:14 PM all    0.40    0.00    0.15    0.00    0.00 99.45
05:10:09 PM all    0.44    0.00    0.15    0.00    0.00 99.41
05:20:12 PM all    0.39    0.00    0.13    0.00    0.00 99.48
05:30:24 PM all    0.38    0.00    0.14    0.00    0.00 99.48
05:40:02 PM all    0.41    0.00    0.14    0.00    0.00 99.45
05:50:22 PM all    0.41    0.00    0.17    0.06    0.00 99.35
06:00:00 PM all    0.44    0.00    0.14    0.00    0.00 99.42
06:10:12 PM all    0.47    0.00    0.14    0.00    0.00 99.39
06:20:06 PM all    0.47    0.00    0.18    0.00    0.00 99.35
06:30:02 PM all    0.44    0.00    0.16    0.00    0.00 99.40
06:40:05 PM all    0.46    0.00    0.15    0.00    0.00 99.38
06:50:17 PM all    0.44    0.00    0.12    0.00    0.00 99.43
07:00:06 PM all    0.45    0.00    0.14    0.00    0.00 99.41
07:10:20 PM all    0.44    0.00    0.14    0.00    0.00 99.42
07:20:02 PM all    0.41    0.00    0.13    0.00    0.00 99.45
07:30:09 PM all    0.41    0.00    0.13    0.00    0.00 99.45
07:40:24 PM all    0.42    0.00    0.14    0.00    0.00 99.43
07:50:11 PM all    0.37    0.00    0.13    0.00    0.00 99.49
08:00:24 PM all    0.37    0.00    0.13    0.00    0.00 99.50

08:00:24 PM CPU %user %nice %system %iowait %steal %idle
08:10:12 PM all    0.37    0.00    0.13    0.00    0.00 99.50
08:20:07 PM all 30.58    1.26    0.71    0.01    0.00 67.44
08:30:02 PM all 33.38 12.77    5.61    0.08    0.00 48.15
08:40:24 PM all 27.96    7.33    2.78    0.01    0.00 61.92
08:50:16 PM all 23.59    0.99    0.48    0.00    0.00 74.94
09:00:20 PM all 23.15    0.90    0.49    0.00    0.00 75.46
09:10:24 PM all 23.69    0.96    0.53    0.00    0.00 74.82
09:20:07 PM all 23.88    0.99    0.57    0.00    0.00 74.55
09:30:24 PM all 24.19    1.00    0.56    0.00    0.00 74.25
09:40:02 PM all 24.29    0.96    0.58    0.00    0.00 74.17
09:50:18 PM all 17.12    0.55    0.41    0.00    0.00 81.91
10:00:24 PM all    0.36    0.00    0.15    0.00    0.00 99.49
10:10:12 PM all    0.37    0.00    0.14    0.00    0.00 99.49
10:20:24 PM all    0.40    0.00    0.17    0.00    0.00 99.43
10:30:02 PM all    0.40    0.00    0.14    0.00    0.00 99.46
10:40:03 PM all    0.40    0.00    0.17    0.00    0.00 99.42
10:50:24 PM all    0.38    0.00    0.16    0.00    0.00 99.46
11:00:12 PM all    0.38    0.00    0.15    0.00    0.00 99.47
11:10:07 PM all    0.38    0.00    0.16    0.00    0.00 99.46
11:20:02 PM all    0.39    0.00    0.16    0.00    0.00 99.45
11:30:01 PM all    0.43    0.00    0.16    0.00    0.00 99.40
11:40:24 PM all    0.39    0.00    0.16    0.00    0.00 99.45
11:50:05 PM all    0.39    0.00    0.15    0.00    0.00 99.46
Average:    all    2.13    0.27    0.24    0.01    0.00 97.34

sar -f /var/log/sysstat/sa10
Linux 6.8.0-60-generic (z8g4-portal) 06/10/2025    _x86_64_    (80 CPU)

12:00:04 AM CPU %user %nice %system %iowait %steal %idle
12:10:11 AM all 20.56    1.25    2.31    0.05    0.00 75.83
12:20:17 AM all 20.28    1.09    2.39    0.05    0.00 76.18
12:30:09 AM all 19.09    0.06    2.50    0.33    0.00 78.01
12:40:07 AM all 18.94    0.00    2.30    0.04    0.00 78.72
12:50:14 AM all 18.86    0.00    2.32    0.04    0.00 78.78
01:00:08 AM all 18.84    0.00    2.45    0.05    0.00 78.66
01:10:04 AM all 18.88    0.00    2.55    0.06    0.00 78.50
01:20:08 AM all 18.70    0.00    2.49    0.06    0.00 78.74
01:30:02 AM all 18.67    0.00    2.56    0.07    0.00 78.70
01:40:04 AM all 18.50    0.00    2.67    0.07    0.00 78.76
01:50:04 AM all 18.48    0.00    2.66    0.08    0.00 78.78
02:00:01 AM all 18.59    0.00    2.63    0.08    0.00 78.70
02:10:04 AM all 18.60    0.00    2.60    0.07    0.00 78.73
02:20:04 AM all 18.67    0.00    2.53    0.06    0.00 78.74
02:30:15 AM all 18.59    0.00    2.63    0.07    0.00 78.72
02:40:17 AM all 18.45    0.00    2.69    0.07    0.00 78.79
02:50:14 AM all 18.54    0.00    2.61    0.07    0.00 78.79
03:00:12 AM all 18.86    0.00    2.35    0.06    0.00 78.73
03:10:00 AM all 19.15    0.00    2.59    0.06    0.00 78.19
03:20:12 AM all 18.81    0.00    2.62    0.06    0.00 78.51
03:30:17 AM all 18.45    0.00    2.68    0.06    0.00 78.81
03:40:07 AM all 18.41    0.00    2.74    0.06    0.00 78.79
03:50:01 AM all 18.47    0.00    2.67    0.06    0.00 78.79
04:00:04 AM all 18.51    0.00    2.67    0.07    0.00 78.75
04:10:17 AM all 18.47    0.00    2.65    0.06    0.00 78.82
04:20:17 AM all 18.44    0.00    2.69    0.06    0.00 78.80
04:30:13 AM all 18.52    0.00    2.61    0.06    0.00 78.80
04:40:10 AM all 18.53    0.00    2.59    0.07    0.00 78.81
04:50:14 AM all 18.51    0.00    2.63    0.06    0.00 78.80
05:00:08 AM all 18.72    0.00    2.70    0.07    0.00 78.51
05:10:17 AM all 18.76    0.00    2.45    0.06    0.00 78.74
05:20:09 AM all 18.55    0.00    2.68    0.06    0.00 78.71
05:30:13 AM all 18.50    0.00    2.69    0.06    0.00 78.75
05:40:05 AM all 18.66    0.00    2.56    0.06    0.00 78.73
05:50:17 AM all 18.71    0.00    2.50    0.06    0.00 78.73
06:00:04 AM all 18.77    0.00    2.53    0.06    0.00 78.64
06:10:07 AM all 18.84    0.00    2.64    0.06    0.00 78.46
06:20:17 AM all 18.87    0.00    2.55    0.06    0.00 78.52
06:30:14 AM all 18.76    0.00    2.47    0.06    0.00 78.72
06:40:17 AM all 18.66    0.00    2.61    0.06    0.00 78.68
06:50:02 AM all 18.66    0.00    2.59    0.06    0.00 78.70
07:00:14 AM all 18.68    0.00    2.54    0.06    0.00 78.72
07:10:12 AM all 18.58    0.00    2.54    0.05    0.00 78.83
07:20:14 AM all 18.63    0.00    2.20    0.05    0.00 79.12
07:30:14 AM all    1.32    0.00    0.18    0.00    0.00 98.50
07:40:04 AM all    0.23    0.00    0.10    0.00    0.00 99.67
07:50:17 AM all    0.23    0.00    0.10    0.00    0.00 99.67
08:00:04 AM all    0.27    0.00    0.10    0.00    0.00 99.62
08:10:08 AM all    0.24    0.00    0.10    0.00    0.00 99.66
08:20:17 AM all    0.24    0.00    0.10    0.00    0.00 99.65
08:30:00 AM all    0.24    0.00    0.10    0.00    0.00 99.65
08:40:01 AM all    0.24    0.00    0.10    0.00    0.00 99.66
08:50:17 AM all    0.24    0.00    0.11    0.00    0.00 99.65
09:00:13 AM all    0.33    0.00    0.19    0.00    0.00 99.47
09:10:04 AM all    0.74    1.21    0.42    0.08    0.00 97.56
09:20:02 AM all    0.26    2.06    0.34    0.00    0.00 97.34
09:30:14 AM all    0.26    2.06    0.35    0.00    0.00 97.33
09:40:04 AM all    0.24    2.03    0.37    0.00    0.00 97.36
09:50:03 AM all    0.23    2.03    0.37    0.00    0.00 97.36
10:00:10 AM all    0.54    1.58    0.38    0.13    0.00 97.38

10:00:10 AM CPU %user %nice %system %iowait %steal %idle
10:10:11 AM all    0.49    0.00    0.17    0.19    0.00 99.15
10:20:17 AM all    0.25    0.00    0.11    0.00    0.00 99.63
10:30:06 AM all    0.25    0.00    0.11    0.00    0.00 99.64
10:40:17 AM all    0.26    0.00    0.11    0.00    0.00 99.62
10:50:04 AM all    0.26    0.00    0.11    0.00    0.00 99.63
11:00:13 AM all    0.31    0.00    0.12    0.00    0.00 99.56
11:10:17 AM all    0.27    0.00    0.11    0.00    0.00 99.62
11:20:14 AM all    0.39    0.42    0.34    0.16    0.00 98.68
11:30:15 AM all    0.45    3.57    0.69    0.21    0.00 95.08
11:40:04 AM all    0.25    2.06    0.30    0.05    0.00 97.35
11:50:04 AM all    0.24    2.10    0.30    0.05    0.00 97.31
12:00:13 PM all    0.33    2.08    0.35    0.04    0.00 97.20
12:10:17 PM all    0.28    2.08    0.31    0.03    0.00 97.30
12:20:17 PM all    0.25    2.11    0.31    0.04    0.00 97.30
12:30:04 PM all    0.24    2.08    0.32    0.05    0.00 97.31
12:40:17 PM all    0.25    2.09    0.30    0.05    0.00 97.32
12:50:04 PM all    0.25    2.10    0.30    0.04    0.00 97.31
01:00:15 PM all    0.30    2.09    0.31    0.04    0.00 97.24
01:10:17 PM all    0.25    2.08    0.30    0.05    0.00 97.32
01:20:14 PM all    0.25    2.09    0.29    0.01    0.00 97.37
01:30:14 PM all    0.27    2.12    0.49    0.40    0.00 96.73
01:40:17 PM all    0.25    2.11    0.27    0.01    0.00 97.37
01:50:00 PM all    0.48    7.19    2.88    0.16    0.00 89.29
02:00:14 PM all    1.42    3.94    1.77    0.47    0.00 92.40
02:10:14 PM all    1.03    4.71    1.40    0.75    0.00 92.11
02:20:17 PM all    0.73    3.71    1.33    1.48    0.00 92.75
02:30:04 PM all    0.27    1.96    0.32    0.02    0.00 97.43
02:40:17 PM all    0.26    2.07    0.36    0.04    0.00 97.26
02:50:17 PM all    0.28    2.07    0.33    0.01    0.00 97.32
03:00:14 PM all    0.30    2.07    0.34    0.00    0.00 97.28
03:10:17 PM all    0.27    2.08    0.35    0.00    0.00 97.30
03:20:04 PM all    0.26    2.10    0.33    0.00    0.00 97.31
03:30:15 PM all    0.27    2.06    0.28    0.01    0.00 97.39
03:40:17 PM all    0.28    2.16    0.25    0.00    0.00 97.31
03:50:03 PM all    0.31    2.18    0.30    0.01    0.00 97.20
04:00:13 PM all    0.37    2.25    0.31    0.02    0.00 97.05
04:10:04 PM all    0.26    2.34    0.24    0.00    0.00 97.15
04:20:17 PM all    0.25    2.33    0.28    0.00    0.00 97.14
04:30:17 PM all    0.26    2.27    0.29    0.01    0.00 97.18
04:40:09 PM all    0.24    2.17    0.24    0.02    0.00 97.33
04:50:04 PM all    0.25    2.28    0.26    0.12    0.00 97.09
05:00:14 PM all    0.31    2.28    0.26    0.00    0.00 97.15
05:10:17 PM all    0.38    2.68    0.43    0.08    0.00 96.43
Average:    all    8.24    0.98    1.31    0.08    0.00 89.39

05:30:58 PMLINUX RESTART    (80 CPU)

05:40:07 PM CPU %user %nice %system %iowait %steal %idle
05:50:37 PM all    1.06    0.00    0.63    0.03    0.00 98.28
06:00:07 PM all    0.70    0.00    0.41    0.01    0.00 98.88
06:10:07 PM all    0.54    0.00    0.24    0.01    0.00 99.22
06:20:09 PM all    0.43    0.00    0.17    0.01    0.00 99.40
06:30:13 PM all    0.45    0.00    0.24    0.01    0.00 99.30
06:40:06 PM all    0.41    0.00    0.25    0.21    0.00 99.13
06:50:07 PM all    0.37    0.00    0.11    0.00    0.00 99.51
07:00:34 PM all    0.34    0.00    0.10    0.00    0.00 99.55
07:10:13 PM all    0.39    0.00    0.11    0.00    0.00 99.50
07:20:27 PM all    0.36    0.00    0.10    0.00    0.00 99.54
07:30:06 PM all    0.37    0.00    0.10    0.00    0.00 99.53
07:40:17 PM all    0.37    0.00    0.21    0.02    0.00 99.39
07:50:21 PM all    0.39    0.00    0.33    0.01    0.00 99.26
08:00:07 PM all    0.38    0.00    0.11    0.00    0.00 99.51
08:10:27 PM all    0.99    0.53    0.14    0.00    0.00 98.34
08:20:00 PM all    7.23    3.58    0.44    0.01    0.00 88.74
08:30:17 PM all    4.43    1.58    0.38    0.02    0.00 93.59
08:40:31 PM all    3.73    1.75    0.30    0.00    0.00 94.22
08:50:07 PM all    4.20    2.93    0.34    0.00    0.00 92.53
09:00:14 PM all    3.50    2.32    0.32    0.00    0.00 93.85
09:10:06 PM all    1.14    0.63    0.16    0.00    0.00 98.07
09:20:27 PM all    0.68    0.24    0.13    0.00    0.00 98.95
09:30:12 PM all    0.33    0.00    0.10    0.00    0.00 99.57
09:40:07 PM all    0.36    0.00    0.10    0.00    0.00 99.54
09:50:08 PM all    0.34    0.00    0.10    0.00    0.00 99.56
10:00:21 PM all    0.34    0.00    0.10    0.00    0.00 99.56
10:10:16 PM all    0.38    0.00    0.11    0.00    0.00 99.51
10:20:12 PM all    0.33    0.00    0.10    0.00    0.00 99.56
10:30:17 PM all    0.33    0.00    0.11    0.00    0.00 99.55
10:40:19 PM all    0.34    0.00    0.11    0.00    0.00 99.54
10:50:37 PM all    0.34    0.00    0.10    0.00    0.00 99.55
11:00:11 PM all    0.38    0.00    0.11    0.00    0.00 99.52
11:10:04 PM all    0.36    0.00    0.11    0.00    0.00 99.53
11:20:08 PM all    0.34    0.00    0.10    0.00    0.00 99.56
11:30:24 PM all    0.33    0.00    0.10    0.00    0.00 99.57
11:40:07 PM all    0.33    0.00    0.10    0.00    0.00 99.57
11:50:27 PM all    0.34    0.00    0.10    0.00    0.00 99.56
Average:    all    1.01    0.36    0.19    0.01    0.00 98.43

Realplayer 发表于 2025-6-11 14:18

还是用wintogo跑个窗试试吧

indtability 发表于 2025-6-11 21:31

怀疑是硬盘问题的话fsck试了吗

页: [1] 2

Stage1st's Archiver

[电子疑难病(已自愈)] 无规律的Ubuntu 24死机现象, 求赛博专家会诊