大贤者
精华
|
战斗力 鹅
|
回帖 0
注册时间 2014-6-30
|
本帖最后由 WiiGe 于 2025-6-11 07:46 编辑
最新一次死机的日志: https://share.wiige.top/upload/er5yBd
=================================================================================
更新简单脱敏后的日志: https://share.wiige.top/p/aAAYza
=================================================================================
我弄了一台HP的Z8G4工作站当HomeServer, 系统Ubuntu 24 LTS, 所有应用均docker化与系统隔离, 主要就是Nextcloud, Komga, Plex, Gitlab之类的东西
配置为:
Z8G4: https://support.hp.com/us-en/dri ... orkstation/16449803
内存有128GB DDR4 ECC(已隐去空槽信息):
- sudo dmidecode -t memory
- # dmidecode 3.5
- Getting SMBIOS data from sysfs.
- SMBIOS 3.2.0 present.
- Handle 0x000F, DMI type 16, 23 bytes
- Physical Memory Array
- Location: System Board Or Motherboard
- Use: System Memory
- Error Correction Type: Single-bit ECC
- Maximum Capacity: 4608 GB
- Error Information Handle: Not Provided
- Number Of Devices: 12
- Handle 0x0010, DMI type 17, 40 bytes
- Memory Device
- Array Handle: 0x000F
- Error Information Handle: Not Provided
- Total Width: 72 bits
- Data Width: 64 bits
- Size: 64 GB
- Form Factor: DIMM
- Set: None
- Locator: CPU0-DIMM1
- Bank Locator: CPU0
- Type: DDR4
- Type Detail: Synchronous LRDIMM
- Speed: 2666 MT/s
- Manufacturer: Samsung
- Serial Number: 398E97A7
- Asset Tag: Not Specified
- Part Number: M386A8K40BM2-CTD
- Rank: 4
- Configured Memory Speed: 2666 MT/s
- Minimum Voltage: 1.2 V
- Maximum Voltage: 1.2 V
- Configured Voltage: 1.2 V
- Handle 0x001D, DMI type 16, 23 bytes
- Physical Memory Array
- Location: System Board Or Motherboard
- Use: System Memory
- Error Correction Type: Single-bit ECC
- Maximum Capacity: 4608 GB
- Error Information Handle: Not Provided
- Number Of Devices: 12
- Handle 0x001E, DMI type 17, 40 bytes
- Memory Device
- Array Handle: 0x001D
- Error Information Handle: Not Provided
- Total Width: 72 bits
- Data Width: 64 bits
- Size: 64 GB
- Form Factor: DIMM
- Set: None
- Locator: CPU1-DIMM1
- Bank Locator: CPU1
- Type: DDR4
- Type Detail: Synchronous LRDIMM
- Speed: 2666 MT/s
- Manufacturer: Samsung
- Serial Number: 398F3D61
- Asset Tag: Not Specified
- Part Number: M386A8K40BM2-CTD
- Rank: 4
- Configured Memory Speed: 2666 MT/s
- Minimum Voltage: 1.2 V
- Maximum Voltage: 1.2 V
- Configured Voltage: 1.2 V
复制代码
有几片SSD:
- $sudo smartctl --all /dev/nvme0
- === START OF INFORMATION SECTION ===
- Model Number: INTEL SSDPED1D280GA
- Serial Number: PHMB75160057280CGN
- Firmware Version: E2010480
- PCI Vendor/Subsystem ID: 0x8086
- IEEE OUI Identifier: 0x5cd2e4
- Controller ID: 0
- NVMe Version: <1.2
- Number of Namespaces: 1
- Namespace 1 Size/Capacity: 280,065,171,456 [280 GB]
- Namespace 1 Formatted LBA Size: 512
- Local Time is: Sun Jun 1 01:58:46 2025 CST
- Firmware Updates (0x02): 1 Slot
- Optional Admin Commands (0x0007): Security Format Frmw_DL
- Optional NVM Commands (0x0006): Wr_Unc DS_Mngmt
- Log Page Attributes (0x0a): Cmd_Eff_Lg Telmtry_Lg
- Maximum Data Transfer Size: 32 Pages
- === START OF SMART DATA SECTION ===
- SMART overall-health self-assessment test result: PASSED
- $ sudo smartctl --all /dev/nvme1
- === START OF INFORMATION SECTION ===
- Model Number: MEMBLAZE P6530CH0384M00
- Serial Number: SH220600806
- Firmware Version: 0A101Z00
- PCI Vendor/Subsystem ID: 0x1c5f
- IEEE OUI Identifier: 0x00e004
- Total NVM Capacity: 3,840,755,982,336 [3.84 TB]
- Unallocated NVM Capacity: 0
- Controller ID: 1
- NVMe Version: 1.4
- Number of Namespaces: 1
- Namespace 1 Size/Capacity: 3,840,755,982,336 [3.84 TB]
- Namespace 1 Formatted LBA Size: 512
- Namespace 1 IEEE EUI-64: 38b19e 7418032601
- Local Time is: Sun Jun 1 01:59:07 2025 CST
- Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
- Optional Admin Commands (0x001f): Security Format Frmw_DL NS_Mngmt Self_Test
- Optional NVM Commands (0x0054): DS_Mngmt Sav/Sel_Feat Timestmp
- Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
- Maximum Data Transfer Size: 128 Pages
- Warning Comp. Temp. Threshold: 70 Celsius
- Critical Comp. Temp. Threshold: 77 Celsius
- Namespace 1 Features (0x08): No_ID_Reuse
- Supported Power States
- St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
- 0 + 25.00W - - 0 0 0 0 0 0
- 1 + 14.00W - - 0 0 0 0 0 0
- 2 + 13.00W - - 0 0 0 0 0 0
- 3 + 12.00W - - 0 0 0 0 0 0
- 4 + 11.00W - - 0 0 0 0 0 0
- 5 + 10.00W - - 0 0 0 0 0 0
- 6 + 9.00W - - 0 0 0 0 0 0
- 7 + 8.00W - - 0 0 0 0 0 0
- 8 + 7.00W - - 0 0 0 0 0 0
- 9 + 6.00W - - 0 0 0 0 0 0
- Supported LBA Sizes (NSID 0x1)
- Id Fmt Data Metadt Rel_Perf
- 0 + 512 0 2
- 1 - 4096 0 0
- 2 - 512 8 2
- 3 - 4096 8 0
- 4 - 4096 64 0
- === START OF SMART DATA SECTION ===
- SMART overall-health self-assessment test result: PASSED
- $sudo smartctl --all /dev/nvme2
- === START OF INFORMATION SECTION ===
- Model Number: Fanxiang S500Pro 2TB
- Serial Number: FXS500Pro243954233
- Firmware Version: SN12517
- PCI Vendor/Subsystem ID: 0x1e4b
- IEEE OUI Identifier: 0x000000
- Total NVM Capacity: 2,048,408,248,320 [2.04 TB]
- Unallocated NVM Capacity: 0
- Controller ID: 0
- NVMe Version: 1.4
- Number of Namespaces: 1
- Namespace 1 Size/Capacity: 2,048,408,248,320 [2.04 TB]
- Namespace 1 Formatted LBA Size: 512
- Namespace 1 IEEE EUI-64: 000000 0243954233
- Local Time is: Sun Jun 1 01:59:09 2025 CST
- Firmware Updates (0x1a): 5 Slots, no Reset required
- Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
- Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
- Log Page Attributes (0x02): Cmd_Eff_Lg
- Maximum Data Transfer Size: 128 Pages
- Warning Comp. Temp. Threshold: 90 Celsius
- Critical Comp. Temp. Threshold: 95 Celsius
- Supported Power States
- St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
- 0 + 6.50W - - 0 0 0 0 0 0
- 1 + 5.80W - - 1 1 1 1 0 0
- 2 + 3.60W - - 2 2 2 2 0 0
- 3 - 0.7460W - - 3 3 3 3 5000 10000
- 4 - 0.7260W - - 4 4 4 4 8000 45000
- Supported LBA Sizes (NSID 0x1)
- Id Fmt Data Metadt Rel_Perf
- 0 + 512 0 0
- === START OF SMART DATA SECTION ===
- SMART overall-health self-assessment test result: PASSED
- $ sudo smartctl --all /dev/nvme3
- === START OF INFORMATION SECTION ===
- Model Number: MEMBLAZE P6530CH0384M00
- Serial Number: SH220600832
- Firmware Version: 0A101Z00
- PCI Vendor/Subsystem ID: 0x1c5f
- IEEE OUI Identifier: 0x00e004
- Total NVM Capacity: 3,840,755,982,336 [3.84 TB]
- Unallocated NVM Capacity: 0
- Controller ID: 1
- NVMe Version: 1.4
- Number of Namespaces: 1
- Namespace 1 Size/Capacity: 3,840,755,982,336 [3.84 TB]
- Namespace 1 Formatted LBA Size: 512
- Namespace 1 IEEE EUI-64: 38b19e 7418034001
- Local Time is: Sun Jun 1 01:59:11 2025 CST
- Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
- Optional Admin Commands (0x001f): Security Format Frmw_DL NS_Mngmt Self_Test
- Optional NVM Commands (0x0054): DS_Mngmt Sav/Sel_Feat Timestmp
- Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
- Maximum Data Transfer Size: 128 Pages
- Warning Comp. Temp. Threshold: 70 Celsius
- Critical Comp. Temp. Threshold: 77 Celsius
- Namespace 1 Features (0x08): No_ID_Reuse
- Supported Power States
- St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
- 0 + 25.00W - - 0 0 0 0 0 0
- 1 + 14.00W - - 0 0 0 0 0 0
- 2 + 13.00W - - 0 0 0 0 0 0
- 3 + 12.00W - - 0 0 0 0 0 0
- 4 + 11.00W - - 0 0 0 0 0 0
- 5 + 10.00W - - 0 0 0 0 0 0
- 6 + 9.00W - - 0 0 0 0 0 0
- 7 + 8.00W - - 0 0 0 0 0 0
- 8 + 7.00W - - 0 0 0 0 0 0
- 9 + 6.00W - - 0 0 0 0 0 0
- Supported LBA Sizes (NSID 0x1)
- Id Fmt Data Metadt Rel_Perf
- 0 + 512 0 2
- 1 - 4096 0 0
- 2 - 512 8 2
- 3 - 4096 8 0
- 4 - 4096 64 0
- === START OF SMART DATA SECTION ===
- SMART overall-health self-assessment test result: PASSED
复制代码
CPU就捡的洋垃圾6138:
- $ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
- 80 Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
复制代码
弄了张3070 8G 涡轮跑点小模型:
症状表现:
7*24小时运行, 但时不时会死机, 死机间隔短则十数小时, 长则数月. 据统计70%的死机时间在00:00-09:00这段时间, 剩下的30%死机随机分布于其他时间段, 但00:00-09:00这段时间主要就是Plex的声纹分析和媒体分析之类的后台工作
死机时直连显示器无光标闪烁, 直连键盘输入无响应, CapsLoack和ScrLock无响应. LAN内SSH无响应, Intel AMT无响应
机箱内未见故障灯(这内部是真好看呐):
死机时候可明显听到风扇拉满, 至少处于80%+的高转速, 因为平时Z8G4非常安静
除了Hard reset之外, 这个问题从没被我真正解决过
大概是入手这套配置1-2个月后开始出现, 该症状已经持续超过一年
我做了什么:
检查机箱内温度, 发现最高也就62°C, CPU占用率平均低于47%, 某些单核会持续100%(后台转码情有可原)
怀疑是显卡过热所以没显示输出, nvitop看了发现也就不到60°C
怀疑是swap满了卡死, 检查内核日志无OOM字样, 将swap扩到64G问题依旧
怀疑是网卡过热掉了, 发现显示输出也是卡死, 这机要死应是大家齐齐整整一起死了
怀疑是运行环境太热, 切到7*24小时开空调的卧室, 症状依旧, 死机频率仍旧不可琢磨
怀疑是杂牌电饭锅线吃不住1800W的电源, 换了根原装备件电源线, 问题依旧
怀疑是内存不稳, 在BIOS里跑了3小时内存测试, 全绿通过
怀疑是电源不对, 看了原装备件价格, 被劝退
怀疑是主板有问题, 但我没有证据, 店家: 做好散热不要乱猜
挠头: 难不成还能是CPU单核满载把自己热死了?
类似病症:
https://h30471.www3.hp.com/t5/tai-shi-dian-nao/z8g4jing-chang-si-ji/m-p/1134658 -> 我没有它那样的故障灯
https://h30471.www3.hp.com/t5/tai-shi-dian-nao/HP-Z8G4-928-PCIE-yan-zhong-chao-shi-si-ji-zhong-qi/m-p/1261763 -> 我没有它那样的POST报错
https://h30471.www3.hp.com/t5/tai-shi-dian-nao/hui-puZ8G4gong-zuo-zhanCPU-wen-du-guo-gao-jing-chang-si-ji/m-p/1264789 -> 我观察了下我没有那么高的温度(数据来自cockpit)
https://h30471.www3.hp.com/t5/tai-shi-dian-nao/HPZ8G4-pin-fan-si-ji/m-p/1077573 -> 和我的状况很像, 我看了看也妹有啥浮灰影响散热
现在我没啥思路了. 就想着发来这里集思广益, 看看诸君有没有什么好点子可以找到这个错误的缘由, 按理说这机器就是散热强力设计稳定, 是个只为性能释放而生的机器, 怎么没事就死机叻? 我不明白(奉化口音)
如有需要我可以贴一下dmesg和journalctl -k的输出, 但我看着这东西又臭又长大家估计不怎么想看......?
|
本帖子中包含更多资源
您需要 登录 才可以下载或查看,没有账号?立即注册
×
|