婆罗门
精华
|
战斗力 鹅
|
回帖 0
注册时间 2007-4-27
|
DeepSeek R1 671B over 2 tok/sec *without* GPU on local gaming rig!
DeepSeek R1 671B 在本地游戏设备上 *无需* GPU 即可实现每秒超过 2 个令牌的处理速度!Discussion (self.LocalLLaMA)
submitted 12 hours ago by VoidAlchemyllama.cpp
Don't rush out and buy that 5090TI just yet (if you can even find one lol)!
先别急着出去买那款 5090TI(如果你能找到的话,哈哈)!
I just inferenced ~2.13 tok/sec with 2k context using a dynamic quant of the full R1 671B model (not a distill) after disabling my 3090TI GPU on a 96GB RAM gaming rig. The secret trick is to not load anything but kv cache into RAM and let llama.cpp use its default behavior to mmap() the model files off of a fast NVMe SSD. The rest of your system RAM acts as disk cache for the active weights.
我刚刚在一台 96GB 内存的游戏机上禁用了 3090TI GPU 后,使用动态量化的完整 R1 671B 模型(非蒸馏版)以约 2.13 tok/sec 的速度进行了推理,上下文长度为 2k。秘诀是只将 kv 缓存加载到 RAM 中,让 llama.cpp 使用其默认行为从快速的 NVMe SSD 上 mmap() 模型文件。系统其余的内存则作为活动权重的磁盘缓存。
Yesterday a bunch of folks got the dynamic quant flavors of unsloth/DeepSeek-R1-GGUF running on gaming rigs in another thread here. I myself got the DeepSeek-R1-UD-Q2_K_XL flavor going between 1~2 toks/sec and 2k~16k context on 96GB RAM + 24GB VRAM experimenting with context length and up to 8 concurrent slots inferencing for increased aggregate throuput.
昨天一群人在另一个帖子里让 unsloth/DeepSeek-R1-GGUF 的动态量化版本在游戏设备上运行起来了。我自己在 96GB 内存+24GB 显存的配置下,让 DeepSeek-R1-UD-Q2_K_XL 版本以 1~2 tokens/秒的速度运行,上下文长度在 2k~16k 之间,通过实验上下文长度和最多 8 个并发槽推理,以提高总体吞吐量。
After experimenting with various setups, the bottle neck is clearly my Gen 5 x4 NVMe SSD card as the CPU doesn't go over ~30%, the GPU was basically idle, and the power supply fan doesn't even come on. So while slow, it isn't heating up the room.
在尝试了各种设置后,瓶颈显然是我的第五代 x4 NVMe SSD 卡,因为 CPU 使用率不超过约 30%,GPU 基本处于空闲状态,电源风扇甚至没有启动。所以虽然速度慢,但它并没有让房间变热。
So instead of a $2k GPU what about $1.5k for 4x NVMe SSDs on an expansion card for 2TB "VRAM" giving theoretical max sequential read "memory" bandwidth of ~48GB/s? This less expensive setup would likely give better price/performance for big MoEs on home rigs. If you forgo a GPU, you could have 16 lanes of PCIe 5.0 all for NVMe drives on gamer class motherboards.
那么,与其花费 2000 美元购买 GPU,不如考虑以 1500 美元的价格在扩展卡上安装 4 块 NVMe SSD,提供 2TB 的“显存”,理论上最大连续读取“内存”带宽约为 48GB/s?这种成本较低的配置可能会为家用设备上的大型混合专家模型(MoE)提供更好的性价比。如果你放弃使用 GPU,在游戏级主板上,你可以将所有 16 条 PCIe 5.0 通道都用于 NVMe 驱动器。
If anyone has a fast read IOPs drive array, I'd love to hear what kind of speeds you can get. I gotta bug Wendell over at Level1Techs lol...
如果有人有高速读取 IOPs 的驱动器阵列,我很想听听你能达到什么样的速度。我得去烦一下 Level1Techs 的 Wendell 了,哈哈...
P.S. In my opinion this quantized R1 671B beats the pants off any of the distill model toys. While slow and limited in context, it is still likely the best thing available for home users for many applications.
附言:在我看来,这个量化版的 R1 671B 远超任何蒸馏模型玩具。尽管速度慢且上下文有限,但对于家庭用户的许多应用场景来说,它可能仍然是最佳选择。
Just need to figure out how to short circuit the <think>Blah blah</think> stuff by injecting a </think> into the assistant prompt to see if it gives decent results without all the yapping haha...
只需要弄清楚如何通过向助手提示中注入 </think> 来短路 <think>Blah blah</think> 的东西,看看在没有所有废话的情况下是否能给出不错的结果哈哈... |
|