工具
精华
|
战斗力 鹅
|
回帖 0
注册时间 2024-11-19
|
原来萝莉已经自己在x上说了这个价仍然在盈亏线上:
Behind the MiMo API Price Reduction:
The deepest price cut, up to 99%, is for Input (Cache Hit). The core reason is our inference framework now supports hierarchical KV cache optimization for SWA. Production inference engine tests show this optimization increases cached token capacity by 5x, equivalent to an 80% reduction in caching costs. Combined with Cache Read Overlap among multiple Full Attention modules in the Hybrid model, actual costs are further reduced.
Prices for Input (Cache Miss) and Output are also reduced by 60%-80%. This mainly benefits from the extreme 1:7 Full:SWA sparsity ratio brought by the model architecture (the prefill compute of the 70-layer MiMo-V2.5-Pro roughly equals a 10-layer GQA model). This kept our original inference costs well below the industry average, naturally leaving a 2x-3x profit margin in pricing. This price adjustment simply reflects our decision to pass these structural cost efficiencies directly to developers.
Operating at these newly reduced API prices, our production inference engine is running at near full capacity, and we can still essentially break even. We previously advised LLM companies not to "blindly cut prices" precisely because very few model architectures and inference optimizations can keep API costs from running at a loss. If more architectures that save compute and KV cache emerge, along with better inference Infra to drive down API costs, this will form an excellent virtuous cycle in the industry.
More crucially, affordable, high-performance model APIs will drive real, sustained, and at-scale inference demand. This upstream demand pulls forward the development of the entire AI infrastructure chain—including chips, servers, optical transceivers, PCBs, liquid cooling, power, energy storage, and data centers—serving as a strategic fulcrum for a systemic revaluation of AI hardware. In the long run, this injects more affordable and accessible compute into both training and inference pipelines, accelerating the parallel evolution of global AGI across multiple regions and technical routes.
For more technical details, we will release a detailed Blog post later.
mimo翻译:
MiMo API 降价背后的原因:
最深降幅高达99%的是输入(缓存命中)部分。核心原因在于我们的推理框架现在为 SWA(滑动窗口注意力)支持了分层 KV 缓存优化。生产推理引擎测试显示,此优化将缓存令牌容量提升了5倍,相当于缓存成本降低了80%。结合混合模型中多个全注意力模块的缓存读取重叠机制,实际成本进一步降低。
输入(缓存未命中)和输出价格也下调了60%-80%。这主要得益于模型架构带来的极致 1:7 全注意力:SWA 稀疏比(70层 MiMo-V2.5-Pro 的预填充计算量大致相当于一个10层 GQA 模型)。这使得我们原本的推理成本远低于行业平均水平,在定价上自然留下了2-3倍的利润空间。此次调价仅仅是我们决定将这些结构性成本优势直接让渡给开发者的体现。
在新降价后的 API 价格下运营,我们的生产推理引擎正接近满负荷运行,且我们基本上仍能维持收支平衡。我们此前建议大语言模型公司不要“盲目降价”,正是因为极少有模型架构和推理优化能够保证 API 成本不亏损。如果有更多能够节省计算和 KV 缓存的架构出现,并辅以更优的推理基础设施来降低 API 成本,这将在行业内形成一个极好的良性循环。
更关键的是,价格亲民、性能卓越的模型 API 将催生真实、持续且规模化的推理需求。这一上游需求将拉动整个 AI 基础设施链条的发展——包括芯片、服务器、光收发模块、PCB、液冷、电力、储能和数据中心——成为系统性重估 AI 硬件的战略支点。长远来看,这将为训练和推理流程注入更多经济、易得的算力,加速全球通用人工智能在不同地区和技术路线上并行演进。
更多技术细节,我们将在稍后发布详细的博客文章。
|
|