CANN/catlass FP8转FP16反量化Tile操作

张

张建站

2026/5/30 9:28:03

10分钟阅读

TileCastFp8ToFp16Dequant【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass代码位置[TOC]功能说明TileCastFp8ToFp16Dequant模板负责将 FP8int8_t存储量化数据反量化dequant并转换为 FP16half结果直接写回 GM。常用于 A 矩阵weight的 Prologue 阶段在计算前完成数据解量化。流水线GM(fp8) → UB → Dequant(fp8→fp16) → UB → GM(fp16)使用双缓冲BUFFER_NUM2和 4 个 EventId 实现 MTE2/V/MTE3 三级流水并发。与TileCastInt8ToFp16Dequant的区别在于 Dequant 实现不同FP8 使用查表/位运算且支持 ColumnMajor 排布。限制仅支持 AtlasA2 架构CATLASS_ARCH 2201。模板原型template class ArchTag, // 架构标签仅 Arch::AtlasA2 class SrcType_, // 源类型Gemm::GemmTypeElementSrc, LayoutSrc class DstType_, // 目标类型Gemm::GemmTypeElementDst, LayoutDst uint32_t COMPUTE_LENGTH // 每次 Vector 引擎计算的长度 struct TileCastFp8ToFp16Dequant { using ElementSrc typename SrcType_::Element; // int8_tFP8 using ElementDst typename DstType_::Element; // halfFP16 using LayoutTagSrc typename SrcType_::Layout; using LayoutTagDst typename DstType_::Layout; };COMPUTE_LENGTH单次 Dequant 计算长度影响 UB buffer 分配大小LayoutSrc/LayoutDst只支持 RowMajor 或 ColumnMajor且两者必须相同构造与析构构造函数TileCastFp8ToFp16Dequant(Arch::ResourceArchTag resource, Params const params_);从Arch::Resource的 UB buffer 中分配双缓冲inputBuffer[2]×COMPUTE_LENGTH× 1 byteFP8 输入outputBuffer[2]×COMPUTE_LENGTH× 2 bytesFP16 输出workspace[2]×COMPUTE_LENGTH× 2 bytes计算 workspaceParamsstruct Params { half scalar; // 反量化 scale half zeroPoint; // 反量化 zero point Params() default; Params(half scalar_, half zeroPoint_); };析构函数无UB buffer 由 Resource 管理。调用接口主接口FP8 → FP16 Dequantvoid operator()( AscendC::GlobalTensorElementDst gmDst, LayoutDst const layoutDst, // GM 目标FP16 AscendC::GlobalTensorElementSrc gmSrc, LayoutSrc const layoutSrc, // GM 源FP8 uint32_t bufferIndex // 双缓冲索引in/out );Epilogue 辅助接口FP32 → FP16 Castvoid EpCastFp32ToFp16( AscendC::GlobalTensorhalf gmDst, LayoutRowMajor layoutDst, AscendC::GlobalTensorfloat gmSrc, LayoutRowMajor layoutSrc );用于 Epilogue 阶段将 float 累加结果 cast 为 half仅支持 RowMajor。调用示例#include catlass/gemm/tile/cast_fp8_to_fp16.hpp using namespace Catlass::Gemm::Tile; using ElementSrc int8_t; using ElementDst half; using SrcType Gemm::GemmTypeElementSrc, layout::RowMajor; using DstType Gemm::GemmTypeElementDst, layout::RowMajor; constexpr uint32_t COMPUTE_LENGTH 16 * 1024; const int M 256; const int K 4096; auto layoutSrc layout::RowMajor::MakeLayoutElementSrc(M, K); auto layoutDst layout::RowMajor::MakeLayoutElementDst(M, K); AscendC::GlobalTensorElementSrc gmSrc; AscendC::GlobalTensorElementDst gmDst; Arch::ResourceArch::AtlasA2 resource; TileCastFp8ToFp16DequantArch::AtlasA2, SrcType, DstType, COMPUTE_LENGTH::Params params(0.5, 0.0); TileCastFp8ToFp16DequantArch::AtlasA2, SrcType, DstType, COMPUTE_LENGTH castOp(resource, params); uint32_t bufferIndex 0; castOp(gmDst, layoutDst, gmSrc, layoutSrc, bufferIndex);【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

AI赋能RTL设计：CRADLE框架优化FPGA资源实战

1. 从RTL设计痛点看AI赋能的必要性在数字电路设计领域，Verilog和VHDL等硬件描述语言（HDL）就像建筑师手中的蓝图，而寄存器传输级（RTL）设计则是将抽象概念转化为具体电路结构的核心环节。我从业十余年&#x…...

2026/5/30 9:27:08 阅读更多 →

ControlNet SDXL未来展望：MindSpore-Lab项目的技术路线图与发展方向

ControlNet SDXL未来展望：MindSpore-Lab项目的技术路线图与发展方向【免费下载链接】controlnet_sdxl 项目地址: https://ai.gitcode.com/hf_mirrors/MindSpore-Lab/controlnet_sdxl ControlNet SDXL作为MindSpore-Lab项目的重要AI图像生成工具&#xff0c…...

2026/5/30 9:24:18 阅读更多 →

怎样高效使用思源宋体：7种字重深度解析与专业实战指南

怎样高效使用思源宋体：7种字重深度解析与专业实战指南【免费下载链接】source-han-serif-ttf Source Han Serif TTF 项目地址: https://gitcode.com/gh_mirrors/so/source-han-serif-ttf 还在为中文排版设计寻找既专业又免费的字体解决方案而烦恼吗&#xf…...

2026/5/30 9:19:03 阅读更多 →