CANN/cannbot-skills Flash Attention优化笔记
Deep Note:agent/example/kernels/a2/flash_attn_full_pj_hif8_commonub.py【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsOpen this file only after the short catalog entry confirmed the kernel is relevant.What this kernel is really forcomparing againstflash_attn_full_pj_hif8.pyafter the math contract is already understoodstudying how a shared vec-side slot buffer changes queueing structure without changing the visible formulaDecisions worth copyingmove vec scratch from two plainTensorviews onto one sharedDBufffamily:ub_score_pv score_pv_cntkeepstage1_cntandstage2_cntseparate even though the shared scratch family existstreat the gain as a same-side vecubinqueueing improvement, not as a new cross-side ownership modeldo not expect UB-footprint reduction here; the point is cleaner overlap between the next preload and current vec computePrefer another kernel whenyou are still deriving the math contract and want the simpler readable baselineyou are debugging row-max / row-sum correctness and do not want shared vec scratch lineage in the picture yet【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考