Media Summary: In this video, I'll be deriving and coding FlashAttention is an IO-aware algorithm for computing Episode 67 of the Stanford MLSys Seminar “Foundation Models Limited Series”! Speaker: Tri Dao Abstract: Transformers are slow ...
Lecture 12 Flash Attention - Detailed Analysis & Overview
In this video, I'll be deriving and coding FlashAttention is an IO-aware algorithm for computing Episode 67 of the Stanford MLSys Seminar “Foundation Models Limited Series”! Speaker: Tri Dao Abstract: Transformers are slow ... Several LLMs have used long context: GPT-4 (32k), MosaicML's MPT (65k), Anthropic's Claude (100k). But Speaker: Charles Frye From the Modal team: ML Performance Reading Group Session 24 meeting recording Paper: