Flash attention is "an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM", allowing longer context windows in Transformers.
Read more about it here, I still don't get it.