发表于 2024-03-26 更新于 2024-06-04 分类于并行处理阅读次数：本文字数： 748 阅读时长 ≈ 1 分钟

# Memory bound algorithms

获取数据有时比计算的开销更大：

带宽和时延

什么样的运输受限于访存

以稠密矩阵的计算为例

通过计算\(B_c\)可以看出优化的潜力，对于本来就memory bound的类型，可优化的空间就比较小

Organization of processors, caches, and memory

多核：每个核有自己的L1,L2共享一个L3

一个node有多个 sockets

Other ways to get more bandwidth

为什么顺序访问和随机访问在L1的时候性能是一样的。因为比较小的数据一开始warm up所有的元素都可以在L1上可以找到

image-20240402144304188

image-20240402150337148

image-20240402153634304

不write alloc

image-20240409143043063

image-20240409143703463

image-20240409145054028

image-20240409150006161

image-20240409153020704

image-20240604000011031

image-20240604002823280

image-20240416150049265

image-20240416163022123

image-20240603160342812

image-20240603192339780

image-20240603194843237

image-20240604092834160

image-20240604093159626

image-20240604094252125

image-20240604095524207

image-20240604101637559

0%