SIMD stands for Single instruction multiple data.

With SIMD the CPU can operate on multiple pieces of data in parallel using the same instruction. Like vectorized operations. Here is a quick breakdown

Say we have two arrays

A = [1, 2, 3, 4]
B = [5, 6, 7, 8]

With SIMD, the CPU can process multiple additions in parallel using a single instruction — for example, adding all 4 pairs in one go if the vector register supports 4 integers.

You need hardware support for SIMD, special SIMD registers perform the parallel operations.

Modern compilers can automatically detect simple patterns like loops and apply SIMD vectorization when safe automatically:

for (int i = 0; i < N; ++i) {
    C[i] = A[i] + B[i];
}

SIMD registers can be 64, 128, 256, or 512 bits wide. These are split into lanes, typically 32 bits wide when working with single-precision floats.

For example:

__m256 a = _mm256_loadu_ps(A);
__m256 b = _mm256_loadu_ps(B);
__m256 c = _mm256_add_ps(a, b);

This splits A and B into 8x32-bit wide lanes which get added in parallel.

Real-World Use

  • Libraries like Eigen, OpenCV, TensorFlow, or PyTorch use SIMD under the hood for fast math.
  • GPUs also use SIMD-like concepts but at larger scales (SIMT - Single Instruction, Multiple Threads)

Computer Architecture