All corrections
X March 3, 2026 at 09:47 PM

x.com/brianroemmele/status/2028524908779802736?s=48

3 corrections found

1
Claim
No CoreML, no Metal, no GPU. Pure, blazing ANE silicon.
Correction

The referenced ANE project is built and linked against Apple’s CoreML framework and also uses CPU code for parts of training, so it is not “No CoreML” nor “Pure” ANE-only execution.

Full reasoning

The post points to the GitHub repo maderix/ANE. In the repo’s own build instructions, the main training binary is compiled while linking against CoreML ("-framework CoreML"). That directly contradicts the claim “No CoreML”.

Additionally, the repo’s README explicitly documents that multiple parts of the training step run on the CPU, which contradicts “Pure, blazing ANE silicon” (i.e., not exclusively ANE execution).

Even if the author’s intent was “no CoreML training APIs,” the post’s wording (“No CoreML”) is broader and is contradicted by the project’s documented build/runtime requirements.

2 sources
2
Claim
full neural network training– including backpropagation – directly on the Apple Neural Engine (ANE).
Correction

The repo itself says weight-gradient (dW) computation and several other training operations run on the CPU, so training is not fully “directly on the ANE.”

Full reasoning

The maderix/ANE README documents that while it runs forward and certain backward passes (dx) on ANE, weight gradients (dW) are computed on the CPU, and other training-step components also fall back to CPU.

That means the overall training loop is not fully executed “directly on the Apple Neural Engine” as claimed in the post: substantial parts of backprop/training are explicitly performed on CPU.

1 source
  • GitHub - maderix/ANE (README)

    README: “All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)” and “CPU handles: RMSNorm backward... dW gradient accumulation... Adam optimizer updates.”

3
Claim
M4 ANE hits roughly 6.6 TFLOPS per watt – 80 times more efficient than an NVIDIA A100.
Correction

NVIDIA’s own A100 specs imply ~0.78–1.04 TFLOPS/W (FP16 Tensor Core), so 6.6 TFLOPS/W is about 6–8× higher—not 80×.

Full reasoning

NVIDIA’s official A100 specifications list FP16 Tensor Core performance of 312 TFLOPS and TDP of 300W (PCIe) or 400W (SXM).

From those official numbers:

  • A100 80GB PCIe: 312 TFLOPS / 300W ≈ 1.04 TFLOPS/W
  • A100 80GB SXM: 312 TFLOPS / 400W ≈ 0.78 TFLOPS/W

If the post’s “6.6 TFLOPS per watt” were accepted, that would be roughly 6.3× (vs 1.04) to 8.5× (vs 0.78) — not 80×. Therefore the “80 times” efficiency comparison is contradicted by NVIDIA’s own published A100 performance and power specs.

1 source
Model: OPENAI_GPT_5 Prompt: v1.6.0