x.com/brianroemmele/status/2028524908779802736?s=48
3 corrections found
No CoreML, no Metal, no GPU. Pure, blazing ANE silicon.
The referenced ANE project is built and linked against Apple’s CoreML framework and also uses CPU code for parts of training, so it is not “No CoreML” nor “Pure” ANE-only execution.
Full reasoning
The post points to the GitHub repo maderix/ANE. In the repo’s own build instructions, the main training binary is compiled while linking against CoreML ("-framework CoreML"). That directly contradicts the claim “No CoreML”.
Additionally, the repo’s README explicitly documents that multiple parts of the training step run on the CPU, which contradicts “Pure, blazing ANE silicon” (i.e., not exclusively ANE execution).
Even if the author’s intent was “no CoreML training APIs,” the post’s wording (“No CoreML”) is broader and is contradicted by the project’s documented build/runtime requirements.
2 sources
- GitHub - maderix/ANE: Training neural networks on Apple Neural Engine via reverse-engineered private APIs
Building: “xcrun clang ... -framework IOSurface -framework CoreML -framework Accelerate ...” (repo README build command).
- GitHub - maderix/ANE (README)
README: “CPU handles: RMSNorm backward, residual connections, loss computation, dW gradient accumulation (cblas_sgemm), Adam optimizer updates.”
full neural network training– including backpropagation – directly on the Apple Neural Engine (ANE).
The repo itself says weight-gradient (dW) computation and several other training operations run on the CPU, so training is not fully “directly on the ANE.”
Full reasoning
The maderix/ANE README documents that while it runs forward and certain backward passes (dx) on ANE, weight gradients (dW) are computed on the CPU, and other training-step components also fall back to CPU.
That means the overall training loop is not fully executed “directly on the Apple Neural Engine” as claimed in the post: substantial parts of backprop/training are explicitly performed on CPU.
1 source
- GitHub - maderix/ANE (README)
README: “All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)” and “CPU handles: RMSNorm backward... dW gradient accumulation... Adam optimizer updates.”
M4 ANE hits roughly 6.6 TFLOPS per watt – 80 times more efficient than an NVIDIA A100.
NVIDIA’s own A100 specs imply ~0.78–1.04 TFLOPS/W (FP16 Tensor Core), so 6.6 TFLOPS/W is about 6–8× higher—not 80×.
Full reasoning
NVIDIA’s official A100 specifications list FP16 Tensor Core performance of 312 TFLOPS and TDP of 300W (PCIe) or 400W (SXM).
From those official numbers:
- A100 80GB PCIe: 312 TFLOPS / 300W ≈ 1.04 TFLOPS/W
- A100 80GB SXM: 312 TFLOPS / 400W ≈ 0.78 TFLOPS/W
If the post’s “6.6 TFLOPS per watt” were accepted, that would be roughly 6.3× (vs 1.04) to 8.5× (vs 0.78) — not 80×. Therefore the “80 times” efficiency comparison is contradicted by NVIDIA’s own published A100 performance and power specs.
1 source
- NVIDIA A100 | NVIDIA (Specifications)
Specs show “FP16 Tensor Core 312 TFLOPS” and “Max Thermal Design Power (TDP) 300W (PCIe) / 400W (SXM)”.