Parallel Scaling Law Visualization

$$ \text{Loss}=E+\left( \frac{A}{\text{Parameters}\times (1+k\log P)} \right)^{\alpha} $$


Select our pre-fitted parameters for two datasets

Loss for a 2.8B model when P=4 is: 1.01548. It is equivalant to:

  • A 4.33B model with P=1;
  • A 3.4B model with P=2;
  • A 2.8B model with P=4;
  • A 2.38B model with P=8;

Note: The equivalent parameters are for reference only. In some reasoning tasks, scaling the parallel streams will obtain more performance gains than the loss benefits!

Enjoy it! 😊