Gradio

Loss for a 2.8B model when P=4 is: 1.01548. It is equivalant to:

A 4.33B model with P=1;
A 3.4B model with P=2;
A 2.8B model with P=4;
A 2.38B model with P=8;

Note: The equivalent parameters are for reference only. In some reasoning tasks, scaling the parallel streams will obtain more performance gains than the loss benefits!

Enjoy it! 😊

Parallel Scaling Law Visualization