Loss for a 2.8B model when P=4 is: 1.01548. It is equivalant to:
- A 4.33B model with P=1;
- A 3.4B model with P=2;
- A 2.8B model with P=4;
- A 2.38B model with P=8;
Note: The equivalent parameters are for reference only. In some reasoning tasks, scaling the parallel streams will obtain more performance gains than the loss benefits!
Enjoy it! 😊