It is a simple comparison between FSRS and SM-17. FSRS-v-SM16-v-SM17.ipynb is the notebook for the comparison.
Due to the difference between the workflow of SuperMemo and Anki, it is not easy to compare the two algorithms. I tried to make the comparison as fair as possible. Here is some notes:
- The first interval in SuperMemo is the duration between creating the card and the first review. In Anki, the first interval is the duration between the first review and the second review. So I removed the first record of each card in SM-17 data.
- There are six grades in SuperMemo, but only four grades in Anki. So I merged 0, 1 and 2 in SuperMemo to 1 in Anki, and mapped 3, 4, and 5 in SuperMemo to 2, 3, and 4 in Anki.
- I use the
R (SM17)(exp)
recorded insm18/systems/{collection_name}/stats/SM16-v-SM17.csv
as the prediction of SM-17. Reference: Confusion among R(SM16), R(SM17)(exp), R(SM17), R est. and expFI. - To ensure FSRS has the same information as SM-17, I implement an online learning version of FSRS, where FSRS has zero knowledge of the future reviews as SM-17 does.
- The results are based on the data from a small group of people. It may be different from the result of other SuperMemo users.
We use three metrics in the SRS benchmark to evaluate how well these algorithms work: Log Loss, AUC, and a custom RMSE that we call RMSE (bins).
- Log Loss (also known as Binary Cross Entropy): used primarily in binary classification problems, Log Loss serves as a measure of the discrepancies between predicted probabilities of recall and review outcomes (1 or 0). It quantifies how well the algorithm approximates the true recall probabilities. Log Loss ranges from 0 to infinity, lower is better.
- Root Mean Square Error in Bins (RMSE (bins)): this is a metric designed for use in the SRS benchmark. In this approach, predictions and review outcomes are grouped into bins based on three features: the interval length, the number of reviews, and the number of lapses. Within each bin, the squared difference between the average predicted probability of recall and the average recall rate is calculated. These values are then weighted according to the sample size in each bin, and then the final weighted root mean square error is calculated. This metric provides a nuanced understanding of algorithm performance across different probability ranges. For more details, you can read The Metric. RMSE (bins) ranges from 0 to 1, lower is better.
- AUC (Area under the ROC Curve): this metric tells us how much the algorithm is capable of distinguishing between classes. AUC ranges from 0 to 1, however, in practice it's almost always greater than 0.5; higher is better.
Log Loss and RMSE (bins) measure calibration: how well predicted probabilities of recall match the real data. AUC measures discrimination: how well the algorithm can tell two (or more, generally speaking) classes apart. AUC can be good (high) even if Log Loss and RMSE are poor.
Total users: 16
Total repetitions: 194,281
The following tables present the means and the 99% confidence intervals. The best result is highlighted in bold. Arrows indicate whether lower (↓) or higher (↑) values are better.
Algorithm | Log Loss↓ | RMSE (bins)↓ | AUC↑ |
---|---|---|---|
FSRS-6 | 0.36±0.077 | 0.05±0.012 | 0.68±0.056 |
FSRS-5 | 0.37±0.084 | 0.06±0.022 | 0.68±0.061 |
FSRS-4.5 | 0.37±0.088 | 0.06±0.023 | 0.68±0.060 |
FSRSv4 | 0.38±0.088 | 0.06±0.024 | 0.67±0.060 |
FSRSv3 | 0.40±0.091 | 0.08±0.020 | 0.65±0.049 |
SM-17 | 0.41±0.098 | 0.08±0.020 | 0.62±0.038 |
SM-16 | 0.42±0.087 | 0.11±0.026 | 0.60±0.020 |
Algorithm | Log Loss↓ | RMSE (bins)↓ | AUC↑ |
---|---|---|---|
FSRS-6 | 0.41±0.069 | 0.08±0.027 | 0.65±0.047 |
FSRS-5 | 0.43±0.076 | 0.10±0.037 | 0.64±0.048 |
FSRS-4.5 | 0.43±0.091 | 0.10±0.036 | 0.64±0.047 |
FSRSv4 | 0.45±0.086 | 0.11±0.049 | 0.63±0.049 |
FSRSv3 | 0.5±0.11 | 0.12±0.039 | 0.62±0.043 |
SM-17 | 0.5±0.11 | 0.10±0.033 | 0.63±0.035 |
SM-16 | 0.5±0.11 | 0.12±0.034 | 0.61±0.024 |
Averages weighted by the number of reviews are more representative of "best case" performance when plenty of data is available. Since almost all algorithms perform better when there's a lot of data to learn from, weighting by n(reviews) biases the average towards lower values.
Unweighted averages are more representative of "average case" performance. In reality, not every user will have hundreds of thousands of reviews, so the algorithm won't always be able to reach its full potential.
The metrics presented above can be difficult to interpret. In order to make it easier to understand how algorithms perform relative to each other, the image below shows the percentage of users for whom algorithm A (row) has a lower Log Loss than algorithm B (column). For example, FSRS-6 has a 75% superiority over SM-17, meaning that for 75% of all collections in this benchmark, FSRS-6 can estimate the probability of recall more accurately.
This table is based on 16 collections.
The figures below show effect sizes comparing the Log Loss between all pairs of algorithms using the Wilcoxon signed-rank test r-values:
The colors indicate:
-
Red shades indicate the row algorithm performs worse than the column algorithm:
- Dark red: large effect (r > 0.5)
- Red: medium effect (0.5 ≥ r > 0.2)
- Light red: small effect (r ≤ 0.2)
-
Green shades indicate the row algorithm performs better than the column algorithm:
- Dark green: large effect (r > 0.5)
- Green: medium effect (0.5 ≥ r > 0.2)
- Light green: small effect (r ≤ 0.2)
-
Grey indicates that the p-value is greater than 0.05, meaning we cannot conclude which algorithm performs better.
The Wilcoxon test considers both the sign and rank of differences between pairs, but it does not account for the varying number of reviews across collections. Therefore, while the test results are reliable for qualitative analysis, caution should be exercised when interpreting the specific magnitude of effects.
If you would like to support this project, please consider sharing your data with us. The shared data will be stored in ./dataset/ folder.
You can open an issue to submit it: https://github.com/open-spaced-repetition/fsrs-vs-sm17/issues/new/choose
leee_ 🔣 |
Jarrett Ye 🔣 |
天空守望者 🔣 |
reallyyy 🔣 |
shisuu 🔣 |
Winston 🔣 |
Spade7 🔣 |
John Qing 🔣 |
WolfSlytherin 🔣 |
HyFran 🔣 |
Hansel221 🔣 |
曾经沧海难为水 🔣 |
Pariance 🔣 |
github-gracefeng 🔣 |