FSRS vs SM-17

It is a simple comparison between FSRS and SM-17. FSRS-v-SM16-v-SM17.ipynb is the notebook for the comparison.

Due to the difference between the workflow of SuperMemo and Anki, it is not easy to compare the two algorithms. I tried to make the comparison as fair as possible. Here is some notes:

The first interval in SuperMemo is the duration between creating the card and the first review. In Anki, the first interval is the duration between the first review and the second review. So I removed the first record of each card in SM-17 data.
There are six grades in SuperMemo, but only four grades in Anki. So I merged 0, 1 and 2 in SuperMemo to 1 in Anki, and mapped 3, 4, and 5 in SuperMemo to 2, 3, and 4 in Anki.
I use the R (SM17)(exp) recorded in sm18/systems/{collection_name}/stats/SM16-v-SM17.csv as the prediction of SM-17. Reference: Confusion among R(SM16), R(SM17)(exp), R(SM17), R est. and expFI.
To ensure FSRS has the same information as SM-17, I implement an online learning version of FSRS, where FSRS has zero knowledge of the future reviews as SM-17 does.
The results are based on the data from a small group of people. It may be different from the result of other SuperMemo users.

Metrics

We use three metrics in the SRS benchmark to evaluate how well these algorithms work: Log Loss, AUC, and a custom RMSE that we call RMSE (bins).

Log Loss (also known as Binary Cross Entropy): used primarily in binary classification problems, Log Loss serves as a measure of the discrepancies between predicted probabilities of recall and review outcomes (1 or 0). It quantifies how well the algorithm approximates the true recall probabilities. Log Loss ranges from 0 to infinity, lower is better.
Root Mean Square Error in Bins (RMSE (bins)): this is a metric designed for use in the SRS benchmark. In this approach, predictions and review outcomes are grouped into bins based on three features: the interval length, the number of reviews, and the number of lapses. Within each bin, the squared difference between the average predicted probability of recall and the average recall rate is calculated. These values are then weighted according to the sample size in each bin, and then the final weighted root mean square error is calculated. This metric provides a nuanced understanding of algorithm performance across different probability ranges. For more details, you can read The Metric. RMSE (bins) ranges from 0 to 1, lower is better.
AUC (Area under the ROC Curve): this metric tells us how much the algorithm is capable of distinguishing between classes. AUC ranges from 0 to 1, however, in practice it's almost always greater than 0.5; higher is better.

Log Loss and RMSE (bins) measure calibration: how well predicted probabilities of recall match the real data. AUC measures discrimination: how well the algorithm can tell two (or more, generally speaking) classes apart. AUC can be good (high) even if Log Loss and RMSE are poor.

Result

Total users: 16

Total repetitions: 194,281

The following tables present the means and the 99% confidence intervals. The best result is highlighted in bold. Arrows indicate whether lower (↓) or higher (↑) values are better.

Weighted by number of repetitions

Algorithm	Log Loss↓	RMSE (bins)↓	AUC↑
FSRS-6	0.36±0.077	0.05±0.012	0.68±0.056
FSRS-5	0.37±0.084	0.06±0.022	0.68±0.061
FSRS-4.5	0.37±0.088	0.06±0.023	0.68±0.060
FSRSv4	0.38±0.088	0.06±0.024	0.67±0.060
FSRSv3	0.40±0.091	0.08±0.020	0.65±0.049
SM-17	0.41±0.098	0.08±0.020	0.62±0.038
SM-16	0.42±0.087	0.11±0.026	0.60±0.020

Unweighted (per user)

Algorithm	Log Loss↓	RMSE (bins)↓	AUC↑
FSRS-6	0.41±0.069	0.08±0.027	0.65±0.047
FSRS-5	0.43±0.076	0.10±0.037	0.64±0.048
FSRS-4.5	0.43±0.091	0.10±0.036	0.64±0.047
FSRSv4	0.45±0.086	0.11±0.049	0.63±0.049
FSRSv3	0.5±0.11	0.12±0.039	0.62±0.043
SM-17	0.5±0.11	0.10±0.033	0.63±0.035
SM-16	0.5±0.11	0.12±0.034	0.61±0.024

Averages weighted by the number of reviews are more representative of "best case" performance when plenty of data is available. Since almost all algorithms perform better when there's a lot of data to learn from, weighting by n(reviews) biases the average towards lower values.

Unweighted averages are more representative of "average case" performance. In reality, not every user will have hundreds of thousands of reviews, so the algorithm won't always be able to reach its full potential.

Superiority

The metrics presented above can be difficult to interpret. In order to make it easier to understand how algorithms perform relative to each other, the image below shows the percentage of users for whom algorithm A (row) has a lower Log Loss than algorithm B (column). For example, FSRS-6 has a 75% superiority over SM-17, meaning that for 75% of all collections in this benchmark, FSRS-6 can estimate the probability of recall more accurately.

This table is based on 16 collections.

Statistical significance

The figures below show effect sizes comparing the Log Loss between all pairs of algorithms using the Wilcoxon signed-rank test r-values:

The colors indicate:

Red shades indicate the row algorithm performs worse than the column algorithm:
- Dark red: large effect (r > 0.5)
- Red: medium effect (0.5 ≥ r > 0.2)
- Light red: small effect (r ≤ 0.2)
Green shades indicate the row algorithm performs better than the column algorithm:
- Dark green: large effect (r > 0.5)
- Green: medium effect (0.5 ≥ r > 0.2)
- Light green: small effect (r ≤ 0.2)
Grey indicates that the p-value is greater than 0.05, meaning we cannot conclude which algorithm performs better.

The Wilcoxon test considers both the sign and rank of differences between pairs, but it does not account for the varying number of reviews across collections. Therefore, while the test results are reliable for qualitative analysis, caution should be exercised when interpreting the specific magnitude of effects.

Share your data

If you would like to support this project, please consider sharing your data with us. The shared data will be stored in ./dataset/ folder.

You can open an issue to submit it: https://github.com/open-spaced-repetition/fsrs-vs-sm17/issues/new/choose

Contributors

_{leee_} 🔣	_{Jarrett Ye} 🔣	_{天空守望者} 🔣	_reallyyy 🔣	_shisuu 🔣	_Winston 🔣	_Spade7 🔣
_{John Qing} 🔣	_{WolfSlytherin} 🔣	_HyFran 🔣	_Hansel221 🔣	_{曾经沧海难为水} 🔣	_Pariance 🔣	_{github-gracefeng} 🔣

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
dataset		dataset
plots		plots
result		result
.all-contributorsrc		.all-contributorsrc
.gitattributes		.gitattributes
.gitignore		.gitignore
7 models.csv		7 models.csv
FSRS-v-SM16-v-SM17.ipynb		FSRS-v-SM16-v-SM17.ipynb
README.md		README.md
evaluate.py		evaluate.py
models.py		models.py
plot.py		plot.py
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FSRS vs SM-17

Metrics

Result

Weighted by number of repetitions

Unweighted (per user)

Superiority

Statistical significance

Share your data

Contributors

About

Sponsor this project

Contributors 3

Languages

open-spaced-repetition/fsrs-vs-sm17

Folders and files

Latest commit

History

Repository files navigation

FSRS vs SM-17

Metrics

Result

Weighted by number of repetitions

Unweighted (per user)

Superiority

Statistical significance

Share your data

Contributors

About

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

Contributors 3

Languages