diff options
Diffstat (limited to 'paper/main.tex')
| -rw-r--r-- | paper/main.tex | 16 |
1 files changed, 8 insertions, 8 deletions
diff --git a/paper/main.tex b/paper/main.tex index 4ab205e..6e824b6 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -190,20 +190,20 @@ The main lesson is to decompose the evaluation question before interpreting the \begin{thebibliography}{10} -\bibitem[Paleka et~al.(2026)Paleka, et~al.]{paleka2026pitfalls} +\bibitem[Paleka et~al.(2026)]{paleka2026pitfalls} Daniel Paleka et~al. -\newblock Pitfalls in evaluating model behavior: measurement, reporting, and - interpretability failures. +\newblock Pitfalls in evaluating language model forecasters. \newblock In {\em International Conference on Learning Representations}, 2026. -\bibitem[O'Bray et~al.(2022)O'Bray, et~al.]{obray2022evaluation} +\bibitem[O'Bray et~al.(2022)]{obray2022evaluation} Leslie O'Bray et~al. -\newblock Evaluation beyond leaderboard metrics: methodology matters. +\newblock Evaluation metrics for graph generative models: problems, pitfalls, + and practical solutions. \newblock In {\em International Conference on Learning Representations}, 2022. -\bibitem[Jordan et~al.(2020)Jordan, et~al.]{jordan2020evaluating} -Matt Jordan et~al. -\newblock Evaluating machine learning: tests, cases, and expectations. +\bibitem[Jordan et~al.(2020)]{jordan2020evaluating} +Scott~M. Jordan et~al. +\newblock Evaluating the performance of reinforcement learning algorithms. \newblock In {\em International Conference on Machine Learning}, 2020. \bibitem[Lillicrap et~al.(2016)Lillicrap, Cownden, Tweed, and |
