Skip to article content

Model Evaluation and Explanation Course at URI

In this course, we will aim to answer one key question, through a multidisciplinary examination:

We will answer this question by first looking at how to measure what an AI can do and then by looking at how to explain what an AI does.

We will begin with a review of model evaluation for simpler ML models, emphasizing the more abstract goals that underlie the standard evaluation procedures. Then we will use these underlying goals to build an understanding of the tools available for evaluating more complex systems. Then we will study various AI benchmarks to see what specific AI abilities we can make rigorous claims about. In this section, we will also compare how these rigorous claims get translated into nonscientific literature. In the second part of the course, we will study model explanation techniques, again starting with simple models where we can build good intuition and then working up to what we can do for deep neural networks with millions-hundreds of billions of parameters.

Complete this form to request a permission number

Course structure

Spring 2025: TTh 5:00-6:15pm

There will be 1-3 papers or freely available textbook chapters to read most weeks. Most class sessions will be focused on a discussion of the paper(s) assigned for that class session.

While we study up-to-date papers of benchmarks and new model evaluation techniques, students will prepare short, lightening-talk style presentations of papers. For these class sesssions, several students each student will read and present different papers and we will discuss them as a group.

Evaluation

Students will have to:

The grade will be based on two components:

within each of these categories some things will be graded on completion only (full points for doing it in good faith & on time) and some will be graded on quality (a more traditional score will be provided). For quality items a rubric will be provided 1 week or more in advance.

Remember, this is a 4 credit cass with only 3 contact hours per week. This means that you should be spending a minimum of 9 hours per week on things for the course outide of class time, but most weeks probably closer to 12. When project milestones are due, expecations for community contribution will be a bit lower, and between milestones especially when I am giving feedback, you’ll have more community contributions expected.

community contribution evaluation

Community Contributions will be based on

All of these are grouped together becasue for each of you your contributions will look different, because we are all contributing to the community, each specific contribution will be different.

For final grading, you will receive one of:

There will be a chance to contest your standing here at the end of the semester if you do not reach meet/exceeds expecatations and want something between

I will let you know at least every 2 weeks if I think your contributions are below, meeting, or exceeding expectations. You can cancel out a below expectation with an exceeding expectation in order to get your average up to on track. You cannot have more than 2 below expecations in a row and still get meet/exceeds in your final grade (so you cannot just exceed at the beginning and then quit)

project evaluation

This will be evaluated by, percentatges are of the project grade not of the final course grade

Project

The project will be completed by meeting several milestones, these milestones are not likely to change and changes will be announced with a lot of warning

project milestonedue datesubmission
evaluate ideas based on interest from a group-created list of ideas2025-02-13course discussion
draft proposal an AI benchmark component2025-02-20tba
peer review of proposals2025-03-06form will be provided
consolidation into teams[1]2025-03-06in class
implement at least one benchmark task2025-03-27benchmark repo + presentation
benchmark map2025-04-10plain text or table that indicates your plan of how many tasks
apply an explanation technique to better understand why at least one model performs as it does on your benchmark2025-04-17in class demo/discussion
better bench self assessment for peer review2025-04-24checklist template in repo or paper overleaf (use google doc and convert to md for repo)
draft paper[2] for peer review [3]2025-04-24benchmark repo or overleaf, team preference, registration of the location required
extend the benchmark with a task that gives different performance [4]2025-04-24,9benchmark repo
in class presentation2025-04-24,9presentation[5] and draft paper[6],
complete a final conference-paper[2] style report2025-05-06final paper for grading and short reflection form will be active after may 1
optional register abstract for neurips2025-05-11abstract and all authors registered in openreview
optional submit paper for neurips2025-05-15paper with all required sections to openreview, inluding checklist
optional submit supplemental materials for neurips2025-05-22paper with all required sections to openreview

LLM use

All work must reflect the students understanding. LLM assistants may be used to improve writing quality for assignments where writing quality will be assessed. However, when quality will be assessed, concision and proper style will also be required. Any submitted writing that contains classic “bot” phrasing or that is overly verbose and off topic will not be assessed and earn zero credit.

At the instructor’s discretion, any submitted work may be re-assessed by oral exam to ensure that the student actually understands.

Tools

This class meets in person, synchronously.

Overall Schedule

datetopicdue
2025-01-23intronone
2025-01-28classic evaluation in predictive systemsprep work (see prismia)
2025-01-30novel evaluation in predictive systems + syllabus discussioncontribute informal evaluations
2025-02-04basics of evaluation in generative systems + reading research discussiondiscuss/comment on informal eval + contribute benchmark ideas +copilot arena informal audit
2025-02-06what is a benchmark? + synthesizing research discussionpossible benchmarks to present for approval
2025-02-11benchmark spotlightsspotlight presentations (all students 5 min presentation)
2025-02-13benchmark common structuresdecide benchmark proposal topic (rough)
2025-02-18benchmark spotlights II (possible swap to following week)
2025-02-20benchmark spotlights III (possible swap to following week)full benchmark propsal
2025-02-25evaluating evaluation I (possible swap to previous week)
2025-02-27evaluating evaluation II (possible swap to previous week)
2025-03-04peer review (Sarah out)proposal revisions
2025-03-06team formation (Sarah out)full reads of proposals
2025-03-11spring breaknone
2025-03-13spring breaknone
2025-03-18canceled
2025-03-20explanation concepts
2025-03-25TCAV
2025-03-27benchmark demospresentations
2025-04-01evaluating explanations
2025-04-03communicating and contextualizing performancepaper location registered
2025-04-08explanation techniques IIexplanation technique spotlights
2025-04-10explanation techniques IIIexplanation technique spotlights
2025-04-15communicating explanationsnone
2025-04-17making a plan based on explanation resultsexplanation results spotlights - one per team, 5 minutes max
2025-04-22communicating limitationspaper draft for reading
2025-04-24COF Sensitivity, PersonaPromptBench, poison-detection-benchmarkpresentations
2025-04-29llm-webdev-rank, WABench, fairnessBenchpresentations
Footnotes
  1. depending on enrollment and similarity of proposal topics benchmarks may be completed in teams or indvidually but the proposal stage and peer review will be individual assignments. Teams will be expected to have more tasks in their benchmarks than individuals.

  2. the paper should be in the style (tone and content level) of a CS conference. It should be 6-10 pages in the neurips format( latex from neurips or arxiv_nips myst template), including figures, but unlimited additional pages for references.

  3. there is a short form to complete for each benchmark, you should take notes enough to complete the forms after class, once per presenatation

  4. if the LLM scores well on the first task, add one where it does not score well; if it scores poorly on the first add one that it can do well

  5. Your presentation should be 15 minutes, strictly enforced, so that there is time for questions

  6. Your draft should be a complete paper, but may have only partial results

Model Evaluation and Explanation Course at URI
Annotated Bibliography