KEYNOTE LECTURE
From Assessment of Learning to Assessment for Learning

Lambert Schuwirth

Maastricht University
Maastricht, The Netherlands

(+) 31 43 38 85731
(+) 31 43 3885779

Most of us are educated in a setting in which assessment is mainly used to test whether students have acquired sufficient knowledge and skills during the course to proceed to the next module. This “assessment of learning” is basically placed outside the educational process. “Assessment for learning” on the other hand seeks to establish assessment programmes that are inextricably connected to the educational process.1, 2

With the “assessment of learning” concept a number of assumptions and practices have become dominant. First, there is notion of the stable and generic traits. Traditionally, medical competence was defined as the combination of knowledge, skills, problem solving skills and attitudes. Much of the developments in assessment have been aimed at developing the single best instrument for each trait. Yet, one could wonder whether all aspects of medical competence are best modeled using such a stable trait notion. Suppose we would apply this notion to the construct ‘blood pressure’ and we would take the blood pressure of 10 patients every half hour for 24 hours. If we would then find consistent differences between the 10 patients but not variability within each patient, we would logically conclude that the measurement is unreliable, simply because blood pressure is assumed to vary from time to time. Standard reliability theory, however, would yield a perfect reliability coefficient

A second notion is the tendency to treat individual items as meaningless; they only acquire meaning through their contribution to the total. Especially in the discussion on killer stations in OSCEs this issue becomes salient: should someone who miserably fails the resuscitation station in an OSCE be allowed to pass for example by performing very well on a communication skills station and vice versa.

A third important point is that statistics are mainly used to optimally eliminate information. In any test information from the given answers (from which one can tell for example what mistakes were made) is reduced to a pass-fail decision (from which one can only infer whether enough correct answers were given). Such dichotomous information is not very useful when one wants to provide the students with information to guide their learning activities, i.e. in assessment for learning.

The consequence of this is that many assessment programmes seek to use one single best instrument for each trait instead of using a variety of instruments each with their own strengths and weaknesses; a typical 1:1 relationship.

Currently the trend is towards defining medical competence as a combination of competencies. Many official bodies have issued their own competencies-document or defined competency-domains. (e.g.,3,4) The risk is now that an assessment programme seeks the single best instrument for each competency domain. From an assessment for learning perspective, however, this is ill advised. Instead the programme should be set up such that information relevant to a competency domain is extracted from parts of one test and is triangulated with information of another test. Information as to why a student failed the resuscitation station would be triangulated for example with his performance on cardiac anatomy and/or physiology items from a written test or with observations on cardiac examination in a practice setting. This may seem complicated, but is much more meaningful than combining the results with the performance on a communication skills station. Also, this is analogous to what every clinician does on a daily basis; combining information from various sources to determine not only whether the patient is ill or healthy, but also what additional diagnostics to order and what therapeutic actions to start. This is exactly what the issues in assessment for learning should be:

•is there sufficient information about a student or should (hypothesis-driven) extra information be collected;
•what educational intervention or remediation is most indicated for this student, and,
•what is the prognosis for this student?

But for this we have to learn how to combine information based on the content of the items and not based on their format (no clinician ever calculates the average of all the lab tests s/he ordered).

I do not want to imply that this is easy, but it is doable. It will require a radical change in our teacher training programmes with respect to assessment though. But more importantly, it first requires further research. Fortunately, in the field of provision of feedback much has been done already, but there are many other terrains which are still quite open. (cf.5)

A first terrain is the quality of programmes of assessment. There is shared opinion that the quality of each assessment instrument is a trade-off between various quality criteria (such as reliability, validity, educational impact, cost-effectiveness and acceptability), but there is still very little known on what constitutes quality of assessment programmes. Baartman et al. have defined outcome criteria for the quality of assessment programme, and Dijkstra et al. have created a model and defined design criteria for quality of assessment programmes.6-8

A surprising poorly charted territory is what aspects of an assessment programme influence student learning and teacher teaching and by what mechanisms. This is surprising because there is such strong shared opinion about it. Cilliers has done important initial work to gain more insight into the sources of the impact, the mechanisms through which it works and its possible consequences.9,10 It appears to be a highly individualised process, in which the individual motivational aspects play an important role.

It is one thing to describe the concessions we have to make using the prevalent psychometrical approaches and perhaps be critical about them, but it is another to come up with feasible alternatives. Fortunately the wheel need not be reinvented completely. Interesting work has been done in the 1960s en 1970s from a domain sampling framework.11 Re-exploring these methods, refining them and extending on them (e.g., drawing on qualitative research methodology) can be helpful.12,13 After this, the next essential point on the research agenda is how to choose the most appropriate psychometrical approach for each of the parts of the assessment programme.

This brings us to the final and most important topic for further research, namely the use of human judgement in assessment. Our initial thoughts may be that human judgement is fallible and should therefore be eliminated from the assessment process, but this is impossible.14 A multiple-choice test, for example, may capture the student responses in a numerical way, but its topic, its content, its blueprint, the specific topics for the items and the specific wording are all culminations of human judgement. There are biases in human judgement that we should probably try to counteract (for example, framing related biased, pseudo opinions, strategic behaviours), but some are inevitably part of human decision making as a measure of reduction of cognitive load.15,16 Naturalistic decision making theories and expertise development theories can help us understand how to improve the judgements of teachers and assessors.17-19 First studies looking at human judgement in assessment as a diagnostic classification task, and thus as an expertise issue, show that the characteristics of judgement in assessment are highly similar to those in diagnostic expertise.20,21

In summary, the whole change in thinking about assessment in the educational environment has led to dramatic changes in concepts and a whole new exciting research agenda.

REFERENCES

  1. Martinez ME, Lipson JI. Assessment for learning. Educational Leadership 1989;47(7):73-5.
  2. Shepard L. The role of assessment in a learning culture. Educational Researcher 2009;29(7):4-14.
  3. Canmeds. http://rcpsc.medical.org/publications/index.php#canmeds” http. Ottawa, 1996 (accessed 30 July 2010)
  4. ACGME. http://www.acgme.org/outcome/comp/compCPRL.aspChicago, 2007 [accessed 30 July 2010).
  5. Shute VJ. Focus on formative feedback. Review of educational research 2008;78(1):153-89.
  6. Baartman LKJ. Assessing the assessment [Dissertation]. Open University, 2008.
  7. Dijkstra J, Galbraith R, Hodges B, McAvoy P, McCrorie P, Southgate L, et al. Development and validation of guidelines for designing programmes of assessment: a modified Delphi-study. (submitted).
  8. Dijkstra J, Van der Vleuten CPM, Schuwirth LWT. A new framework for designing programmes of assessment. Advances in health sciences education 2009 (early online publication).
  9. Cilliers FJ , Schuwirth LWT, Herman N, Adendorff HJ, Van der Vleuten CPM. A model of the sources, consequences and mechanism of impact of summative assessment on how students learn. (submitted).
  10. Cilliers FJ, Schuwirth LWT, Adendorff HJ, Herman N, Van der Vleuten CPM. The mechanisms of impact of summative assessment on medical students’ learning. advances in health sciences education 2010 (early online publication).
  11. Berk RA. A Consumers’ Guide to Criterion-Referenced Test Reliability. Journal of Educational Measurement 1980;17(4):323-349.
  12. Driessen E, Van der Vleuten CPM, Schuwirth LWT, Van Tartwijk J, Vermunt J. The use of qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: a case study. Medical Education 2005;39(2):214-20.
  13. Rickets C. A plea for the proper use of criterion-referenced test in medical assessment. Medical Education 2009;53:1141-6.
  14. Plous S. The psychology of judgment and decision making. New Jersey: McGraw-Hill inc., 1993.
  15. Klein G. Naturalistic Decision Making. Human Factors 2008;50(3):456-60.
  16. van Merrienboer JJ, Sweller J. Cognitive load theory in health professional education: design principles and strategies. Med Educ 2010;44(1):85-93.
  17. Ericsson KA, Charness N. Expert performance. American Psychologist 1994;49(8):725-47.
  18. Marewski JN, Gaissmaier W, Gigerenzer G. Good judgements do not require complex cognition. Cognitive Processing 2009;11(2):103 – 21
  19. Schmidt HG, Boshuizen HP. On acquiring expertise in medicine. Special Issue: European educational psychology. Educational Psychology Review 1993;5(3):205-221.
  20. Govaerts MJB, Schuwirth LWT, Van der Vleuten CPM, Muijtjens AMM. Performance rating in the workplace: effects of rater expertise. (submitted).
  21. Govaerts MJB, Van der Vleuten CPM, Schuwirth LWT, Muijtjens AMM. Broadening Perspectives on Clinical Performance Assessment: Rethinking the Nature of In-training Assessment. advances in health sciences education 2007;12(2):239-60

Published Page Numbers: 170-172